Efficient Sequence Clustering for RNA-Seq Data without a Reference Genome.
-
Citations (0)
-
Cited In (0)
Page 1
GI-Edition
Lecture Notes
in Informatics
Dietmar Schomburg,
Andreas Grote (Eds.)
German Conference on
Bioinformatics 2010
September 20 - 22, 2010
Braunschweig, Germany
D. Schomburg, A. Grote (Eds.): GCB 2010
Proceedings
Gesellschaft für Informatik e.V. (GI)
publishes this series in order to make available to a broad public
recent findings in informatics (i.e. computer science and informa-
tion systems), to document conferences that are organized in co-
operation with GI and to publish the annual GI Award dissertation.
Broken down into
• seminar
• proceedings
• dissertations
• thematics
current topics are dealt with from the vantage point of research
and development, teaching and further training in theory and prac-
tice. The Editorial Committee uses an intensive review process in
order to ensure high quality contributions.
The volumes are published in German or English.
Information: http://www.gi-ev.de/service/publikationen/lni/
This volume contains papers presented at the 25thGerman Conference on Bioin-
formatics held at the Technische Universität Carolo-Wilhelmina in Braunschweig,
Germany, September 20-22, 2010. The German Conference on Bioinformatics is
an annual, international conference, which provides a forum for the presentation
of current research in bioinformatics and computational biology. It is organized
on behalf of the Special Interest Group on Informatics in Biology of the German
Society of Computer Science (GI) and the German Society of Chemical Technique
and Biotechnology (Dechema) in cooperation with the German Society for Bio-
chemistry and Molecular Biology (GBM).
173
ISSN 1617-5468
ISBN 978-3-88579-267-3
Page 2
Page 3
Page 4
Dietmar Schomburg, Andreas Grote (Editors)
German Conference on Bioinformatics 2010
September 20-22, 2010
Technische Universität Carolo Wilhelmina zu
Braunschweig, Germany
Gesellschaft für Informatik e.V. (GI)
Page 5
Lecture Notes in Informatics (LNI) - Proceedings
Series of the Gesellschaft für Informatik (GI)
Volume P-173
ISBN 978-3-88579-267-3
ISSN 1617-5468
Volume Editors
Prof. Dr. Dietmar Schomburg
Technische Universität Carolo-Wilhelmina zu Braunschweig
Email: d.schomburg@tu-bs.de
Dr. Andreas Grote
Technische Universität Carolo-Wilhelmina zu Braunschweig
Email: andreas.grote@tu-bs.de
Series Editorial Board
Heinrich C. Mayr, Universität Klagenfurt, Austria (Chairman, mayr@ifit.uni-klu.ac.at)
Hinrich Bonin, Leuphana-Universität Lüneburg, Germany
Dieter Fellner, Technische Universität Darmstadt, Germany
Ulrich Flegel, SAP Research, Germany
Ulrich Frank, Universität Duisburg-Essen, Germany
Johann-Christoph Freytag, Humboldt-Universität Berlin, Germany
Thomas Roth-Berghofer, DFKI
Michael Goedicke, Universität Duisburg-Essen
Ralf Hofestädt, Universität Bielefeld
Michael Koch, Universität der Bundeswehr, München, Germany
Axel Lehmann, Universität der Bundeswehr München, Germany
Ernst W. Mayr, Technische Universität München, Germany
Sigrid Schubert, Universität Siegen, Germany
Martin Warnke, Leuphana-Universität Lüneburg, Germany
Dissertations
Dorothea Wagner, Universität Karlsruhe, Germany
Seminars
Reinhard Wilhelm, Universität des Saarlandes, Germany
Thematics
Andreas Oberweis, Universität Karlsruhe (TH)
Gesellschaft für Informatik, Bonn 2010
printed by Köllen Druck+Verlag GmbH, Bonn
Page 6
Preface
This volume contains papers presented at the 25th German Conference on Bioinformat-
ics held at the Technische Universit¨ at Carolo-Wilhelmina in Braunschweig, Germany,
September 20-22, 2010. The German Conference on Bioinformatics is an annual, in-
ternational conference, which provides a forum for the presentation of current research
in bioinformatics and computational biology. It is organized on behalf of the Special
Interest Group on Informatics in Biology of the German Society of Computer Science
(GI) and the German Society of Chemical Technique and Biotechnology (Dechema) in
cooperation with the German Society for Biochemistry and Molecular Biology (GBM).
Five outstanding scientists were invited to give keynote lectures to the conference:
• Edda Klipp - ‘Cellular stress response and regulation of metabolism’
• Thomas Lengauer - ‘HIV Bioinformatics: Analyzing viral evolution for the ben-
efit of AIDS patients’
• Werner Mewes - ‘The data deluge: can simple models explain complex biological
systems?’
• Stefan Schuster - ‘Road games in metabolism - A biotechnological perspective’
• Gregory Stephanopoulos - ‘After a decade of systems biology, time for a record
card’
Besides the keynote lectures, the scientific program comprised 22 contributed talks
presenting 12 regular and 10 short papers. All accepted regular papers are collected
in these proceedings. The remaining accepted contributions, i.e. short papers and
poster abstracts, are published in a separate volume. We would like to thank the pro-
gram committee members and all reviewers for their evaluations of the contributions.
Furthermore, we would like to thank the local organizers for keeping the conference
running. The organizers are grateful to all the sponsors and supporting scientific part-
ners. Last but not least, we would like to thank all contributors and participants of
the GCB 2010.
Braunschweig, August 2010
Dietmar Schomburg and Andreas Grote
5
Page 7
Organizers
Conference Chair
Dietmar Schomburg, Braunschweig
Local Organizers
Wolf-Tilo Balke (TU Braunschweig)
S´ andor Fekete (TU Braunschweig)
Reinhold Haux (TU Braunschweig)
Dieter Jahn (TU Braunschweig)
Frank Klawonn (Ostfalia University
of Applied Sciences)
Constantin Bannert (TU Braun-
schweig)
Antje Chang (TU Braunschweig)
Andreas Grote (TU Braunschweig)
KatharinaHanke
schweig)
AdamPodstawka
schweig)
AlexanderRiemer
schweig)
Maurice Scheer (TU Braunschweig)
(TUBraun-
(TUBraun-
(TUBraun-
Programm committee
Mario Albrecht, Saarbr¨ ucken
Wolf-Tilo Balke, Braunschweig
Tim Beißbarth, G¨ ottingen
Thomas Dandekar, W¨ urzburg
S´ andor Fekete, Braunschweig
Georg F¨ ullen, Rostock
Robert Giegerich, Bielefeld
Reinhold Haux, Braunschweig
Ralf Hofest¨ adt, Bielefeld
Matthias Heinemann, Z¨ urich
Dieter Jahn, Braunschweig
Frank Klawonn, Wolfenb¨ uttel
Edda Klipp, Berlin
Ina Koch, Berlin
Oliver Kohlbacher, T¨ ubingen
Thomas Lengauer, Saarbr¨ ucken
Hans-Peter Lenhof, Saarbr¨ ucken
Michael Marschollek, Hannover
Werner Mewes, M¨ unchen
Michael Meyer-Hermann Frankfurt
Burkhard Morgenstern, G¨ ottingen
Stefan Posch, Halle
Matthias Rarey, Hamburg
Falk Schreiber, Gatersleben
Stefan Schuster, Jena
GregoryStephanopoulos,
bridge USA
Jens Stoye, Bielefeld
Andrew Torda, Hamburg
Martin Vingron, Berlin
Christian von Mering, Z¨ urich
Edgar Wingender, G¨ ottingen
Andreas Ziegler, L¨ ubeck
Cam-
6
Page 8
Sponsors and supporters
Supporting scientific societies
Gesellschaft
(DECHEMA)
http://www.dechema.de
f¨ ur ChemischeTechnik undBiotechnologie e.V.
Gesellschaft f¨ ur Biochemie und Molekularbiologie e.V. (GBM)
http://www.gbm-online.de
Gesellschaft f¨ ur Informatik e.V. (GI)
http://www.gi-ev.de
Non-profit sponsors
Technische Universit¨ at Braunschweig
http://www.tu-braunschweig.de
Commercial sponsors
Biobase - Biological Databases
http://www.biobase-international.com
CLC bio
http://www.clcbio.com
Convey Computer
http://www.conveycomputer.com
genomatix
http://www.genomatix.de
7
Page 9
geneXplain
http://www.genexplain.com
MEGWARE Computer Cluster
http://www.megware.com
Thalia
http://www.thalia.de
Transtec - IT & Solutions
http://www.transtec.de
8
Page 10
Table of Contents
Preface
5
Organizers
6
Sponsors and supporters
7
Table of Contents
9
Submissions
Andreas R. Gruber, Stephan H. Bernhart, You Zhou, Ivo L. Ho-
facker
RNALfoldz: efficient prediction of thermodynamically stable, local secondary
structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Florian Battke, Stephan K¨ orner, Steffen H¨ uttner, Kay Nieselt
Efficient sequence clustering for RNA-seq data without a reference genome
Florian Bl¨ ochl, Maria L. Hartsperger, Volker St¨ umpflen, Fabian J.
Theis
Uncovering the structure of heterogenous biological data: fuzzy graph parti-
tioning in the k-partite setting. . . . . . . . . . . . . . . . . . . . . . . . . . 31
Sergiy Bogomolov, Martin Mann, Bj¨ orn Voß, Andreas Podelski,
Rolf Backofen
Shape-based barrier estimation for RNAs . . . . . . . . . . . . . . . . . . . . 41
Thomas Fober, Marco Mernberger, Gerhard Klebe, Eyke H¨ ullermeier
Efficient Similarity Retrieval of Protein Binding Sites based on Histogram
Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Peter Husemann, Jens Stoye
Repeat-aware Comparative Genome Assembly . . . . . . . . . . . . . . . . . . 61
Katrin Bohl, Lu´ ıs F. de Figueiredo, Oliver H¨ adicke, Steffen Klamt,
Christian Kost, Stefan Schuster, Christoph Kaleta
CASOP GS: Computing intervention strategies targeted at production im-
provement in genome-scale metabolic networks . . . . . . . . . . . . . . . . . 71
Jan Grau, Daniel Arend, Ivo Grosse, Artemis G. Hatzigeorgiou,
Jens Keilwagen, Manolis Maragkakis, Claus Weinholdt, Stefan Posch
Predicting miRNA targets utilizing an extended profile HMM . . . . . . . . . 81
11
. 21
9
Page 11
Arli A. Parikesit, Peter F. Stadler, Sonja J. Prohaska
Quantitative Comparison of Genomic-Wide Protein Domain Distributions . . 93
Jan Budczies, Carsten Denkert, Berit M. M¨ uller, Scarlet F. Brockm¨ oller,
Manfred Dietel, Jules L. Griffin, Matej Oresic, Oliver Fiehn
METAtarget – extracting key enzymes of metabolic regulation from high-
throughput metabolomics data using KEGG REACTION information . . . . 103
Andreas Gogol-D¨ oring, Wei Chen
Finding Optimal Sets of Enriched Regions in ChIP-Seq Data . . . . . . . . . 113
Enrico Glaab, Jonathan M. Garibaldi, Natalio Krasnogor
Learning pathway-based decision rules to classify microarray cancer samples . 123
Index of authors
135
10
Page 12
RNALfoldz: efficient prediction of thermodynamically
stable, local secondary structures
Andreas R. Gruber1, Stephan H. Bernhart1,
You Zhou1,2, and Ivo L. Hofacker1
1Institute for Theoretical Chemistry
University of Vienna, W¨ ahringerstraße 17, 1090 Wien, Austria
2College of Computer Science and Technology
Jilin University, Changchun 130012, China
{agruber, berni, ivo}@tbi.univie.ac.at, zyou@jlu.edu.cn
Abstract: The search for local RNA secondary structures and the annotation of unusu-
ally stable folding regions in genomic sequences are two well motivated bioinformatic
problems. In this contribution we introduce RNALfoldz an efficient solution two
tackle both tasks. It is an extension of the RNALfold algorithm augmented by sup-
port vector regression for efficient calculation of a structure’s thermodynamic stability.
We demonstrate the applicability of this approach on the genome of E. coli and investi-
gate a potential strategy to determine z-score cutoffs given a predefined false discovery
rate.
1 Introduction
Over the past decade noncoding RNAs (ncRNAs) have risen from a shadowy existence to
one of the primary research topics in modern molecular biology. Today computational
RNA biology faces challenges in the ever growing amount of sequencing data. Effi-
cient computational tools are needed to turn these data into information. In this context,
the search for locally stable RNA secondary structures in large sequences is a well mo-
tivated bioinformatic problem that has drawn considerable attention in the community.
RNALfold [HPS04] has been the first in a series of tools that offered an efficient solution
to this task. Instead of a straight-forward, but costly sliding window approach a dynamic
programming recursion has been formulated that predicts all stable, local RNA structures
in O(N ×L2), where L is the maximum base-pair span and N the length of the sequence.
Since its publication, the RNALfold algorithm has inspired a lot of work in this field, see
e.g. Rnall by Wan et al. [WLX06] or RNAslider by Horesh et al. [HWL+09]. All
contributions so far in this field focused on improving the computational complexity of
the algorithm, but none of the approaches has ever been used to unravel results of biolog-
ical significance. In particular, de novo detection of functional RNA structures has been
addressed, but application on a genome-wide scale with a low false discovery rate seems
still out of reach. Even on the moderately sized genome of E. coli (4.6 Mb) one is drown-
ing in hundreds of thousands of local structures. Unlike in the well established field of
protein coding gene detection where one can exploit signals like codon usage, functional
Gruber et al.11
Page 13
RNA secondary structures, in general, do not show strong characteristics that make them
easily distinguishable from random decoys. Successful approaches for ncRNA detection
operating solely on a single sequence [HHS08, JWW+07] are limited to specific RNA
classes, where some outstanding characteristics can be harnessed. There is no master plan
for the detection of functional RNA structures, but one would certainly want to limit the
RNALfold output to a reasonable amount. So far, only the minimum free energy (MFE)
ofthelocallystablesecondarystructures, whichisintrinsicallycomputedbythealgorithm,
has been considered as potential discriminator to limit the number of secondary structures.
As demonstrated clearly by Freyhult and colleagues [FGM05] the MFE is roughly a func-
tion of the length of the sequence and the G+C content. Even normalizing the MFE by
length of the sequence does not serve as a good discriminator between shuffled or coding
sequences and functional RNA structures. A strategy that does work, however, is to com-
pare the native MFE E of the RNA molecule to the MFEs of a set of shuffled sequences of
same length and base composition [LM89]. This way we can evaluate the thermodynamic
stability of the secondary structure. A common statistical quantity in this context is the
z-score, which is calculated as follows
z =E − µ
σ
where µ and σ are the average and the standard deviation of the energies of the set of
shuffled sequences. The more negative the z-score the more thermodynamically stable is
the structure. Efficient estimation of a sequence’s z-score has been a profound problem
already addressed in the very beginnings of computational RNA biology. A first strategy
to avoid explicit shuffling and folding was based on table look-ups of linear regression
coefficients [CLS+90]. Clote and colleagues [CFKK05] introduced the concept of the
asymptotic z-score, where the efficient calculation is also solved via table look-ups. The
current state-of-the art approach for fast and efficient estimation of the z-score is to use
support vector regression [WHS05].
The study by Clote and colleagues and a follow up to Chen et al. (1990) [LLM02] also
report on the effort to predict thermodynamically stable structures using a sliding window
approach. In this contribution we present RNALfoldz an algorithm that combines local
RNA secondary structure prediction and the efficient search for thermodynamically stable
structures. RNALfoldz is an extension of the RNALfold algorithm augmented by sup-
port vector regression for efficient calculation of a sequence’s z-score. We demonstrate the
applicability of this approach on the genome of E. coli and investigate a potential strategy
to determine z-score cutoffs given a predefined false discovery rate.
2Methods
2.1Fast estimation of the z-score using support vector regression
Fortheefficientestimationofthez-scorewefollowthestrategyfirstintroducedbyWashietl
etal. [WHS05]. Insteadofexplicitgenerationandfoldingofshuffledsequencesinorderto
12Gruber et al.
Page 14
determine the average free energy and the corresponding standard deviation support vector
regression (SVR) models are trained to estimate both values. As described in detail in the
previous work, we used a regularly spaced grid to sample sequences for the training set.
Synthetic sequences ranged from 50 to 400 nt in steps of 50 nt. The G+C content, A/(A+T)
ratio and C/(C+G) ratio were, however, extended to a broader spectrum, now ranging from
0.20 to 0.80 in steps of 0.05. A total of 17,576 sequences were used for training. For each
sequence of the training set 1,000 randomized sequences were generated using the Fisher-
Yates shuffle algorithm, and subsequently folded with RNAfold with dangling ends op-
tion -d2 [HFS+94]. SVR models for the average free energy and standard deviation were
trained using the LIBSVM package (www.csie.ntu.edu.tw/˜cjlin/libsvm).
While in the previous work input features and the dependent variables were normalized
to a mean of zero and a standard deviation of one, we apply here a different normalization
strategy that leads to a significantly lower number of support vectors for the final models.
For the regression of the average free energy model the dependent variable is normalized
by the length of the sequence, while for the standard deviation it is the square root of the
sequence length. The length still remains in the set of input features and is scaled from 0 to
1. Other features remain unchanged. We used a RBF kernel, and optimized values for the
SVM parameters were determined using standard protocols for this purpose. Final regres-
sion models were selected by balancing two criteria: (i) mean absolute error (MAE) on a
test set of 5,000 randomly drawn sequences of arbitrary length (50-400) from the human
genome, and (ii) complexity of the model (number of support vectors) , which translates to
following procedure: from the top 10% of regression models in terms of MAE we selected
the one that had the lowest number of support vectors. For the average free energy re-
gression we selected a model with a MAE of 0.453 and 1,088 support vectors, and for the
standard deviation regression a model with a MAE of 0.027 and 2,252 support vectors. To
validate our approach we finally compared z-scores derived from the SVR to traditionally
sampled z-scores on a set of 1,000 randomly drawn sequences from the human genome.
The correlation coefficient (R) is 0.9981 and the MAE is 0.072. This is in fair agreement
to results obtained when comparing two sets of sampled z-scores (R: 0.9986, MAE: 0.054,
number of shuffled sequences = 1,000).
2.2Adaption of the RNALfold algorithm
RNALfold computes locally stable structures of long RNA molecules. It uses a Zuker
type secondary structure prediction algorithm [ZS81] and restricts the maximum base pair
span to L bases to keep the structures local. The sequence is processed from the 3’ (the
sequence length n) to the 5’ end. In order to keep the number of back trace operations low
and the output at moderate size, we want to avoid backtracing structures that differ only
by unpaired regions. Furthermore, only the longest helices possible are of interest. To
achieve this, a structure starting at base i is only traced back if the total energy F(i,n) is
smaller than that of its 3’ neighbor F(i + 1,n) while its 5’ neighbor has the same energy:
F(i−1,n) = F(i,n) < F(i+1,n). The local minimum structure is found by identifying
the pairing partner j of i so that C(i,j)+F(j +1,n) = F(i,n), i.e. the minimum energy
Gruber et al. 13
Page 15
from i to n is decomposed into the local minimum part i,j and the rest of the molecule.
Here, C(i,j) stands for the energy of a structural feature enclosed by the base pair i,j.
As a result of this, the output of RNALfold contains components, i.e. structures that are
enclosed by a base pair, only. Before we actually start the trace back, we evaluate two
new criteria: (1) the sequence of the structure traced back has to be within the training
parameters of the SVR, and (2) the z-score of the energy of this structure has to be lower
than a predefined bound.Criterion (1) is simply imposed by the training boundaries of
the SVMs. Boundaries have, however, been chosen carefully to cover a broad range of
today’s known spectrum of functional RNA structures. 99.79% of the sequences in the
Rfam v. 10 full data set match the base composition requirements of the SVR and 90% of
Rfam RNA families are in within the sequence length restrictions.
In order to get the exact sequence composition that is needed for the SVR evaluations,
the 3’ end of the structure (j) has to be computed first. This is done by a first, short
backtracing step, where the decomposition F(i,n) = C(i,j) + F(j + 1,n) is used to
find j. Subsequently, the average free energy given the base composition of the sequence
s(i,j) is computed by calling the corresponding SVR model. The SVR model for the
standard deviation has approximately twice the number of support vectors as the average
free energy model. To minimize calls of this model, first the minimal standard deviation
for the particular sequence length is looked up. We can then, using the free energy of
C(i,j), calculate a lower bound of the z-score. Only if this lower bound is below the
minimal required z-score, the support vector regression for the standard deviation is called
to calculate the actual z-score. If the z-score then still meets the minimal z-score criterion,
the structure is fully traced back and printed out.
3 Results
The concept of fast and efficient estimation of the z-score by support vector regression
was first introduced by Washietl et al. [WHS05], and implemented in the noncoding RNA
gene finder RNAz. The speed up of this approach compared to explicit shuffling and fold-
ing, which is usually done on 1,000 replicas, is tremendous, at minimum a factor of 1,000.
Moreover, computing time is invariant to the length of the sequence, while RNA folding
is of complexity of O(N3). When considering the z-score as evaluation criterion in the
RNALfold algorithm, calculation of the z-score becomes a time consuming factor, as in
a worst case scenario it has to be done almost for every nucleotide of the sequence. It is
therefore a crucial concern to use support vector models that do not only have good accu-
racy, but also a moderate number of support vectors (SVs). In this work we extended the
z-score support vector regression to cover a broader range of the sequence spectrum, but
managed at the same time to build models that have significantly less SVs than the models
used by RNAz. This was accomplished by normalizing the dependent variables of the re-
gression, i. e. the average free energy and the standard deviation, by the sequence length.
The dependent variables do not strictly linearly correlate with the sequence length and so
we have to keep the sequence length as an input feature. Nevertheless, redundant points
are created in the training set, which eventually leads to a smaller space to be trained. For
14Gruber et al.
View other sources
Hide other sources
-
Available from Kay Nieselt · 10 Dec 2012
-
Available from emis.de