Efficient Sequence Clustering for RNASeq Data without a Reference Genome.
 Citations (24)
 Cited In (0)

Article: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing
J. Royal Statist. Soc., Series B. 01/1995; 57:289  300.  SourceAvailable from: Michael ReissGabriela Alexe, Sorin Alexe, David E Axelrod, Tibérius O Bonates, Irina I Lozina, Michael Reiss, Peter L Hammer[Show abstract] [Hide abstract]
ABSTRACT: The potential of applying data analysis tools to microarray data for diagnosis and prognosis is illustrated on the recent breast cancer dataset of van 't Veer and coworkers. We reexamine that dataset using the novel technique of logical analysis of data (LAD), with the double objective of discovering patterns characteristic for cases with good or poor outcome, using them for accurate and justifiable predictions; and deriving novel information about the role of genes, the existence of special classes of cases, and other factors. Data were analyzed using the combinatorics and optimizationbased method of LAD, recently shown to provide highly accurate diagnostic and prognostic systems in cardiology, cancer proteomics, hematology, pulmonology, and other disciplines. LAD identified a subset of 17 of the 25,000 genes, capable of fully distinguishing between patients with poor, respectively good prognoses. An extensive list of 'patterns' or 'combinatorial biomarkers' (that is, combinations of genes and limitations on their expression levels) was generated, and 40 patterns were used to create a prognostic system, shown to have 100% and 92.9% weighted accuracy on the training and test sets, respectively. The prognostic system uses fewer genes than other methods, and has similar or better accuracy than those reported in other studies. Out of the 17 genes identified by LAD, three (respectively, five) were shown to play a significant role in determining poor (respectively, good) prognosis. Two new classes of patients (described by similar sets of covering patterns, gene expression ranges, and clinical features) were discovered. As a byproduct of the study, it is shown that the training and the test sets of van 't Veer have differing characteristics. The study shows that LAD provides an accurate and fully explanatory prognostic system for breast cancer using genomic data (that is, a system that, in addition to predicting good or poor prognosis, provides an individualized explanation of the reasons for that prognosis for each patient). Moreover, the LAD model provides valuable insights into the roles of individual and combinatorial biomarkers, allows the discovery of new classes of patients, and generates a vast library of biomedical research hypotheses.Breast cancer research: BCR 02/2006; 8(4):R41. · 5.88 Impact Factor 
Page 1
GIEdition
Lecture Notes
in Informatics
Dietmar Schomburg,
Andreas Grote (Eds.)
German Conference on
Bioinformatics 2010
September 20  22, 2010
Braunschweig, Germany
D. Schomburg, A. Grote (Eds.): GCB 2010
Proceedings
Gesellschaft für Informatik e.V. (GI)
publishes this series in order to make available to a broad public
recent findings in informatics (i.e. computer science and informa
tion systems), to document conferences that are organized in co
operation with GI and to publish the annual GI Award dissertation.
Broken down into
• seminar
• proceedings
• dissertations
• thematics
current topics are dealt with from the vantage point of research
and development, teaching and further training in theory and prac
tice. The Editorial Committee uses an intensive review process in
order to ensure high quality contributions.
The volumes are published in German or English.
Information: http://www.giev.de/service/publikationen/lni/
This volume contains papers presented at the 25thGerman Conference on Bioin
formatics held at the Technische Universität CaroloWilhelmina in Braunschweig,
Germany, September 2022, 2010. The German Conference on Bioinformatics is
an annual, international conference, which provides a forum for the presentation
of current research in bioinformatics and computational biology. It is organized
on behalf of the Special Interest Group on Informatics in Biology of the German
Society of Computer Science (GI) and the German Society of Chemical Technique
and Biotechnology (Dechema) in cooperation with the German Society for Bio
chemistry and Molecular Biology (GBM).
173
ISSN 16175468
ISBN 9783885792673
Page 2
Page 3
Page 4
Dietmar Schomburg, Andreas Grote (Editors)
German Conference on Bioinformatics 2010
September 2022, 2010
Technische Universität Carolo Wilhelmina zu
Braunschweig, Germany
Gesellschaft für Informatik e.V. (GI)
Page 5
Lecture Notes in Informatics (LNI)  Proceedings
Series of the Gesellschaft für Informatik (GI)
Volume P173
ISBN 9783885792673
ISSN 16175468
Volume Editors
Prof. Dr. Dietmar Schomburg
Technische Universität CaroloWilhelmina zu Braunschweig
Email: d.schomburg@tubs.de
Dr. Andreas Grote
Technische Universität CaroloWilhelmina zu Braunschweig
Email: andreas.grote@tubs.de
Series Editorial Board
Heinrich C. Mayr, Universität Klagenfurt, Austria (Chairman, mayr@ifit.uniklu.ac.at)
Hinrich Bonin, LeuphanaUniversität Lüneburg, Germany
Dieter Fellner, Technische Universität Darmstadt, Germany
Ulrich Flegel, SAP Research, Germany
Ulrich Frank, Universität DuisburgEssen, Germany
JohannChristoph Freytag, HumboldtUniversität Berlin, Germany
Thomas RothBerghofer, DFKI
Michael Goedicke, Universität DuisburgEssen
Ralf Hofestädt, Universität Bielefeld
Michael Koch, Universität der Bundeswehr, München, Germany
Axel Lehmann, Universität der Bundeswehr München, Germany
Ernst W. Mayr, Technische Universität München, Germany
Sigrid Schubert, Universität Siegen, Germany
Martin Warnke, LeuphanaUniversität Lüneburg, Germany
Dissertations
Dorothea Wagner, Universität Karlsruhe, Germany
Seminars
Reinhard Wilhelm, Universität des Saarlandes, Germany
Thematics
Andreas Oberweis, Universität Karlsruhe (TH)
Gesellschaft für Informatik, Bonn 2010
printed by Köllen Druck+Verlag GmbH, Bonn
Page 6
Preface
This volume contains papers presented at the 25th German Conference on Bioinformat
ics held at the Technische Universit¨ at CaroloWilhelmina in Braunschweig, Germany,
September 2022, 2010. The German Conference on Bioinformatics is an annual, in
ternational conference, which provides a forum for the presentation of current research
in bioinformatics and computational biology. It is organized on behalf of the Special
Interest Group on Informatics in Biology of the German Society of Computer Science
(GI) and the German Society of Chemical Technique and Biotechnology (Dechema) in
cooperation with the German Society for Biochemistry and Molecular Biology (GBM).
Five outstanding scientists were invited to give keynote lectures to the conference:
• Edda Klipp  ‘Cellular stress response and regulation of metabolism’
• Thomas Lengauer  ‘HIV Bioinformatics: Analyzing viral evolution for the ben
efit of AIDS patients’
• Werner Mewes  ‘The data deluge: can simple models explain complex biological
systems?’
• Stefan Schuster  ‘Road games in metabolism  A biotechnological perspective’
• Gregory Stephanopoulos  ‘After a decade of systems biology, time for a record
card’
Besides the keynote lectures, the scientific program comprised 22 contributed talks
presenting 12 regular and 10 short papers. All accepted regular papers are collected
in these proceedings. The remaining accepted contributions, i.e. short papers and
poster abstracts, are published in a separate volume. We would like to thank the pro
gram committee members and all reviewers for their evaluations of the contributions.
Furthermore, we would like to thank the local organizers for keeping the conference
running. The organizers are grateful to all the sponsors and supporting scientific part
ners. Last but not least, we would like to thank all contributors and participants of
the GCB 2010.
Braunschweig, August 2010
Dietmar Schomburg and Andreas Grote
5
Page 7
Organizers
Conference Chair
Dietmar Schomburg, Braunschweig
Local Organizers
WolfTilo Balke (TU Braunschweig)
S´ andor Fekete (TU Braunschweig)
Reinhold Haux (TU Braunschweig)
Dieter Jahn (TU Braunschweig)
Frank Klawonn (Ostfalia University
of Applied Sciences)
Constantin Bannert (TU Braun
schweig)
Antje Chang (TU Braunschweig)
Andreas Grote (TU Braunschweig)
KatharinaHanke
schweig)
AdamPodstawka
schweig)
AlexanderRiemer
schweig)
Maurice Scheer (TU Braunschweig)
(TUBraun
(TUBraun
(TUBraun
Programm committee
Mario Albrecht, Saarbr¨ ucken
WolfTilo Balke, Braunschweig
Tim Beißbarth, G¨ ottingen
Thomas Dandekar, W¨ urzburg
S´ andor Fekete, Braunschweig
Georg F¨ ullen, Rostock
Robert Giegerich, Bielefeld
Reinhold Haux, Braunschweig
Ralf Hofest¨ adt, Bielefeld
Matthias Heinemann, Z¨ urich
Dieter Jahn, Braunschweig
Frank Klawonn, Wolfenb¨ uttel
Edda Klipp, Berlin
Ina Koch, Berlin
Oliver Kohlbacher, T¨ ubingen
Thomas Lengauer, Saarbr¨ ucken
HansPeter Lenhof, Saarbr¨ ucken
Michael Marschollek, Hannover
Werner Mewes, M¨ unchen
Michael MeyerHermann Frankfurt
Burkhard Morgenstern, G¨ ottingen
Stefan Posch, Halle
Matthias Rarey, Hamburg
Falk Schreiber, Gatersleben
Stefan Schuster, Jena
GregoryStephanopoulos,
bridge USA
Jens Stoye, Bielefeld
Andrew Torda, Hamburg
Martin Vingron, Berlin
Christian von Mering, Z¨ urich
Edgar Wingender, G¨ ottingen
Andreas Ziegler, L¨ ubeck
Cam
6
Page 8
Sponsors and supporters
Supporting scientific societies
Gesellschaft
(DECHEMA)
http://www.dechema.de
f¨ ur ChemischeTechnik undBiotechnologie e.V.
Gesellschaft f¨ ur Biochemie und Molekularbiologie e.V. (GBM)
http://www.gbmonline.de
Gesellschaft f¨ ur Informatik e.V. (GI)
http://www.giev.de
Nonprofit sponsors
Technische Universit¨ at Braunschweig
http://www.tubraunschweig.de
Commercial sponsors
Biobase  Biological Databases
http://www.biobaseinternational.com
CLC bio
http://www.clcbio.com
Convey Computer
http://www.conveycomputer.com
genomatix
http://www.genomatix.de
7
Page 9
geneXplain
http://www.genexplain.com
MEGWARE Computer Cluster
http://www.megware.com
Thalia
http://www.thalia.de
Transtec  IT & Solutions
http://www.transtec.de
8
Page 10
Table of Contents
Preface
5
Organizers
6
Sponsors and supporters
7
Table of Contents
9
Submissions
Andreas R. Gruber, Stephan H. Bernhart, You Zhou, Ivo L. Ho
facker
RNALfoldz: efficient prediction of thermodynamically stable, local secondary
structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Florian Battke, Stephan K¨ orner, Steffen H¨ uttner, Kay Nieselt
Efficient sequence clustering for RNAseq data without a reference genome
Florian Bl¨ ochl, Maria L. Hartsperger, Volker St¨ umpflen, Fabian J.
Theis
Uncovering the structure of heterogenous biological data: fuzzy graph parti
tioning in the kpartite setting. . . . . . . . . . . . . . . . . . . . . . . . . . 31
Sergiy Bogomolov, Martin Mann, Bj¨ orn Voß, Andreas Podelski,
Rolf Backofen
Shapebased barrier estimation for RNAs . . . . . . . . . . . . . . . . . . . . 41
Thomas Fober, Marco Mernberger, Gerhard Klebe, Eyke H¨ ullermeier
Efficient Similarity Retrieval of Protein Binding Sites based on Histogram
Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Peter Husemann, Jens Stoye
Repeataware Comparative Genome Assembly . . . . . . . . . . . . . . . . . . 61
Katrin Bohl, Lu´ ıs F. de Figueiredo, Oliver H¨ adicke, Steffen Klamt,
Christian Kost, Stefan Schuster, Christoph Kaleta
CASOP GS: Computing intervention strategies targeted at production im
provement in genomescale metabolic networks . . . . . . . . . . . . . . . . . 71
Jan Grau, Daniel Arend, Ivo Grosse, Artemis G. Hatzigeorgiou,
Jens Keilwagen, Manolis Maragkakis, Claus Weinholdt, Stefan Posch
Predicting miRNA targets utilizing an extended profile HMM . . . . . . . . . 81
11
. 21
9
Page 11
Arli A. Parikesit, Peter F. Stadler, Sonja J. Prohaska
Quantitative Comparison of GenomicWide Protein Domain Distributions . . 93
Jan Budczies, Carsten Denkert, Berit M. M¨ uller, Scarlet F. Brockm¨ oller,
Manfred Dietel, Jules L. Griffin, Matej Oresic, Oliver Fiehn
METAtarget – extracting key enzymes of metabolic regulation from high
throughput metabolomics data using KEGG REACTION information . . . . 103
Andreas GogolD¨ oring, Wei Chen
Finding Optimal Sets of Enriched Regions in ChIPSeq Data . . . . . . . . . 113
Enrico Glaab, Jonathan M. Garibaldi, Natalio Krasnogor
Learning pathwaybased decision rules to classify microarray cancer samples . 123
Index of authors
135
10
Page 12
RNALfoldz: efficient prediction of thermodynamically
stable, local secondary structures
Andreas R. Gruber1, Stephan H. Bernhart1,
You Zhou1,2, and Ivo L. Hofacker1
1Institute for Theoretical Chemistry
University of Vienna, W¨ ahringerstraße 17, 1090 Wien, Austria
2College of Computer Science and Technology
Jilin University, Changchun 130012, China
{agruber, berni, ivo}@tbi.univie.ac.at, zyou@jlu.edu.cn
Abstract: The search for local RNA secondary structures and the annotation of unusu
ally stable folding regions in genomic sequences are two well motivated bioinformatic
problems. In this contribution we introduce RNALfoldz an efficient solution two
tackle both tasks. It is an extension of the RNALfold algorithm augmented by sup
port vector regression for efficient calculation of a structure’s thermodynamic stability.
We demonstrate the applicability of this approach on the genome of E. coli and investi
gate a potential strategy to determine zscore cutoffs given a predefined false discovery
rate.
1 Introduction
Over the past decade noncoding RNAs (ncRNAs) have risen from a shadowy existence to
one of the primary research topics in modern molecular biology. Today computational
RNA biology faces challenges in the ever growing amount of sequencing data. Effi
cient computational tools are needed to turn these data into information. In this context,
the search for locally stable RNA secondary structures in large sequences is a well mo
tivated bioinformatic problem that has drawn considerable attention in the community.
RNALfold [HPS04] has been the first in a series of tools that offered an efficient solution
to this task. Instead of a straightforward, but costly sliding window approach a dynamic
programming recursion has been formulated that predicts all stable, local RNA structures
in O(N ×L2), where L is the maximum basepair span and N the length of the sequence.
Since its publication, the RNALfold algorithm has inspired a lot of work in this field, see
e.g. Rnall by Wan et al. [WLX06] or RNAslider by Horesh et al. [HWL+09]. All
contributions so far in this field focused on improving the computational complexity of
the algorithm, but none of the approaches has ever been used to unravel results of biolog
ical significance. In particular, de novo detection of functional RNA structures has been
addressed, but application on a genomewide scale with a low false discovery rate seems
still out of reach. Even on the moderately sized genome of E. coli (4.6 Mb) one is drown
ing in hundreds of thousands of local structures. Unlike in the well established field of
protein coding gene detection where one can exploit signals like codon usage, functional
Gruber et al.11
Page 13
RNA secondary structures, in general, do not show strong characteristics that make them
easily distinguishable from random decoys. Successful approaches for ncRNA detection
operating solely on a single sequence [HHS08, JWW+07] are limited to specific RNA
classes, where some outstanding characteristics can be harnessed. There is no master plan
for the detection of functional RNA structures, but one would certainly want to limit the
RNALfold output to a reasonable amount. So far, only the minimum free energy (MFE)
ofthelocallystablesecondarystructures, whichisintrinsicallycomputedbythealgorithm,
has been considered as potential discriminator to limit the number of secondary structures.
As demonstrated clearly by Freyhult and colleagues [FGM05] the MFE is roughly a func
tion of the length of the sequence and the G+C content. Even normalizing the MFE by
length of the sequence does not serve as a good discriminator between shuffled or coding
sequences and functional RNA structures. A strategy that does work, however, is to com
pare the native MFE E of the RNA molecule to the MFEs of a set of shuffled sequences of
same length and base composition [LM89]. This way we can evaluate the thermodynamic
stability of the secondary structure. A common statistical quantity in this context is the
zscore, which is calculated as follows
z =E − µ
σ
where µ and σ are the average and the standard deviation of the energies of the set of
shuffled sequences. The more negative the zscore the more thermodynamically stable is
the structure. Efficient estimation of a sequence’s zscore has been a profound problem
already addressed in the very beginnings of computational RNA biology. A first strategy
to avoid explicit shuffling and folding was based on table lookups of linear regression
coefficients [CLS+90]. Clote and colleagues [CFKK05] introduced the concept of the
asymptotic zscore, where the efficient calculation is also solved via table lookups. The
current stateofthe art approach for fast and efficient estimation of the zscore is to use
support vector regression [WHS05].
The study by Clote and colleagues and a follow up to Chen et al. (1990) [LLM02] also
report on the effort to predict thermodynamically stable structures using a sliding window
approach. In this contribution we present RNALfoldz an algorithm that combines local
RNA secondary structure prediction and the efficient search for thermodynamically stable
structures. RNALfoldz is an extension of the RNALfold algorithm augmented by sup
port vector regression for efficient calculation of a sequence’s zscore. We demonstrate the
applicability of this approach on the genome of E. coli and investigate a potential strategy
to determine zscore cutoffs given a predefined false discovery rate.
2Methods
2.1Fast estimation of the zscore using support vector regression
FortheefficientestimationofthezscorewefollowthestrategyfirstintroducedbyWashietl
etal. [WHS05]. Insteadofexplicitgenerationandfoldingofshuffledsequencesinorderto
12Gruber et al.
Page 14
determine the average free energy and the corresponding standard deviation support vector
regression (SVR) models are trained to estimate both values. As described in detail in the
previous work, we used a regularly spaced grid to sample sequences for the training set.
Synthetic sequences ranged from 50 to 400 nt in steps of 50 nt. The G+C content, A/(A+T)
ratio and C/(C+G) ratio were, however, extended to a broader spectrum, now ranging from
0.20 to 0.80 in steps of 0.05. A total of 17,576 sequences were used for training. For each
sequence of the training set 1,000 randomized sequences were generated using the Fisher
Yates shuffle algorithm, and subsequently folded with RNAfold with dangling ends op
tion d2 [HFS+94]. SVR models for the average free energy and standard deviation were
trained using the LIBSVM package (www.csie.ntu.edu.tw/˜cjlin/libsvm).
While in the previous work input features and the dependent variables were normalized
to a mean of zero and a standard deviation of one, we apply here a different normalization
strategy that leads to a significantly lower number of support vectors for the final models.
For the regression of the average free energy model the dependent variable is normalized
by the length of the sequence, while for the standard deviation it is the square root of the
sequence length. The length still remains in the set of input features and is scaled from 0 to
1. Other features remain unchanged. We used a RBF kernel, and optimized values for the
SVM parameters were determined using standard protocols for this purpose. Final regres
sion models were selected by balancing two criteria: (i) mean absolute error (MAE) on a
test set of 5,000 randomly drawn sequences of arbitrary length (50400) from the human
genome, and (ii) complexity of the model (number of support vectors) , which translates to
following procedure: from the top 10% of regression models in terms of MAE we selected
the one that had the lowest number of support vectors. For the average free energy re
gression we selected a model with a MAE of 0.453 and 1,088 support vectors, and for the
standard deviation regression a model with a MAE of 0.027 and 2,252 support vectors. To
validate our approach we finally compared zscores derived from the SVR to traditionally
sampled zscores on a set of 1,000 randomly drawn sequences from the human genome.
The correlation coefficient (R) is 0.9981 and the MAE is 0.072. This is in fair agreement
to results obtained when comparing two sets of sampled zscores (R: 0.9986, MAE: 0.054,
number of shuffled sequences = 1,000).
2.2Adaption of the RNALfold algorithm
RNALfold computes locally stable structures of long RNA molecules. It uses a Zuker
type secondary structure prediction algorithm [ZS81] and restricts the maximum base pair
span to L bases to keep the structures local. The sequence is processed from the 3’ (the
sequence length n) to the 5’ end. In order to keep the number of back trace operations low
and the output at moderate size, we want to avoid backtracing structures that differ only
by unpaired regions. Furthermore, only the longest helices possible are of interest. To
achieve this, a structure starting at base i is only traced back if the total energy F(i,n) is
smaller than that of its 3’ neighbor F(i + 1,n) while its 5’ neighbor has the same energy:
F(i−1,n) = F(i,n) < F(i+1,n). The local minimum structure is found by identifying
the pairing partner j of i so that C(i,j)+F(j +1,n) = F(i,n), i.e. the minimum energy
Gruber et al. 13
Page 15
from i to n is decomposed into the local minimum part i,j and the rest of the molecule.
Here, C(i,j) stands for the energy of a structural feature enclosed by the base pair i,j.
As a result of this, the output of RNALfold contains components, i.e. structures that are
enclosed by a base pair, only. Before we actually start the trace back, we evaluate two
new criteria: (1) the sequence of the structure traced back has to be within the training
parameters of the SVR, and (2) the zscore of the energy of this structure has to be lower
than a predefined bound.Criterion (1) is simply imposed by the training boundaries of
the SVMs. Boundaries have, however, been chosen carefully to cover a broad range of
today’s known spectrum of functional RNA structures. 99.79% of the sequences in the
Rfam v. 10 full data set match the base composition requirements of the SVR and 90% of
Rfam RNA families are in within the sequence length restrictions.
In order to get the exact sequence composition that is needed for the SVR evaluations,
the 3’ end of the structure (j) has to be computed first. This is done by a first, short
backtracing step, where the decomposition F(i,n) = C(i,j) + F(j + 1,n) is used to
find j. Subsequently, the average free energy given the base composition of the sequence
s(i,j) is computed by calling the corresponding SVR model. The SVR model for the
standard deviation has approximately twice the number of support vectors as the average
free energy model. To minimize calls of this model, first the minimal standard deviation
for the particular sequence length is looked up. We can then, using the free energy of
C(i,j), calculate a lower bound of the zscore. Only if this lower bound is below the
minimal required zscore, the support vector regression for the standard deviation is called
to calculate the actual zscore. If the zscore then still meets the minimal zscore criterion,
the structure is fully traced back and printed out.
3 Results
The concept of fast and efficient estimation of the zscore by support vector regression
was first introduced by Washietl et al. [WHS05], and implemented in the noncoding RNA
gene finder RNAz. The speed up of this approach compared to explicit shuffling and fold
ing, which is usually done on 1,000 replicas, is tremendous, at minimum a factor of 1,000.
Moreover, computing time is invariant to the length of the sequence, while RNA folding
is of complexity of O(N3). When considering the zscore as evaluation criterion in the
RNALfold algorithm, calculation of the zscore becomes a time consuming factor, as in
a worst case scenario it has to be done almost for every nucleotide of the sequence. It is
therefore a crucial concern to use support vector models that do not only have good accu
racy, but also a moderate number of support vectors (SVs). In this work we extended the
zscore support vector regression to cover a broader range of the sequence spectrum, but
managed at the same time to build models that have significantly less SVs than the models
used by RNAz. This was accomplished by normalizing the dependent variables of the re
gression, i. e. the average free energy and the standard deviation, by the sequence length.
The dependent variables do not strictly linearly correlate with the sequence length and so
we have to keep the sequence length as an input feature. Nevertheless, redundant points
are created in the training set, which eventually leads to a smaller space to be trained. For
14Gruber et al.
View other sources
Hide other sources
 Available from Kay Nieselt · Jun 2, 2014
 Available from emis.de