ArticlePDF Available

RNAprofiling 2.0: Enhanced cluster analysis of structural ensembles

Authors:

Abstract and Figures

Understanding the base pairing of an RNA sequence provides insight into its molecular structure.By mining suboptimal sampling data, RNAprofiling 1.0 identifies the dominant helices in low-energy secondary structures as features, organizes them into profiles which partition the Boltzmann sample, and highlights key similarities/differences among the most informative, i.e. selected, profiles in a graphical format. Version 2.0 enhances every step of this approach. First, the featured substructures are expanded from helices to stems. Second, profile selection includes low-frequency pairings similar to featured ones. In conjunction, these updates extend the utility of the method to sequences up to length 600, as evaluated over a sizable dataset. Third, relationships are visualized in a decision tree which highlights the most important structural differences. Finally, this cluster analysis is made accessible to experimental researchers in a portable format as an interactive webpage, permitting a much greater understanding of trade-offs among different possible base pairing combinations.
Content may be subject to copyright.
RNAprofiling 2.0: Enhanced cluster analysis of structural ensembles
Forrest Hurley, University of North Carolina at Chapel Hill
Christine Heitsch, Georgia Institute of Technology
March 29, 2023
Abstract
Understanding the base pairing of an RNA sequence provides insight into its molecular structure.
By mining suboptimal sampling data, RNAprofiling 1.0 identifies the dominant helices in low-energy
secondary structures as features, organizes them into profiles which partition the Boltzmann sample, and
highlights key similarities/differences among the most informative, i.e. selected, profiles in a graphical
format. Version 2.0 enhances every step of this approach. First, the featured substructures are expanded
from helices to stems. Second, profile selection includes low-frequency pairings similar to featured ones.
In conjunction, these updates extend the utility of the method to sequences up to length 600, as eval-
uated over a sizable dataset. Third, relationships are visualized in a decision tree which highlights the
most important structural differences. Finally, this cluster analysis is made accessible to experimental
researchers in a portable format as an interactive webpage, permitting a much greater understanding of
trade-offs among different possible base pairing combinations.
1 Introduction
The method of sampling RNA secondary structures from the Boltzmann distribution [1, 2] under the
nearest neighbor thermodynamic model (NNTM) [3] provides critical base pairing alternatives to the
minimum free energy (MFE) configuration. Such information can be essential to understanding how
RNA sequences fold and the functionality of these important molecules. Yet, the power of ensemble
analysis can only be realized by identifying the underlying patterns in a sufficiently large set of suboptimal
structures.
RNAprofiling, or just profiling for short, refers to the overall cluster analysis method that organizes
and analyzes a collection of secondary structures according to a set of features. It was developed [4] to
identify the dominant combinations of base pairing signals in the Boltzmann ensemble. RNAprofiling 1.0
(denoted here Pv1) consistently achieves high sample compression together with low information loss on
“small” sequences, on the order of 100 nucleotides (nt). We present here an updated version, RNApro-
filing 2.0 (denoted Pv2), which can mine a stable, informative structural signal from Boltzmann samples
of much longer sequences.
In contrast to other cluster analysis methods like Sfold [5] and RNAshapes [6], Pv2 does not generate
the sample to be analyzed. Rather, it is available to leverage the ensemble analysis power of state-of-
the-art software packages like RNAstructure [7] and ViennaRNA [8]. Hence, we demonstrate here that
Pv2 will reliably report the high probability base pairing combinations for sequences up to 600 nt.
We note that the signal from the Boltzmann ensemble at the substructural unit level, i.e. the features
being considered, remains strong well-beyond 1000 nt. However, the probability of different combinations
of these units, i.e. their profiles, decays with sequence length. Like prediction accuracy, this is a reflection
of the NNTM itself, and the sampling method employed. Given a particular Boltzmann sample as input,
Pv2 outputs high quality information in a useful quantity for further hypothesis generation.
As described, the content of that information is determined directly from the input sample. When
introduced [4], it was established that RNAprofiling provides complementary information to both Sfold
and RNAshapes. Moreover, a thorough analysis [9] compared the three, where Pv1 analyzed Boltzmann
samples generated by GTfold [10]. It was found that all three improved over the MFE, but there was no
clear advantage among cluster analysis methods in terms of base pair prediction accuracy.
©2023. This manuscript version is made available under the CC-BY-NC-ND 4.0 license.
1
arXiv:2303.15552v1 [q-bio.BM] 27 Mar 2023
In terms of efficiency, for a sequence of length 600 nt, Pv2 analyzes a Boltzmann sample of 10,000
structures in about 20 seconds, with the sample generation taking about 5 sec. Shorter sequences and/or
smaller samples take correspondingly less time to analyze, e.g. about 2 sec to analyze a sequence 200
nt and sample of 1000 structures. In contrast [9], Sfold takes about 25 seconds (sampling + analysis) at
this 200 length/1K size scale, as does RNAshapes.
Regardless of which cluster analysis method is used, there are two key points for experimentalists [9].
First, as well-known to the ribonomics community, prediction quality improves if more than one con-
formation is considered. Second, the quality is substantially enhanced if the conformations are initially
considered at lower granularity/higher abstraction. This supports a multilayered approach to RNA sec-
ondary structure determination where an early computational step identifies critical structural differences
“to be vetted by further computational analysis, experimental testing, and/or biological insight.”
The new version of RNAprofiling presented here significantly enhances the method’s ability to do just
that. The new code is freely available at github.com/gtDMMB/RNAprofilingV2 under a GPLv2 license
and can be run online through the rnaprofiling.gatech.edu website.
2 Method
Pv2 follows the same three general steps as Pv1. First, identify key substructural units as features.
Second, cluster secondary structures into profiles based on their features, and select the most informative.
Third, visually highlight relationships among the selected profiles. As described, Pv2 provides significant
enhancements at each step. Additionally, the profiling output is now made available as an interactive
webpage which further facilitates compare/contrast across clusters.
2.1 Feature identification
Pv1 introduced “helix class” as its structural unit. A helix (i, j, k) in an RNA secondary structure is a
set of base pairs {(i, j),(i+ 1, j 1),...,(i+k1, j k+ 1)}where at least one of {i1, j + 1}and of
{i+k, j k}are single-stranded. Under the NNTM, a hairpin loop must contain at least 3 nucleotides
which implies that ji2k2. A helix (i, j, k) is maximal if the minimum hairpin length is respected
and neither (i1, j + 1) nor (i+k, j k) are a canonical base pair. A helix class (HC) consists of all
helices which are a subset of the same maximal helix, and is denoted by that maximal (i, j, k) triplet.
Profiling always begins by determining all HC present along with their frequency (i.e. estimated
probability) in the sampled secondary structures. When the observed HC are ordered by decreasing
frequency, this yields a distribution with a long tail of low-probability base pairings. The threshold at
which to cut the tail is determined by maximizing the average Shannon information entropy [4]. This
yields a relatively small set of selected helix classes (SHC).
Pv1 used SHC as features, and demonstrated this achieves both high sample compression as well as
low information loss. Pv2 generalizes this approach by using SHC to generate “stem classes” as defined
below.
Like a HC, a stem class (SC) represents sets of possible base pairings. It is built from SHC and
denoted as [i, j;k, l], where the intervals ixi+k1 and jl+ 1 yjare the minimal
regions such that all base pairs (x, y) represented have their 5’ end in the first and 3’ one in the second.
A “trivial” SC consists of a single SHC with [i, j;k, k ]. The SC frequency is the number of sampled
structures where any of the base pairings from the constituent SHC are present.
For our purposes, two (or more) helices are said to form a stem if they are (successively) separated
by at most two single-stranded bases on either side. In other words, if a secondary structures contains
helices (i, j, k) and (i0, j0, k0) with i+ki0i+k+ 2 and jk2j0jk, then they form a
stem. This extends to multiple distinct helices in succession, as long as the separation criteria is met.
The idea of a stem is motivated by the NNTM [3], which treats small internal loops (i.e. with sizes 1 ×1,
1×2/2 ×1, and 2 ×2) as special cases, in part to address noncanonical base pairings.
Generalizing to the sets of base pairs which form profiling’s structural units, two HC are considered
stemmable if there exists a helix from each class which together could form a stem. If for some reason a
smaller, or larger gap size than 2, is desired, this can be changed by the user. We note that stemmability
is a reflexive and symmetric mathematical relation on HC. A stem class (SC) is then defined to be the
transitive closure of stemmable pairs of SHC, yielding well-defined equivalence classes.
Observe that the HC definition naturally satisfies transitivity whereas it is imposed on stemmable
pairs. To confirm that SC remain local substructural units, we consider two properties: length and
width. Both are defined precisely in Supplemental Material. Length is an upper bound on the number
2
of pairings possible from the SC in one structure. Width is a measure of the ‘spread’ where two helices
which form a stem would have a width of 1, 2, or 3 depending on the asymmetry of the small internal
loop/bulge. As will be shown, nearly all SC have this width as well.
2.2 Profile selection
The next step is to cluster the sampled structures into “profiles” determined by a common set of features.
To focus on the most informative combinations, a maximum average entropy threshold is again used to
filter the low probability tail. This yields a relatively small set of selected profiles for further consideration.
In addition to using SC rather than SHC as the default structural unit, Pv2 updates the estimated
probabilities prior to profile selection by considering low frequency pairings similar to the featured ones.
These augmented counts are distinguished as “fuzzy” stem classes (FSC). Intuitively, if ‘enough’ non-
SHC pairings in a secondary structure span the 5’ and 3’ sequence regions [i, j;k, l ], when expanded
slightly, then the corresponding SC is added to the FSC profile for that structure and the FSC frequency
is increased accordingly. (Details in Supplemental Material.)
In Pv2, the profiles are selected based on FSC by default. In this way, low frequency pairings which
occupy the same sequence regions, approximately, as the featured ones are included in the summary
profile graph information. As with stem versus helix classes, and the stem gap size, this can be changed
by the user.
2.3 Relationship visualization
The third and final step is to highlight similarities and differences across the selected profiles. This
compare/contrast visualization illuminates crucial differences between the most frequently occurring
combinations of base pairings. In Pv1, the summary profile graph was based on the Hasse diagram for
the selected profiles ordered under set inclusion. Pv2 retains this option but also provides (by default) a
complementary perspective via an augmented decision tree. An example is seen in Figure 1, and discussed
further in Section 2.4. Details of the unsupervised tree building procedure are in Supplemental Material.
The advantage of decision trees over Hasse diagrams is that the number of edges remains small, even at
longer sequence lengths, making it more comprehensible for the user. The trade-off is that support for
non-selected feature combinations (the “intersection profiles” from Pv1) can be spread among different
branches.
2.4 Interactive output
The most visible enhancement in Pv2 is the profiling output; the summary profile graph is now embedded
in a portable interactive webpage with auxiliary information. Figure 1 shows an example for the FMN
riboswitch (RF00050) in Actinobacillus succinogenes 130Z (CP000746.1/533105–533234). As illustrated,
this provides a compact, informative summary of the Boltzmann sample input which also highlights
critical structural differences.
Figure 1 caption gives an overview of the information provided. We now consider the decision tree
example in more detail. All nodes (ovals and rectangles) correspond to a subset of the Boltzmann sample
input, and are labeled by its size. Rectangles can only be a leaf, and denote a selected profile or collection
thereof (dashed). The root node is the full sample provided. Edges are labeled with choices on feature
inclusion and/or exclusion (denoted ¬) that lead to at least one selected profile.
As detailed in Supplemental Material, decisions are grouped based on the split induced by each
feature on the selected profiles among the current set of structures. If there is no forced decision, the
one maximizing the dissimilarity in the two branches (according to the Hellinger distance [11]) is chosen.
Splits indicate critical structural differences, in order of priority for further computational analysis and/or
experimental testing.
For instance, of the full 1000 structures in the sample analyzed for Figure 1, 978 contain base pairings
from 5, A, and B. For these structures, all selected profiles contain all three features, so this is a “forced”
decision.
Letter indexed features denote nontrivial FSC. As seen in the feature display, and confirmed in the
FSC and SHC tables, feature A is composed of two SHC, 1 and 2, that are separated by a small internal
loop of size 1 ×1, which could be a noncanonical pairing. In contrast, B contains 4 = (16,30,4) and 12
= (17,28,3) which can form a stem with 4 base pairs and a bulge of size 1. Hence, Pv2 treats them as a
single structural unit.
3
Fuzziness increases the estimated probability of these SC by less than 5% (from exact frequencies
of 952 for A and 954 for B). However, it raises 5 by more than 30%, from 757 to 985. This is useful
information that the sequence region [54,59; 64,69] most likely forms some kind of hairpin structure.
None of the other combinations of inclusion/exclusion on 5, A, B lead to a selected profile, so there is
only one edge in the tree for this common forced decision.
In contrast, the 979 structures are essentially evenly split between including/excluding C. According
to the Hellinger distance, this makes it the highest priority structural difference among the remaining
features for this Boltzmann sample input.
If C is present, then this forces 3 to be included, 10 and 11 to be excluded. Why exclusions are
forced is useful information, and communicated by red pairings below the sequence line in the feature
display. In this case, the only remaining feature uncertainty (as summarized by a contingency table) is
between 7 and 9. Given the overlap in their base pairing regions, resolving this ambiguity may not even
be necessary.
If C is absent, the next critical split identified is 10. If it exists, then different combinations of
the inclusion/exclusion of 9 and of (11,3) are present in the sample, as would be summarized in the
contingency table for II. The table would show that of the 325 structures on this decision path, most
(0.575) have both but some (0.169) have 9 with (¬11, ¬3) while others (0.178) have the opposite (and
0.077 have neither).
Finally, excluding both C and 10 yields a single selected profile III pictured in bottom right. Based
on this analysis, the most important difference between the structures in the input sample is the presence
of C, and if ¬C then 10. To facilitate further investigation, after selecting a node, users can download
files containing all associated structures. All images in the webpage can be downloaded directly or are
available in the output folder.
3 Results
Previous results [4] demonstrated that Pv1 consistently achieves high sample compression together with
low information loss on “small” sequences around 100 nt. We confirm this on a much larger dataset, and
also demonstrate that the Pv2 enhancements extend the length at which a useful structural signal can
be extracted from a Boltzmann sample up to 600 nt. Results highlight the value of this cluster analysis
approach, as well as the challenges for very long sequences.
3.1 Dataset
Our dataset used the curated CONTRAfold [12] one, which contains 151 sequences from distinct Rfam [13]
families (version 7.0), as a starting point. Each existing family was augmented by up to 4 additional
sequences from Rfam (version 14.9). Eight of the original families are no longer in Rfam, while 3 had
just one more sequence, 8 had two, and 7 had 3. For the remaining 125 families, gc content was used
as a proxy for phylogenetic diversity; four sequences were added iteratively by maximizing the minimum
difference with the previous ones. This yielded a total of 683 sequences over 143 current families.
For each sequence, 25 different Boltzmann samples of size 10,000 were generated using the RNAlib
python bindings [8]. Analysis reports the results of these 17,075 trials. As addressed below, larger
sample sizes (beyond the typical 1000 [2]) improved result reproducibility across different runs for longer
sequences.
The maximum CONTRAfold sequence length was 568. Excepting one family (RF00177), all addi-
tional ones have length <600 nt. To facilitate profile-level comparisons, the families were divided by
average sequence length ninto 5 categories: extra-small (xs), small (s), medium (m), large (l), and
extra-large (xl). They consist, respectively, of 14 families with 23 < n 50, 85 with 50 < n 150, 22
with 150 < n 220, 21 with 270 < n 567, and 1 with n= 1331.2.
The Pv1 proof-of-principle dataset had 15 sequences spread over 5 families with lengths ranging from
72 (tRNA) to 133 (5S rRNA) with a 99 nt average. As shown below, SHC consistently yield profiles
with high sample compression and low information loss over the 85 comparable sfamilies. Moreover, by
moving to FSC as features, Pv2 achieves the same profile qualities on mfamilies, and even extracts a
useful signal for lones. The xl outlier was retained as an example of how much the structural signal in
a Boltzmann sample, i.e. the combination of base pairings, decays with sequence length.
4
3.2 Features
Pv1 results found that Boltzmann sampling can be very noisy, with many distinct helices generated
even at small lengths. This is only more true for longer sequences with larger (by 10×) sample sizes.
Nonetheless, relatively few features can reproducibly represent most of the base pairing information in a
Boltzmann sample.
We found linear relationships between sequence length and number of different features. (See Supple-
mental Material.) There are about 500 distinct helices per 100 nt on average, and moving to HCs yields
a 2-fold reduction. The maximum average entropy thresholding achieves about 25-fold reduction, with
10 SHC per 100 nt. This is further reduced to a rate of 5.14 SC (R2= 0.96). Since fuzzy counts only
affect the estimated probability, the number of distinct SC/FSC per 100 nt is two orders of magnitude
less than helices.
This amount of compression still provides good coverage of the base pairing information sampled.
Coverage was computed as the proportion of helices (with multiplicity) which belong to a SHC. Nearly
half (48.4%) of trials had >0.90 coverage, and most (87.3%) covered more than 3/4 of helices sampled.
Coverage decreased slightly with sequence length (at a rate of 0.025 per 100 nt according to the best line
fit). Only 2.0% had coverage below 0.6, and these outliers were generally shorter sequences (<150 nt).
The SC are built from SHC so provide the same coverage with 1/2 as many structural units. We
confirm that SC remain local substructures by considering the length and width distributions. As defined,
a helix has width w= 1, and two helices that form a stem have 1 w3 depending on the asymmetry
of the small internal loop/bulge between them. Over all trials, more than half (52.2%) contained only
one SHC, and these trivial SC had average (std) length l= 5.45 (2.39) nt. Allowing stems increases the
feature length without significantly increasing the width. Of the nontrivial SC, nearly all (89.1%) had
w3 with l= 11.9 (5.21) while only 1.2% had w6 with l= 28.7 (18.6).
The reproducibility of SHC is very high, across all lengths. For each sequence, this is the average
over all SHC sampled of the percentage of runs (out of 25) that it is present. See Supplemental Material
for boxplots. Reproducibility for SC, as well as profiles, is computed likewise. However, the criteria for
“is present” is stringent; the compound structure must include exactly the same set of SHC. This means
that any variation in SHC propagates and is amplified. Nonetheless, median SC reproducibility remained
high. Interestingly, the noisier sequences were the shorter ones.
Boltzmann samples of size 1K were originally analyzed, and only 10 (of 683) sequences had an average
SHC reproducibility h0.9. Hence, more than 3/4 (76.9%) had an average SC reproducibility s > 0.9.
Moving to size 10K, as will be discussed in the next section, meant only one (xs) sequence had h0.9
which improved the percentage with s > 0.9 to 93.0%. This included all but 4 (of 106) msequences and
1 (of 105) lones, which had 0.8< s 0.9.
This confirms that Pv2 is reliably extracting a very informative base pairing signal from the sampled
structures. Moreover, since the number of possible profiles grows exponentially with features, the greater
compression obtained by Pv2 makes this approach accessible to longer sequences.
3.3 Profiles
Profiles were also assessed for sample compression, information loss, and reproducibility. Five different
feature sets were considered: helix, HC, SHC, SC, and FSC. For each type, the corresponding profiles
were generated, and the maximum average entropy thresholding applied to select the most informative
ones. Results are broken down into groups by average sequence length per family, and report analysis of
10K-sized Boltzmann samples except where noted.
Results given in Figure 2 demonstrates that Pv2 achieves high sample compression with low infor-
mation loss for sequences up to 600 nt. Using SHC as features works well in the length range (up to 150
nt) previously reported. The number of selected profiles is low and the coverage is high: 3 and 0.98 for
xs, 5 and 0.82 for s. However, the coverage for mhas already dropped to 0.49. The key to overcoming
this length barrier is consolidating SHC into SC.
As seen, Boltzmann sampling can be very noisy. Even for the ssequences, a median of 1,634 different
structures (i.e. helix profiles) were sampled. Decreasing feature granularity filters sample noise. The
difference is particularly noticeable for lsequences when moving from SHC to SC; the median drops
more than 8-fold (from 2,984 to 360). Moreover, the median number of selected profiles is reduced down
to 7, not much higher than mand s. Importantly, this level of sample compression is accompanied by
a corresponding rise in median coverage, which increases from 0.15 to 0.61 for FSC. The corresponding
rise for mfrom 0.49 was to 0.85.
5
When analyzing reproducibility, the standard 1K-sized samples [5] did not yield results comparable
to Pv1, but 10K did. Profile reproducibility is affected by the propagation, and amplification, of feature
variability; increased sample size reduces this effect across all sequence lengths. For example, with 1K
samples, although 80.2% of msequence have an average SHC selected profile reproducibility s > 0.7,
only 10.4% were above 0.9. With 10K samples, 98.1% have s > 0.7 with 69.4% above 0.9. Although the
corresponding numbers for lsequences also improve significantly (to 84.8% from 20.0% and 18.1% from
0% resp.), they are lower due to the confounding effect of profile growth.
Like the number of structural units sampled, the average number of features in a profile also grows
linearly with sequence length (at a rate of about 6 SHC and 4 SC/FSC per 100 nt). This effect is
particularly apparent for the xl sequences, whose SHC and SC reproducibility is as high as any other,
but whose corresponding profiles are not reproducible. (See Supplemental Material.)
Since Pv2 uses FSC by default, these are the selected profiles whose reproducibility was evaluated.
It is very high for xs and ssequences (92.1% and 73.5% above 0.9, resp.) although the outliers in SC
reproducibility are apparent. Nearly all msequences (88.7%) are above 0.7, with a majority (57.1%)
above 0.9. Even the lones have 64.8% and 35.2% resp., with a median of 0.84 and interquartile range
of (0.61,0.93).
4 Conclusions
RNAprofiling 2.0 (Pv2) consistently achieves high sample compression together with low information
loss on sequences up to 600 nt, a 4-fold length increase over Pv1. This is accomplished by expanding
the featured substructures from helices to stems, including low-frequency pairings similar to featured
ones in the profile selection, and visualizing their relationships in a decision tree. Pv2 takes as input
a Boltzmann sample, as provided by software packages like RNAstructure or ViennaRNA. The Pv2
output is a portable interactive webpage which provides a compact, informative summary of the sample
provided. Critical structural differences are highlighted to be evaluated further by some combination of
computational analysis, experimental testing, and biological insight.
5 Acknowledgments
This work was supported by funds from the National Institutes of Health (R01GM126554 to CH) and
the National Science Foundation (DMS1344199 to CH). Additional support for FH provided by a grant
from the National Institute of Environmental Health Sciences (2T32ES007018 to Rebecca Fry, UNC).
References
[1] J. S. McCaskill, “The equilibrium partition function and base pair binding probabilities for RNA
secondary structure,” Biopolymers, vol. 29, no. 6-7, pp. 1105–19, 1990.
[2] Y. Ding and C. E. Lawrence, “A statistical sampling algorithm for RNA secondary structure pre-
diction,” Nucleic Acids Res, vol. 31, no. 24, pp. 7280–7301, 2003.
[3] D. H. Turner and D. H. Mathews, “NNDB: the nearest neighbor parameter database for predicting
stability of nucleic acid secondary structure,” Nucleic Acids Res, vol. 38, pp. D280–2, 2010.
[4] E. Rogers and C. Heitsch, “Profiling small RNA reveals multimodal substructural signals in a
Boltzmann ensemble,” Nucleic Acids Res, p. gku959, 2014.
[5] Y. Ding, C. Y. Chan, and C. E. Lawrence, “Sfold web server for statistical folding and rational
design of nucleic acids,” Nucleic Acids Res, vol. 32, no. Web Server Issue, pp. W135–W141, 2004.
[6] P. Steffen, B. Voß, M. Rehmsmeier, J. Reeder, and R. Giegerich, “RNAshapes: an integrated RNA
analysis package based on abstract shapes,” Bioinformatics, vol. 22, no. 4, pp. 500–503, 2006.
[7] J. S. Reuter and D. H. Mathews, “RNAstructure: Software for RNA secondary structure prediction
and analysis,” BMC Bioinformatics, vol. 11, 2010.
[8] R. Lorenz, S. H. Bernhart, C. oner zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, and I. L.
Hofacker, “ViennaRNA package 2.0,” Algorithms for Molecular Biology, vol. 6, 2011.
[9] E. Rogers and C. Heitsch, “New insights from cluster analysis methods for RNA secondary structure
prediction,” Wiley Interdiscip Rev RNA, vol. 7, no. 3, pp. 278–94, 2016.
6
[10] S. Swenson, J. Anderson, A. Ash, P. Gaurav, Z. uk¨osd, D. Bader, S. Harvey, and C. Heitsch,
“GTfold: Enabling parallel RNA secondary structure prediction on multi-core desktops,” BMC Res
Notes, vol. 5, no. 1, p. 341, 2012.
[11] L. L. Cam and G. L. Yang, Asymptotics in Statistics. Springer New York, 2 ed., 2000.
[12] C. B. Do, D. A. Woods, and S. Batzoglou, “CONTRAfold: RNA secondary structure prediction
without physics-based models,” Bioinformatics, vol. 22, pp. e90–e98, 2006.
[13] I. Kalvari, E. P. Nawrocki, N. Ontiveros-Palacios, J. Argasinska, K. Lamkiewicz, M. Marz,
S. Griffiths-Jones, C. Toffano-Nioche, D. Gautheret, Z. Weinberg, E. Rivas, S. R. Eddy, R. D.
Finn, A. Bateman, and A. I. Petrov, “Rfam 14: expanded coverage of metagenomic, viral and
microRNA families,” Nucleic Acids Research, vol. 49, pp. D192–D200, 2021.
7
Figure 1: Example of Pv2 output webpage. Left column displays an interactive summary profile graph by default
a decision tree. Nodes are clickable and labeled with corresponding number of sampled structures. Center column has
a dynamic panel (top) displaying features in arc diagram format for chosen (grey) node from profile graph. Decisions
on incoming edge are emphasized; positive ones in bold above the sequence line, and negative ones, denoted with
¬in tree, in red below. Features are labeled according to the FSC table (center, middle) which lists SC regions
[i, j;k, l] with fuzzy frequencies and SHC contained. Nontrivial FSC are denoted by letters, and trivial ones by their
SHC index. All SHC are listed (center, bottom) with maximal (i, j, k) triplet and integer indexed in decreasing exact
frequency as given. Selected profiles, or groups thereof, are denoted by rectangular leaves in decision tree and labeled
by roman numerals. More than one selected profile is represented if the incoming edge is a contingency (dashed). In
this case, the contingency table is given below the feature display when the leaf (dashed rectangle) is chosen. Right
column shows most frequent secondary structure for each leaf in radial (or arc) diagram format. Users can download
all structures corresponding to the chosen node, or just the most frequent, for further analysis. See Section 2.4, and
Supplemental Material, for further information.
8
Figure 2: Profile analysis with feature sets ordered by increasing abstraction/decreasing granularity from helices
through helix classes to stem classes. Median and interquartile range (IQR) are reported over distinct trials, excluding
those where every sampled structure is a unique profile. Counts (top and middle) are log scale. Coverage (bottom)
is proportion of sampled structures represented by selected profiles.
9
Supplemental Material for
RNAprofiling 2.0: Enhanced cluster analysis of structural ensembles
Forrest Hurley, University of North Carolina at Chapel Hill
Christine Heitsch, Georgia Institute of Technology
March 29, 2023
1 Additional results
Figures 1, 2, and 3 provide further details on the growth rate of features, growth rate of profile length, and
reproducibility for 1K and 10K samples respectively.
Recall that Pv2 denotes RNAprofiling 2.0, and Pv1 the original, i.e. RNAprofiling 1.0. Features considered
are helices, helices classes (HC), selected helix classes (SHC), stem classes (SC), and fuzzy stem classes
(FSC). As features, the difference between SC and FSC is their frequency, i.e. estimated probability, in the
Boltzmann sample input. As discussed below in Section 3, this changes both the profile selection and the
decision tree construction.
2 Implementation information
Pv2 is freely available under the GPLv2 license at github.com/gtDMMB/RNAprofilingV2 and can be run
online via the rnaprofiling.gatech.edu website.
When run online, the default option is to upload a Boltzmann sample in either ct or dot file format.
To expedite exploratory analysis, the website also provides the option of generating a sample with either
RNAstructure 6.4 [1] or ViennaRNA 2.4.14 [2].
A command line interface is available in the form of a python script. The script provides all the same
options as the web interface, although generating samples is disabled unless there is a local install of RNAs-
tructure or ViennaRNA.
This new version has a completely new codebase, written in Python rather than C/C++. Pv2 is imple-
mented and tested with Python 3 (3.6.9) using the numpy (1.19.5), networkx (2.4), matplotlib (3.3.4), and
pygraphviz (1.6) libraries. A graphviz install [3] is also required to generate the summary profile graph. The
program is loaded on the web server using PyInstaller (4.10).
The full output from both the web interface and the command line interface is displayed in HTML with
JavaScript and should work in most modern internet browsers. The svg.min library is used to render graphviz
output in the browser.
Table 1 lists the main Pv2 parameters. There are some additional IO options for sequences and samples
available via the command line and website.
Pv1 functionality may be recreated by using SHC as features, and a Hasse diagram as the summary
profile graph output. By default, a maximum average entropy threshold determines the selection of HC and
of profiles. As in Pv1, this can be overridden by a user-specified cutoff.
The default options for stem gap, fuzzy counts, and contingency nodes can be altered by users to suit
their particular analysis goals. For example, by default, 75% of the full binary tree must be present before it
is collapsed into a contingency node, but can be changed to provide a larger or smaller output tree as useful.
©2023. This manuscript version is made available under the CC-BY-NC-ND 4.0 license.
1
arXiv:2303.15552v1 [q-bio.BM] 27 Mar 2023
Figure 1: Number of distinct features by type for each sequence. Regression lines assume 0 intercept. Note
difference in y-axis scales between graphs.
Figure 2: Average length of profiles by feature type for each sequence. Difference in regression line slopes
for SC and FSC is not significant (p=0.08).
2
Figure 3: Reproducibility of features and of profiles for 1K and 10K samples across family length categories.
Although 1K suffices for features, profile reproducibility degrades with sequence length. Larger, e.g. 10K,
samples yield a more reliable structural signal in general.
Finally, we highlight the possibility of having consistent HC labels. This can be very useful if comparing
results across multiple different samples for the same sequence. When invoked, the HC labels are based on
the sequence itself, and so are independent of the particular sample frequencies.
3
Parameter Value Default
Output Type {Hasse, Tree}Tree
Feature Type {Selected Helix Class, Stem Class}Stem Class
Frequency Format {Counts, Percentages, Decimals}Counts
Helix Class Selection Cutoff Positive Integer or Auto Auto
Profile Selection Cutoff Positive Integer or Auto Auto
Stem Gap Non-Negative Integer 2
Use Fuzzy Stem Counts Boolean True
Fuzzy Dilation Size Positive Integer 5
Fuzzy Basepair Frequency Margin Real in [0,1] 0.333
Min Contingency Node Proportion Real in [0,1] 0.75
Use Consistent Helix Class Labels Boolean False
Table 1: Main code parameters. “Cutoff options override the standard profiling threshold method; if used,
helix classes, resp. profiles, with lower frequency in the input sample will not be considered. The three
“Fuzzy” options apply only to SC and are ignored if using HC. “Contingency” option is ignored if output is
not a decision tree. Consistency in helix labeling can be very useful when comparing multiple analyses for
the same sequence.
4
3 Technical details of method
3.1 Stem class length and width
A helix (i, j, k) has length k, the number of base pairs it contains. A HC is denoted by its maximal helix
(i, j, k) and has length ksince the maximum number of base pairs possible in any of its constituent helices
is k. Observe that k1 is half the Manhattan distance from the outermost possible pairing (i, j) to the
innermost (i+k1, j k+ 1) in the usual (x, y) plane; k1 = (1/2) (|i(i+k1)|+|j(jk+ 1)|).
The SC length is defined analogously and will be the maximum number of pairings possible in any of its
constituent combinations. First observe that the outermost possible base pair is the one with the greatest
contact distance, i.e. where jiis maximal. Let Mdenote this value, and mdenote least possible, which
corresponds to the innermost pair. Since the Manhattan distance between the outermost and innermost
pairs is equal to Mm, the length of a stem class is defined to be 1 + (Mm)/2.
The SC width will capture a measure of the ‘spread’ of observed pairings represented. This is done by
counting the number “helical diagonals” covered by the SC. Here, a helical diagonal denotes a line in the
(x, y) plane with slope 1 which intersects the identity x=yat points where xor x+ 1/2 is a positive
integer. The HC h= (i, j, k) has width 1 since all pairings lie on a single diagonal with midpoint x= (i+j)/2
as the intersection. If the helix class h0= (i0, j0, k0) is stemmable with h, then |(i+j)/2(i0+j0)/2| 1, and
the number of diagonals covered is either 1, 2, or 3 depending on the asymmetry of the internal loop/bulge
separating them, i.e. whether ((jk+ 1) j01) (i0(i+k1) 1) is 0, ±1, or ±2. For a stem class
[i, j;k, l], let Cbe the set of helix class midpoints. Then its width is 1 + 2 (max Cmin C).
3.2 Fuzzy stem class frequencies
Fuzzy counts address low-frequency base pairings, i.e. non-SHC ones, that are ‘close’ to a SC. These aug-
mented frequencies are distinguished as FSC. A secondary structure has both a SC profile as well as a
(possibly enlarged) FSC one. The frequency of a profile is the number of structures in the input sample
with those features (and no more). The difference between SC and FSC as features affects the distribution
of profile frequencies which is used both for selection, and also to build the decision tree.
Pv2 uses fuzzy counts by default, which are found as follows. Consider a secondary structure Sand its
SC profile P. For each SC not already in P, expand its region [i, j;k, l] slightly and count the non-SHC
base pairs from Swhich fall inside. This number is then compared to a baseline. If it is high enough, then
[i, j;k, l] is added to the FSC profile for S. A base pair (x, y) falls inside the expanded region of [i, j;k, l]
with dilation size bif ibx<i+k+band jlb<yj+b. The baseline is a fixed fraction, by
default (1/3), of the average number of base pairs in the expanded region over all structures with [i, j;k, l]
in their SC profile.
3.3 Decision tree construction
A path starting at the root of the tree corresponds to a sequence of choices, positive and/or negative, on
features which results in one or more selected profiles as a leaf. As will be explained, the multiplicity is due
to the use of “contingency” leaf nodes. By default, fuzzy frequencies are used to build the tree, but exact
ones can be chosen instead.
Every node corresponds to a subset of the Boltzmann sample, and is labeled with the number of structures
under consideration. The root node is the full sample input, i.e. all nonempty profiles. Subsequent nodes
consist of groups of profiles determined by prior decisions on feature inclusion/exclusion. Edges denote
decisions and are labeled with one or more features, with negative choices indicated by ¬. Nontrivial SC are
denoted by letters, and trivial ones by their SHC integer index. Subset sizes are updated after every decision
to remove the profiles no longer in consideration.
The choice of inclusion/exclusion for each remaining feature splits the selected profiles currently un-
der consideration. Features which yield the same split are grouped together into a common decision for
consideration.
A decision is “forced” if the other side of the split is empty, i.e. if none of the selected profiles in the
current group have the opposite choice of feature(s). There is at most one forced common decision possible,
5
and it has priority. It often includes multiple features, particularly negative options. In this case, there is a
single down edge from the current node.
Otherwise, there will be two down edges, one for each side of the split for the chosen decision. Note that
both sides contain at least one selected profile. In this case, the different possible common decisions (which
may consist of a single feature) are considered. The one which maximizes the Hellinger distance is chosen.
The Hellinger distance [4] is computed over discrete probability distributions pand qdefined on sample
space Sas
DH(p, q, S) = s1
2X
s∈S pp(s)pq(s)2
.(1)
It is a measure of the similarity of pand q, and achieves a maximum of 1, i.e. the greatest dissimilarity, when
pand qhave disjoint support.
Let Fbe the set of features corresponding to a common split in the current selected profiles. The sample
space Sfor the Hellinger distance computation corresponding to Fis the set of all possible combinations of
the remaining features after removing prior decisions (corresponding to the path to the current node) and
the feature(s) in F. The discrete distributions pand qare the normalized frequencies from the two sides
of the split; their support over Sis typically sparse. The decision chosen is the Fwhere pand qare most
dissimilar, with ties broken by lexicographic ordering.
This process of grouping remaining features into decisions, and choosing one of them, proceeds down
each branch of the tree until a selected profile is reached. At this point, the tree is evaluated for contingency
nodes.
All the descendants of a non-leaf node are replaced by a contingency if two conditions are met (and these
are not met by any of its ancestors). First, at least 75% of the full binary tree is present. Second, all paths
from the node being evaluated to a leaf descendant have the same set of common decisions. If so, then that
set is presented in a single contingency table.
In this case, all frequencies in the table are reported, not just the ones for selected profiles. The decision
edges are collapsed down to a single (dashed) contingency edge, labeled with all the decisions so condensed.
The resulting contingency node (dashed rectangle) compactly represents multiple selected profiles. Its fre-
quency is updated to include the low frequency structures reported in the contingency table.
References
[1] J. S. Reuter and D. H. Mathews, “RNAstructure: Software for RNA secondary structure prediction and
analysis,” BMC Bioinformatics, vol. 11, 2010.
[2] R. Lorenz, S. H. Bernhart, C. oner zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, and I. L.
Hofacker, “ViennaRNA package 2.0,” Algorithms for Molecular Biology, vol. 6, 2011.
[3] E. R. Gansner and S. C. North, “An open graph visualization system and its applications to software
engineering,” Pract. Exper, pp. 1–5, 1999.
[4] L. L. Cam and G. L. Yang, Asymptotics in Statistics. Springer New York, 2 ed., 2000.
6
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.
Article
Full-text available
As the biomedical impact of small RNAs grows, so does the need to understand competing structural alternatives for regions of functional interest. Suboptimal structure analysis provides significantly more RNA base pairing information than a single minimum free energy prediction. Yet computational enhancements like Boltzmann sampling have not been fully adopted by experimentalists since identifying meaningful patterns in this data can be challenging. Profiling is a novel approach to mining RNA suboptimal structure data which makes the power of ensemble-based analysis accessible in a stable and reliable way. Balancing abstraction and specificity, profiling identifies significant combinations of base pairs which dominate low-energy RNA secondary structures. By design, critical similarities and differences are highlighted, yielding crucial information for molecular biologists. The code is freely available via http://gtfold.sourceforge.net/profiling.html.
Article
Full-text available
Background Accurate and efficient RNA secondary structure prediction remains an important open problem in computational molecular biology. Historically, advances in computing technology have enabled faster and more accurate RNA secondary structure predictions. Previous parallelized prediction programs achieved significant improvements in runtime, but their implementations were not portable from niche high-performance computers or easily accessible to most RNA researchers. With the increasing prevalence of multi-core desktop machines, a new parallel prediction program is needed to take full advantage of today’s computing technology. Findings We present here the first implementation of RNA secondary structure prediction by thermodynamic optimization for modern multi-core computers. We show that GTfold predicts secondary structure in less time than UNAfold and RNAfold, without sacrificing accuracy, on machines with four or more cores. Conclusions GTfold supports advances in RNA structural biology by reducing the timescales for secondary structure prediction. The difference will be particularly valuable to researchers working with lengthy RNA sequences, such as RNA viral genomes.
Article
Full-text available
Secondary structure forms an important intermediate level of description of nucleic acids that encapsulates the dominating part of the folding energy, is often well conserved in evolution, and is routinely used as a basis to explain experimental findings. Based on carefully measured thermodynamic parameters, exact dynamic programming algorithms can be used to compute ground states, base pairing probabilities, as well as thermodynamic properties. The ViennaRNA Package has been a widely used compilation of RNA secondary structure related computer programs for nearly two decades. Major changes in the structure of the standard energy model, the Turner 2004 parameters, the pervasive use of multi-core CPUs, and an increasing number of algorithmic variants prompted a major technical overhaul of both the underlying RNAlib and the interactive user programs. New features include an expanded repertoire of tools to assess RNA-RNA interactions and restricted ensembles of structures, additional output information such as centroid structures and maximum expected accuracy structures derived from base pairing probabilities, or z-scores for locally stable secondary structures, and support for input in fasta format. Updates were implemented without compromising the computational efficiency of the core algorithms and ensuring compatibility with earlier versions. The ViennaRNA Package 2.0, supporting concurrent computations via OpenMP, can be downloaded from http://www.tbi.univie.ac.at/RNA.
Article
Full-text available
The Nearest Neighbor Database (NNDB, http://rna.urmc.rochester.edu/NNDB) is a web-based resource for disseminating parameter sets for predicting nucleic acid secondary structure stabilities. For each set of parameters, the database includes the set of rules with descriptive text, sequence-dependent parameters in plain text and html, literature references to experiments and usage tutorials. The initial release covers parameters for predicting RNA folding free energy and enthalpy changes.
Article
Full-text available
An RNA molecule, particularly a long-chain mRNA, may exist as a population of structures. Further more, multiple structures have been demonstrated to play important functional roles. Thus, a representation of the ensemble of probable structures is of interest. We present a statistical algorithm to sample rigorously and exactly from the Boltzmann ensemble of secondary structures. The forward step of the algorithm computes the equilibrium partition functions of RNA secondary structures with recent thermodynamic parameters. Using conditional probabilities computed with the partition functions in a recursive sampling process, the backward step of the algorithm quickly generates a statistically representative sample of structures. With cubic run time for the forward step, quadratic run time in the worst case for the sampling step, and quadratic storage, the algorithm is efficient for broad applicability. We demonstrate that, by classifying sampled structures, the algorithm enables a statistical delineation and representation of the Boltzmann ensemble. Applications of the algorithm show that alternative biological structures are revealed through sampling. Statistical sampling provides a means to estimate the probability of any structural motif, with or without constraints. For example, the algorithm enables probability profiling of single-stranded regions in RNA secondary structure. Probability profiling for specific loop types is also illustrated. By overlaying probability profiles, a mutual accessibility plot can be displayed for predicting RNA:RNA interactions. Boltzmann probability-weighted density of states and free energy distributions of sampled structures can be readily computed. We show that a sample of moderate size from the ensemble of an enormous number of possible structures is sufficient to guarantee statistical reproducibility in the estimates of typical sampling statistics. Our applications suggest that the sampling algorithm may be well suited to prediction of mRNA structure and target accessibility. The algorithm is applicable to the rational design of small interfering RNAs (siRNAs), antisense oligonucleotides, and trans-cleaving ribozymes in gene knock-down studies.
Article
Full-text available
The Sfold web server provides user-friendly access to Sfold, a recently developed nucleic acid folding software package, via the World Wide Web (WWW). The software is based on a new statistical sampling paradigm for the prediction of RNA secondary structure. One of the main objectives of this software is to offer computational tools for the rational design of RNA-targeting nucleic acids, which include small interfering RNAs (siRNAs), antisense oligonucleotides and trans-cleaving ribozymes for gene knock-down studies. The methodology for siRNA design is based on a combination of RNA target accessibility prediction, siRNA duplex thermodynamic properties and empirical design rules. Our approach to target accessibility evaluation is an original extension of the underlying RNA folding algorithm to account for the likely existence of a population of structures for the target mRNA. In addition to the application modules Sirna, Soligo and Sribo for siRNAs, antisense oligos and ribozymes, respectively, the module Srna offers comprehensive features for statistical representation of sampled structures. Detailed output in both graphical and text formats is available for all modules. The Sfold server is available at http://sfold.wadsworth.org and http://www.bioinfo.rpi.edu/applications/sfold.
Article
A widening gap exists between the best practices for RNA secondary structure prediction developed by computational researchers and the methods used in practice by experimentalists. Minimum free energy predictions, although broadly used, are outperformed by methods which sample from the Boltzmann distribution and data mine the results. In particular, moving beyond the single structure prediction paradigm yields substantial gains in accuracy. Furthermore, the largest improvements in accuracy and precision come from viewing secondary structures not at the base pair level but at lower granularity/higher abstraction. This suggests that random errors affecting precision and systematic ones affecting accuracy are both reduced by this ‘fuzzier’ view of secondary structures. Thus experimentalists who are willing to adopt a more rigorous, multilayered approach to secondary structure prediction by iterating through these levels of granularity will be much better able to capture fundamental aspects of RNA base pairing. WIREs RNA 2016, 7:278–294. doi: 10.1002/wrna.1334 This article is categorized under: RNA Evolution and Genomics > Computational Analyses of RNA
Article
A novel application of dynamic programming to the folding problem for RNA enables one to calculate the full equilibrium partition function for secondary structure and the probabilities of various substructures. In particular, both the partition function and the probabilities of all base pairs are computed by a recursive scheme of polynomial order N3 in the sequence length N. The temperature dependence of the partition function gives information about melting behavior for the secondary structure. The pair binding probabilities, the computation of which depends on the partition function, are visually summarized in a “box matrix” display and this provides a useful tool for examining the full ensemble of probable alternative equilibrium structures. The calculation of this ensemble representation allows a proper application and assessment of the predictive power of the secondary structure method, and yields important information on alternatives and intermediates in addition to local information about base pair opening and slippage. The results are illustrated for representative tRNA, 5S RNA, and self-replicating and self-splicing RNA molecules, and allow a direct comparison with enzymatic structure probes. The effect of changes in the thermodynamic parameters on the equilibrium ensemble provides a further sensitivity check to the predictions.