Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
RNAprofiling 2.0: Enhanced cluster analysis of structural ensembles∗
Forrest Hurley, University of North Carolina at Chapel Hill
Christine Heitsch, Georgia Institute of Technology
March 29, 2023
Abstract
Understanding the base pairing of an RNA sequence provides insight into its molecular structure.
By mining suboptimal sampling data, RNAprofiling 1.0 identifies the dominant helices in low-energy
secondary structures as features, organizes them into profiles which partition the Boltzmann sample, and
highlights key similarities/differences among the most informative, i.e. selected, profiles in a graphical
format. Version 2.0 enhances every step of this approach. First, the featured substructures are expanded
from helices to stems. Second, profile selection includes low-frequency pairings similar to featured ones.
In conjunction, these updates extend the utility of the method to sequences up to length 600, as eval-
uated over a sizable dataset. Third, relationships are visualized in a decision tree which highlights the
most important structural differences. Finally, this cluster analysis is made accessible to experimental
researchers in a portable format as an interactive webpage, permitting a much greater understanding of
trade-offs among different possible base pairing combinations.
1 Introduction
The method of sampling RNA secondary structures from the Boltzmann distribution [1, 2] under the
nearest neighbor thermodynamic model (NNTM) [3] provides critical base pairing alternatives to the
minimum free energy (MFE) configuration. Such information can be essential to understanding how
RNA sequences fold — and the functionality of these important molecules. Yet, the power of ensemble
analysis can only be realized by identifying the underlying patterns in a sufficiently large set of suboptimal
structures.
RNAprofiling, or just profiling for short, refers to the overall cluster analysis method that organizes
and analyzes a collection of secondary structures according to a set of features. It was developed [4] to
identify the dominant combinations of base pairing signals in the Boltzmann ensemble. RNAprofiling 1.0
(denoted here Pv1) consistently achieves high sample compression together with low information loss on
“small” sequences, on the order of 100 nucleotides (nt). We present here an updated version, RNApro-
filing 2.0 (denoted Pv2), which can mine a stable, informative structural signal from Boltzmann samples
of much longer sequences.
In contrast to other cluster analysis methods like Sfold [5] and RNAshapes [6], Pv2 does not generate
the sample to be analyzed. Rather, it is available to leverage the ensemble analysis power of state-of-
the-art software packages like RNAstructure [7] and ViennaRNA [8]. Hence, we demonstrate here that
Pv2 will reliably report the high probability base pairing combinations for sequences up to 600 nt.
We note that the signal from the Boltzmann ensemble at the substructural unit level, i.e. the features
being considered, remains strong well-beyond 1000 nt. However, the probability of different combinations
of these units, i.e. their profiles, decays with sequence length. Like prediction accuracy, this is a reflection
of the NNTM itself, and the sampling method employed. Given a particular Boltzmann sample as input,
Pv2 outputs high quality information in a useful quantity for further hypothesis generation.
As described, the content of that information is determined directly from the input sample. When
introduced [4], it was established that RNAprofiling provides complementary information to both Sfold
and RNAshapes. Moreover, a thorough analysis [9] compared the three, where Pv1 analyzed Boltzmann
samples generated by GTfold [10]. It was found that all three improved over the MFE, but there was no
clear advantage among cluster analysis methods in terms of base pair prediction accuracy.
∗©2023. This manuscript version is made available under the CC-BY-NC-ND 4.0 license.
1
arXiv:2303.15552v1 [q-bio.BM] 27 Mar 2023
In terms of efficiency, for a sequence of length ∼600 nt, Pv2 analyzes a Boltzmann sample of 10,000
structures in about 20 seconds, with the sample generation taking about 5 sec. Shorter sequences and/or
smaller samples take correspondingly less time to analyze, e.g. about 2 sec to analyze a sequence ∼200
nt and sample of 1000 structures. In contrast [9], Sfold takes about 25 seconds (sampling + analysis) at
this 200 length/1K size scale, as does RNAshapes.
Regardless of which cluster analysis method is used, there are two key points for experimentalists [9].
First, as well-known to the ribonomics community, prediction quality improves if more than one con-
formation is considered. Second, the quality is substantially enhanced if the conformations are initially
considered at lower granularity/higher abstraction. This supports a multilayered approach to RNA sec-
ondary structure determination where an early computational step identifies critical structural differences
“to be vetted by further computational analysis, experimental testing, and/or biological insight.”
The new version of RNAprofiling presented here significantly enhances the method’s ability to do just
that. The new code is freely available at github.com/gtDMMB/RNAprofilingV2 under a GPLv2 license
and can be run online through the rnaprofiling.gatech.edu website.
2 Method
Pv2 follows the same three general steps as Pv1. First, identify key substructural units as features.
Second, cluster secondary structures into profiles based on their features, and select the most informative.
Third, visually highlight relationships among the selected profiles. As described, Pv2 provides significant
enhancements at each step. Additionally, the profiling output is now made available as an interactive
webpage which further facilitates compare/contrast across clusters.
2.1 Feature identification
Pv1 introduced “helix class” as its structural unit. A helix (i, j, k) in an RNA secondary structure is a
set of base pairs {(i, j),(i+ 1, j −1),...,(i+k−1, j −k+ 1)}where at least one of {i−1, j + 1}and of
{i+k, j −k}are single-stranded. Under the NNTM, a hairpin loop must contain at least 3 nucleotides
which implies that j−i−2k≥2. A helix (i, j, k) is maximal if the minimum hairpin length is respected
and neither (i−1, j + 1) nor (i+k, j −k) are a canonical base pair. A helix class (HC) consists of all
helices which are a subset of the same maximal helix, and is denoted by that maximal (i, j, k) triplet.
Profiling always begins by determining all HC present along with their frequency (i.e. estimated
probability) in the sampled secondary structures. When the observed HC are ordered by decreasing
frequency, this yields a distribution with a long tail of low-probability base pairings. The threshold at
which to cut the tail is determined by maximizing the average Shannon information entropy [4]. This
yields a relatively small set of selected helix classes (SHC).
Pv1 used SHC as features, and demonstrated this achieves both high sample compression as well as
low information loss. Pv2 generalizes this approach by using SHC to generate “stem classes” as defined
below.
Like a HC, a stem class (SC) represents sets of possible base pairings. It is built from SHC and
denoted as [i, j;k, l], where the intervals i≤x≤i+k−1 and j−l+ 1 ≤y≤jare the minimal
regions such that all base pairs (x, y) represented have their 5’ end in the first and 3’ one in the second.
A “trivial” SC consists of a single SHC with [i, j;k, k ]. The SC frequency is the number of sampled
structures where any of the base pairings from the constituent SHC are present.
For our purposes, two (or more) helices are said to form a stem if they are (successively) separated
by at most two single-stranded bases on either side. In other words, if a secondary structures contains
helices (i, j, k) and (i0, j0, k0) with i+k≤i0≤i+k+ 2 and j−k−2≤j0≤j−k, then they form a
stem. This extends to multiple distinct helices in succession, as long as the separation criteria is met.
The idea of a stem is motivated by the NNTM [3], which treats small internal loops (i.e. with sizes 1 ×1,
1×2/2 ×1, and 2 ×2) as special cases, in part to address noncanonical base pairings.
Generalizing to the sets of base pairs which form profiling’s structural units, two HC are considered
stemmable if there exists a helix from each class which together could form a stem. If for some reason a
smaller, or larger gap size than 2, is desired, this can be changed by the user. We note that stemmability
is a reflexive and symmetric mathematical relation on HC. A stem class (SC) is then defined to be the
transitive closure of stemmable pairs of SHC, yielding well-defined equivalence classes.
Observe that the HC definition naturally satisfies transitivity whereas it is imposed on stemmable
pairs. To confirm that SC remain local substructural units, we consider two properties: length and
width. Both are defined precisely in Supplemental Material. Length is an upper bound on the number
2
of pairings possible from the SC in one structure. Width is a measure of the ‘spread’ where two helices
which form a stem would have a width of 1, 2, or 3 depending on the asymmetry of the small internal
loop/bulge. As will be shown, nearly all SC have this width as well.
2.2 Profile selection
The next step is to cluster the sampled structures into “profiles” determined by a common set of features.
To focus on the most informative combinations, a maximum average entropy threshold is again used to
filter the low probability tail. This yields a relatively small set of selected profiles for further consideration.
In addition to using SC rather than SHC as the default structural unit, Pv2 updates the estimated
probabilities prior to profile selection by considering low frequency pairings similar to the featured ones.
These augmented counts are distinguished as “fuzzy” stem classes (FSC). Intuitively, if ‘enough’ non-
SHC pairings in a secondary structure span the 5’ and 3’ sequence regions [i, j;k, l ], when expanded
slightly, then the corresponding SC is added to the FSC profile for that structure and the FSC frequency
is increased accordingly. (Details in Supplemental Material.)
In Pv2, the profiles are selected based on FSC by default. In this way, low frequency pairings which
occupy the same sequence regions, approximately, as the featured ones are included in the summary
profile graph information. As with stem versus helix classes, and the stem gap size, this can be changed
by the user.
2.3 Relationship visualization
The third and final step is to highlight similarities and differences across the selected profiles. This
compare/contrast visualization illuminates crucial differences between the most frequently occurring
combinations of base pairings. In Pv1, the summary profile graph was based on the Hasse diagram for
the selected profiles ordered under set inclusion. Pv2 retains this option but also provides (by default) a
complementary perspective via an augmented decision tree. An example is seen in Figure 1, and discussed
further in Section 2.4. Details of the unsupervised tree building procedure are in Supplemental Material.
The advantage of decision trees over Hasse diagrams is that the number of edges remains small, even at
longer sequence lengths, making it more comprehensible for the user. The trade-off is that support for
non-selected feature combinations (the “intersection profiles” from Pv1) can be spread among different
branches.
2.4 Interactive output
The most visible enhancement in Pv2 is the profiling output; the summary profile graph is now embedded
in a portable interactive webpage with auxiliary information. Figure 1 shows an example for the FMN
riboswitch (RF00050) in Actinobacillus succinogenes 130Z (CP000746.1/533105–533234). As illustrated,
this provides a compact, informative summary of the Boltzmann sample input which also highlights
critical structural differences.
Figure 1 caption gives an overview of the information provided. We now consider the decision tree
example in more detail. All nodes (ovals and rectangles) correspond to a subset of the Boltzmann sample
input, and are labeled by its size. Rectangles can only be a leaf, and denote a selected profile or collection
thereof (dashed). The root node is the full sample provided. Edges are labeled with choices on feature
inclusion and/or exclusion (denoted ¬) that lead to at least one selected profile.
As detailed in Supplemental Material, decisions are grouped based on the split induced by each
feature on the selected profiles among the current set of structures. If there is no forced decision, the
one maximizing the dissimilarity in the two branches (according to the Hellinger distance [11]) is chosen.
Splits indicate critical structural differences, in order of priority for further computational analysis and/or
experimental testing.
For instance, of the full 1000 structures in the sample analyzed for Figure 1, 978 contain base pairings
from 5, A, and B. For these structures, all selected profiles contain all three features, so this is a “forced”
decision.
Letter indexed features denote nontrivial FSC. As seen in the feature display, and confirmed in the
FSC and SHC tables, feature A is composed of two SHC, 1 and 2, that are separated by a small internal
loop of size 1 ×1, which could be a noncanonical pairing. In contrast, B contains 4 = (16,30,4) and 12
= (17,28,3) which can form a stem with 4 base pairs and a bulge of size 1. Hence, Pv2 treats them as a
single structural unit.
3
Fuzziness increases the estimated probability of these SC by less than 5% (from exact frequencies
of 952 for A and 954 for B). However, it raises 5 by more than 30%, from 757 to 985. This is useful
information that the sequence region [54,59; 64,69] most likely forms some kind of hairpin structure.
None of the other combinations of inclusion/exclusion on 5, A, B lead to a selected profile, so there is
only one edge in the tree for this common forced decision.
In contrast, the 979 structures are essentially evenly split between including/excluding C. According
to the Hellinger distance, this makes it the highest priority structural difference among the remaining
features for this Boltzmann sample input.
If C is present, then this forces 3 to be included, 10 and 11 to be excluded. Why exclusions are
forced is useful information, and communicated by red pairings below the sequence line in the feature
display. In this case, the only remaining feature uncertainty (as summarized by a contingency table) is
between 7 and 9. Given the overlap in their base pairing regions, resolving this ambiguity may not even
be necessary.
If C is absent, the next critical split identified is 10. If it exists, then different combinations of
the inclusion/exclusion of 9 and of (11,3) are present in the sample, as would be summarized in the
contingency table for II. The table would show that of the 325 structures on this decision path, most
(0.575) have both but some (0.169) have 9 with (¬11, ¬3) while others (0.178) have the opposite (and
0.077 have neither).
Finally, excluding both C and 10 yields a single selected profile III pictured in bottom right. Based
on this analysis, the most important difference between the structures in the input sample is the presence
of C, and if ¬C then 10. To facilitate further investigation, after selecting a node, users can download
files containing all associated structures. All images in the webpage can be downloaded directly or are
available in the output folder.
3 Results
Previous results [4] demonstrated that Pv1 consistently achieves high sample compression together with
low information loss on “small” sequences around 100 nt. We confirm this on a much larger dataset, and
also demonstrate that the Pv2 enhancements extend the length at which a useful structural signal can
be extracted from a Boltzmann sample up to 600 nt. Results highlight the value of this cluster analysis
approach, as well as the challenges for very long sequences.
3.1 Dataset
Our dataset used the curated CONTRAfold [12] one, which contains 151 sequences from distinct Rfam [13]
families (version 7.0), as a starting point. Each existing family was augmented by up to 4 additional
sequences from Rfam (version 14.9). Eight of the original families are no longer in Rfam, while 3 had
just one more sequence, 8 had two, and 7 had 3. For the remaining 125 families, gc content was used
as a proxy for phylogenetic diversity; four sequences were added iteratively by maximizing the minimum
difference with the previous ones. This yielded a total of 683 sequences over 143 current families.
For each sequence, 25 different Boltzmann samples of size 10,000 were generated using the RNAlib
python bindings [8]. Analysis reports the results of these 17,075 trials. As addressed below, larger
sample sizes (beyond the typical 1000 [2]) improved result reproducibility across different runs for longer
sequences.
The maximum CONTRAfold sequence length was 568. Excepting one family (RF00177), all addi-
tional ones have length <600 nt. To facilitate profile-level comparisons, the families were divided by
average sequence length ninto 5 categories: extra-small (xs), small (s), medium (m), large (l), and
extra-large (xl). They consist, respectively, of 14 families with 23 < n ≤50, 85 with 50 < n ≤150, 22
with 150 < n ≤220, 21 with 270 < n ≤567, and 1 with n= 1331.2.
The Pv1 proof-of-principle dataset had 15 sequences spread over 5 families with lengths ranging from
72 (tRNA) to 133 (5S rRNA) with a 99 nt average. As shown below, SHC consistently yield profiles
with high sample compression and low information loss over the 85 comparable sfamilies. Moreover, by
moving to FSC as features, Pv2 achieves the same profile qualities on mfamilies, and even extracts a
useful signal for lones. The xl outlier was retained as an example of how much the structural signal in
a Boltzmann sample, i.e. the combination of base pairings, decays with sequence length.
4
3.2 Features
Pv1 results found that Boltzmann sampling can be very noisy, with many distinct helices generated
even at small lengths. This is only more true for longer sequences with larger (by 10×) sample sizes.
Nonetheless, relatively few features can reproducibly represent most of the base pairing information in a
Boltzmann sample.
We found linear relationships between sequence length and number of different features. (See Supple-
mental Material.) There are about 500 distinct helices per 100 nt on average, and moving to HCs yields
a 2-fold reduction. The maximum average entropy thresholding achieves about 25-fold reduction, with
10 SHC per 100 nt. This is further reduced to a rate of 5.14 SC (R2= 0.96). Since fuzzy counts only
affect the estimated probability, the number of distinct SC/FSC per 100 nt is two orders of magnitude
less than helices.
This amount of compression still provides good coverage of the base pairing information sampled.
Coverage was computed as the proportion of helices (with multiplicity) which belong to a SHC. Nearly
half (48.4%) of trials had >0.90 coverage, and most (87.3%) covered more than 3/4 of helices sampled.
Coverage decreased slightly with sequence length (at a rate of 0.025 per 100 nt according to the best line
fit). Only 2.0% had coverage below 0.6, and these outliers were generally shorter sequences (<150 nt).
The SC are built from SHC so provide the same coverage with 1/2 as many structural units. We
confirm that SC remain local substructures by considering the length and width distributions. As defined,
a helix has width w= 1, and two helices that form a stem have 1 ≤w≤3 depending on the asymmetry
of the small internal loop/bulge between them. Over all trials, more than half (52.2%) contained only
one SHC, and these trivial SC had average (std) length l= 5.45 (2.39) nt. Allowing stems increases the
feature length without significantly increasing the width. Of the nontrivial SC, nearly all (89.1%) had
w≤3 with l= 11.9 (5.21) while only 1.2% had w≥6 with l= 28.7 (18.6).
The reproducibility of SHC is very high, across all lengths. For each sequence, this is the average
over all SHC sampled of the percentage of runs (out of 25) that it is present. See Supplemental Material
for boxplots. Reproducibility for SC, as well as profiles, is computed likewise. However, the criteria for
“is present” is stringent; the compound structure must include exactly the same set of SHC. This means
that any variation in SHC propagates and is amplified. Nonetheless, median SC reproducibility remained
high. Interestingly, the noisier sequences were the shorter ones.
Boltzmann samples of size 1K were originally analyzed, and only 10 (of 683) sequences had an average
SHC reproducibility h≤0.9. Hence, more than 3/4 (76.9%) had an average SC reproducibility s > 0.9.
Moving to size 10K, as will be discussed in the next section, meant only one (xs) sequence had h≤0.9
which improved the percentage with s > 0.9 to 93.0%. This included all but 4 (of 106) msequences and
1 (of 105) lones, which had 0.8< s ≤0.9.
This confirms that Pv2 is reliably extracting a very informative base pairing signal from the sampled
structures. Moreover, since the number of possible profiles grows exponentially with features, the greater
compression obtained by Pv2 makes this approach accessible to longer sequences.
3.3 Profiles
Profiles were also assessed for sample compression, information loss, and reproducibility. Five different
feature sets were considered: helix, HC, SHC, SC, and FSC. For each type, the corresponding profiles
were generated, and the maximum average entropy thresholding applied to select the most informative
ones. Results are broken down into groups by average sequence length per family, and report analysis of
10K-sized Boltzmann samples except where noted.
Results given in Figure 2 demonstrates that Pv2 achieves high sample compression with low infor-
mation loss for sequences up to 600 nt. Using SHC as features works well in the length range (up to 150
nt) previously reported. The number of selected profiles is low and the coverage is high: 3 and 0.98 for
xs, 5 and 0.82 for s. However, the coverage for mhas already dropped to 0.49. The key to overcoming
this length barrier is consolidating SHC into SC.
As seen, Boltzmann sampling can be very noisy. Even for the ssequences, a median of 1,634 different
structures (i.e. helix profiles) were sampled. Decreasing feature granularity filters sample noise. The
difference is particularly noticeable for lsequences when moving from SHC to SC; the median drops
more than 8-fold (from 2,984 to 360). Moreover, the median number of selected profiles is reduced down
to 7, not much higher than mand s. Importantly, this level of sample compression is accompanied by
a corresponding rise in median coverage, which increases from 0.15 to 0.61 for FSC. The corresponding
rise for mfrom 0.49 was to 0.85.
5
When analyzing reproducibility, the standard 1K-sized samples [5] did not yield results comparable
to Pv1, but 10K did. Profile reproducibility is affected by the propagation, and amplification, of feature
variability; increased sample size reduces this effect across all sequence lengths. For example, with 1K
samples, although 80.2% of msequence have an average SHC selected profile reproducibility s > 0.7,
only 10.4% were above 0.9. With 10K samples, 98.1% have s > 0.7 with 69.4% above 0.9. Although the
corresponding numbers for lsequences also improve significantly (to 84.8% from 20.0% and 18.1% from
0% resp.), they are lower due to the confounding effect of profile growth.
Like the number of structural units sampled, the average number of features in a profile also grows
linearly with sequence length (at a rate of about 6 SHC and 4 SC/FSC per 100 nt). This effect is
particularly apparent for the xl sequences, whose SHC and SC reproducibility is as high as any other,
but whose corresponding profiles are not reproducible. (See Supplemental Material.)
Since Pv2 uses FSC by default, these are the selected profiles whose reproducibility was evaluated.
It is very high for xs and ssequences (92.1% and 73.5% above 0.9, resp.) although the outliers in SC
reproducibility are apparent. Nearly all msequences (88.7%) are above 0.7, with a majority (57.1%)
above 0.9. Even the lones have 64.8% and 35.2% resp., with a median of 0.84 and interquartile range
of (0.61,0.93).
4 Conclusions
RNAprofiling 2.0 (Pv2) consistently achieves high sample compression together with low information
loss on sequences up to 600 nt, a 4-fold length increase over Pv1. This is accomplished by expanding
the featured substructures from helices to stems, including low-frequency pairings similar to featured
ones in the profile selection, and visualizing their relationships in a decision tree. Pv2 takes as input
a Boltzmann sample, as provided by software packages like RNAstructure or ViennaRNA. The Pv2
output is a portable interactive webpage which provides a compact, informative summary of the sample
provided. Critical structural differences are highlighted to be evaluated further by some combination of
computational analysis, experimental testing, and biological insight.
5 Acknowledgments
This work was supported by funds from the National Institutes of Health (R01GM126554 to CH) and
the National Science Foundation (DMS1344199 to CH). Additional support for FH provided by a grant
from the National Institute of Environmental Health Sciences (2T32ES007018 to Rebecca Fry, UNC).
References
[1] J. S. McCaskill, “The equilibrium partition function and base pair binding probabilities for RNA
secondary structure,” Biopolymers, vol. 29, no. 6-7, pp. 1105–19, 1990.
[2] Y. Ding and C. E. Lawrence, “A statistical sampling algorithm for RNA secondary structure pre-
diction,” Nucleic Acids Res, vol. 31, no. 24, pp. 7280–7301, 2003.
[3] D. H. Turner and D. H. Mathews, “NNDB: the nearest neighbor parameter database for predicting
stability of nucleic acid secondary structure,” Nucleic Acids Res, vol. 38, pp. D280–2, 2010.
[4] E. Rogers and C. Heitsch, “Profiling small RNA reveals multimodal substructural signals in a
Boltzmann ensemble,” Nucleic Acids Res, p. gku959, 2014.
[5] Y. Ding, C. Y. Chan, and C. E. Lawrence, “Sfold web server for statistical folding and rational
design of nucleic acids,” Nucleic Acids Res, vol. 32, no. Web Server Issue, pp. W135–W141, 2004.
[6] P. Steffen, B. Voß, M. Rehmsmeier, J. Reeder, and R. Giegerich, “RNAshapes: an integrated RNA
analysis package based on abstract shapes,” Bioinformatics, vol. 22, no. 4, pp. 500–503, 2006.
[7] J. S. Reuter and D. H. Mathews, “RNAstructure: Software for RNA secondary structure prediction
and analysis,” BMC Bioinformatics, vol. 11, 2010.
[8] R. Lorenz, S. H. Bernhart, C. H¨oner zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, and I. L.
Hofacker, “ViennaRNA package 2.0,” Algorithms for Molecular Biology, vol. 6, 2011.
[9] E. Rogers and C. Heitsch, “New insights from cluster analysis methods for RNA secondary structure
prediction,” Wiley Interdiscip Rev RNA, vol. 7, no. 3, pp. 278–94, 2016.
6
[10] S. Swenson, J. Anderson, A. Ash, P. Gaurav, Z. S¨uk¨osd, D. Bader, S. Harvey, and C. Heitsch,
“GTfold: Enabling parallel RNA secondary structure prediction on multi-core desktops,” BMC Res
Notes, vol. 5, no. 1, p. 341, 2012.
[11] L. L. Cam and G. L. Yang, Asymptotics in Statistics. Springer New York, 2 ed., 2000.
[12] C. B. Do, D. A. Woods, and S. Batzoglou, “CONTRAfold: RNA secondary structure prediction
without physics-based models,” Bioinformatics, vol. 22, pp. e90–e98, 2006.
[13] I. Kalvari, E. P. Nawrocki, N. Ontiveros-Palacios, J. Argasinska, K. Lamkiewicz, M. Marz,
S. Griffiths-Jones, C. Toffano-Nioche, D. Gautheret, Z. Weinberg, E. Rivas, S. R. Eddy, R. D.
Finn, A. Bateman, and A. I. Petrov, “Rfam 14: expanded coverage of metagenomic, viral and
microRNA families,” Nucleic Acids Research, vol. 49, pp. D192–D200, 2021.
7
Figure 1: Example of Pv2 output webpage. Left column displays an interactive summary profile graph — by default
a decision tree. Nodes are clickable and labeled with corresponding number of sampled structures. Center column has
a dynamic panel (top) displaying features in arc diagram format for chosen (grey) node from profile graph. Decisions
on incoming edge are emphasized; positive ones in bold above the sequence line, and negative ones, denoted with
¬in tree, in red below. Features are labeled according to the FSC table (center, middle) which lists SC regions
[i, j;k, l] with fuzzy frequencies and SHC contained. Nontrivial FSC are denoted by letters, and trivial ones by their
SHC index. All SHC are listed (center, bottom) with maximal (i, j, k) triplet and integer indexed in decreasing exact
frequency as given. Selected profiles, or groups thereof, are denoted by rectangular leaves in decision tree and labeled
by roman numerals. More than one selected profile is represented if the incoming edge is a contingency (dashed). In
this case, the contingency table is given below the feature display when the leaf (dashed rectangle) is chosen. Right
column shows most frequent secondary structure for each leaf in radial (or arc) diagram format. Users can download
all structures corresponding to the chosen node, or just the most frequent, for further analysis. See Section 2.4, and
Supplemental Material, for further information.
8
Figure 2: Profile analysis with feature sets ordered by increasing abstraction/decreasing granularity from helices
through helix classes to stem classes. Median and interquartile range (IQR) are reported over distinct trials, excluding
those where every sampled structure is a unique profile. Counts (top and middle) are log scale. Coverage (bottom)
is proportion of sampled structures represented by selected profiles.
9
Supplemental Material for
RNAprofiling 2.0: Enhanced cluster analysis of structural ensembles∗
Forrest Hurley, University of North Carolina at Chapel Hill
Christine Heitsch, Georgia Institute of Technology
March 29, 2023
1 Additional results
Figures 1, 2, and 3 provide further details on the growth rate of features, growth rate of profile length, and
reproducibility for 1K and 10K samples respectively.
Recall that Pv2 denotes RNAprofiling 2.0, and Pv1 the original, i.e. RNAprofiling 1.0. Features considered
are helices, helices classes (HC), selected helix classes (SHC), stem classes (SC), and fuzzy stem classes
(FSC). As features, the difference between SC and FSC is their frequency, i.e. estimated probability, in the
Boltzmann sample input. As discussed below in Section 3, this changes both the profile selection and the
decision tree construction.
2 Implementation information
Pv2 is freely available under the GPLv2 license at github.com/gtDMMB/RNAprofilingV2 and can be run
online via the rnaprofiling.gatech.edu website.
When run online, the default option is to upload a Boltzmann sample in either ct or dot file format.
To expedite exploratory analysis, the website also provides the option of generating a sample with either
RNAstructure 6.4 [1] or ViennaRNA 2.4.14 [2].
A command line interface is available in the form of a python script. The script provides all the same
options as the web interface, although generating samples is disabled unless there is a local install of RNAs-
tructure or ViennaRNA.
This new version has a completely new codebase, written in Python rather than C/C++. Pv2 is imple-
mented and tested with Python 3 (3.6.9) using the numpy (1.19.5), networkx (2.4), matplotlib (3.3.4), and
pygraphviz (1.6) libraries. A graphviz install [3] is also required to generate the summary profile graph. The
program is loaded on the web server using PyInstaller (4.10).
The full output from both the web interface and the command line interface is displayed in HTML with
JavaScript and should work in most modern internet browsers. The svg.min library is used to render graphviz
output in the browser.
Table 1 lists the main Pv2 parameters. There are some additional IO options for sequences and samples
available via the command line and website.
Pv1 functionality may be recreated by using SHC as features, and a Hasse diagram as the summary
profile graph output. By default, a maximum average entropy threshold determines the selection of HC and
of profiles. As in Pv1, this can be overridden by a user-specified cutoff.
The default options for stem gap, fuzzy counts, and contingency nodes can be altered by users to suit
their particular analysis goals. For example, by default, 75% of the full binary tree must be present before it
is collapsed into a contingency node, but can be changed to provide a larger or smaller output tree as useful.
∗©2023. This manuscript version is made available under the CC-BY-NC-ND 4.0 license.
1
arXiv:2303.15552v1 [q-bio.BM] 27 Mar 2023
Figure 1: Number of distinct features by type for each sequence. Regression lines assume 0 intercept. Note
difference in y-axis scales between graphs.
Figure 2: Average length of profiles by feature type for each sequence. Difference in regression line slopes
for SC and FSC is not significant (p=0.08).
2
Figure 3: Reproducibility of features and of profiles for 1K and 10K samples across family length categories.
Although 1K suffices for features, profile reproducibility degrades with sequence length. Larger, e.g. 10K,
samples yield a more reliable structural signal in general.
Finally, we highlight the possibility of having consistent HC labels. This can be very useful if comparing
results across multiple different samples for the same sequence. When invoked, the HC labels are based on
the sequence itself, and so are independent of the particular sample frequencies.
3
Parameter Value Default
Output Type {Hasse, Tree}Tree
Feature Type {Selected Helix Class, Stem Class}Stem Class
Frequency Format {Counts, Percentages, Decimals}Counts
Helix Class Selection Cutoff Positive Integer or Auto Auto
Profile Selection Cutoff Positive Integer or Auto Auto
Stem Gap Non-Negative Integer 2
Use Fuzzy Stem Counts Boolean True
Fuzzy Dilation Size Positive Integer 5
Fuzzy Basepair Frequency Margin Real in [0,1] 0.333
Min Contingency Node Proportion Real in [0,1] 0.75
Use Consistent Helix Class Labels Boolean False
Table 1: Main code parameters. “Cutoff” options override the standard profiling threshold method; if used,
helix classes, resp. profiles, with lower frequency in the input sample will not be considered. The three
“Fuzzy” options apply only to SC and are ignored if using HC. “Contingency” option is ignored if output is
not a decision tree. Consistency in helix labeling can be very useful when comparing multiple analyses for
the same sequence.
4
3 Technical details of method
3.1 Stem class length and width
A helix (i, j, k) has length k, the number of base pairs it contains. A HC is denoted by its maximal helix
(i, j, k) and has length ksince the maximum number of base pairs possible in any of its constituent helices
is k. Observe that k−1 is half the Manhattan distance from the outermost possible pairing (i, j) to the
innermost (i+k−1, j −k+ 1) in the usual (x, y) plane; k−1 = (1/2) ∗(|i−(i+k−1)|+|j−(j−k+ 1)|).
The SC length is defined analogously and will be the maximum number of pairings possible in any of its
constituent combinations. First observe that the outermost possible base pair is the one with the greatest
contact distance, i.e. where j−iis maximal. Let Mdenote this value, and mdenote least possible, which
corresponds to the innermost pair. Since the Manhattan distance between the outermost and innermost
pairs is equal to M−m, the length of a stem class is defined to be 1 + (M−m)/2.
The SC width will capture a measure of the ‘spread’ of observed pairings represented. This is done by
counting the number “helical diagonals” covered by the SC. Here, a helical diagonal denotes a line in the
(x, y) plane with slope −1 which intersects the identity x=yat points where xor x+ 1/2 is a positive
integer. The HC h= (i, j, k) has width 1 since all pairings lie on a single diagonal with midpoint x= (i+j)/2
as the intersection. If the helix class h0= (i0, j0, k0) is stemmable with h, then |(i+j)/2−(i0+j0)/2| ≤ 1, and
the number of diagonals covered is either 1, 2, or 3 depending on the asymmetry of the internal loop/bulge
separating them, i.e. whether ((j−k+ 1) −j0−1) −(i0−(i+k−1) −1) is 0, ±1, or ±2. For a stem class
[i, j;k, l], let Cbe the set of helix class midpoints. Then its width is 1 + 2 ∗(max C−min C).
3.2 Fuzzy stem class frequencies
Fuzzy counts address low-frequency base pairings, i.e. non-SHC ones, that are ‘close’ to a SC. These aug-
mented frequencies are distinguished as FSC. A secondary structure has both a SC profile as well as a
(possibly enlarged) FSC one. The frequency of a profile is the number of structures in the input sample
with those features (and no more). The difference between SC and FSC as features affects the distribution
of profile frequencies which is used both for selection, and also to build the decision tree.
Pv2 uses fuzzy counts by default, which are found as follows. Consider a secondary structure Sand its
SC profile P. For each SC not already in P, expand its region [i, j;k, l] slightly and count the non-SHC
base pairs from Swhich fall inside. This number is then compared to a baseline. If it is high enough, then
[i, j;k, l] is added to the FSC profile for S. A base pair (x, y) falls inside the expanded region of [i, j;k, l]
with dilation size bif i−b≤x<i+k+band j−l−b<y≤j+b. The baseline is a fixed fraction, by
default (1/3), of the average number of base pairs in the expanded region over all structures with [i, j;k, l]
in their SC profile.
3.3 Decision tree construction
A path starting at the root of the tree corresponds to a sequence of choices, positive and/or negative, on
features which results in one or more selected profiles as a leaf. As will be explained, the multiplicity is due
to the use of “contingency” leaf nodes. By default, fuzzy frequencies are used to build the tree, but exact
ones can be chosen instead.
Every node corresponds to a subset of the Boltzmann sample, and is labeled with the number of structures
under consideration. The root node is the full sample input, i.e. all nonempty profiles. Subsequent nodes
consist of groups of profiles determined by prior decisions on feature inclusion/exclusion. Edges denote
decisions and are labeled with one or more features, with negative choices indicated by ¬. Nontrivial SC are
denoted by letters, and trivial ones by their SHC integer index. Subset sizes are updated after every decision
to remove the profiles no longer in consideration.
The choice of inclusion/exclusion for each remaining feature splits the selected profiles currently un-
der consideration. Features which yield the same split are grouped together into a common decision for
consideration.
A decision is “forced” if the other side of the split is empty, i.e. if none of the selected profiles in the
current group have the opposite choice of feature(s). There is at most one forced common decision possible,
5
and it has priority. It often includes multiple features, particularly negative options. In this case, there is a
single down edge from the current node.
Otherwise, there will be two down edges, one for each side of the split for the chosen decision. Note that
both sides contain at least one selected profile. In this case, the different possible common decisions (which
may consist of a single feature) are considered. The one which maximizes the Hellinger distance is chosen.
The Hellinger distance [4] is computed over discrete probability distributions pand qdefined on sample
space Sas
DH(p, q, S) = s1
2X
s∈S pp(s)−pq(s)2
.(1)
It is a measure of the similarity of pand q, and achieves a maximum of 1, i.e. the greatest dissimilarity, when
pand qhave disjoint support.
Let Fbe the set of features corresponding to a common split in the current selected profiles. The sample
space Sfor the Hellinger distance computation corresponding to Fis the set of all possible combinations of
the remaining features after removing prior decisions (corresponding to the path to the current node) and
the feature(s) in F. The discrete distributions pand qare the normalized frequencies from the two sides
of the split; their support over Sis typically sparse. The decision chosen is the Fwhere pand qare most
dissimilar, with ties broken by lexicographic ordering.
This process of grouping remaining features into decisions, and choosing one of them, proceeds down
each branch of the tree until a selected profile is reached. At this point, the tree is evaluated for contingency
nodes.
All the descendants of a non-leaf node are replaced by a contingency if two conditions are met (and these
are not met by any of its ancestors). First, at least 75% of the full binary tree is present. Second, all paths
from the node being evaluated to a leaf descendant have the same set of common decisions. If so, then that
set is presented in a single contingency table.
In this case, all frequencies in the table are reported, not just the ones for selected profiles. The decision
edges are collapsed down to a single (dashed) contingency edge, labeled with all the decisions so condensed.
The resulting contingency node (dashed rectangle) compactly represents multiple selected profiles. Its fre-
quency is updated to include the low frequency structures reported in the contingency table.
References
[1] J. S. Reuter and D. H. Mathews, “RNAstructure: Software for RNA secondary structure prediction and
analysis,” BMC Bioinformatics, vol. 11, 2010.
[2] R. Lorenz, S. H. Bernhart, C. H¨oner zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, and I. L.
Hofacker, “ViennaRNA package 2.0,” Algorithms for Molecular Biology, vol. 6, 2011.
[3] E. R. Gansner and S. C. North, “An open graph visualization system and its applications to software
engineering,” Pract. Exper, pp. 1–5, 1999.
[4] L. L. Cam and G. L. Yang, Asymptotics in Statistics. Springer New York, 2 ed., 2000.
6