ArticlePDF Available

Sequence Bundles: a novel method for visualising, discovering and exploring sequence motifs

Authors:
  • Science Practice

Abstract and Figures

Background We introduce Sequence Bundles--a novel data visualisation method for representing multiple sequence alignments (MSAs). We identify and address key limitations of the existing bioinformatics data visualisation methods (i.e. the Sequence Logo) by enabling Sequence Bundles to give salient visual expression to sequence motifs and other data features, which would otherwise remain hidden. Methods For the development of Sequence Bundles we employed research-led information design methodologies. Sequences are encoded as uninterrupted, semi-opaque lines plotted on a 2-dimensional reconfigurable grid. Each line represents a single sequence. The thickness and opacity of the stack at each residue in each position indicates the level of conservation and the lines' curved paths expose patterns in correlation and functionality. Several MSAs can be visualised in a composite image. The Sequence Bundles method is designed to favour a tangible, continuous and intuitive display of information. Results We have developed a software demonstration application for generating a Sequence Bundles visualisation of MSAs provided for the BioVis 2013 redesign contest. A subsequent exploration of the visualised line patterns allowed for the discovery of a number of interesting features in the dataset. Reported features include the extreme conservation of sequences displaying a specific residue and bifurcations of the consensus sequence. Conclusions Sequence Bundles is a novel method for visualisation of MSAs and the discovery of sequence motifs. It can aid in generating new insight and hypothesis making. Sequence Bundles is well disposed for future implementation as an interactive visual analytics software, which can complement existing visualisation tools.
Content may be subject to copyright.
RESEARC H Open Access
Sequence Bundles: a novel method for visualising,
discovering and exploring sequence motifs
Marek Kultys
1*
, Lydia Nicholas
1
, Roland Schwarz
2
, Nick Goldman
2
, James King
1
From 3rd IEEE Symposium on Biological Data Visualization
Atlanta, GA, USA. 13-14 October 2013
Abstract
Background: We introduce Sequence Bundlesa novel data visualisation method for representing multiple
sequence alignments (MSAs). We identify and address key limitations of the existing bioinformatics data
visualisation methods (i.e. the Sequence Logo) by enabling Sequence Bundles to give salient visual expression to
sequence motifs and other data features, which would otherwise remain hidden.
Methods: For the development of Sequence Bundles we employed research-led information design
methodologies. Sequences are encoded as uninterrupted, semi-opaque lines plotted on a 2-dimensional
reconfigurable grid. Each line represents a single sequence. The thickness and opacity of the stack at each residue
in each position indicates the level of conservation and the linescurved paths expose patterns in correlation and
functionality. Several MSAs can be visualised in a composite image. The Sequence Bundles method is designed to
favour a tangible, continuous and intuitive display of information.
Results: We have developed a software demonstration application for generating a Sequence Bundles
visualisation of MSAs provided for the BioVis 2013 redesign contest. A subsequent exploration of the visualised
line patterns allowed for the discovery of a number of interesting features in the dataset. Reported features
include the extreme conservation of sequences displaying a specific residue and bifurcations of the consensus
sequence.
Conclusions: Sequence Bundles is a novel method for visualisation of MSAs and the discovery of sequence
motifs. It can aid in generating new insight and hypothesis making. Sequence Bundles is well disposed for
future implementation as an interactive visual analytics software, which can complement existing visualisation
tools.
Background
Sequence Bundles is a novel method for collation, visual
representation, exploration and analysis of multiple
sequence alignment (MSA) data [1]. Since its develop-
ment, this method has been used to visualise and expose
a number of sequence motifs and data features in pro-
tein alignments. The Sequence Bundles method was pre-
sented at the IEEEVis 2013 conference in Atlanta,
Georgia, where it was awarded the ex aequo honourable
mention in the BioVis 2013 data redesign contest.
Motivation
With the continuous development of ever more powerful
methods for data collection and generation, we are faced
with the challenge of not only making sense of this abun-
dance of information, but also making good use of it.
Modern computational methods for structuring data,
finding patterns and querying databases address many of
these challenges already. However, in many processes,
the abilities intrinsic to human perception are still not
matched by computers. Such processes include: rapidly
recognising complex and non-obvious patterns; instant
inferring, deducing and ad hoc hypothesis-forming;
following sound and scientifically informed intuition. We
aimed at capitalising on these human abilities and tried
* Correspondence: marek@science-practice.com
1
Science Practice Ltd, London, 83-85 Paul Street, EC2A 4NQ, UK
Full list of author information is available at the end of the article
Kultys et al.BMC Proceedings 2014, 8(Suppl 2):S8
http://www.biomedcentral.com/1753-6561/8/S2/S8
© 2014 Kultys et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creative commons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in
any medium, pro vided the original work is properly cited. The Creative Commons Public Domain Dedication wai ver (http://
creativecommons.org/publicdomain/zero/1.0/) applies to the data made av ailable in this article, unless otherwise stated.
tobringsequencedataanalysisclosertohuman
experience.
Our motivation in creating, developing, and putting
Sequence Bundles to practical use was to allow for the
discovery of hidden sequence motifs and other data fea-
tures in a visualised dataset by direct manipulation and
visual analysis of that data visualisation itself. Sequence
Bundles is a visualisation method aimed at aiding scienti-
fic discovery by enabling the process of direct exploration
where visualisation can be used as a sandbox for rapid
testing of hypothesis, suppositions and even speculations
about MSAs.
We also aimed at designing a visualisation method
that would demonstrate potential for being relatively
accessible to domain non-specific readers (e.g. prospec-
tive collaborators). By revealing moremore intuitively
than existing MSA visualisation methodsthe Sequence
Bundles method is designed with the intent to be
equally approachable and attractive to both practitioner
and non-practitioner audience groups.
Related work
With the current growth in the amount of biological data,
its scale, variety and complexity, new strategies and tools
for exploring this wealth of knowledge are required [2,3].
Moreover, in order for this knowledge to be understand-
able and usable for both expert and interdisciplinary audi-
ences, it needs to be presented in accessible, transparent
and intuitive ways.
In bioinformatics, a convention of the Sequence Logo
has been developed [4] in order to enable the display of a
range of MSA features in a single graphic: the consensus
sequence, relative frequencies of residues at every position,
the amount of information present at every position mea-
sured in bits, as well as significant locations in the input
alignment. Further developments which build on the
Sequence Logo method include inter alia: HMMLogo (giv-
ing visual representation to both emission and transition
probabilities of Profile Hidden Markov ModelspHMMs)
[5]; Seq2Logo (including other important information in
the visual output, e.g. about the low number of observa-
tions) [6]; CodonLogo (a tool that allows for visual discri-
mination between patterns of codon and nucleotide
conservation) [7]; and pLogo (visualising residue heights
scaled relative to their statistical significance) [8]. All of
these developments are in essence variations on the origi-
nal Sequence Logo visualisation method by Schneider and
Stephens [4] and even though they enhance the Logo
visualisation by the addition of novel features, they also
retain the Logos inherent limitations.
Some kinds of information buried in MSAs cannot be
easily exposed by either the Sequence Logo method, or
any of its variations. When addressing those MSA fea-
tures designers of visualisation tools need to rely on
combining other methods [9] oras in case of the
Sequence Bundlescreating new ones.
Objectives
In a series of interviews and workshops with bioinformati-
cians from the United Kingdom, United States and Poland
(see the Acknowledgementssection), we identified a
number of requirements that a successful MSA visualisa-
tion should support, as well as a number of limitations
and redundant features of the existing Sequence Logo
method that should be addressed. This led our design
efforts towards the following objectives:
1Although Sequence Logos are very effective in
exposing the general consensus sequence, as well as
amino acid distribution on each position, they also
obscure patterns in the relationships between sites
within the sequences. This results in very important
information about residue correlation and non-obvious
sequence affinity being removed completely from the
visualisation. Our general goal was, therefore, to rein-
troduce this relational information to the visualisation
in order to facilitate and assist visual exposure of
sequence motifs.
2Our scientific interviewees saw little benefit in
showing the amount of information on each position,
measured in Sequence Logos against the Y-axis and
expressed in bits. In fact, some scientists were sur-
prised to learn about that during the interview, as they
had never used this measure before. Displaying the
amount of information seemed to be addressed to a
far more specialised user. Therefore, our aim was to
remove this data from the Y-axis and repurpose the
axis for the benefit of a larger and more interdisciplin-
ary audience.
3Some visualisation tools are well suited for show-
ing details, while others favour a more global inspec-
tion. Residue statistical detail and localised sequence
properties can be easily identified and described by
using Sequence Logos (or even by inspecting parts of a
MSA itself). However, the Logo method is of limited
value when applied to datasets with longer sequences,
because of its site-specific focus. Thus, our objective
was to favour global inspection of datasets by design-
ing a visualisation encoding which is capable of expos-
ing macroscopic patterns and generating findings of
sequence-wide significance.
4A Sequence Logo hides important information
about the total number of analysed sequences (this
information exists in the length of a MSA itself) and
their relative affinity (relative distance from each
other on the phylogenetic tree). Consequently, our
aim was to provide an indication of the sample size
(number of sequences in a visualised MSA).
Kultys et al.BMC Proceedings 2014, 8(Suppl 2):S8
http://www.biomedcentral.com/1753-6561/8/S2/S8
Page 2 of 10
5The Sequence Logo visualisation method is
equally well equipped to display either DNA or pro-
tein MSAs. In fact, the Logo visualisation principles
should be easily applied to any sequential dataset
which can be formatted as a MSA. Our goal was to
retain this universal scope of application.
In line with our motivation, and in order to address
Sequence Logo limitations and other visualisation chal-
lenges identified during our research, we decided to
abandon the convention of Sequence Logo and develop a
completely new method for visualising MSA data, which
we explain below. First in the Methodssection we out-
line iterative design methodologies employed in the pro-
ject, followed by an explanation of the Sequence Bundles
visual encoding and a summary of key departures from
the Sequence Logo. Later, in the Resultssection, we
describe the extent to which Sequence Bundles has been
developed and list a number of interesting data features
exposed in the competition dataset by using our visuali-
sation method. Finally, we conclude with a discussion
around the interactive potential of the Sequence Bundles
method, which can complement existing visualisation
tools to expose what otherwise could remain hidden.
Methods
Design methods
We approach bioinformatics visualisation from the per-
spective of information design. Information design is a
design discipline focused on defining, planning, and shap-
ing of the contents of a message and the environments in
which it is presented, with the intention of satisfying the
information needs of the intended recipients[10]. In our
case the MSA is the contents of a message and the recipi-
ents are bioinformatics practitioners. Taking this approach
and using methodologies and techniques practiced in the
design world, we developed Sequence Bundles in the
following research-led and iterative design process:
1Desk research phase in which we conducted a
multidisciplinary and multi-level literature review
and acquired basic understanding of bioinformatics
fundamentals;
2Initial sketching phase in which we tried to
produce Sequence Logos ourselves by using both fic-
tional and real data. This enabled us to understand
how exactly Sequence Logo visual encoding works,
which features it exposes, and which it conceals;
3External research phase in which we interviewed
a number of molecular biology and bioinformatics
experts to learn about their scientific work, their opi-
nion on Sequence Logos and its strengths and limita-
tions, as well as their reasons for which they decide to
use or not to use the Logo in their practice;
4Prototyping on paper and idea generating phase
in which we brainstormed new concepts for sequence
data representation, explored diverse strategies for
visually encoding bioinformatics data, investigated
ways in which Sequence Logos can be redesigned,
and prototyped all our ideas in sketches, drawings
and mock-ups;
5Stimulus research and ideas refinements phase
in which we consulted with bioinformatics experts
presenting them our prototyped ideas once again to
obtain detailed explanations of how selected
approaches can function. For this phase we simulated
visualisation outcomes with real small MSAs;
6Prototyping in code phase in which we devel-
oped the Sequence Bundles demonstration applica-
tion to generate actual visualisations of the BioVis
2013 redesign contest dataset, which helped in
further refinements of the visual encoding;
7Visual analysis and insight generation phase
which emerged unplanned, when we started exploring
and editing vector visualisations generated with the
demonstration application. In this phase we discovered
a number of features in the competition data, which
were given salient expression by the Sequence Bundles
visual encoding. We discuss some of these features in
the Resultssection.
8Presentation and expert feedback phase took
place at the IEEEVis 2013 conference in Atlanta,
Georgia, where we presented the Sequence Bundles
method and our findings to the BioVis 2013 contest
jury and other experts in the field. We received valu-
able feedback regarding our developments thus far
and discussed potential directions for future work.
Visual encoding
Figure 1. shows a Sequence Bundles visualisation of the
BioVis 2013 redesign contest dataset [11]. The visualised
MSA contains 1809 aligned sequences of the adenylate
kinase lid (AKL) domain sampled from two groups of
bacteria: Gram-positive (886 sequences labelled black)
and Gram-negative (923 sequences labelled blue). Each
sequence in the MSA is 36 positions long. All visualisa-
tions throughout the paper are based on this dataset
provided for the contest entrants (see the Acknowledge-
mentssection).
The Sequence Bundles method plots sequences as
stacked lines against horizontal X-axis, which marks
sequence base or residue numbers, and against vertical
Y-axis, on which residues are arranged on a scale of their
physicochemical properties (in Figure 1 it is the scale of
amino acid hydrophobicity ordered after Wampler [12])
and marked with their letter symbols. A distinct Y-axis
position is used for gap characters in the MSA. One line
Kultys et al.BMC Proceedings 2014, 8(Suppl 2):S8
http://www.biomedcentral.com/1753-6561/8/S2/S8
Page 3 of 10
represents each protein sequence. Read from left to right,
the lines precise shape plotted against both axes corre-
sponds to the sequence of specific residues displayed on
each subsequent site. This visual encoding of sequences
combined with their meaningful vertical organisation
allows for saliently exposing patterns in their properties
and functionality, (e.g. when amino acid hydrophobicity
defines the Y-axis, the more hydrophilic each sequence
fragment is, the closer to the top of the chart it will
appear; conversely, the more hydrophobic it is, the lower
the line will be plotted).
In Figure 1 we contrast two families of bacteria by
compositing two coloured sub-Bundles (Gram-positives
are black and Gram-negatives are blue). Each sub-Bundle
is created by plotting all lines representing individual
sequences from the respective MSA and stacking them in
sets of 10. In Figure 1, for the Gram-positive sub-Bundle
all black lines displaying arginine (R) in position 1 will be
arrangedinstacksof10andoverlaidatleast88times.
Lines are collated in the same order in which sequences
reside in the MSA. Line thickness in Sequence Bundles is
uniformandsettopreventwhitegapsfromappearing
between neighbouring lines; thereby a stack of many
lines appear as bundled together. In order to enable the
distinction between denser and less dense stacks, lines in
Sequence Bundles are semi-transparent. In all figures in
this paper line transparency is set to 98% (2% opacity) in
normal blending mode to enable clear display of overlay-
ing lines and motifs. Both the thickness and the opacity
of the stack of lines at each letter in each position indi-
cate the level of localised consensus between sequences.
The general consensus sequence for each group of
sequences compared in the MSA is also shown. Optimal
line tangency in Sequence Bundles was selected in our
iterative design process, providing reductions in visual
clutter created by intersecting lines and improvements in
perceptual clarity of the image.
Comparison of sub-Bundles in the composite Sequence
Bundles visualisation is facilitated by the use of labelling
by colour, as well as by plotting each group with a vertical
offset relative to one another. The selection of black and
saturated light blue colours in Figure 1 complies with the
best practices of visual design [13], as it enables users with
any kind of colour-blindness to discern each sub-Bundle,
thus allowing an even greater range of users to comforta-
bly work with Sequence Bundles.
Key departures from sequence logos
The Sequence Bundles method was conceived as a rede-
sign of the existing long-standing convention of Sequence
Logos. However, the extent to which Sequence Bundles
departed from the Logo qualifies it as an altogether
Figure 1 Sequence Bundles comparing amino acid distribution and correlation in the AKL domain. Bundled visualisation plots sequences
as stacked lines against a Y-axis of letters arranged on a scale representing amino acid hydrophobicity. The linescurved paths expose the
conservation of residues by converging at matched positions. Their place relative to letters on the Y-axis exposes patterns in functionality. The
consensus sequence is indicated. Lines representing two groups of organisms differ by colour: Gram-positive bacteria (black lines) and Gram-
negative bacteria (blue lines). The visualisation is generated from a total of 1809 AKL protein sequences. The number of samples is: 923 Gram-
negative sequences vs. 886 Gram-positives, which is in 100:96 ratio.
Kultys et al.BMC Proceedings 2014, 8(Suppl 2):S8
http://www.biomedcentral.com/1753-6561/8/S2/S8
Page 4 of 10
separate, novel approach to the same problem. Here we
list six key departures from the Sequence Logo which
allow Sequence Bundles to overcome main limitations and
weaknesses of the Logo:
AShifting the focus of the visualisation from
being position-oriented to sequence-oriented by
explicitly maintaining continuity and integrity of
each plotted sequence;
Reason: Residuesfunctions are associated with their
position in relation to one another within proteins.
Because Sequence Logos represent residues in isola-
tion without valuable contextual information, their
position-oriented focus limits their uses. The Sequence
Bundles method is sequence-focused, therefore it
allows to view a string of residues holistically as a func-
tional protein, as well as to expose correlations and
motifs, potentially assisting discovery (see the Results
section for examples).
BUsing semi-opaque curved paths instead of
deformed letters;
Reason: Deformed type is hard to read and stacking
letters means that highly conserved ones rest on an
uneven bed of less conserved ones, which makes
them difficult to compare. Unfortunate stacking can
also lead to letter misinterpretation (e.g. V above I in
position 23 of the contest Logo could be misread as
Y). Representing sequences with curved paths allows
for their equal and proportional display with strong
focus on sequence continuity. Atypical sequences are
never removed but are faint enough to be
inconspicuous.
CReassigning the Y-axis from displaying the
amount of information measured in bits to displaying
letter-coded amino acids arranged by physiochemical
properties;
Reason: We found that many bioinformaticians were
uninterested in the level of detail about mutual infor-
mation shown in protein alignments. For the purpose
of protein conformation research, residue physiochem-
ical properties are reportedly a far more important
measureanddeservemorerefinedandstructured
representation than crude colour-coding used in
Sequence Logos (this also allows the Sequence Bundles
method to adhere to the best practice of design acces-
sibility for users with colour vision deficiency).
A comparison of two different Y-axis arrangements by
amino acid physiochemical properties and their effects
on the Sequence Bundles plots is shown in Figure 2
(ordering of amino acids by molecular weight after
Lide [14]).
DIntegrating three separate contest Sequence
Logo figures into one combined visualisation, where
both Gram-positive and Gram-negative bacteria can
be directly juxtaposed;
Reason: It is very difficult to compare stacked letters
across separate Sequence Logo figures, and we found
that users frequently misjudged lettersheight and
relative proportions. By placing the two datasets on
the same graph and differentiating by colour, the
Sequence Bundles method enables an easy and direct
comparison of both groups, whilst also offering a gen-
eral overview of the whole population. Thus, any
arbitrary collection of sets of sequences can be readily
compared by stacking a given number of lines in
Sequence Bundles, with each sub-Bundle remaining
in direct visual relationship with the rest. The com-
pound plot allows both overall and relative features
to be observed, easily compared and contrasted.
EVisualising MSA gaps as a separate unit on the
Y-axis;
Reason: MSAs rely on gaps to optimise alignment.
Gaps are never shown in Sequence Logos, which dis-
sociates visual representations from visualised data
(although some Sequence Logo modifications visualise
information about sequence insertions and deletions
included in the alignment). Sequence Bundles displays
gap locations within each sequence alongside gaps
actual length.
FVisualising explicitly all sequences included in
the alignment and providing the total number of
sequences in each colour group;
Reason: We discovered that scientists are often hesi-
tant to trust Sequence Logos as they give no indication
of the total number of compared sequences and pro-
portions of sequences distributed between juxtaposed
coloured groups. Logos generated from 9 or 9,000
sequences can look the same, but their credibility
would be very different. To make this information suf-
ficiently explicit and avoid visual clutter, transparency
level applied to plotted lines is balanced against the
total number of all visualised sequences.
Results
Current developments
The Sequence Bundles method has been implemented as
a demonstration application written in the open source
Processing language [15]. This application includes algo-
rithms and methods responsible for visual encoding of
already structured and formatted databases. In the visua-
lisation pipeline outlined by Ward et al. [16], Sequence
Bundles facilitates the Data to Visual Mappingprocess
and to some extent also the View Transformationpro-
cess. It does not support the Data Modellingprocess or
the Data Selectionprocessthese need to be completed
Kultys et al.BMC Proceedings 2014, 8(Suppl 2):S8
http://www.biomedcentral.com/1753-6561/8/S2/S8
Page 5 of 10
outside of the Sequence Bundles demonstration
application.
At this stage of development the Sequence Bundles
demonstration application offers automated means of plot-
ting and visually encoding a large number of sequences
organised in a previously curated MSA. The Sequence Bun-
dles demonstration application accepts input of sequences
in plain text (TXT) file formats (including gaps). Two out-
put types are supported: bitmap and vector graphics files.
Bitmaps can be exported from the Processing default image
renderer in a specified resolution. Vector graphics files can
be exported to Portable Document File format (PDF) with
preserved editing capabilities, measurements and scale, as
well as specified colour and transparency settings (this is
attained via an open source PDF Export library for
Processing).
Using the Sequence Bundles demonstration applica-
tion we have managed to discover a number of interest-
ing features in the contest dataset, which are outlined
below.
Data features identified with sequence bundles
The development of the Sequence Bundles visual encoding
and the demonstration application for generating vector
visualisations enabled the exploration of the competition
dataset in a novel visual manner. Various actions, such as
rendering of the data according to different residue order-
ing principles (Figure 2.), brushing (i.e. making a selection
Figure 2 Comparison of two Sequence Bundles plots differentiated by the Y-axis organisation. Reorganising the Y-axis by different
principles enables a more in-depth exploration of visualised data and assists in finding meaningful links between data alignment and the
physical properties of amino acids displayed in the sequence. Panel A shows the Sequence Bundles visualisation of the AKL domain with the
Y-axis organised according to amino acid hydrophobicity (hydrophilic to hydrophobic residues arranged top to bottom after Wampler [12]).
Panel B shows the Sequence Bundles visualisation of the same dataset with the Y-axis organised according to amino acid molecule weight
(small to large molecules arranged top to bottom after Lide [14]).
Kultys et al.BMC Proceedings 2014, 8(Suppl 2):S8
http://www.biomedcentral.com/1753-6561/8/S2/S8
Page 6 of 10
by dragging the mouse cursor in an interactive visualisa-
tion view [17]) and highlighting of selected regions or
close-up examination of interesting sections of sequences,
led to the discovery of a number of interesting and poten-
tially insightful features of the AKL domain dataset. In
Figures 3, 4, 5, 6 we illustrate four of those features, specify
details about each of them in figure legends and outline
the methods by which they became exposed.
Conclusions
We have created a novel visualisation method for dis-
playing MSAs called Sequence Bundles and developed it
as a demonstration application running in Processing.
We have demonstrated the efficacy of our design deci-
sions and the value of Sequence Bundles presentation of
data by exposing a number of interesting and surprising
features in the contest dataset, which would otherwise
have remained hidden. Although it remains to be con-
firmed what scientific meaning the observed features
have, the ability of the Sequence Bundles method to
identify features in data that are of interest to the data
authors themselves demonstrates the intuitiveness and
high suitability of Sequence Bundles for visual explora-
tion of MSAs.
The results of our visual investigation into the hidden
patterns in the contest data also demonstrate the predis-
position of the Sequence Bundles method for prospec-
tive implementation as a dynamic and interactive
software tool for MSA visualisation and visual analysis.
Conventional controls such as updatable rendering,
zooming in and out, panning, colour-coding, as well as
partitioning and splicing of datasetscurrently attainable
Figure 3 Feature 1: Extreme conservation of sequences displaying asparagine in Gram-negatives in position 13. Highlighting all AKL
domain sequences in Gram-negative bacteria displaying asparagine (N) in position 13 exposes extreme conservation of the selected sequences
throughout the length of the visualised protein. The total number of highlighted sequences is 48, including 46 identical (minor variation occurs
only in two sequences in positions 2, 3, 6, 12, 14, 15, 20, 21, 23, 26, 30, 33 and 35). The underlying causes for this unusual conservation remain
to be investigated. Due to the scale of this extreme conservation (nearly 5.4% of the whole Gram-negative dataset), the findings can have
significant implications for the interpretation and evaluation of data.
Figure 4 Feature 2: Extreme conservation of sequences displaying phenylalanine in Gram-positives in position 35. Highlighting all AKL
domain sequences in Gram-positive bacteria displaying phenylalanine (F) in position 35 reveals that all of these sequences are extremely
conserved throughout the whole length of the visualised protein with rare variation in positions 9, 20, 23 and 25. The total number of
sequences in this selection is 12, out of which 6 are identical). The underlying causes for this unusual conservation remain to be investigated.
One of possible reasons could be a mis-curation of the MSA (which requires to be confirmed).
Kultys et al.BMC Proceedings 2014, 8(Suppl 2):S8
http://www.biomedcentral.com/1753-6561/8/S2/S8
Page 7 of 10
Figure 5 Feature 3: Dissimilarity of distribution of two pairs of prolines and lysines in positions 17-20. Close inspection of the AKL domain
visualised with Sequence Bundles exposes the dissimilar nature of the distribution of two pairs of residues in the Gram-positives (black) sub-dataset,
which is difficult to be observed with the use of existing visualisation methods or in the general consensus sequence (in positions 17-20 it is: ...PPKK...).
The Sequence Bundles method preserves continuity of sequences by visualising them as uninterrupted lines which reveals that while the majority of
sequences in positions 17-18 display a consecutive pair of prolines (indicated by a thick horizontal bridgebetween P-P in panel A), one part of the
Gram-positive sequences display a lysine in position 19, while another part display a lysine in position 20. Note that very few black lines bridge the gap
between K-K in the Sequence Bundles visualisation (panel B)the majority of sequences include only one of the lysines. This data feature remains hidden
in the Sequence Logo, as well as in the general consensus sequence itself. In fact, only 23 sequences display the exact ...PPKK... motif fully consistent with
the general consensus sequence. The reason for this dissimilarity of residue distribution in the MSA remains to be explained and interpreted.
Figure 6 Feature 4: Bifurcation of the general consensus sequence in the streamgraphvariant of the visualisation. Restructuring the
Sequence Bundles chart into a streamgraphvisualisation exposes the pattern of variation formed by sequences displaying most frequent
residues in positions from 19 to 23 in Gram-positive bacteria (black). Note two interweaving threads bifurcating in position 19 and converging in
position 24, one displaying: ...KVEGI... and the other: ...AKADV... Neither of these two parallel threads adhere to the general consensus sequence in
Gram-positives which in positions 19-23 is: ...KKAGV... Connection bridgesbetween consensus residues in Gram-positive in positions 19-20, 21-22
and 22-23 are much less significant compared to strongly pronounced interweaving links between other frequent residues. This data feature can
be saliently exposed owing to the fact that the Sequence Bundles visualisation method displays sequences as continuous lines and not as
discrete items of statistical data in each position (as in the case of the Sequence Logo).
Kultys et al.BMC Proceedings 2014, 8(Suppl 2):S8
http://www.biomedcentral.com/1753-6561/8/S2/S8
Page 8 of 10
only through edits to the demonstration application
codewould immediately streamline the process of
exploration of the visualisation. Additional features such
as additive and subtractive brushing, highlighting, anno-
tating and toggling of axes would enable considerable
flexibility, introduce instant user feedback and improve
the general workflow. These are currently facilitated by
taking advantage of vector graphics software user inter-
face. In the longer perspective, software tools employing
the Sequence Bundles method may benefit from intro-
ducing independent localised Y-axis organisation (as
opposed to the prevalent global arrangement), smart
algorithms to optimise disentanglement of the lines, or
3-dimensional presentation of data. Development of
visual analytics programmes which would take full
advantage of the Sequence Bundles visual encoding,
would complement existing MSA visualisation tools
well. This would not only increase the efficiency and
scope of the bioinformatics workflow, but also open the
bioinformatics domain for access by collaborators from
other fields, as well as for interested non-experts.
List of abbreviations
AKL adenylate kinase lid; EMBL-EBI European Molecular Biology
Laboratory, European Bioinformatics Institute; MSA Multiple Sequence
Alignment; MSAs Multiple Sequence Alignments (plural); PDF Portable
Document Format; pHMMs profile Hidden Markov Models;
Competing interests
Authors declare that they have no competing interests.
Authorscontributions
MK and JK conceived the Sequence Bundles project as a submission to the
BioVis 2013 redesign contest. MK acted as project coordinator and lead
designer. MK developed the complete code for Sequence Bundles
demonstration application and is responsible for exposing all data features
presented. JK oversaw the team, managed the project and provided design
direction. LN acted as the researcher for the project and wrote the contest
submission paper. NG and RS acted as scientific consultants, proposed new
features and gave feedback on preliminary versions of Sequence Bundles. All
authors contributed to team meetings, discussions and workshops. MK
wrote this manuscript, and JK, LN, NG and RS helped to draft it. All authors
read and approved the final version of the manuscript.
AuthorsInformation
MK, JK and LN work at Science Practicea design consultancy with strong
focus on collaborations with biological sciences and bio-medical industry.
MK is an information designer with strong interest in data visualisation and
visual analytics. He lectured at universities in Europe and USA. He also
authored and led tutorials in visual communication at 2012 and 2013 IEEEVis
conferences.
JK is an interaction designer, the principal and the founder of Science
Practice. He lectured at universities in Europe, USA and Asia. He is also one
of the winners of the 2008 iGEM grand prix.
LN is an anthropologist and a developer. She is reading for an MSc in digital
anthropology at UCL, London.
NG and RS are scientists at European Molecular Biology Laboratory,
European Bioinformatics Institute (EMBL-EBI). NG is the leader of the
Goldman research group at EMBL-EBI. The groups interest is in
development of new models and data analysis methods for the study of
molecular sequence evolution.
RS is a post-doctoral fellow in the Goldman group at EMBL-EBI.
Acknowledgements
We are grateful for the help received from the following scientists during
our research into bioinformatics: Dr Daniel Buchan (Department of
Computer Science, Bioinformatics Group, University College London),
Dr Bruce Palfey (Department of Biological Chemistry, University of Michigan
Medical School), Fulla Abdul-Jabbar (University of Michigan), Dr Joanna
Sułkowska (University of Warsaw, Faculty of Chemistry and University of
California San Diego, Center for Theoretical Biological Physics) and Dr
Efstathios Sideris (Pixelated Noise Ltd). We are also grateful for editorial help
provided by Dr Anna Mieczakowski (Eclipse Experience Ltd).
Acknowledgement of funding support: RS was supported by an EMBL
Interdisciplinary Postdoc (EIPOD) fellowship with Cofunding from Marie Curie
Actions COFUND. NG was supported by EMBL.
We gratefully acknowledge the dataset provided by Drs. Magliery and
Sullivan at The Ohio State University for the purposes of the BioVis 2013
Contest.
Declarations
Publication of this work was supported by Science Practice and EMBL.
This article has been published as part of BMC Proceedings Volume 8
Supplement 2, 2014: Proceedings of the 3rd Annual Symposium on
Biological Data Visualization: Data Analysis and Redesign Contests. The full
contents of the supplement are available online at http://www.
biomedcentral.com/bmcproc/supplements/8/S2
Authorsdetails
1
Science Practice Ltd, London, 83-85 Paul Street, EC2A 4NQ, UK.
2
European
Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome
Trust Genome Campus, Cambridge, Hinxton, CB10 1SD, UK.
Published: 28 August 2014
References
1. Science Practice: Sequence Bundles. [http://science-practice.com/projects/
sequence-bundles/], http://dx.doi.org/10.14435/sequence-bundles-biovis.
2. Aniba MR, Poch O, Thompson JD: Issues in bioinformatics benchmarking:
the case study of multiple sequence alignment. Nucleic Acids Research
2010, 38(21):7353-7363[http://dx.doi.org/10.1093/nar/gkq625].
3. Kamena C, Notredame C: Upcoming challenges for multiple sequence
alignment methods in the high-throughput era. Bioinformatics 2009,
25(19):2455-2465 [http://dx.doi.org/10.1093/bioinformatics/btp452].
4. Schneider TD, Stephens RM: Sequence Logos: A New Way to Display
Consensus Sequences. Nucleic Acids Research 1990, 18(20):6097-6100
[http://dx.doi.org/10.1093/nar/18.20.6097].
5. Schuster-Böckler B, Schultz J, Rahmann S: HMM Logos for visualization of
protein families. BMC Bioinformatics 2004, 5:7 [http://dx.doi.org/10.1186/
14712105-5-7].
6. Thomsen MCF, Nielsen M: Seq2Logo: a method for construction and
visualisation of amino acid binding motifs and sequence profiles
including sequence weighting, pseudo counts and two-sided
representation of amino acid enrichment and depletion. Nucleic Acids
Research 2012, 40:W281-W287 [http://dx.doi.org/10.1093/nar/gks469].
7. Sharma V, Murphy DP, Provan G, Baranov PV: CodonLogo: a sequence
logo-based viewer for codon patterns. Bioinformatics 2012,
28(14):1935-1936 [http://dx.doi.org/10.1093/bioinformatics/bts295].
8. OShea JP, Chou MF, Quader SA, Ryan JK, Church GM, Schwartz D: pLogo: a
probabilistic approach to visualizing sequence motifs. Nature Methods
2013, 10(12):1211-1212 [http://dx.doi.org/10.1038/nmeth.2646].
9. Schwarz R, Seibel PN, Rahmann S, Schoen C, Huenerberg M, Müller-
Reible C, Dandekar T, Karchin R, Schultz J, Müller T: Detecting species-site
dependencies in large multiple sequence alignments. Nucleic Acid
Research 2009, 37(18):5959-5968.
10. International Institute for Information Design: idX (information design
exchange) Information Design: Core Competencies, What information
designers know and can do. International Institute for Information Design,
Vienna; 2007 [http://www.iiid.net/PDFs/idxPublication.pdf].
11. 3rd IEEE Symposium on Biological Data Visualisation, BioVis 2013 Data
Redesign Contest. [http://biovis.net/year/2013/info/redesign-contest].
12. Wampler JE: Tutorial on Peptide and Protein Structure. [http://www.bmb.
uga.edu/wampler/tutorial/].
Kultys et al.BMC Proceedings 2014, 8(Suppl 2):S8
http://www.biomedcentral.com/1753-6561/8/S2/S8
Page 9 of 10
13. Kultys M: Visual Alpha-Beta-Gamma: Rudiments of Visual Design for Data
Explorers. Parsons Journal for Information Mapping 2013, 5(1) [http://piim.
newschool.edu/journal/issues/2013/01/index.php].
14. Properties of Amino Acids. In Handbook of Chemistry and Physics, Internet
Version 2005. Boca Raton FL: CRC Press;Lide DR 2005:.
15. Processing 2. [http://processing.org].
16. Ward M, Grinstein G, Keim D: Interactive Data Visualisation Natick MA: A K
Peters; 2010.
17. Becker RA, Cleveland WS: Brushing Scatterplots. Technometrics 1987,
29(2):127-142 [http://dx.doi.org/10.1080/00401706.1987.10488204].
doi:10.1186/1753-6561-8-S2-S8
Cite this article as: Kultys et al.: Sequence Bundles: a novel method for
visualising, discovering and exploring sequence motifs. BMC Proceedings
2014 8(Suppl 2):S8.
Submit your next manuscript to BioMed Central
and take full advantage of:
Convenient online submission
Thorough peer review
No space constraints or color figure charges
Immediate publication on acceptance
Inclusion in PubMed, CAS, Scopus and Google Scholar
Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Kultys et al.BMC Proceedings 2014, 8(Suppl 2):S8
http://www.biomedcentral.com/1753-6561/8/S2/S8
Page 10 of 10
... While the overall goal is similar to ours, storylines make use of vertical position only for showing approximate groupings of entities at time points and not exact event or attribute alignment as is our goal with SEQUENCE BRAIDING. Likewise, it is possible to directly visualize multiple sequence alignment results as in Sequence Bundles [27], though it is not designed for showing attribute and event overviews. ...
... We limited our consideration of possible baselines for comparison to tools that: (1) support event sequence visualization, (2) support event alignment or aggregation by attributes, and (3) are open source or have otherwise available code for customization and evaluation. Candidates we considered include Storylines [38], Storyflow [30], and Sequence Bundles [27]. However, we found each unsuitable due to substantially different purposes, constraints, or approaches. ...
... In Sequence Bundles [27] (example in fig. 21) the vertical position of the nodes encodes a node attribute and the horizontal position is the position in the sequence. ...
Article
Full-text available
Temporal event sequence alignment has been used in many domains to visualize nuanced changes and interactions over time. Existing approaches align one or two sentinel events. Overview tasks require examining all alignments of interest using interaction and time or juxtaposition of many visualizations. Furthermore, any event attribute overviews are not closely tied to sequence visualizations. We present SEQUENCE BRAIDING, a novel overview visualization for temporal event sequences and attributes using a layered directed acyclic network. SEQUENCE BRAIDING visually aligns many temporal events and attribute groups simultaneously and supports arbitrary ordering, absence, and duplication of events. In a controlled experiment we compare SEQUENCE BRAIDING and IDMVis on user task completion time, correctness, error, and confidence. Our results provide good evidence that users of SEQUENCE BRAIDING can understand high-level patterns and trends faster and with similar error. A full version of this paper with all appendices; the evaluation stimuli, data, and analysis code; and source code are available at osf.io/mq2wt.
... To address these issues, we present ggmsa, an R package providing a comprehensive set of methods for analyzing and visualizing the MSA by individuals or groups. To identify sequence conservation patterns, variations and recombination at the site level, we implement a set of functions including sequence logo [17], sequence bundle [18], stacked sequence alignment visualization and nucleotide comparative plots. The combination of sequence logo and sequence bundle effectively depicts extreme conserved regions by stacked residues' letters and bundling lines. ...
... The sequence logo summarizes the frequency ratios of individual symbols in each MSA column and visually depicts them as one "hallmark," that is, a stacked group of single-letter residue symbols with symbol heights scaled by the frequencies. The sequence bundle is a visualization algorithm [18] for exploring the residue relationship between sequences. In a graph of sequence bundle, each sequence in MSAs is shown as semi-opaque, uninterrupted lines, where the x-axis shows the sequence position and the y-axis shows the residue letters, which are ordered and can be re-ordered according to various biochemical properties. ...
Article
The identification of the conserved and variable regions in the multiple sequence alignment (MSA) is critical to accelerating the process of understanding the function of genes. MSA visualizations allow us to transform sequence features into understandable visual representations. As the sequence-structure-function relationship gains increasing attention in molecular biology studies, the simple display of nucleotide or protein sequence alignment is not satisfied. A more scalable visualization is required to broaden the scope of sequence investigation. Here we present ggmsa, an R package for mining comprehensive sequence features and integrating the associated data of MSA by a variety of display methods. To uncover sequence conservation patterns, variations and recombination at the site level, sequence bundles, sequence logos, stacked sequence alignment and comparative plots are implemented. ggmsa supports integrating the correlation of MSA sequences and their phenotypes, as well as other traits such as ancestral sequences, molecular structures, molecular functions and expression levels. We also design a new visualization method for genome alignments in multiple alignment format to explore the pattern of within and between species variation. Combining these visual representations with prime knowledge, ggmsa assists researchers in discovering MSA and making decisions. The ggmsa package is open-source software released under the Artistic-2.0 license, and it is freely available on Bioconductor (https://bioconductor.org/packages/ggmsa) and Github (https://github.com/YuLab-SMU/ggmsa).
... It is made available under a The copyright holder for this preprint this version posted August 25, 2020. . https://doi.org/10.1101/2020.08.24.265645 doi: bioRxiv preprint [53] . CC-BY 4.0 International license (which was not certified by peer review) is the author/funder. ...
Preprint
Full-text available
A novel coronavirus (SARS-CoV-2) has devastated the globe as a pandemic that has killed more than 800,000 people. Effective and widespread vaccination is still uncertain, so many scientific efforts have been directed towards discovering antiviral treatments. Many drugs are being investigated to inhibit the coronavirus main protease, 3CLpro, from cleaving its viral polyprotein, but few publications have addressed this protease's interactions with the host proteome or their probable contribution to virulence. Too few host protein cleavages have been experimentally verified to fully understand 3CLpro's global effects on relevant cellular pathways and tissues. Here, we set out to determine this protease's targets and corresponding potential drug targets. Using a neural network trained on coronavirus proteomes with a Matthews correlation coefficient of 0.983, we predict that a large proportion of the human proteome is vulnerable to 3CLpro, with 4,460 out of approximately 20,000 human proteins containing at least one predicted cleavage site. These cleavages are nonrandomly distributed and are enriched in the epithelium along the respiratory tract, brain, testis, plasma, and immune tissues and depleted in olfactory and gustatory receptors despite the prevalence of anosmia and ageusia in COVID-19 patients. Affected cellular pathways include cytoskeleton/motor/cell adhesion proteins, nuclear condensation and other epigenetics, host transcription and RNAi, coagulation, pattern recognition receptors, growth factor, lipoproteins, redox, ubiquitination, and apoptosis. This whole proteome cleavage prediction demonstrates the importance of 3CLpro in expected and nontrivial pathways affecting virulence, lead us to propose more than a dozen potential therapeutic targets against coronaviruses, and should therefore be applied to all viral proteases and experimentally verified.
... Non-linear or value-based alignment of sequence data is also applicable. Sequence Bundles [25] and Sequence Diversity Diagram [36] chose to plot sequences as stacked lines against the Y-axis of data values. Also, Sequence Synopsis [10] and work proposed by Liu et al. [26] shifted and aligned sequences based on data value, instead of aligning along a common timeline. ...
Article
While analyzing multiple data sequences, the following questions typically arise: how does a single sequence change over time, how do multiple sequences compare within a period, and how does such comparison change over time. This paper presents a visual technique named STBins to answer these questions. STBins is designed for visual tracking of individual data sequences and also for comparison of sequences. The latter is done by showing the similarity of sequences within temporal windows. A perception study is conducted to examine the readability of alternative visual designs based on sequence tracking and comparison tasks. Also, two case studies based on real-world datasets are presented in detail to demonstrate usage of our technique.
... Integration with new MSA visualization techniques, e.g. sequence bundles (Kultys et al., 2014) is planned. The MSAViewer has already been found useful and became part of Galaxy (Giardine et al., 2005) https://cpt.tamu.edu/clustalwmsa-and-visualisations ...
Article
Full-text available
The MSAViewer is a quick and easy visualization and analysis JavaScript component for Multiple Sequence Alignment data of any size. Core features include interactive navigation through the alignment, application of popular color schemes, sorting, selecting and filtering. The MSAViewer is “web ready”: written entirely in JavaScript, compatible with modern web browsers and does not require any specialized software. The MSAViewer is part of the BioJS collection of components. Availability The MSAViewer is released as open source software under the Boost Software License 1.0. Documentation, source code and the viewer are available at http://msa.biojs.net/. Contact msa{at}biojs.net Supplementary Information Supplementary data are available at Bioinformatics online.
Article
A novel coronavirus (SARS-CoV-2) has devastated the globe as a pandemic that has killed millions of people. Widespread vaccination is still uncertain, so many scientific efforts have been directed toward discovering antiviral treatments. Many drugs are being investigated to inhibit the coronavirus main protease, 3CLpro, from cleaving its viral polyprotein, but few publications have addressed this protease’s interactions with the host proteome or their probable contribution to virulence. Too few host protein cleavages have been experimentally verified to fully understand 3CLpro’s global effects on relevant cellular pathways and tissues. Here, I set out to determine this protease’s targets and corresponding potential drug targets. Using a neural network trained on cleavages from 392 coronavirus proteomes with a Matthews correlation coefficient of 0.985, I predict that a large proportion of the human proteome is vulnerable to 3CLpro, with 4,898 out of approximately 20,000 human proteins containing at least one putative cleavage site. These cleavages are nonrandomly distributed and are enriched in the epithelium along the respiratory tract, brain, testis, plasma, and immune tissues and depleted in olfactory and gustatory receptors despite the prevalence of anosmia and ageusia in COVID-19 patients. Affected cellular pathways include cytoskeleton/motor/cell adhesion proteins, nuclear condensation and other epigenetics, host transcription and RNAi, ribosomal stoichiometry and nascent-chain detection and degradation, ubiquitination, pattern recognition receptors, coagulation, lipoproteins, redox, and apoptosis. This whole proteome cleavage prediction demonstrates the importance of 3CLpro in expected and nontrivial pathways affecting virulence, lead me to propose more than a dozen potential therapeutic targets against coronaviruses, and should therefore be applied to all viral proteases and subsequently experimentally verified.
Article
With the rapid development of online education in recent years, there has been an increasing number of learning platforms that provide students with multi-step questions to cultivate their problem-solving skills. To guarantee the high quality of such learning materials, question designers need to inspect how students' problem-solving processes unfold step by step to infer whether students' problem-solving logic matches their design intent. They also need to compare the behaviors of different groups (e.g., students from different grades) to distribute questions to students with the right level of knowledge. The availability of fine-grained interaction data, such as mouse movement trajectories from the online platforms, provides the opportunity to analyze problem-solving behaviors. However, it is still challenging to interpret, summarize, and compare the high dimensional problem-solving sequence data. In this paper, we present a visual analytics system, QLens, to help question designers inspect detailed problem-solving trajectories, compare different student groups, distill insights for design improvements. In particular, QLens models problem-solving behavior as a hybrid state transition graph and visualizes it through a novel glyph-embedded Sankey diagram, which reflects students' problem-solving logic, engagement, and encountered difficulties. We conduct three case studies and three expert interviews to demonstrate the usefulness of QLens on real-world datasets that consist of thousands of problem-solving traces.
Article
Full-text available
Genomic data visualization is essential for interpretation and hypothesis generation as well as a valuable aid in communicating discoveries. Visual tools bridge the gap between algorithmic approaches and the cognitive skills of investigators. Addressing this need has become crucial in genomics, as biomedical research is increasingly data‐driven and many studies lack well‐defined hypotheses. A key challenge in data‐driven research is to discover unexpected patterns and to formulate hypotheses in an unbiased manner in vast amounts of genomic and other associated data. Over the past two decades, this has driven the development of numerous data visualization techniques and tools for visualizing genomic data. Based on a comprehensive literature survey, we propose taxonomies for data, visualization, and tasks involved in genomic data visualization. Furthermore, we provide a comprehensive review of published genomic visualization tools in the context of the proposed taxonomies.
Preprint
Full-text available
Genomic data visualization is essential for interpretation and hypothesis generation as well as a valuable aid in communicating discoveries. Visual tools bridge the gap between algorithmic approaches and the cognitive skills of investigators. Addressing this need has become crucial in genomics, as biomedical research is increasingly data-driven and many studies lack well-defined hypotheses. A key challenge in data-driven research is to discover unexpected patterns and to formulate hypotheses in an unbiased manner in vast amounts of genomic and other associated data. Over the past two decades, this has driven the development of numerous data visualization techniques and tools for visualizing genomic data. Based on a comprehensive literature survey, we propose taxonomies for data, visualization, and tasks involved in genomic data visualization. Furthermore, we provide a comprehensive review of published genomic visualization tools in the context of the proposed taxonomies.
Article
Full-text available
Sequence Logos and its variants are the most commonly used method for visualization of multiple sequence alignments (MSAs) and sequence motifs. They provide consensus-based summaries of the sequences in the alignment. Consequently, individual sequences cannot be identified in the visualization and covariant sites are not easily discernible. We recently proposed Sequence Bundles, a motif visualization technique that maintains a one-to-one relationship between sequences and their graphical representation and visualizes covariant sites. We here present Alvis, an open-source platform for the joint explorative analysis of MSAs and phylogenetic trees, employing Sequence Bundles as its main visualization method. Alvis combines the power of the visualization method with an interactive toolkit allowing detection of covariant sites, annotation of trees with synapomorphies and homoplasies, and motif detection. It also offers numerical analysis functionality, such as dimension reduction and classification. Alvis is user-friendly, highly customizable and can export results in publication-quality figures. It is available as a full-featured standalone version (http://www.bitbucket.org/rfs/alvis) and its Sequence Bundles visualization module is further available as a web application (http://science-practice.com/projects/sequence-bundles).
Article
Full-text available
This article discusses terminology useful toward the creation and discussion of visualizations. It continues and expands on a hands-on tutorial delivered by the author at the VisWeek2012 conference in October 2012 in Seattle, WA. Today, many fields of scientific activity have become increasingly reliant on visuals as means of dissemination, exploration, and analysis of information. Consequently, either directly or through computer coding, many scientists become de facto image-makers. This article aims at providing basic introduction to visual language to those data explorers, those who depend in their fields of activity for producing (creating or generating) images. Five elements (the vocabulary) of visual language and four visual structures (the grammar) are defined, explained, and discussed. These aspects of the vocabulary and grammar of visual language are then elucidated through pictorial examples, thus presenting basic interdisciplinary knowledge of the subject from a design practitioner’s perspective. The article concludes with a bibliography with recommendations of valuable readings for further investigation of the topic. It is hoped that the paper provides useful background to be informedly used in readers’ own imagemaking activities.
Article
Full-text available
Seq2Logo is a web-based sequence logo generator. Sequence logos are a graphical representation of the information content stored in a multiple sequence alignment (MSA) and provide a compact and highly intuitive representation of the position-specific amino acid composition of binding motifs, active sites, etc. in biological sequences. Accurate generation of sequence logos is often compromised by sequence redundancy and low number of observations. Moreover, most methods available for sequence logo generation focus on displaying the position-specific enrichment of amino acids, discarding the equally valuable information related to amino acid depletion. Seq2logo aims at resolving these issues allowing the user to include sequence weighting to correct for data redundancy, pseudo counts to correct for low number of observations and different logotype representations each capturing different aspects related to amino acid enrichment and depletion. Besides allowing input in the format of peptides and MSA, Seq2Logo accepts input as Blast sequence profiles, providing easy access for non-expert end-users to characterize and identify functionally conserved/variable amino acids in any given protein of interest. The output from the server is a sequence logo and a PSSM. Seq2Logo is available at http://www.cbs.dtu.dk/biotools/Seq2Logo (14 May 2012, date last accessed).
Article
Full-text available
Conserved patterns across a multiple sequence alignment can be visualized by generating sequence logos. Sequence logos show each column in the alignment as stacks of symbol(s) where the height of a stack is proportional to its informational content, whereas the height of each symbol within the stack is proportional to its frequency in the column. Sequence logos use symbols of either nucleotide or amino acid alphabets. However, certain regulatory signals in messenger RNA (mRNA) act as combinations of codons. Yet no tool is available for visualization of conserved codon patterns. We present the first application which allows visualization of conserved regions in a multiple sequence alignment in the context of codons. CodonLogo is based on WebLogo3 and uses the same heuristics but treats codons as inseparable units of a 64-letter alphabet. CodonLogo can discriminate patterns of codon conservation from patterns of nucleotide conservation that appear indistinguishable in standard sequence logos. The CodonLogo source code and its implementation (in a local version of the Galaxy Browser) are available at http://recode.ucc.ie/CodonLogo and through the Galaxy Tool Shed at http://toolshed.g2.bx.psu.edu/.
Article
Full-text available
The post-genomic era presents many new challenges for the field of bioinformatics. Novel computational approaches are now being developed to handle the large, complex and noisy datasets produced by high throughput technologies. Objective evaluation of these methods is essential (i) to assure high quality, (ii) to identify strong and weak points of the algorithms, (iii) to measure the improvements introduced by new methods and (iv) to enable non-specialists to choose an appropriate tool. Here, we discuss the development of formal benchmarks, designed to represent the current problems encountered in the bioinformatics field. We consider several criteria for building good benchmarks and the advantages to be gained when they are used intelligently. To illustrate these principles, we present a more detailed discussion of benchmarks for multiple alignments of protein sequences. As in many other domains, significant progress has been achieved in the multiple alignment field and the datasets have become progressively more challenging as the existing algorithms have evolved. Finally, we propose directions for future developments that will ensure that the bioinformatics benchmarks correspond to the challenges posed by the high throughput data.
Article
Full-text available
Multiple sequence alignments (MSAs) are one of the most important sources of information in sequence analysis. Many methods have been proposed to detect, extract and visualize their most significant properties. To the same extent that site-specific methods like sequence logos successfully visualize site conservations and sequence-based methods like clustering approaches detect relationships between sequences, both types of methods fail at revealing informational elements of MSAs at the level of sequence–site interactions, i.e. finding clusters of sequences and sites responsible for their clustering, which together account for a high fraction of the overall information of the MSA. To fill this gap, we present here a method that combines the Fisher score-based embedding of sequences from a profile hidden Markov model (pHMM) with correspondence analysis. This method is capable of detecting and visualizing group-specific or conflicting signals in an MSA and allows for a detailed explorative investigation of alignments of any size tractable by pHMMs. Applications of our methods are exemplified on an alignment of the Neisseria surface antigen LP2086, where it is used to detect sites of recombinatory horizontal gene transfer and on the vitamin K epoxide reductase family to distinguish between evolutionary and functional signals.
Article
Full-text available
This review focuses on recent trends in multiple sequence alignment tools. It describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based multiple sequence alignments. Some results are presented suggesting that template-based methods are significantly more accurate than simpler alternative methods. The validation of existing methods is also discussed at length with the detailed description of recent results and some suggestions for future validation strategies. The last part of the review addresses future challenges for multiple sequence alignment methods in the genomic era, most notably the need to cope with very large sequences, the need to integrate large amounts of experimental data, the need to accurately align non-coding and non-transcribed sequences and finally, the need to integrate many alternative methods and approaches. Contact: cedric.notredame@crg.es
Article
Full-text available
Profile Hidden Markov Models (pHMMs) are a widely used tool for protein family research. Up to now, however, there exists no method to visualize all of their central aspects graphically in an intuitively understandable way. We present a visualization method that incorporates both emission and transition probabilities of the pHMM, thus extending sequence logos introduced by Schneider and Stephens. For each emitting state of the pHMM, we display a stack of letters. The stack height is determined by the deviation of the position's letter emission frequencies from the background frequencies. The stack width visualizes both the probability of reaching the state (the hitting probability) and the expected number of letters the state emits during a pass through the model (the state's expected contribution).A web interface offering online creation of HMM Logos and the corresponding source code can be found at the Logos web server of the Max Planck Institute for Molecular Genetics http://logos.molgen.mpg.de. We demonstrate that HMM Logos can be a useful tool for the biologist: We use them to highlight differences between two homologous subfamilies of GTPases, Rab and Ras, and we show that they are able to indicate structural elements of Ras.
Article
Methods for visualizing protein or nucleic acid motifs have traditionally relied upon residue frequencies to graphically scale character heights. We describe the pLogo, a motif visualization in which residue heights are scaled relative to their statistical significance. A pLogo generation tool is publicly available at http://plogo.uconn.edu/ and supports real-time conditional probability calculations and visualizations.
Article
A dynamic graphical method is one in which a data analyst interacts in real time with a data display on a computer graphics terminal. Using a screen input device such as a mouse, the analyst can specify, in a visual way, points or regions on the display and cause aspects of the display to change nearly instantaneously. Brushing is a collection of dynamic methods for viewing multidimensional data. It is very effective when used on a scatterplot matrix, a rectangular array of all pairwise scatterplots of the variables. Four brushing operations—highlight, shadow highlight, delete, and label—are carried out by moving a mouse-controlled rectangle, called the brush, over one of the scatterplots. The effect of an operation appears simultaneously on all scatterplots. Three paint modes—transient, lasting, and undo—and the ability to change the shape of the brush allow the analyst to specify collections of points on which the operations are carried out. Brushing can be used in various ways or on certain types of data; these usages are called brush techniques and include the following: single-point and cluster linking, conditioning on a single variable, conditioning on two variables, subsetting with categorical variables, and stationarity probing of a time series.
Article
A graphical method is presented for displaying the patterns in a set of aligned sequences. The characters representing the sequence are stacked on top of each other for each position in the aligned sequences. The height of each letter is made proportional to Its frequency, and the letters are sorted so the most common one is on top. The height of the entire stack is then adjusted to signify the information content of the sequences at that position. From these ‘sequence logos’, one can determine not only the consensus sequence but also the relative frequency of bases and the information content (measured In bits) at every position in a site or sequence. The logo displays both significant residues and subtle sequence patterns.