ArticlePDF Available

Triplex: An R/Bioconductor package for identification and visualization of potential intramolecular triplex patterns in DNA sequences

Authors:

Abstract and Figures

Motivation: Upgrade and integration of triplex software into the R/Bioconductor framework.Results: We combined a previously published implementation of a triplex DNA search algorithm with visualization to create a versatile R/Bioconductor package 'triplex'. The new package provides functions that can be used to search Bioconductor genomes and other DNA sequence data for occurrence of nucleotide patterns capable of forming intramolecular triplexes (H-DNA). Functions producing 2D and 3D diagrams of the identified triplexes allow instant visualization of the search results. Leveraging the power of Biostrings and GRanges classes, the results get fully integrated into the existing Bioconductor framework, allowing their passage to other Genome visualization and annotation packages, such as GenomeGraphs, rtracklayer or Gviz.Availability: R package 'triplex' is available from Bioconductor (bioconductor.org).Contact: Supplementary information: Supplementary data are available at Bioinformatics online. © 2013 The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] /* */
Content may be subject to copyright.
BIOINFORMATICS Vol. 00 no. 00 2013
Pages 1–2
Triplex: an R/Bioconductor package for identification and
visualization of potential intramolecular triplex patterns
in DNA sequences
Jiˇ
ı Hon1, Tom´
aˇ
s Mart´ınek1, Kamil Rajdl 2and Matej Lexa2
1Department of Computer Syt tems, Faculty of Information Technology, Brno Technical University,
Boˇzetˇ
echova 2, 61266 Brno, Czech Republic
2Department of Information Technology, Faculty of Informatics, Masaryk University, Botanick´
a 68a,
60200 Brno, Czech Republic
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACT
Motivation: Upgrade and integration of triplex software into the
R/Bioconductor framework.
Results: We combined a previously published implementation
of a triplex DNA search algorithm with visualization to create
a versatile R/Bioconductor package ”triplex”. The new package
provides functions that can be used to search Bioconductor genomes
and other DNA sequence data for occurrence of nucleotide patterns
capable of forming intramolecular triplexes (H-DNA). Functions
producing 2-D and 3-D diagrams of the identified triplexes allow
instant visualization of the search results. Leveraging the power of
Biostrings and GRanges classes, the results get fully integrated
into the existing Bioconductor framework, allowing their passage
to other Genome visualization and annotation packages, such as
GenomeGraphs, rtracklayer or Gviz.
Availability: R package ”triplex” is available from Bioconductor
(bioconductor.org).
Contact: lexa@fi.muni.cz
1 INTRODUCTION
DNA sequence analysis and annotation are important steps in
uncovering the molecular basis of life. While protein-coding
sequences have been intensively studied in the past, recent focus
has shifted towards the less-known biological functions encoded in
intergenic DNA, as well as the study of structural and regulatory
aspects of genetic information packaging in chromosomes. Tools for
the necessary sequence analysis of non-coding sequences are less
common than their gene-centered counterparts. We have recently
formulated and implemented an algorithm to detect potential
triplex-forming sequences in genomes (Lexa et al., 2011). Such
sequences have been implicated as important players in several key
processes, such as transcriptional regulation (Walter et al., 2001) or
DNA recombination (Rooney and Moore, 1995).
Triplex DNA forms when a third strand of nucleotides is allowed
to align with a Watson-Crick duplex using Hoogsteen bonds to
stabilize the nascent structure (Soyfer and Potaman, 1995). H-DNA
to whom correspondence should be addressed
is a form of DNA where triplexes form intramolecularly, without the
participation of other DNA molecules (Htun and Dahlberg, 1989).
Currently, several research groups reported on their efforts
to map triplex-forming sites in known genomes, as well as on
the development of tools to carry out such searches. Hoyne
et al. (2000) used pattern recognition tools to search for
homopurine/homopyrimidine stretches in DNA as likely triplex
formation sites. Cer et al. (2012) created a non-B DNA search
tool (nBMST) that includes mirror repeat detection functionality
to identify potential triplexes. Buske et al. (2012) and Lexa et al.
(2011) created triplex detection procedures allowing for a small
percentage of imperfections in the sequences, leading to higher
sensitivity of searches. Often, the tools exist as standalone software
or web tools, which led us to the idea to integrate triplex search,
visualization and genome annotation into a unified Bioconductor
software package in R for increased flexibility.
Here we describe triplex, demonstrating its use in sequence
analysis of sample data, focusing on functions integrating it with the
rest of the R/Bioconductor suite. Of the abovementioned software,
only triplex provides specialized H-DNA searching. The other
software treats H-DNA as general mirror repeats and lacks fine-
grained or configurable mismatch evaluation (nBMST), focuses
on a different class of triplexes (Hoyne et al., 2000) or provides
general results that need to be further filtered to identify H-DNA
(triplexator), requiring several orders of processing time more
than triplex. The software by Lexa et al. (2011) used to create
the package was improved by i) integration into R/Bioconductor,
ii) elimination of recognized bugs in scoring and alignment and
by ii) providing basepair information, either as text/variables or
visualizations.
We performed a simple comparison of nBMST and triplexator
programs with triplex (see Supplementary material). It showed
that reported (CT)n and (TA)n mirror repeats coincide with H-
DNA found by triplex. Triplexator returned several longer patterns
reported by triplex in fragments, a problem that may depend on
precise settings, although we found computation time and memory
use increased significantly at such attempts. This is likely caused
by triplexator design to find any combinations of triplex-forming
sequences, not only local patterns leading to H-DNA.
c
Oxford University Press 2013. 1
Lexa et al
Fig. 1. (A) 2D diagram and (B) 3D model of one of the best scored triplex
2 THE SOFTWARE
The R triplex package is essentially an R interface to the underlying
C implementation of a dynamic-programming search strategy of
the same name (Lexa et al., 2011). The main functionality of the
original program was to detect the positions of subsequences in
a much larger sequence capable of folding into an intramolecular
triplex (H-DNA) made of as many canonical nucleotide triplets
as possible. We extended this basic functionality, to include the
calculation of exact base-pairing in the triple helices. This allowed
us to include visualization, showing the exact base-pairing in 1D,
2D or 3D (see Usage Example).
The created package takes advantage of the existing Bioconductor
infrastracture. For example, the triplex search method uses the
DNAString object as input. As a result, all available genomes
(BSgenomes objects) can be easily analysed. As for the output,
identified triplexes are stored in data objects of a class based on
XStringViews. Thus all other libraries or methods working with
IRanges can be applied to triplexes as well. Alternatively, the
results can be transformed into GRanges objects which enable
further possibilities, such as visualization of genome tracks using
GenomeGraphs or export of results tothe GFF3 annotation format.
3 USAGE EXAMPLE
In the following example, we load a genomic sequence from one
of the BSGenome packages, identify potential triplexes with length
over 8 triplets of nucleotides and score 17, create two different
visualizations of the best-scored triplex. Finally, we export the
identified positions into a genome annotation track (via a GFF3 file)
and store the sequences in a FASTA file.
I) Load necessary libraries and genomes.
> library(triplex)
> library(BSgenome.Celegans.UCSC.ce10)
II) Search for potential triplex positions and display the results.
> t <- triplex.search(Celegans[["chrX"]],
+ min_score=17,min_len=8)
> t
Triplex views on a 17718866-letter DNAString subject
subject: CTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA...TAGGCTTAGGCTTAGGCTTAGGCTTAGGCTTAGG
triplexes: start width score pvalue ins type s
[1] 762 28 17 6.5e-04 0 4 - [TCTAAAAGACACACAATTTAGAAAAAAA]
[2] 1160 26 17 3.7e-04 0 7 + [ACAAAAACTTCATCAACAAGAAAAAA]
...
[20033] 17715172 29 17 3.7e-04 0 6 + [AAAAAAAAGTGAAAAAAACTGAATTTCAT]
[20034] 17718247 27 17 3.7e-04 0 6 + [AAAAAAAAACACTTAAACATAAAACTA]
III) Sort the results by score and display the best-scoring non-
trivial triplex. Graphical output is shown in figure 1.
> ts <- t[order(score(t),decreasing=TRUE)]
> triplex.diagram(ts[1])
> triplex.3D(ts[1])
IV) Export the results as GFF3 and FASTA files.
> library(rtracklayer)
> export(as(t, "GRanges")," test.gff", version="3")
> writeXStringSet(as(t, "DNAStringSet"), file="test.fa",
+ format="fasta")
4 CONCLUSION
We present a new R/Bioconductor package which integrates our
previously defined algorithm for identification of triplex-forming
sequences with two new methods of their visualization (2D diagram
and 3D model). The created package utilizes existing Bioconductor
infrastructure in such way that available genomes (BSGenomes)
can easily be used as input. The identified triplexes can be further
analysed as IRanges or GRanges objects (and optionally exported
into GFF3 or FASTA file). In connection with R language and
existing libraries for statistical analysis, the package represents
powerful tool for molecular biologists interested in analysis of
non-canonical DNA structures such as triplexes.
ACKNOWLEDGEMENTS
Funding: This article was supported by the framework of
IT4Innovations project CZ.1.05/1.1.00/02.0070 funded by the
EU Operational Programme ’Research and Development for
Innovations’, MSMT Grants No.0021630528 ”Security-Oriented
Research in Information Technology”, and LA09016 ”Participation
of CR in ERCIM”, and BUT grant FIT-S-11-1 ”Advanced secured,
reliable and adaptive IT”.
Conflict of Interest: none declared.
REFERENCES
Buske, F. A., Bauer, D. C., Mattick, J.S., and Bailey, T.L. (2012). Triplexator: detecting
nucleic acid triple helices in genomic and transcriptomic. Genome Research,22(7),
1372–1381.
Cer, R., Bruce, K., Donohue, D., Temiz, N., Mudunuri, U., Yi, M., Volfovsky, N.,
Bacolla, A., Luke, B., Collins, J., and Stephens, R. (2012). Searching for non-B
DNA-forming motifs using nBMST (non-B DNA motif search tool)., volume Chapter
18, pages Unit 18.7.1–22.
Hoyne, P. R., Edwards, L. M., Viari, A., and Maher, L. J. (2000). Searching genomes for
sequences with the potential to form intrastrand triple helices. Journal of Molecular
Biology,302(4), 797–809.
Htun, H. and Dahlberg, J. E. (1989). Topology and formation of triple-stranded h-dna.
Science,243, 1571–1576.
Lexa, M., Mart´ınek, T., Burgetov´a, I., Kopeˇcek, D., and Br´azdov´a, M. (2011). A
dynamic programming algorithm for identification of triplex-forming sequences.
Bioinformatics,27(18), 2510–7.
Rooney, S. M. and Moore, P. D. (1995). Antiparallel, intramolecular triplex dna
stimulates homologous recombination in human cells. Proceedings of the National
Academy of Sciences,92(6), 2141–2144.
Soyfer, V. N. and Potaman, V. N. (1995). Triple-helical nucleic acids. Springer-Verlag.
Walter, A., Sch¨utz, H., Simon, H., and Birch-Hirschfeld, E. (2001). Evidence for a dna
triplex in a recombination-like motif: I. recognition of watson-crick base pairs by
natural bases in a high-stability triplex. J Mol Recognit,14, 122–139.
2
... Here, we use the triplex package to predict intramolecular triplex motifs because it has several advantages compared to other software (Hon et al. 2013). For example, using the nBMST tool, as in a previous study of mtDNA instability (Oliveira et al. 2013), we only identified two potential triplex motifs within the major arc that did not overlap with the six motifs identified by the triplex package (Table S1). ...
... Why did an earlier study (Oliveira et al. 2013) fail to uncover an association between triplex motifs and mtDNA deletions? One reason may be that the nBMST tool used in that study, is not very sensitive to detect bona fide triplexes (Hon et al. 2013), whereas the triplex package used here, not only detected (which was not certified by peer review) is the author/funder. All rights reserved. ...
Preprint
The “theory of resistant biomolecules” posits that long-lived species show resistance to molecular damage at the level of their biomolecules. Here, we test this hypothesis in the context of mitochondrial DNA (mtDNA) as it implies that predicted mutagenic DNA motifs should be inversely correlated with species maximum lifespan (MLS). First, we confirmed that guanine-quadruplex (GQ) and direct repeat (DR) motifs are mutagenic, as they associate with mtDNA deletions in the human major arc of mtDNA, while also adding mirror repeat (MR) and intramolecular triplex motifs to a growing list of potentially mutagenic features. What is more, triplex motifs showed disease-specific associations with deletions and an apparent interaction with GQ motifs. Surprisingly, even though DR, MR and GQ motifs were associated with mtDNA deletions, their correlation with MLS was explained by the biased base composition of mtDNA. Only triplex motifs negatively correlated with MLS and these results remained stable even after adjusting for body mass, phylogeny, mtDNA base composition and effective number of codons. Taken together, our work highlights the importance of base composition for the comparative biogerontology of mtDNA and suggests that future research on mitochondrial triplex motifs is warranted.
... For each extracted window, the probabilities of cruciform, denaturation and Z-DNA formation at every base position were computed using perl software package SIST (Stress-Induced Structural Transitions) [39], set to the algorithm that considers competition among the three structures. Potential triplex and Gquadruplex formation was analyzed separately using R packages triplex [40] and pqsfinder [41], respectively. Both algorithms output a score at each position, which was normalized to a maximum of 300 [41] and 50, respectively. ...
Article
Full-text available
Kaposi sarcoma (KS), a common HIV-associated malignancy, presents a range of clinicopathological features. Kaposi sarcoma-associated herpesvirus (KSHV) is its etiologic agent, but the contribution of viral genomic variation to KS development is poorly understood. To identify potentially influential viral polymorphisms, we characterized KSHV genetic variation in 67 tumors from 1-4 distinct sites from 29 adults with advanced KS in Kampala, Uganda. Whole KSHV genomes were sequenced from 20 tumors with the highest viral load, whereas only polymorphic genes were screened by PCR and sequenced from 47 other tumors. Nine individuals harbored ≥1 tumors with a median 6-fold over-coverage of a region centering on K5 and K6 genes. K8.1 gene was inactivated in 8 individuals, while 5 had mutations in the miR-K10 microRNA coding sequence. Recurring inter-host polymorphisms were detected in K4.2 and K11.2. The K5-K6 region rearrangement breakpoints and K8.1 mutations were all unique, indicating that they arise frequently de novo. Rearrangement breakpoints were associated with potential G-quadruplex and Z-DNA forming sequences. Exploratory evaluations of viral mutations with clinical and tumor traits were conducted by logistic regression without multiple test corrections. K5-K6 over-coverage and K8.1 inactivation were tentatively correlated (p
... In many cases, the sequence dependence of DNA secondary structures allows computational prediction of where in the genome each has the potential to form (e.g. Hon et al. 2013;Bedrat et al. 2016). While these predictions have been applied to many non-B DNA conformations, in some cases the prediction is complicated by the fact that physiologically relevant structures can require the presence of an additional strand of nucleic acid, e.g. ...
Article
Full-text available
During replication, folding of the DNA template into non-B-form secondary structures provides one of the most abundant impediments to the smooth progression of the replisome. The core replisome collaborates with multiple accessory factors to ensure timely and accurate duplication of the genome and epigenome. Here, we discuss the forces that drive non-B structure formation and the evidence that secondary structures are a significant and frequent source of replication stress that must be actively countered. Taking advantage of recent advances in the molecular and structural biology of the yeast and human replisomes, we examine how structures form and how they may be sensed and resolved during replication.
... LongTarget is a web-based tool that is far slower than Triplexator, making genome-wide analyses prohibitively slow. In addition to canonical rules, Triplex [151,152] incorporated six less-common binding rules and then applied a BLAST-like dynamic programming algorithm [162] to select the best triplexes with the highest alignment scores. TTSMI is a database containing inferred triplex sites in the human genome and allows users to study co-occurrences of triplexes with other types of genomic regulatory elements, for example, TF binding sites, CpG islands, and single-nucleotide polymorphisms [153,154]. ...
Chapter
Full-text available
With the advances in sequencing technology and transcriptome analysis, it is estimated that up to 75% of the human genome is transcribed into RNAs. This finding prompted intensive investigations on the biological functions of non-coding RNAs and led to very exciting discoveries of microRNAs as important players in disease pathogenesis and therapeutic applications. Research on long non-coding RNAs (lncRNAs) is in its infancy, yet a broad spectrum of biological regulations has been attributed to lncRNAs. As a novel class of RNA transcripts, the expression level and splicing variants of lncRNAs are various. Northern blot analysis can help us learn about the identity, size, and abundance of lncRNAs. Here we describe how to use northern blot to determine lncRNA abundance and identify different splicing variants of a given lncRNA.
... Until now, computational prediction of lncRNA-DNA interactions has received relatively little attention from the scientific community working in lncRNAome [95]. We did find several tools that assessed the triple helix formation of RNA-DNA interactions, namely Triplex [96], Triplex Domain Finder [97], Triplexator [98], Triplex-Inspector [99], and LongTarget [100]. ...
Article
Full-text available
Long non-coding RNAs (lncRNA), the pervasively transcribed part of the mammalian genome, have played a significant role in changing our protein-centric view of genomes. The abundance of lncRNAs and their diverse roles across cell types have opened numerous avenues for the research community regarding lncRNAome. To discover and understand lncRNAome, many sophisticated computational techniques have been leveraged. Recently, deep learning (DL)-based modeling techniques have been successfully used in genomics due to their capacity to handle large amounts of data and produce relatively better results than traditional machine learning (ML) models. DL-based modeling techniques have now become a choice for many modeling tasks in the field of lncRNAome as well. In this review article, we summarized the contribution of DL-based methods in nine different lncRNAome research areas. We also outlined DL-based techniques leveraged in lncRNAome, highlighting the challenges computational scientists face while developing DL-based models for lncRNAome. To the best of our knowledge, this is the first review article that summarizes the role of DL-based techniques in multiple areas of lncRNAome.
... The overrepresented gene ontology (GO) terms were determined using GAGE (version 3.4.1, BioConductor) (51). The mouse annotation "org.Mm.eg.db" (version 3.4.1, ...
Article
Full-text available
Bone provides supportive microenvironments for hematopoietic stem cells (HSCs) and mesenchymal stem cells (MSCs) and is a frequent site of metastasis. While incidences of bone metastases increase with age, the properties of the bone marrow microenvironment that regulate dormancy and reactivation of disseminated tumor cells (DTCs) remain poorly understood. Here, we elucidate the age-associated changes in the bone secretome that trigger proliferation of HSCs, MSCs, and DTCs in the aging bone marrow microenvironment. Remarkably, a bone-specific mechanism involving expansion of pericytes and induction of quiescence-promoting secretome rendered this proliferative microenvironment resistant to radiation and chemotherapy. This bone- specific expansion of pericytes was triggered by an increase in PDGF signaling via remodeling of specialized type H blood vessels in response to therapy. The decline in bone marrow pericytes upon aging provides an explanation for loss of quiescence and expansion of cancer cells in the aged bone marrow microenvironment. Manipulation of blood flow — specifically, reduced blood flow — inhibited pericyte expansion, regulated endothelial PDGF-B expression, and rendered bone metastatic cancer cells susceptible to radiation and chemotherapy. Thus, our study provides a framework to recognize bone marrow vascular niches in age-associated increases in metastasis and to target angiocrine signals in therapeutic strategies to manage bone metastasis.
Chapter
The massive amount of experimental DNA and RNA sequence information provides an encyclopedia for cell biology that requires computational tools for efficient interpretation. The ability to write and apply simple computing scripts propels the investigator beyond the boundaries of online analysis tools to more broadly interrogate laboratory experimental data and to integrate them with all available datasets to test and challenge hypotheses. Here we describe robust prototypic bash and C++ scripts with metrics and methods for validation that we have made publicly available to address the roles of non-B DNA-forming motifs in eliciting genetic instability and to query The Cancer Genome Atlas. Importantly, the methods presented provide practical data interpretation tools to examine fundamental relationships and to enable insights and correlations between alterations in gene expression patterns and patient outcome. The exemplary source codes described are simple and can be efficiently modified, elaborated, and applied to other relationships and areas of investigation.
Chapter
Most of the transcribed human genome codes for noncoding RNAs (ncRNAs), and long noncoding RNAs (lncRNAs) make for the lion’s share of the human ncRNA space. Despite growing interest in lncRNAs, because there are so many of them, and because of their tissue specialization and, often, lower abundance, their catalog remains incomplete and there are multiple ongoing efforts to improve it. Consequently, the number of human lncRNA genes may be lower than 10,000 or higher than 200,000. A key open challenge for lncRNA research, now that so many lncRNA species have been identified, is the characterization of lncRNA function and the interpretation of the roles of genetic and epigenetic alterations at their loci. After all, the most important human genes to catalog and study are those that contribute to important cellular functions—that affect development or cell differentiation and whose dysregulation may play a role in the genesis and progression of human diseases. Multiple efforts have used screens based on RNA-mediated interference (RNAi), antisense oligonucleotide (ASO), and CRISPR screens to identify the consequences of lncRNA dysregulation and predict lncRNA function in select contexts, but these approaches have unresolved scalability and accuracy challenges. Instead—as was the case for better-studied ncRNAs in the past—researchers often focus on characterizing lncRNA interactions and investigating their effects on genes and pathways with known functions. Here, we focus most of our review on computational methods to identify lncRNA interactions and to predict the effects of their alterations and dysregulation on human disease pathways.
Article
The “theory of resistant biomolecules” posits that long-lived species show resistance to molecular damage at the level of their biomolecules. Here, we test this hypothesis in the context of mitochondrial DNA (mtDNA) as it implies that predicted mutagenic DNA motifs should be inversely correlated with species maximum lifespan (MLS). First, we confirmed that guanine-quadruplex and direct repeat (DR) motifs are mutagenic, as they associate with mtDNA deletions in the human major arc of mtDNA, while also adding mirror repeat (MR) and intramolecular triplex motifs to a growing list of potentially mutagenic features. What is more, triplex motifs showed disease-specific associations with deletions and an apparent interaction with guanine-quadruplex motifs. Surprisingly, even though DR, MR and guanine-quadruplex motifs were associated with mtDNA deletions, their correlation with MLS was explained by the biased base composition of mtDNA. Only triplex motifs negatively correlated with MLS even after adjusting for body mass, phylogeny, mtDNA base composition and effective number of codons. Taken together, our work highlights the importance of base composition for the comparative biogerontology of mtDNA and suggests that future research on mitochondrial triplex motifs is warranted.
Article
Nucleic acids (DNA and RNA) dynamically fold and unfold to exert their functions in cells. These folding and unfolding behaviours are also the basis for various technical applications. To understand the biological mechanism of nucleic acid function, and design active materials using nucleic acids, biophysical approaches based on thermodynamics are very useful. Methods for predicting the stability of canonical duplexes of nucleic acids have been extensively investigated for more than half a century and are now widely used. However, such predictions are not always accurate under various solution conditions, particularly cellular conditions, as the concentrations of cations and cosolutes under intracellular conditions, named as molecular crowding, differ from those under standard experimental conditions. Moreover, the crowding condition in cells is spatiotemporally variable. Furthermore, non-canonical structures such as triplex and tetraplex exist in cells and play important roles in gene expression. Therefore, a prediction method reflecting the cellular conditions must be established to determine the stability of various nuclei acid structures. This article reviews the biophysicochemical background of predicting nucleic acid stability and recent advances in the prediction of this stability under cellular conditions.
Article
Full-text available
Double-stranded DNA is able to form triple-helical structures by accommodating a third nucleotide strand in its major groove. This sequence-specific process offers a potent mechanism for targeting genomic loci of interest that is of great value for biotechnological and gene-therapeutic applications. It is likely that nature has leveraged this addressing system for gene regulation, because computational studies have uncovered an abundance of putative triplex target sites in various genomes, with enrichment particularly in gene promoters. However, to draw a more complete picture of the in vivo role of triplexes, not only the putative targets but also the sequences acting as the third strand and their capability to pair with the predicted target sites need to be studied. Here we present Triplexator, the first computational framework that integrates all aspects of triplex formation, and showcase its potential by discussing research examples for which the different aspects of triplex formation are important. We find that chromatin-associated RNAs have a significantly higher fraction of sequence features able to form triplexes than expected at random, suggesting their involvement in gene regulation. We furthermore identify hundreds of human genes that contain sequence features in their promoter predicted to be able to form a triplex with a target within the same promoter, suggesting the involvement of triplexes in feedback-based gene regulation. With focus on biotechnological applications, we screen mammalian genomes for high-affinity triplex target sites that can be used to target genomic loci specifically and find that triplex formation offers a resolution of ~1300 nt.
Article
Full-text available
This unit describes basic protocols on using the non-B DNA Motif Search Tool (nBMST) to search for sequence motifs predicted to form alternative DNA conformations that differ from the canonical right-handed Watson-Crick double-helix, collectively known as non-B DNA, and on using the associated PolyBrowse, a GBrowse-based genomic browser. The nBMST is a Web-based resource that allows users to submit one or more DNA sequences to search for inverted repeats (cruciform DNA), mirror repeats (triplex DNA), direct/tandem repeats (slipped/hairpin structures), G4 motifs (tetraplex, G-quadruplex DNA), alternating purine-pyrimidine tracts (left-handed Z-DNA), and A-phased repeats (static bending). The nBMST is versatile, simple to use, does not require bioinformatics skills, and can be applied to any type of DNA sequences, including viral and bacterial genomes, up to an aggregate of 20 megabasepairs (Mbp).
Article
Full-text available
Current methods for identification of potential triplex-forming sequences in genomes and similar sequence sets rely primarily on detecting homopurine and homopyrimidine tracts. Procedures capable of detecting sequences supporting imperfect, but structurally feasible intramolecular triplex structures are needed for better sequence analysis. We modified an algorithm for detection of approximate palindromes, so as to account for the special nature of triplex DNA structures. From available literature, we conclude that approximate triplexes tolerate two classes of errors. One, analogical to mismatches in duplex DNA, involves nucleotides in triplets that do not readily form Hoogsteen bonds. The other class involves geometrically incompatible neighboring triplets hindering proper alignment of strands for optimal hydrogen bonding and stacking. We tested the statistical properties of the algorithm, as well as its correctness when confronted with known triplex sequences. The proposed algorithm satisfactorily detects sequences with intramolecular triplex-forming potential. Its complexity is directly comparable to palindrome searching. Our implementation of the algorithm is available at http://www.fi.muni.cz/lexa/triplex as source code and a web-based search tool. The source code compiles into a library providing searching capability to other programs, as well as into a stand-alone command-line application based on this library. lexa@fi.muni.cz Supplementary data are available at Bioinformatics online.
Article
Thesis (Ph. D. in mammalian genetics)--University of Illinois at Chicago, 1996. Vita. Includes bibliographical references (leaves 95-121).
Article
Repeating copolymers of (dT-dC)n.(dA-dG)n sequences (TC.AGn) can assume a hinged DNA structure (H-DNA) which is composed of triple-stranded and single-stranded regions. A model for the formation of H-DNA is proposed, based on two-dimensional gel electrophoretic analysis of DNA's with different lengths of (TC.AG)n copolymers. In this model, H-DNA formation is initiated at a small denaturation bubble in the interior of the copolymer, which allows the duplexes on either side to rotate slightly and to fold back, in order to make the first base triplet. This nucleation establishes which of several nonequivalent H-DNA conformations is to be assumed by any DNA molecule, thereby trapping each molecule in one of several metastable conformers that are not freely interconvertible. Subsequently, the acceptor region spools up single-stranded polypyrimidines as they are released by progressive denaturation of the donor region; both the spooling and the denaturation result in relaxation of negative supercoils in the rest of the DNA molecule. From the model, it can be predicted that the levels of supercoiling of the DNA determine which half of the (dT-dC)n repeat is to become the donated third strand.
Article
The DNA motif 5'-AAGGGAGAAXGGGTATAGGGYAAGAGGGAA-3' (named XY32) is an H-palindrome and has been shown to undergo a superhelix-induced, pH-dependent structural transition to H-form (pyrimidine purineo pyrimidine triplex) DNA when X = Y = A (AA32) or X = Y = G (GG32), but when X = A and Y = G (AG32) or X = G and Y = A (GA32), the transition is much more difficult [Mirkin, S. (1987) Nature (London) 330, 495-497]. Furthermore, AA32, GG32, and GA32 triplexes have the proper sequence structure to potentially form pyrimidineopurineopurine (*H-form) triplexes, but AG32 does not [Beal, P. A. & Dervan, P. B. (1992) Nucleic Acids Res. 20, 2773-2776]. Using an in vivo plasmid-plasmid recombination assay system in cultured human cells, we have found that AA32, GA32, and GG32 stimulate homologous recombination between plasmids 3- to 5-fold when both recombination substrates contain these triplex-forming sequences, whereas AG32, which differs from the others by only 1 or 2 bp, does not significantly affect the frequency of recombination. Double-strand breaks, which destroy supercoiling, nullify the stimulation. Therefore, stimulation of homologous recombination between plasmids containing these sequences correlates with their triplex-forming potential. Crosses in which the triplex-forming sequence is inserted into only one substrate exhibit an intermediate stimulation, suggesting that the inserts are acting alone as intramolecular triplexes.
Article
The canonical double-helix form of DNA is thought to predominate both in dilute solution and in living cells. Sequence-dependent fluctuations in local DNA shape occur within the double helix. Besides these relatively modest variations in shape, more extreme and remarkable structures have been detected in which some bases become unpaired. Examples include unusual three-stranded structures such as H-DNA. Certain RNA and DNA strands can also fold onto themselves to form intrastrand triplexes. Although they have been extensively studied in vitro, it remains unknown whether nucleic acid triplexes play natural roles in cells. If natural nucleic acid triplexes were identified in cells, much could be learned by examining the formation, stabilization, and function of such structures. With these goals in mind, we adapted a pattern-recognition program to search genetic databases for a type of potential triplex structure whose presence in genomes has not been previously investigated. We term these sequences Potential Intrastrand Triplex (PIT) elements. The formation of an intrastrand triplex requires three consecutive sequence domains with appropriate symmetry along a single nucleic acid strand. It is remarkable that we discovered multiple copies of sequence elements with the potential to form one particular class of intrastrand triplexes in the fully sequenced genomes of several bacteria. We then focused on the characterization of the 25 copies of a particular approximately 37 nt PIT sequence detected in Escherichia coli. Through biochemical studies, we demonstrate that an isolated DNA strand from this family of E. coli PIT elements forms a stable intrastrand triplex at physiological temperature and pH in the presence of physiological concentrations of Mg(2+).
Article
Data are presented on a triplex type with two parallel homologous strands for which triplex formation is almost as strong as duplex formation at least for some sequences and even at pH 7 and 0.2 M NaCl. The evidence mainly rests upon comparing thermodynamic properties of similar systems. A paperclip oligonucleotide d(A12C4T12C4A12) with two linkers C4 obviously can form a triplex with parallel back-folded adenine strand regions, because the single melting transition of this complex splits in two transitions by introducing mismatches only in the third strand region. Respectively, a hairpin duplex d(A12C4T12) and a single strand d(A12) form a triplex as a 1:1 complex in which the second adenine strand is parallel oriented to the homologous one in the Watson-Crick paired duplex. In this system the melting temperature T(m) of the triplex is practically the same as that of the duplex d(A12)-d(T12), at least within a complex concentration range of 0.2-4.0 microM. The melting behaviour of complexes between triplex stabilizing ligand BePI and the system hairpin duplex plus single strand supports the triplex model. Non-denaturing gel electrophoresis suggests the existence of a triplex for a system in which five of the twelve A-T*A base triads are substituted by C-G*C base triads. The recognition between any substituted Watson-Crick base pair (X-Y) in the hairpin duplex d(A4XA7C4T7YT4) and the correspondingly replaced base (Z) in the third strand d(A4ZA7) is mutually selective. All triplexes with matching base substitutions (Z = X) have nearly the same stability (T(m) values from 29 to 33.5 degrees C), whereas triplexes with non-matching substitutions (Z not equal X) show a clearly reduced stability (T(m) values from 15 to 22 degrees C) at 2microM equimolar oligonucleotide concentration. Most nucleic acid triple helices hitherto known are limited to homopurine-homopyrimidine sequences in the target duplex. A stable triplex formation is demonstrated for inhomogeneous sequences tolerating at least 50% pyrimidine content in the homologous strands. On the basis of the surprisingly similar thermodynamic parameters for duplex and triplex, and of the fact that this triplex type seems to be more stable than many other natural DNA triplexes known, and on the basis of semiempirical and molecule mechanical calculations, we postulate bridging interactions of the third strand with the two other strands in the triplex according to the recombination motif. This triplex, denoted by us 'recombination-like form', tolerates heterogeneous base sequences.
Evidence for a DNA triplex in a recombination-like motif: I. recognition of Watson-Crick base pairs by natural bases in a high-stability triplex
  • A Walter