[Show abstract][Hide abstract] ABSTRACT: Deep transcriptome sequencing has revealed the existence of many transcripts
that lack long or conserved open reading frames and which have been termed long
non-coding RNAs (lncRNAs). Despite the existence of several well-characterized
lncRNAs that play roles in the regulation of gene expression, the vast majority
of them do not yet have a known function. Motivated by the existence of
ribosome profiling data for several species, we have tested the hypothesis that
they may act as a repository for the synthesis of new peptides using data from
human, mouse, zebrafish, fruit fly, Arabidopsis and yeast. The ribosome
protection patterns are consistent with the presence of translated open reading
frames (ORFs) in a very large number of lncRNAs. Most of the ribosome-protected
ORFs are shorter than 100 amino acids and usually cover less than half the
transcript. Ribosome density in these ORFs is high and contrasts sharply with
the 3UTR region, in which very often there is no detectable ribosome binding,
similar to bona fide protein-coding genes. The coding potential of
ribosome-protected ORFs, measured using hexamer frequencies, is significantly
higher than that of randomly selected intronic ORFs and similar to that of
evolutionary young coding sequences. Selective constraints in
ribosome-protected ORFs from lncRNAs are lower than in typical protein-coding
genes but again similar to young proteins. These results strongly suggest that
lncRNAs play an important role in de novo protein evolution.
[Show abstract][Hide abstract] ABSTRACT: Centromere sequences in the genome are associated with the formation of kinetochores, where spindle microtubules grow in mitosis. Centromere sequences usually have long tandem repeats (satellites). In holocentric nematodes it is not clear how kinetochores are formed during mitosis; they are distributed throughout the chromosomes. For this reason it appeared of interest to study the satellites in nematodes in order to determine if they offer any clue on how kinetochores are assembled in these species. We have studied the satellites in the genome of six nematode species. We found that the presence of satellites depends on whether the nematode chromosomes are holocentric or monocentric. It turns out that holocentric nematodes are unique because they have a large number of satellites scattered throughout their genome. Their number, length and composition are different in each species: they apparently have very little evolutionary conservation. In contrast, no scattered satellites are found in the monocentric nematode Trichinella spiralis. It appears that the absence/presence of scattered satellites in the genome distinguishes monocentric from holocentric nematodes. We conclude that the presence of satellites is related to the holocentric nature of the chromosomes of most nematodes. Satellites may stabilize a higher order structure of chromatin and facilitate the formation of kinetochores. We also present a new program, SATFIND, which is suited to find satellite sequences.
PLoS ONE 01/2013; 8(4):e62221. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We carried out a systems-level study of the mechanisms underlying organ-specific metastases of breast cancer. We followed a network-based approach using microarray expression data from human breast cancer metastases to select organ-specific proteins that exert a range of functions allowing cell survival and growth in the microenvironment of distant organs. MinerProt, a home-made software application, was used to group organ-specific signatures of brain (1191 genes), bone (1623 genes), liver (977 genes) and lung (254 genes) metastases by function and select the most differentially expressed gene in each function. As a result, we obtained 19 functional representative proteins in brain, 23 in bone, 15 in liver and 9 in lung, with which we constructed four organ-specific protein-protein interaction networks. The network taxonomy included seven proteins that interacted in brain metastasis, which were mainly associated with signal transduction. Proteins related to immune response functions were bone specific, while those involved in proteolysis, signal transduction and hepatic glucose metabolism were found in liver metastasis. No experimental protein-protein interaction was found in lung metastasis; thus, computationally determined interactions were included in this network. Moreover, three of these selected genes (CXCL12, DSC2 and TFDP2) were associated with progression to specific organs when tested in an independent dataset. In conclusion, we present a network-based approach to filter information by selecting key protein functions as metastatic markers or therapeutic targets.
[Show abstract][Hide abstract] ABSTRACT: There are general features of chromosome dynamics, such as homologue recognition in early meiosis, which are expected to involve related sequence motifs in non-coding DNA, with a similar distribution in different species. A search for such motifs is presented here. It has been carried out with the CONREPP programme. It has been found that short alternating AT sequences (10-20 bases) have a similar distribution in most eukaryotic organisms, with some exceptions related to unique meiotic features. All other microsatellite and repeat sequences vary significantly in different organisms. It is concluded that the unique structural features and uniform distribution of alternating AT sequences indicate that they may facilitate homologous chromosome pairing in the early preleptotene stage of meiosis. They may also play a role in the compaction of DNA in mitotic chromosomes.
Journal of Theoretical Biology 05/2011; 283(1):28-34. · 2.35 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The purpose of this work is to determine the most frequent short sequences in non-coding DNA. They may play a role in maintaining the structure and function of eukaryotic chromosomes. We present a simple method for the detection and analysis of such sequences in several genomes, including Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens. We also study two chromosomes of man and mouse with a length similar to the whole genomes of the other species. We provide a list of the most common sequences of 9-14 bases in each genome. As expected, they are present in human Alu sequences. Our programs may also give a graph and a list of their position in the genome. Detection of clusters is also possible. In most cases, these sequences contain few alternating regions. Their intrinsic structure and their influence on nucleosome formation are not known. In particular, we have found new features of short sequences in C. elegans, which are distributed in heterogeneous clusters. They appear as punctuation marks in the chromosomes. Such clusters are not found in either A. thaliana or D. melanogaster. We discuss the possibility that they play a role in centromere function and homolog recognition in meiosis.
Nucleic Acids Research 12/2009; 38(4):1172-81. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from: http://wwwabi.snv.jussieu.fr/public/Repeatoire.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 01/2009; 6(2):180-9. · 2.25 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We present an analysis of tandem repeats of short sequence motifs (microsatellites) in twelve eukaryotes for which a large part of the genome has been sequenced and assembled. The pattern of motif abundance varies significantly in different species, but it is very similar in different chromosomes of the same species. The most abundant repeats can be classified in two main families. The first family has a rigid conformation, with purines in one strand and pyrimidines in the complementary strand, mainly A(n)/T(n) and (AG)(n)/(CT)(n). The second family has alternating, flexible sequences, such as (AT)(n), (AC)(n) and related sequences. In the pluricellular organisms the relative frequency of both families is rather constant. These observations indicate that microsatellites have structural information and may be involved in the organization of chromatin fibers and in chromosome architecture in general. An additional intriguing finding is the absence of microsatellites with sequences which appear to be forbidden, such as (AATT)(n).
[Show abstract][Hide abstract] ABSTRACT: The identification of homologous DNA is a fundamental building block of comparative genomic and molecular evolution studies. To date, pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with a previously described efficient filtration method for local multiple alignment. During gapped extension, we use the MUSCLE implementation of progressive multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any strand/species-symmetric nucleotide substitution matrix, and we have developed a method to adapt an arbitrary substitution matrix (i.e. HOXD) to organisms with different G+C content. We evaluate the performance of our method and previous approaches on a hybrid dataset of real genomic DNA with simulated interspersed repeats. Our method outperforms existing methods in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in the free, open-source procrastAligner software, available from: http: // alggen. lsi. upc. es / recerca / align / procrastination.
[Show abstract][Hide abstract] ABSTRACT: The Distributed Annotation System (DAS) allows clients to access many disperse genome and protein annotation sources in a
coordinate manner. Here we present DASGenExp, a web based DAS client for interactive visualisation and exploration of genome
based annotations inspired in the Google Maps user interface. The client is easy to use and intuitive and integrates some
unique functions not found in other DAS clients: interactivity, multiple genomes at the same time, arbitrary zoom windows,...
DASGenExp can be freely accessed at
2nd International Workshop on Practical Applications of Computational Biology and Bioinformatics, IWPACBB 2008, Salamanca, Spain, 22th-24th October 2008; 01/2008
[Show abstract][Hide abstract] ABSTRACT: The current wealth of available genomic data provides an unprecedented opportunity to compare and contrast evolutionary histories
of closely and distantly related organisms. The focus of this dissertation is on developing novel algorithms and software
for efficient global and local comparison of multiple genomes and the application of these methods for a biologically relevant
case study. The thesis research is organized into three successive phases, specifically: (1) multiple genome alignment of
closely related species, (2) local multiple alignment of interspersed repeats, and finally, (3) a comparative genomics case
study of Neisseria. In Phase 1, we first develop an efficient algorithm and data structure for maximal unique match search in multiple genome
sequences. We implement these contributions in an interactive multiple genome comparison and alignment tool, M-GCAT, that
can efficiently construct multiple genome comparison frameworks in closely related species. In Phase 2, we present a novel
computational method for local multiple alignment of interspersed repeats. Our method for local alignment of interspersed
repeats features a novel method for gapped extensions of chained seed matches, joining global multiple alignment with a homology
test based on a hidden Markov model (HMM). In Phase 3, using the results from the previous two phases we perform a case study
of neisserial genomes by tracking the propagation of repeat sequence elements in attempt to understand why the important pathogens
of the neisserial group have sexual exchange of DNA by natural transformation. In conclusion, our global contributions in
this dissertation have focused on comparing and contrasting evolutionary histories of related organisms via multiple alignment
2nd International Workshop on Practical Applications of Computational Biology and Bioinformatics, IWPACBB 2008, Salamanca, Spain, 22th-24th October 2008; 01/2008
[Show abstract][Hide abstract] ABSTRACT: During the course of evolution, genomes can undergo large-scale mutation events such as rearrangement and lateral transfer. Such mutations can result in significant variations in gene order and gene content among otherwise closely related organisms. The Mauve genome alignment system can successfully identify such rearrangement and lateral transfer events in comparisons of multiple microbial genomes even under high levels of recombination. This chapter outlines the main features of Mauve and provides examples that describe how to use Mauve to conduct a rigorous multiple genome comparison and study evolutionary patterns.
[Show abstract][Hide abstract] ABSTRACT: Understanding the constraints that operate in mammalian gene promoter sequences is of key importance to understand the evolution of gene regulatory networks. The level of promoter conservation varies greatly across orthologous genes, denoting differences in the strength of the evolutionary constraints. Here we test the hypothesis that the number of tissues in which a gene is expressed is related in a significant manner to the extent of promoter sequence conservation.
We show that mammalian housekeeping genes, expressed in all or nearly all tissues, show significantly lower promoter sequence conservation, especially upstream of position -500 with respect to the transcription start site, than genes expressed in a subset of tissues. In addition, we evaluate the effect of gene function, CpG island content and protein evolutionary rate on promoter sequence conservation. Finally, we identify a subset of transcription factors that bind to motifs that are specifically over-represented in housekeeping gene promoters.
This is the first report that shows that the promoters of housekeeping genes show reduced sequence conservation with respect to genes expressed in a more tissue-restricted manner. This is likely to be related to simpler gene expression, requiring a smaller number of functional cis-regulatory motifs.
[Show abstract][Hide abstract] ABSTRACT: The analysis of the promoter sequence of genes with similar expression patterns is a basic tool to annotate common regulatory elements. Multiple sequence alignments are on the basis of most comparative approaches. The characterization of regulatory regions from co-expressed genes at the sequence level, however, does not yield satisfactory results in many occasions as promoter regions of genes sharing similar expression programs often do not show nucleotide sequence conservation.
In a recent approach to circumvent this limitation, we proposed to align the maps of predicted transcription factors (referred as TF-maps) instead of the nucleotide sequence of two related promoters, taking into account the label of the corresponding factor and the position in the primary sequence. We have now extended the basic algorithm to permit multiple promoter comparisons using the progressive alignment paradigm. In addition, non-collinear conservation blocks might now be identified in the resulting alignments. We have optimized the parameters of the algorithm in a small, but well-characterized collection of human-mouse-chicken-zebrafish orthologous gene promoters.
Results in this dataset indicate that TF-map alignments are able to detect high-level regulatory conservation at the promoter and the 3'UTR gene regions, which cannot be detected by the typical sequence alignments. Three particular examples are introduced here to illustrate the power of the multiple TF-map alignments to characterize conserved regulatory elements in absence of sequence similarity. We consider this kind of approach can be extremely useful in the future to annotate potential transcription factor binding sites on sets of co-regulated genes from high-throughput expression experiments.
[Show abstract][Hide abstract] ABSTRACT: We address the problem of comparing and characterizing the promoter regions of genes with similar expression patterns. This remains a challenging problem in sequence analysis, because often the promoter regions of co-expressed genes do not show discernible sequence conservation. In our approach, thus, we have not directly compared the nucleotide sequence of promoters. Instead, we have obtained predictions of transcription factor binding sites, annotated the predicted sites with the labels of the corresponding binding factors, and aligned the resulting sequences of labels--to which we refer here as transcription factor maps (TF-maps). To obtain the global pairwise alignment of two TF-maps, we have adapted an algorithm initially developed to align restriction enzyme maps. We have optimized the parameters of the algorithm in a small, but well-curated, collection of human-mouse orthologous gene pairs. Results in this dataset, as well as in an independent much larger dataset from the CISRED database, indicate that TF-map alignments are able to uncover conserved regulatory elements, which cannot be detected by the typical sequence alignments.
[Show abstract][Hide abstract] ABSTRACT: Due to recent advances in whole genome shotgun sequencing and assembly technologies, the financial cost of decoding an organism's DNA has been drastically reduced, resulting in a recent explosion of genomic sequencing projects. This increase in related genomic data will allow for in depth studies of evolution in closely related species through multiple whole genome comparisons.
To facilitate such comparisons, we present an interactive multiple genome comparison and alignment tool, M-GCAT, that can efficiently construct multiple genome comparison frameworks in closely related species. M-GCAT is able to compare and identify highly conserved regions in up to 20 closely related bacterial species in minutes on a standard computer, and as many as 90 (containing 75 cloned genomes from a set of 15 published enterobacterial genomes) in an hour. M-GCAT also incorporates a novel comparative genomics data visualization interface allowing the user to globally and locally examine and inspect the conserved regions and gene annotations.
M-GCAT is an interactive comparative genomics tool well suited for quickly generating multiple genome comparisons frameworks and alignments among closely related species. M-GCAT is freely available for download for academic and non-commercial use at: http://alggen.lsi.upc.es/recerca/align/mgcat/intro-mgcat.html.
[Show abstract][Hide abstract] ABSTRACT: Information about the genomic coordinates and the sequence of experimentally identified transcription factor binding sites is found scattered under a variety of diverse formats. The availability of standard collections of such high-quality data is important to design, evaluate and improve novel computational approaches to identify binding motifs on promoter sequences from related genes. ABS (http://genome.imim.es/datasets/abs2005/index.html) is a public database of known binding sites identified in promoters of orthologous vertebrate genes that have been manually curated from bibliography. We have annotated 650 experimental binding sites from 68 transcription factors and 100 orthologous target genes in human, mouse, rat or chicken genome sequences. Computational predictions and promoter alignment information are also provided for each entry. A simple and easy-to-use web interface facilitates data retrieval allowing different views of the information. In addition, the release 1.0 of ABS includes a customizable generator of artificial datasets based on the known sites contained in the collection and an evaluation tool to aid during the training and the assessment of motif-finding programs.
Nucleic Acids Research 02/2006; 34(Database issue):D63-7. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We describe an efficient local multiple alignment filtration heuristic for identification of conserved regions in one or more DNA se- quences. The method incorporates several novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously, (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered. The re- sulting local multiple alignments may have nucleotide substitutions and internal gaps as large as w characters in any occurrence of the motif. The algorithm consumes O(wN) memory and O(wN log wN) time where N is the sequence length. We score the significance of multiple alignments using entropy-based motif scoring methods. We demonstrate the per- formance of our filtration method on Alu-repeat rich segments of the human genome and a large set of Hepatitis C virus genomes. The GPL implementation of our algorithm in C++ is called procrastAligner and is freely available from http://gel.ahabs.wisc.edu/procrastination
Algorithms in Bioinformatics, 6th International Workshop, WABI 2006, Zurich, Switzerland, September 11-13, 2006, Proceedings; 01/2006
[Show abstract][Hide abstract] ABSTRACT: A program has been developed to determine the diffraction pattern given by partially ordered fibres formed by macromolecules with helical symmetry. It is particularly useful for visualizing the splitting of layer lines typical of coiled coils. The program produces as output the diffraction diagram calculated for helices that are oriented along their axis but are randomly oriented in other directions. The results can be numerically analyzed and also visualized on-screen. The program has been applied to the diffraction patterns given by DNA and protein coiled coils.
[Show abstract][Hide abstract] ABSTRACT: IntroductionTo increase data reliability and reduce the costs associated with the HTR, the Catalan Institute of Oncology programmed the manual procedures of data collection from databases by means of a computer application (ASEDAT).