[Show abstract][Hide abstract] ABSTRACT: The Amoebozoa represent a clade of unicellular amoeboid organisms that display a wide variety of lifestyles, including free-living and parasitic species. For example, the social amoeba Dictyostelium discoideum has the ability to aggregate into a multicellular fruiting body upon starvation, while the pathogenic amoeba Entamoeba histolytica is a parasite of humans. Globins are small heme proteins that are present in almost all extant organisms. Although several genomes of amoebozoan species have been sequenced, little is known about the phyletic distribution of globin genes within this phylum. Only two flavohemoglobins (FHbs) of D. discoideum have been reported and characterized previously while the genomes of Entamoeba species are apparently devoid of globin genes. We investigated eleven amoebozoan species for the presence of globin genes by genomic and phylogenetic in silico analyses. Additional FHb genes were identified in the genomes of four social amoebas and the true slime mold Physarum polycephalum. Moreover, a single-domain globin (SDFgb) of Hartmannella vermiformis, as well as two truncated hemoglobins (trHbs) of Acanthamoeba castellanii were identified. Phylogenetic evidence suggests that these globin genes were independently acquired via horizontal gene transfer from some ancestral bacteria. Furthermore, the phylogenetic tree of amoebozoan FHbs indicates that they do not share a common ancestry and that a transfer of FHbs from bacteria to amoeba occurred multiple times.
International journal of biological sciences. 01/2014; 10(7):689-701.
[Show abstract][Hide abstract] ABSTRACT: Most genomes are populated by thousands of sequences that originated from mobile elements. On the one hand, these sequences present a real challenge in the process of genome analysis and annotation. On the other hand, there are very interesting biological subjects involved in many cellular processes. Here, we present an overview of transposable elements (TEs) biodiversity and their impact on genomic evolution. Finally, we discuss different approaches to the TEs detection and analyses.
[Show abstract][Hide abstract] ABSTRACT: Neuroglobin (Ngb) is a hexacoordinated globin expressed mainly in the central and peripheral nervous system of vertebrates. Although several hypotheses have been put forward regarding the role of neuroglobin, its definite function remains uncertain. Ngb appears to have a neuro-protective role enhancing cell viability under hypoxia and other types of oxidative stress. Ngb is phylogenetically ancient and has a substitution rate nearly four times lower than that of other vertebrate globins, e.g. hemoglobin. Despite its high sequence conservation among vertebrates Ngb seems to be elusive in invertebrates.
We determined candidate orthologs in invertebrates and identified a globin of the placozoan Trichoplax adhaerens that is most likely orthologous to vertebrate Ngb and confirmed the orthologous relationship of the polymeric globin of the sea urchin Strongylocentrotus purpuratus to Ngb. The putative orthologous globin genes are located next to genes orthologous to vertebrate POMT2 similarly to localization of vertebrate Ngb. The shared syntenic position of the globins from Trichoplax, the sea urchin and of vertebrate Ngb strongly suggests that they are orthologous. A search for conserved transcription factor binding sites (TFBSs) in the promoter regions of the Ngb genes of different vertebrates via phylogenetic footprinting revealed several TFBSs, which may contribute to the specific expression of Ngb, whereas a comparative analysis with myoglobin revealed several common TFBSs, suggestive of regulatory mechanisms common to globin genes.
Identification of the placozoan and echinoderm genes orthologous to vertebrate neuroglobin strongly supports the hypothesis of the early evolutionary origin of this globin, as it shows that neuroglobin was already present in the placozoan-bilaterian last common ancestor. Computational determination of the transcription factor binding sites repertoire provides on the one hand a set of transcriptional factors that are responsible for the specific expression of the Ngb genes and on the other hand a set of factors potentially controlling expression of a couple of different globin genes.
PLoS ONE 01/2012; 7(10):e47972. · 3.73 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Most of eukaryotic genes are interrupted by introns that need to be removed from pre-mRNAs before they can perform their function. This is done by complex machinery called spliceosome. Many eukaryotes possess two separate spliceosomal systems that process separate sets of introns. The major (U2) spliceosome removes majority of introns, while minute fraction of intron repertoire is processed by the minor (U12) spliceosome. These two populations of introns are called U2-type and U12-type, respectively. The latter fall into two subtypes based on the terminal dinucleotides. The minor spliceosomal system has been lost independently in some lineages, while in some others few U12-type introns persist. We investigated twenty insect genomes in order to better understand the evolutionary dynamics of U12-type introns. Our work confirms dramatic drop of U12-type introns in Diptera, leaving these genomes just with a handful cases. This is mostly the result of intron deletion, but in a number of dipteral cases, minor type introns were switched to a major type, as well. Insect genes that harbor U12-type introns belong to several functional categories among which proteins binding ions and nucleic acids are enriched and these few categories are also overrepresented among these genes that preserved minor type introns in Diptera.
International journal of biological sciences 01/2012; 8(3):344-52. · 3.17 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Many multicellular eukaryotes have two types of spliceosomes for the removal of introns from messenger RNA precursors. The major (U2) spliceosome processes the vast majority of introns, referred to as U2-type introns, while the minor (U12) spliceosome removes a small fraction (less than 0.5%) of introns, referred to as U12-type introns. U12-type introns have distinct sequence elements and usually occur together in genes with U2-type introns. A phylogenetic distribution of U12-type introns shows that the minor splicing pathway appeared very early in eukaryotic evolution and has been lost repeatedly.
We have investigated the evolution of U12-type introns among eighteen metazoan genomes by analyzing orthologous U12-type intron clusters. Examination of gain, loss, and type switching shows that intron type is remarkably conserved among vertebrates. Among 180 intron clusters, only eight show intron loss in any vertebrate species and only five show conversion between the U12 and the U2-type. Although there are only nineteen U12-type introns in Drosophila melanogaster, we found one case of U2 to U12-type conversion, apparently mediated by the activation of cryptic U12 splice sites early in the dipteran lineage. Overall, loss of U12-type introns is more common than conversion to U2-type and the U12 to U2 conversion occurs more frequently among introns of the GT-AG subtype than among introns of the AT-AC subtype. We also found support for natural U12-type introns with non-canonical terminal dinucleotides (CT-AC, GG-AG, and GA-AG) that have not been previously reported.
Although complete loss of the U12-type spliceosome has occurred repeatedly, U12 introns are extremely stable in some taxa, including eutheria. Loss of U12 introns or the genes containing them is more common than conversion to the U2-type. The degeneracy of U12-type terminal dinucleotides among natural U12-type introns is higher than previously thought.
[Show abstract][Hide abstract] ABSTRACT: Interspersed repetitive sequences are major components of eukaryotic genomes. They comprise about 50% of the mammalian genome. They interact with the whole genome and influence its evolution. They do this in many ways, e.g. by serving as recombination hotspots, providing a mechanism for genomic shuffling and a source of 'ready-to-use' motifs for new transcriptional regulatory elements, polyadenylation signals, and protein-coding sequences. In this review we discuss the consequences of exaptation of sequences originated in tansposable elements with focus on events that influence protein coding genes.
[Show abstract][Hide abstract] ABSTRACT: Recent studies indicate that the initial classification of transposable elements (TEs) as 'useless', 'selfish' or 'junk' pieces of DNA is not an accurate one. TEs seem to have complex regulatory functions and contribute to the coding regions of many genes. Because this contribution had been documented only at transcript level, we searched for evidence that would also support the translation of TE cassettes. Our findings suggest that the proportion of proteins with TE-encoded fragments (approximately 0.1%), although probably underestimated, is much less than what the data at transcript level suggest (approximately 4%). In all cases, the TE cassettes are derived from old TEs, consistent with the idea that incorporation (exaptation) of TE fragments into functional proteins requires long evolutionary periods. We therefore argue that functional proteins are unlikely to contain TE cassettes derived from young TEs, the role of which is probably limited to regulatory functions.
Trends in Genetics 06/2006; 22(5):260-7. · 9.77 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Transposable elements (TEs) are major components of eukaryotic genomes, contributing about 50% to the size of mammalian genomes. TEs serve as recombination hot spots and may acquire specific cellular functions, such as controlling protein translation and gene transcription. The latter is the subject of the analysis presented. We scanned TE sequences located in promoter regions of all annotated genes in the human genome for their content in potential transcription regulating signals. All investigated signals are likely to be over-represented in at least one TE class, which shows that TEs have an important potential to contribute to pre-transcriptional gene regulation, especially by moving transcriptional signals within the genome and thus potentially leading to new gene expression patterns. We also found that some TE classes are more likely than others to carry transcription regulating signals, which can explain why they have different retention rates in regions neighboring genes.
[Show abstract][Hide abstract] ABSTRACT: The chimpanzee is our closest living relative. The morphological differences between the two species are so large that there is no problem in distinguishing between them. However, the nucleotide difference between the two species is surprisingly small. The early genome comparison by DNA hybridization techniques suggested a nucleotide difference of 1-2%. Recently, direct nucleotide sequencing confirmed this estimate. These findings generated the common belief that the human is extremely close to the chimpanzee at the genetic level. However, if one looks at proteins, which are mainly responsible for phenotypic differences, the picture is quite different, and about 80% of proteins are different between the two species. Still, the number of proteins responsible for the phenotypic differences may be smaller since not all genes are directly responsible for phenotypic characters.
[Show abstract][Hide abstract] ABSTRACT: Classification of proteins into families is one of the main goals of functional analysis. Proteins are usually assigned to a family on the basis of the presence of family-specific patterns, domains, or structural elements. Whereas proteins belonging to the same family are generally similar to each other, the extent of similarity varies widely across families. Some families are characterized by short, well-defined motifs, whereas others contain longer, less-specific motifs. We present a simple method for visualizing such differences. We applied our method to the Arabidopsis thaliana families listed at The Arabidopsis Information Resource (TAIR) Web site and for 76% of the nontrivial families (families with more than one member), our method identifies simple similarity measures that are necessary and sufficient to cluster members of the family together. Our visualization method can be used as part of an annotation pipeline to identify potentially incorrectly defined families. We also describe how our method can be extended to identify novel families and to assign unclassified proteins into known families.
Genome Research 07/2004; 14(6):1160-9. · 14.40 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: It is believed that 3.2 billion bp of the human genome harbor approximately 35000 protein-coding genes. On average, one could expect one gene per 300000 nucleotides (nt). Although the distribution of the genes in the human genome is not random,it is rather surprising that a large number of genes overlap in the mammalian genomes. Thousands of overlapping genes were recently identified in the human and mouse genomes. However,the origin and evolution of overlapping genes are still unknown. We identified 1316 pairs of overlapping genes in humans and mice and studied their evolutionary patterns. It appears that these genes do not demonstrate greater than usual conservation. Studies of the gene structure and overlap pattern showed that only a small fraction of analyzed genes preserved exactly the same pattern in both organisms.
Genome Research 03/2004; 14(2):280-6. · 14.40 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: One of the most common activities in bioinformatics is the search for similar sequences. These searches are usually carried out with the help of programs from the NCBI BLAST family. As the majority of searches are routinely performed with default parameters, a question that should be addressed is how reliable the results obtained using the default parameter values are, i.e. what fraction of potential matches have been retrieved by these searches. Our primary focus is on the initial hit parameter, also known as the seed or word, used by the NCBI BLASTn, MegaBLAST and other similar programs in searches for similar nucleotide sequences. We show that the use of default values for the initial hit parameter can have a big negative impact on the proportion of potentially similar sequences that are retrieved. We also show how the hit probability of different seeds varies with the minimum length and similarity of sequences desired to be retrieved and describe methods that help in determining appropriate seeds. The experimental results described in this paper illustrate situations in which these methods are most applicable and also show the relationship between the various BLAST parameters.
Nucleic Acids Research 01/2004; 31(23):6935-41. · 8.28 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Interspersed repetitive sequences are major components of eukaryotic genomes. Repetitive elements comprise about 50% of the mammalian genome. They interact with the whole genome and influence its evolution. Repetitive elements may serve as recombination hot spots or acquire specific cellular functions such as RNA transcription control or become part of protein coding regions. The latter is a subject of presented analysis. We searched all currently available vertebrate protein sequences, including human proteome complement for the presence of transposable elements. It appears that insertion of TE-cassettes into open reading frames is a general phenomena. They can be found in all vertebrate lineages and originate in all types of transposable elements. It seems that genomes use those cassettes as 'ready to use' motifs in their evolutionary experiments. Most of TE-cassettes are used to create alternative forms of a message and usually the other form, without TE-cassette, is expressed in a cell. Tables listing vertebrate messages with TE-cassettes are available at http://warta.bio.psu.edu/ScrapYard/.
[Show abstract][Hide abstract] ABSTRACT: Genetic information of human is encoded in two genomes: nuclear and mitochondrial. Both of them reflect molecular evolution of human starting from the beginning of life (about 4.5 billion years ago) until the origin of Homo sapiens species about 100,000 years ago. From this reason human genome contains some features that are common for different groups of organisms and some features that are unique for Homo sapiens. 3.2 x 10(9) base pairs of human nuclear genome are packed into 23 chromosomes of different size. The smallest chromosome - 21st contains 5 x 10(7) base pairs while the biggest one -1st contains 2.63 x 10(8) base pairs. Despite the fact that the nucleotide sequence of all chromosomes is established, the organisation of nuclear genome put still questions: for example: the exact number of genes encoded by the human genome is still unknown giving estimations from 30 to 150 thousand genes. Coding sequences represent a few percent of human nuclear genome. The majority of the genome is represented by repetitiVe sequences (about 50%) and noncoding unique sequences. This part of the genome is frequently wrongly called "junk DNA". The distribution of genes on chromosomes is irregular, DNA fragments containing low percentage of GC pairs code lower number of genes than the fragments of high percentage of GC pairs.
[Show abstract][Hide abstract] ABSTRACT: Interspersed repetitive sequences are major components of eukaryotic genomes. Repetitive elements comprise over 50% of the mammalian genome. Because the specific function of these elements remains to be defined and because of their unusual 'behaviour' in the genome, they are often quoted as a selfish or junk DNA. Our view of the entire phenomenon of repetitive elements has to now be revised in the light of data on their biology and evolution, especially in the light of what we know about the retroposons. I would like to argue that even if we cannot define the specific function of these elements, we still can show that they are not useless pieces of the genomes. The repetitive elements interact with the surrounding sequences and nearby genes. They may serve as recombination hot spots or acquire specific cellular functions such as RNA transcription control or even become part of protein coding regions. Finally, they provide very efficient mechanism for genomic shuffling. As such, repetitive elements should be called genomic scrap yard rather than junk DNA. Tables listing examples of recruited (exapted) transposable elements are available at http://www.ncbi.nlm.gov/Makalowski/ScrapYard/
[Show abstract][Hide abstract] ABSTRACT: Human L1 retrotransposons can produce DNA transduction events in which unique DNA segments downstream of L1 elements are mobilized as part of aberrant retrotransposition events. That L1s are capable of carrying out such a reaction in tissue culture cells was elegantly demonstrated. Using bioinformatic approaches to analyze the structures of L1 element target site duplications and flanking sequence features, we provide evidence suggesting that approximately 15% of full-length L1 elements bear evidence of flanking DNA segment transduction. Extrapolating these findings to the 600,000 copies of L1 in the genome, we predict that the amount of DNA transduced by L1 represents approximately 1% of the genome, a fraction comparable with that occupied by exons.
Genome Research 05/2000; 10(4):411-5. · 14.40 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We have curated a reference set of cancer- related genes and reanalyzed their sequences in the light of molecular information and resources that have become available since they were first cloned. Homology studies were carried out for human oncogenes and tumor suppressors, compared with the complete proteome of the nematode, Caenorhabditis elegans, and partial proteomes of mouse and rat and the fruit fly, Drosophila melanogaster. Our results demonstrate that simple, semi-automated bioinformatics approaches to identifying putative functionally equivalent gene products in different organisms may often be misleading. An electronic supplement to this article provides an integrated view of our comparative genomics analysis as well as mapping data, physical cDNA resources and links to published literature and reviews, thus creating a "window" into the genomes of humans and other organisms for cancer biology.
[Show abstract][Hide abstract] ABSTRACT: Recently, we have defined and analyzed over 1800 orthologous human and rodent genes. Here we extend this work to compare human and Caenorhabditis elegans coding sequences. 1880 human proteins were compared with about 20000 predicted nematode proteins presumably comprising nearly the complete proteome of C. elegans. We found that 44% of human/rodent orthologs have convincing nematode counterparts. On average, the amino acid similarity and identity between aligned human and C. elegans orthologous gene products are 69.3% and 49.1% respectively, and the nucleotide identity is 49.8%. Detailed investigation of our results suggests that some nematode gene predictions are incorrect, leading to erroneous pairing with human genes (e.g. calcineurin and polymerase II elongation factor III). Furthermore, other proteins (i.e. homologs of human ribosomal proteins S20 and L41, thymosin) are missing entirely from the nematode proteome, suggesting that it may not be complete. These results underscore the fact that metazoan gene prediction is a very challenging task and that most computer-predicted nematode genes require supporting evidence of their existence from comparative genomics and/or laboratory investigation.