Article

Differences in dinucleotide frequencies of human, yeast, and Escherichia coli genes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Nucleotide sequences coding proteins in human, yeast and Escherichia coli genes were analyzed in terms of dinucleotide occurrences. Every gene is plotted as a point in the dinucleotide space, which is spanned by 16 axes corresponding to the 16 components of the dinucleotide. The metric unit in the space is defined using the log-odds ratio of dinucleotide occurrences in a gene. The distribution of points showed that genes from the same organism are clustered in the space. The clusters of human and E. coli are completely separated, and the yeast cluster sits between, implying that individual genes are classified into the three sources from their location. In fact, they could be identified with accuracy of 90%, using the DNA data alone. Even genes encoding homologous proteins belonging to the same protein superfamily were discriminated by the DNA data, and were correctly identified into their sources with the same accuracy as above. DNA sequences of non-coding regions, including human introns, as well as human genes of GC-rich and GC-poor types, were also analyzed in the same manner. The most significant finding is that human genomic DNA sequences, including genes and introns together, exhibit the largest deviation of dinucleotide occurrence from the random expectation. Possible origins for this phenomenon are discussed.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Applications include PCR primer (Fislage 1998;Fislage et al. 1997) and microarray probe design (Southern 2001). Several attempts (Deschavanne et al. 1999;Karlin and Ladunga 1994;Karlin and Mrazek 1997;Nakashima et al. 1997;Nakashima et al. 1998;Nussinov 1984;Sandberg et al. 2001) have been made to employ the frequency distribution of short subsequences (n-mers) to identify species with relatively short genome sizes (microbial). In such an approach, the shape of the frequency distribution for certain short subsequences: 2-4mers (Deschavanne et al. 1999;Karlin and Ladunga 1994;Karlin and Mrazek 1997;Nakashima et al. 1997;Nakashima et al. 1998;Nussinov 1984) and 8-9-mers (Deschavanne et al. 1999;Sandberg et al. 2001) have been used to decide what microbial genome one is dealing with, based on a given piece of genome or a whole genome. ...
... Several attempts (Deschavanne et al. 1999;Karlin and Ladunga 1994;Karlin and Mrazek 1997;Nakashima et al. 1997;Nakashima et al. 1998;Nussinov 1984;Sandberg et al. 2001) have been made to employ the frequency distribution of short subsequences (n-mers) to identify species with relatively short genome sizes (microbial). In such an approach, the shape of the frequency distribution for certain short subsequences: 2-4mers (Deschavanne et al. 1999;Karlin and Ladunga 1994;Karlin and Mrazek 1997;Nakashima et al. 1997;Nakashima et al. 1998;Nussinov 1984) and 8-9-mers (Deschavanne et al. 1999;Sandberg et al. 2001) have been used to decide what microbial genome one is dealing with, based on a given piece of genome or a whole genome. ...
... It is well-known that when genome size M > 4 n , the appearance of n-mers in various genomes are not random (Karlin and Ladunga 1994;Karlin and Mrazek 1997;Nakashima et al. 1997;Nakashima et al. 1998;Nussinov 1984). The basic motivation of our analysis is to explore the statistical properties of the presence of longer n-mers if the condition M<<4 n is held. ...
Article
Full-text available
Abstract A comparative statistical analysis of the presence of all possible short subsequences
... The nucleotide se-Recently, we have reported that protein coding nuquences of^arious sources have been analyzed by many cleotide sequences of human, yeast (Saccharomyces cereinvestigators 1 " 10 in terms of the frequency of occurrence v i s iae) and Escherichia coli have different frequencies of of oligomers such as dinucleotides, trinucleotides, and dinucleotides. 13 The genes are distributed in a cluster tetranucleotides. The general analytical method is based around its average when each gene is expressed as a on the odds ratio between observed and expected values vector of the log-odds ratio of 16 components of dinucalculated from base composition. ...
... The 489 genes and 108 introns from humans used here were the same as used in the previous work. 13 Amino acid sequence data for proteins longer than 50 residues were used. ...
... Our main aim in the present study was to examine to what extent genes of various organisms could be separated into clusters in the dinucleotide space, as suggested in a previous study. 13 The degree of separation depends on the detailed definition of the scaling. In our previous work, the log-odds ratio of dinucleotide frequencies in a sequence was employed. ...
Article
Full-text available
A set of 16 kinds of dinucleotide compositions was used to analyze the protein-encoding nucleotide sequences in nine complete genomes: Escherichia coli, Haemophilus influenzae, Helicobacter pylori, Mycoplasma genitalium, Mycoplasma pneumoniae, Synechocystis sp., Methanococcus jannaschii, Archaeoglobus fulgidus, and Saccharomyces cerevisiae. The dinucleotide composition was significantly different between the organisms. The distribution of genes from an organism was clustered around its center in the dinucleotide composition space. The genes from closely related organisms such as Gram-negative bacteria, mycoplasma species and eukaryotes showed some overlap in the space. The genes from nine complete genomes together with those from human were discriminated into respective clusters with 80% accuracy using the dinucleotide composition alone. The composition data estimated from a whole genome was close to that obtained from genes, indicating that the characteristic feature of dinucleotides holds not only for protein coding regions but also noncoding regions. When a dendrogram was constructed from the disposition of the clusters in the dinucleotide space, it resembled the real phylogenetic tree. Thus, the distinct feature observed in the dinucleotide composition may reflect the phylogenetic relationship of organisms.
... In addition, knowledge of the distribution of appearance of n-mers is necessary for PCR primer (Fislage et al., 1997; Fislage, 1998) and microarray probe design (Southern, 2001). Several attempts (Nussinov, 1984; Karlin and Ladunga, 1994; Karlin et al., 1997; Nakashima et al., 1997, 1998; Deschavanne et al., 1999; Sandberg et al., 2001) have been made to employ the distributions of appearance for n-mers to identify species with relatively short genome sizes (microbial). In such an approach, the shapes of the frequency distributions for particular short subsequences [2–4mers (Nussinov, 1984; Karlin and Ladunga, 1994; Karlin et al., 1997; Nakashima et al., 1997, 1998; Campbell et al., 1999) and 8–9mers (Deschavanne et al., 1999; Sandberg et al., 2001)] have been proposed as a measure to decide what microbial genome we are dealing with, based on a given piece of genome or a whole genome. ...
... Several attempts (Nussinov, 1984; Karlin and Ladunga, 1994; Karlin et al., 1997; Nakashima et al., 1997, 1998; Deschavanne et al., 1999; Sandberg et al., 2001) have been made to employ the distributions of appearance for n-mers to identify species with relatively short genome sizes (microbial). In such an approach, the shapes of the frequency distributions for particular short subsequences [2–4mers (Nussinov, 1984; Karlin and Ladunga, 1994; Karlin et al., 1997; Nakashima et al., 1997, 1998; Campbell et al., 1999) and 8–9mers (Deschavanne et al., 1999; Sandberg et al., 2001)] have been proposed as a measure to decide what microbial genome we are dealing with, based on a given piece of genome or a whole genome. The above-mentioned papers deal with the case for frequency of appearance when n is small, such that the total number of n-mers, 4 n , is smaller than the genome sequence length, M, 4 n < M. It is clear, that distributions of appearance of n-mers in this range are essentially different from that for random sequences of the same lengths. ...
... In addition, knowledge of the distribution of appearance of n-mers is necessary for PCR primer (Fislage et al., 1997; Fislage, 1998) and microarray probe design (Southern, 2001). Several attempts (Nussinov, 1984; Karlin and Ladunga, 1994; Karlin et al., 1997; Nakashima et al., 1997 Nakashima et al., , 1998 Deschavanne et al., 1999; Sandberg et al., 2001) have been made to employ the distributions of appearance for n-mers to identify species with relatively short genome sizes (microbial). In such an approach, the shapes of the frequency distributions for particular short subsequences [2–4mers (Nussinov, 1984; Karlin and Ladunga, 1994; Karlin et al., 1997; Nakashima et al., 1997 Nakashima et al., , 1998 Campbell et al., 1999) and 8–9mers (Deschavanne et al., 1999; Sandberg et al., 2001)] have been proposed as a measure to decide what microbial genome we are dealing with, based on a given piece of genome or a whole genome. ...
... Since DNA replication biases are partly visible at the dinucleotide level303132, we have constructed individual codon-pair context maps in which rows and columns were sorted to separate P-site codons ending with a particular nucleotide (N3; rows) and A-site codons starting with a particular nucleotide (N1; columns) (Figure 4A). These two consecutive positions of codon-pair context discriminated rather well codon-pair preferences and such discrimination was very strong for high eukaryotes and weak for low eukaryotes and bacteria (Figure 4A). ...
... doi:10.1371/journal.pone.0000847.g006 species appeared as an important determinant of its codon-pair context behavior (Figure 2), in a similar manner to that described for codon usage bias [39] or dinucleotide genome signatures [31,32]. ...
... In this scenario, one is prompted to hypothesize that the translational process may work with sub-optimized mRNA sequences since codon-context fine tunes decoding fidelity [15,22,23]. Genomes are known to have biased dinucleotide frequencies [31], a feature that has frequently been used to produce genomic signatures of phylogenetical and taxonomical relevance [31,32]. At the ORFeome level this bias influences codon usage [32] but may also interfere with codon-context, whenever the last nucleotide of one codon is associated with the first nucleotide of the second codon of the pair. ...
Article
Full-text available
Codon usage and codon-pair context are important gene primary structure features that influence mRNA decoding fidelity. In order to identify general rules that shape codon-pair context and minimize mRNA decoding error, we have carried out a large scale comparative codon-pair context analysis of 119 fully sequenced genomes. We have developed mathematical and software tools for large scale comparative codon-pair context analysis. These methodologies unveiled general and species specific codon-pair context rules that govern evolution of mRNAs in the 3 domains of life. We show that evolution of bacterial and archeal mRNA primary structure is mainly dependent on constraints imposed by the translational machinery, while in eukaryotes DNA methylation and tri-nucleotide repeats impose strong biases on codon-pair context. The data highlight fundamental differences between prokaryotic and eukaryotic mRNA decoding rules, which are partially independent of codon usage.
... This approach is based on the assumption that members of a species share sequence attributes that are absent in a sister species[19]. We examined whether oligonucleotide frequencies in different barcode loci can discriminate species.In earlier studies, oligonucleotide frequencies have been reported to exhibit species specific signals[20][21][22][23][24][25], but most of these studies were based on the analysis of whole genome. Thus these were applied to small genomes only and used for classification of bacteria. ...
... They suggested that oligonulceotide frequency is useful not only for classification of bacteria, but also for estimation of phylogenetic relationships among closely related species. This and other reports[20][21][22][23][24][25]considered the whole genome sequences for species clustering by Euclidean distance derived from oligonucleotide frequencies. Our study shows that the barcode loci can efficiently discriminate species using di-or trinucleotide frequencies of the loci across the kingdom. ...
Article
Full-text available
DNA barcoding refers to the use of short DNA sequences for rapid identification of species. Genetic distance or character attributes of a particular barcode locus discriminate the species. We report an efficient approach to analyze short sequence data for discrimination between species. A new approach, Oligonucleotide Frequency Range (OFR) of barcode loci for species discrimination is proposed. OFR of the loci that discriminates between species was characteristic of a species, i.e., the maxima and minima within a species did not overlap with that of other species. We compared the species resolution ability of different barcode loci using p-distance, Euclidean distance of oligonucleotide frequencies, nucleotide-character based approach and OFR method. The species resolution by OFR was either higher or comparable to the other methods. A short fragment of 126 bp of internal transcribed spacer region in ribosomal RNA gene was sufficient to discriminate a majority of the species using OFR. Oligonucleotide frequency range of a barcode locus can discriminate between species. Ability to discriminate species using very short DNA fragments may have wider applications in forensic and conservation studies.
... Dinucleotide biases are generally unique and individual for most organisms and have therefore been used as measures of genomic signatures, status of genes and as a method to determine phylogeny ( Campbell et al. 1999). A dinucleotide method has been used at the gene level by Nakashima et al. (1997), who studied the signatures and separation of genes from E. coli, Saccharomyces cerevisae, and Homo sapiens. ...
... The Karlin method (Karlin and Ladunga 1994;Campbell et al. 1999) uses average frequencies of all dinucleotides to determine genomic signatures and a Euclidian distance measure to determine deviations. This is not directly applicable to gene level investigations of convergence to mutational biases due to several circumstances: (i) coding requirements cause 1:2, 2:3, and 3:1 dinucleotide average biases to be dissimilar, (ii) 1:2 dinucleotide bias is primarily a vector of nonrandom amino acid requirements and it is known that these requirements differ among organisms ( Nakashima et al 1997). Furthermore, (iii) 2:3 dinucleotide bias may have large contributions from the codon bias, and (iv) average frequencies do not include the information contained in the covariances. ...
Article
Along the gene, nucleotides in various codon positions tend to exert a slight but observable influence on the nucleotide choice at neighboring positions. Such context biases are different in different organisms and can be used as genomic signatures. In this paper, we will focus specifically on the dinucleotide composed of a third codon position nucleotide and its succeeding first position nucleotide. Using the 16 possible dinucleotide combinations, we calculate how well individual genes conform to the observed mean dinucleotide frequencies of an entire genome, forming a distance measure for each gene. It is found that genes from different genomes can be separated with a high degree of accuracy, according to these distance values. In particular, we address the problem of recent horizontal gene transfer, and how imported genes may be evaluated by their poor assimilation to the host's context biases. By concentrating on the third- and succeeding first position nucleotides, we eliminate most spurious contributions from codon usage and amino-acid requirements, focusing mainly on mutational effects. Since imported genes are expected to converge only gradually to genomic signatures, it is possible to question whether a gene present in only one of two closely related organisms has been imported into one organism or deleted in the other. Striking correlations between the proposed distance measure and poor homology are observed when Escherichia coli genes are compared to Salmonella typhi, indicating that sets of outlier genes in E. coli may contain a high number of genes that have been imported into E. coli, and not deleted in S. typhi.
... In this communication, using the same approach, that is, analyzing together all available non-redundant viral sequences, we report two important features of genomic signatures -dinucleotide (diN) frequency and codon usage bias (CUB) -investigated using principal component analysis (PCA), a multivariate analytical tool. We show that, as expected from results obtained for eukaryotes and prokaryotes, using multivariate analysis (and other approaches), genomic GC content tends to be the main factor determining the frequency of diNs and CUB [for relatively "old" papers analyzing these features see, for example, [18,30,31,34,41,43,48]]. ...
Article
Full-text available
Viruses are, by far, the most abundant biological entities on earth. They are found in all known ecological niches and are the causative agents of many important diseases in plants and animals. From an evolutionary point of view, since viruses do not share any orthologous genes, there is a general consensus that they are polyphyletic; that is, they do not have a common ancestor. This means that they appeared several times during the course of evolution. For their life cycle, they are always obligate parasites of a free cellular life form, which can be bacteria, archaea, or eukaryotes. More complexity is added to these entities by the fact that their genetic material can be DNA or RNA (double- or single-stranded) or retrotranscribed. Given these features, we wondered if some general rules can be inferred when studying two basic genomic signatures—dinucleotides and codon usage—analyzing all available complete and non-redundant viral sequences. In spite of the obviously biased sample of sequences available, some general features appear to emerge.
... We can provide two possible explanations for the excess of SNVs at 5′ss. Firstly, the canonical AG dinucleotides at 3′ss can be created by SNV of CG dinucleotides, which are under-represented in vertebrate genomes [29]. Indeed, we did not observe any SNVs of CG to AG to create 3′ss (Fig. 5). ...
Article
Full-text available
Causative mutations for human genetic disorders have mainly been identified in exonic regions that code for amino acid sequences. Recently, however, it has been reported that mutations in deep intronic regions can also cause certain human genetic disorders by creating novel splice sites, leading to pseudo-exon activation. To investigate how frequently pseudo-exon activation events occur in normal individuals, we conducted in silico identification of such events using personal genome data and corresponding high-quality transcriptome data. With rather stringent conditions, on average, 2.6 pseudo-exon activation events per individual were identified. More pseudo-exon activation events were found in 5′ donor splice sites than in 3′ acceptor splice sites. Although pseudo-exon activation events have sporadically been reported as causative mutations in genetic disorders, it is revealed in this study that such events can be observed in normal individuals at a certain frequency. We estimate that human genomes typically contain on average at least 10 pseudo-exon activation events. The actual number should be higher than this, because we used stringent criteria to identify pseudo-exon activation events. This suggests that it is worth considering the possibility of pseudo-exon activation when searching for causative mutations of genetic disorders if candidate mutations are not identified in coding regions or RNA splice sites.
... Analysis of statistical attributes of DNA sequences is significant for evolutional biology and for technologies to identify living organisms. Several attempts have been made to identify relatively small size (microbial) genomes by using the distribution of the appearance of short consecutive nucleotide strings of length k called k-tuple [1][2][3][4][5][6][7]. To describe the distribution, many scholars used information quantities including Shannon entropy and Fisher information [8][9][10][11][12][13][14][15][16]. ...
... It is hence important to utilize scores that accurately represent the true event probability distributions. Although it is known that higher order genome nucleotide distributions contain unique information, and vary significantly across different types of organisms (see for example [11]), standard FSAs are limited to 0th order likelihood representations that are often approximated using integer values. ...
Article
Full-text available
Background Genome sequencing provides a powerful tool for pathogen detection and can help resolve outbreaks that pose public safety and health risks. Mapping of DNA reads to genomes plays a fundamental role in this approach, where accurate alignment and classification of sequencing data is crucial. Standard mapping methods crudely treat bases as independent from their neighbors. Accuracy might be improved by using higher order paired hidden Markov models (HMMs), which model neighbor effects, but introduce design and implementation issues that have typically made them impractical for read mapping applications. We present a variable-order paired HMM that we term VarHMM, which addresses central issues involved with higher order modeling for sequence alignment. Results Compared with existing alignment methods, VarHMM is able to model higher order distributions and quantify alignment probabilities with greater detail and accuracy. In a series of comparison tests, in which Ion Torrent sequenced DNA was mapped to similar bacterial strains, VarHMM consistently provided better strain discrimination than any of the other alignment methods that we compared with. Conclusions Our results demonstrate the advantages of higher ordered probability distribution modeling and also suggest that further development of such models would benefit read mapping in a range of other applications as well. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1710-0) contains supplementary material, which is available to authorized users.
... Nakashima et al. [89] analyzed human, yeast, and E. coli coding sequences in terms of dinucleotide occurrences. In 16-dimensional space, they observed that the human and E. coli clusters were distinctly separated while that for yeast was positioned in between. ...
... The frequency of the appearance of n-mers has been applied to the identification of regions of compositional peculiarities, most commonly studies of di-and tri-nucleotides (7)(8). We have also observed correlation between regions of unusual compositional properties with the frequency of appearance of longer n-mers in genomic sequences. ...
Conference Paper
Full-text available
Numerous sequencing projects have unveiled partial and full microbial genomes. The data produced far exceeds one person’s analytical capabilities and thus requires the power of computing. A significant amount of work has focused on the diversity of statistical characteristics along microbial genomic sequences, e.g. codon bias, G+C content, the frequencies of short subsequences (n‐mers), etc. Based upon the results of these studies, two observations were made: (1) there exists a correlation between regions of unusual statistical properties, e.g. difference in codon bias, etc., from the rest of the genomic sequence, and evolutionary significant regions, e.g. regions of horizontal gene transfer; and (2) because no two microbial genomes look statistically identical, statistical properties can be used to distinguish between genomic sequences. Recently, we conducted extensive analysis on the presence/absence of n‐mers for many microbial genomes as well as several viral and eukaryotic genomes. This analysis revealed that the presence of n‐mers in all genomes considered (in the range of n, when the condition M<<4 n holds, where M is the genome length) can be treated as a nearly random and independent process. Thus we hypothesize that one may use relatively small sets of randomly picked n‐mers for differentiating between different microorganisms. Recently, we analyzed the frequency of appearance of all 8‐ to 12‐mers present in each of the 200+ publicly available microbial genomes. For nearly all of the genomes under consideration, we observed that some n‐mers are present much more frequently than expected: from 50 to over a thousand copies. Upon closer inspection of these sequences, we found several cases in which an overrepresented n‐mer exhibits a bias towards being located in the coding or being located in the non‐coding region. Although the evolutionary reason for the conservation of such sequences remains unclear, in some cases it is plausible to believe that sequences having a clear bias for non‐coding regions may be because of their role in the DNA uptake/recombination process, being parts in insertion sequences, or serving as transcription factors recognition sites. Our analysis of the frequency of appearance of 6‐mers for each microbial genome revealed regions that display unusual statistical properties with respect to their own genome. After inspection of the genes contained within these regions, we believe that such regions are likely to have been acquired into the genomic sequence through horizontal gene transfer.
... In literature, several previous attempts have also been made to employ the frequency distribution of short subsequences (n-mers or motifs) to identify species for relatively short genome sizes (e.g., viruses and microbes). In such an approach, the shape of the frequency distribution for certain short subsequences, 2-4-mers [4][5][6][7][8] and 8-9-mers [9,10], was proposed to be used to decide what microbial genome is being considered based on a given random piece of genome or the entire genome. Algorithmically, such types of analyses employ a repeatable search for the short patterns in genomes, also known as the exact string matching problem. ...
Article
Full-text available
Statistical analysis of the appearance of short subsequences in different DNA sequences, from individual genes to full genomes, is important for various reasons. Applications include PCR primers and microarray probes design. Moreover, the distribution of short subsequences (n-mers) in a genome can be used to distinguish between species with relatively short genome sizes (e.g., viruses and microbes). To be able to perform such an analysis, a group of algorithms were developed to specifically deal with the problem of finding the appearance of all possible patterns of size n (n-mers) in a sequence or text of size m. The concept of a counting array allows us to map our problem for large subsequences onto a useful data structure. The run-time operation count estimation O(4n+m) makes it computationally convenient to accomplish the calculation of the statistics of the presence/absence of all possible 7-20-mers in more than 250 genomes including the human genome.
... Karlin's group has reported that genes from bacteria have their species specific nucleotide compositions based on the relative ratio between observed and expected dinucleotide frequencies234 . Even genes encoding homologous proteins from different species were discriminated by their dinucleotide fre- quencies [5]. The protein-coding genes from nine genomes were classified into their species with accuracy of 80% in terms of dinucleotide frequencies [6]. ...
Article
The nucleotide composition of protein-coding genes in the two DNA strands of Escherichia coli, Bacillus subtilis, Methanococcus jannaschii and mitochondrial genes of human and fruit fly was studied. E. coli, B. subtilis and M. jannaschii indicated compositional asymmetry in their genomic sequences. The protein-coding genes of E. coli and B. subtilis showed the influence of compositional asymmetry in their compositions, however, no influence was observed in M. jannaschii. Mitochondrial protein-coding genes showed significant difference in composition in the two DNA strands. The deviation of nucleotide composition in the two DNA strands is discussed. © 2008 Springer-Verlag.
... As an alternative method to whole bacterial genome comparison, many studies have shown that di-nucleotide frequencies within DNA sequences exhibit species-specific signals [14][15][16][17][18][19]. Species-specific signals for oligomers up to a length of four nucleotides have also been detected [20,21]. ...
Article
Classification of bacteria is mainly based on sequence comparisons of certain homologous genes such as 16S rRNA. Recently there are challenges to classify bacteria using oligonucleotide frequency pattern of nonhomologous sequences. However, the evolutionary significance of oligonucleotides longer than tetra-nucleotide is not studied well. We performed phylogenetic analysis by using the Euclidean distances calculated from the di to deca-nucleotide frequencies in bacterial genomes, and compared these oligonucleotide frequency-based tree topologies with those for 16S rRNA gene and concatenated seven genes. When oligonucleotide frequency-based trees were constructed for bacterial species with similar GC content, their topologies at genus and family level were congruent with those based on homologous genes. Our results suggest that oligonucleotide frequency is useful not only for classification of bacteria, but also for estimation of their phylogenetic relationships for closely related species.
... This claim could be reconciled with the implication of the neutral theory if higher order features ameliorate more rapidly and uniformly than lower order features. For example, dinucleotide composition can be mathematically decomposed into two parts: (i) the mononucleotide composition and (ii) the matrix of ‘odds ratios’ that compare the observed proportions of the individual dinucleotides to their expectations under the assumption of pure randomness.15 Nakashima et al.16 examined 10 complete genomes and concluded that part (ii) reflects phylogenetic relations better than part (i). ...
Article
Various methods have been developed to detect horizontal gene transfer in bacteria, based on anomalous nucleotide composition, assuming that compositional features undergo amelioration in the host genome. Evolutionary theory predicts the inevitability of false positives when essential sequences are strongly conserved. Foreign genes could become more detectable on the basis of their higher order compositions if such features ameliorate more rapidly and uniformly than lower order features. This possibility is tested by comparing the heterogeneities of bacterial genomes with respect to strand-independent first- and second-order features, (i) G + C content and (ii) dinucleotide relative abundance, in 1 kb segments. Although statistical analysis confirms that (ii) is less inhomogeneous than (i) in all 12 species examined, extreme anomalies with respect to (ii) in the Escherichia coli K12 genome are typically co-located with essential genes.
... The analysis of the frequency of appearance of short n-mers, first discussed over a decade ago for short (2-9 nt long) sequences (Campbell et al., 1999;Deschavanne et al., 1999;Karlin and Ladunga, 1994;Karlin et al., 1997;Nakashima et al., 1997Nakashima et al., , 1998Nussinov, 1984;Sandberg et al., 2001), remains in the scope of interest of many practical applications including pathogen identification (Lehner, 2005;Karlin, 1998;Phillippy et al., 2007;Putonti et al., 2006). In contrast, from the analysis of the presence/absence of longer (4 n )m) subsequences (or n-mers of length n), one may conclude that the appearance of these longer n-mers in genomes can be approximated as a random and independent process (Chumakov et al., 2005;Fofanov et al., 2004). ...
Article
Full-text available
Genomic-based methods have significant potential for fast and accurate identification of organisms or even genes of interest in complex environmental samples (air, water, soil, food, etc.), especially when isolation of the target organism cannot be performed by a variety of reasons. Despite this potential, the presence of the unknown, variable and usually large quantities of background DNA can cause interference resulting in false positive outcomes. In order to estimate how the genomic diversity of the background (total length of all of the different genomes present in the background), target length and target mutation rate affect the probability of misidentifications, we introduce a mathematical definition for the quality of an individual signature in the presence of a background based on its length and number of mismatches needed to transform the signature into the closest subsequence present in the background. This definition, in conjunction with a probabilistic framework, allows one to predict the minimal signature length required to identify the target in the presence of different sizes of backgrounds and the effect of the target's mutation rate on the quality of its identification. The model assumptions and predictions were validated using both Monte Carlo simulations and real genomic data examples. The proposed model can be used to determine appropriate signature lengths for various combinations of target and background genome sizes. It also predicted that any genomic signatures will be unable to identify target if its mutation rate is >5%. Supplementary data are available at Bioinformatics online.
... We believe that these miPPRs are also good candidates for the miRNA promoter and discussed the basis for it partly using computational analysis in Supplementary Note. Moreover, calculation of odds ratio of CpG dinucleotides ( CpG ) (Nakashima et al., 1997) shows that the human miPPRs had 1.43-and 1.64-fold higher CpG than that of the background cohorts described above and that of entire human genome, respectively (Fig. 1B). Considering that a significant fraction of known promoters locate in CpG island, these results support that the miPPRs are rich in functional promoter sequences. ...
Article
Full-text available
Motivation: Just as transcription factors, miRNA genes modulate global patterns of gene expression during differentiation, metabolic activation, stimulus response and also carcinogenesis. However, little is currently known how the miRNA gene expression itself is regulated owing to lack of basic information of their gene structure. Global prediction of promoter regions of miRNA genes would allow us to explore the mechanisms underlying gene-regulatory mechanisms involving these miRNAs. Results: We speculate that if specific miRNA molecules are involved in evolutionarily conserved regulatory systems in vertebrates, this would entail a high level of conservation of the promoter of miRNA gene as well as the miRNA molecule. By our current screening of putative promoter regions of miRNA genes (miPPRs) on this base, we identified 59 miPPRs that would direct production of 79 miRNAs. We present both biochemical and bioinformatical verifications of these putative promoters.
... Recently, we have reported that protein coding nucleotide sequences of human, yeast (Saccharomyces cerevisiae) and Escherichia coli have dierent feature in the frequency of occurrence of dinucleotides [1]. The genes of human are completely separated in the dinucleotide composition space from those of E. coli, and those of yeast sit in-between. ...
Article
Full-text available
occus jannaschii, Archaeoglobus fulgidus, and Saccharomyces cerevisiae have been investigated [2]. The dinucleotide composition was significantly differentbetween the organisms. The distribution of genes from an organism was clustered around its center in the dinucleotide composition space. The genes from closely related organisms such as Gram-negative bacteria, mycoplasma species and eukaryotes showed some overlaps in the space. The genes from nine complete genomes together with those from human were discriminated into respective clusters with 80% accuracy using the dinucleotide composition alone. The composition data estimated from a whole genome was close to that obtained from genes, indicating that the characteristic feature of dinucleotides holds not only for protein coding regions but also noncoding regions. When a dendrogram was constructed from the disposition of the clusters in the dinucleotide space, it resembled the real phylogenetic tree. Thus, the distinct feature
Article
The evolution of drug-resistant pathogenic microbial species is a major global health concern. Naturally occurring, antimicrobial peptides (AMPs) are considered promising candidates to address antibiotic resistance problems. A variety of computational methods have been developed to accurately predict AMPs. The majority of such methods are not microbial strain specific (MSS): they can predict whether a given peptide is active against some microbe, but cannot accurately calculate whether such peptide would be active against a particular MS. Due to insufficient data on most MS, only a few MSS predictive models have been developed so far. To overcome this problem, we developed a novel approach that allows to improve MSS predictive models (MSSPM), based on properties, computed for AMP sequences and characteristics of genomes, computed for target MS. New models can perform predictions of AMPs for MS that do not have data on peptides tested on them. We tested various types of feature engineering as well as different machine learning (ML) algorithms to compare the predictive abilities of resulting models. Among the ML algorithms, Random Forest and AdaBoost performed best. By using genome characteristics as additional features, the performance for all models increased relative to models relying on AMP sequence-based properties only. Our novel MSS AMP predictor is freely accessible as part of DBAASP database resource at http://dbaasp.org/prediction/genome
Article
Orthologous genes from two mycoplasma species, Mycoplasma genitalium and Mycoplasma pneumoniae, were analyzed in terms of trinucleotide composition. A nucleotide sequence can be converted to a composition vector and plotted as a point in a multidimensional composition space according to its composition. It was found that the distribution of orthologous genes in the composition space along the axis of the first principal component, showed a correlation with the degree of sequence identity between pairs of orthologs. Further, as the first principal components of individual genes are also strongly correlated with their G+C content, both the G+C content and the degree of sequence identity are correlated as well. In this paper, we demonstrate that highly conserved sequences showed higher G+C content and poorly conserved ones had lower G+C content for the genes of two mycoplasma species. The presence of respective directional gene distributions in the composition space is suggestive of the differing evolutionary pathways of the two mycoplasma after the divergence from a common ancestral species.
Article
Full-text available
To overcome disadvantages of long time consumption and low efficiency when k-mer frequency is used for DNA segment recognition, the attributes reduction of rough set theory is adopted to reduce the k-mer frequency. Signal reduction experiment in the whole genome of 30 microbial strains was carried out. Results show that using this method can reduce 72.27% of the original high-dimensional genetic signals, and increase the accuracy by 0.62%, meanwhile, the running time is shortened by 73.3%. ©, 2015, Editorial Board of Jilin University. All right reserved.
Thesis
Full-text available
Chronic hepatitis B virus (HBV) infection causes liver disease that can progress to cirrhosis and hepatocellular carcinoma (HCC). Changes in the hepatocyte population that occur from the early immune-tolerant stage of infection to late-stage disease outcomes remain unclear. We hypothesised that some hepatocytes lose HBV antigen expression and escape the HBVspecific immune response, allowing them to undergo clonal proliferation. Clonal proliferation of altered hepatocytes may be a marker of disease progression and may have a direct role in the development of HCC. Liver tissues from 30 patients were analysed, including patients with early-stage HBV infection, late-stage infection with cirrhosis, or with HCC. Unique virus-cell DNA junctions formed by the integration of HBV DNA into the host cell genome were detected using inverse nested PCR (invPCR). The copy number of unique virus-cell DNA junctions was used as a measure of clonal proliferation of hepatocytes. A computer simulation of a liver undergoing stochastic liver turnover was used to determine if the hepatocyte clones observed by invPCR could have been formed by random chance. Immunohistochemistry for HBV surface antigen (HBsAg) expression and Imaging Mass Spectrometry (IMS) for cellular protein expression were carried out to detect cellular changes that may be associated with clonal proliferation. Significantly (p<0.01) larger clones were observed by invPCR in liver DNA extracts of patients with late-stage HBV-associated disease (≤280000 hepatocytes) compared to patients in early-stage HBV infection (8-1124 hepatocytes). Computer simulations indicated that stochastic turnover could not produce clones of >10000 hepatocytes, suggesting that the hepatocytes that had formed large clones had a survival advantage. No significant difference in the extent of clonal proliferation was observed in foci of HBsAg-positive and –negative hepatocytes isolated by laser-microdissection. Heterogeneous expression of cellular proteins was detected using IMS in hepatocytes with apparently normal histology. These results indicate that clonal proliferation of hepatocytes with survival advantage does occur in the hepatocyte population during chronic HBV infection and can be detected before histological changes are evident in the hepatocytes of patients with both early- and late-stage disease. Consistent with our hypothesis, larger hepatocyte clones were associated with disease progression. The cause of the clonal proliferation remains unknown. Contrary to our hypothesis, loss of HBsAg expression was not associated with increased clonal proliferation, suggesting that escape from HBsAg-specific immune attack is not a survival advantage. While not investigated in this thesis, the loss of expression of other HBV antigens may provide a survival advantage. Heterogeneous cellular protein expression suggests that hepatocyte phenotype has been altered in some hepatocytes. However, we could not show using invPCR approaches that the foci of hepatocytes with altered cellular protein expression were clonal. In conclusion, this research has provided groundwork in determining the relationship between the clonal proliferation of hepatocytes, altered hepatocyte phenotype and HBV-associated disease progression. Further studies into the molecular causes of clonal proliferation of hepatocytes with survival advantages could elucidate pathways of HBV-associated disease progression and novel ways to curb the evolution of the hepatocyte population to a less pathogenic state.
Article
Zipf's approach in linguistics is utilized to analyze the statistical features of frequency and correlation of 16 nearest neighboring nucleotides (AA, AC, AG, ..., TT) in 12 human chromosomes (Y, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, and 12). It is found that these statistical features of nearest neighboring nucleotides in human genome: (i) the frequency distribution is a linear function, and (ii) the correlation distribution is an inverse function. The coefficients of the linear function and inverse function depend on the GC content. It proposes the correlation distribution of nearest neighboring nucleotides for the first time and extends the descriptor about nearest neighboring nucleotides.
Conference Paper
This paper presents a novel approach toward high precision biology species categorization which is mainly based on KNN algorithm. KNN has been successfully used in natural language processing (NLP). Our work extends the learning method for biological data. We view the DNA or RNA sequences of certain species as special natural language texts. The approach for constructing composition vectors of DNA and RNA sequences is described. A learning method based on KNN algorithm is proposed. An experimental system for biology species categorization is implemented. Forty three different bacteria organisms selected randomly from EMBL are used for evaluation purpose. And the preliminary experiments show promising results on precision.
Article
Full-text available
Bacterial genomes have diverged during evolution, resulting in clearcut differences in their nucleotide composition, such as their GC content. The analysis of complete sequences of bacterial genomes also reveals the presence of nonrandom sequence variation, manifest in the frequency profile of specific short oligonucleotides. These frequency profiles constitute highly specific genomic signatures. Based on these differences in oligonucleotide frequency between bacterial genomes, we investigated the possibility of predicting the genome of origin for a specific genomic sequence. To this end, we developed a naïve Bayesian classifier and systematically analyzed 28 eubacterial and archaeal genomes. We found that sequences as short as 400 bases could be correctly classified with an accuracy of 85%. We then applied the classifier to the identification of horizontal gene transfer events in whole-genome sequences and demonstrated the validity of our approach by correctly predicting the transfer of both the superoxide dismutase (sodC) and the bioC gene from Haemophilus influenzae to Neisseria meningitis, correctly identifying both the donor and recipient species. We believe that this classification methodology could be a valuable tool in biodiversity studies.
Article
Dinucleotide composition has been recognized as a species-specific characteristic of organisms for more than 20 years. Lang (2000, Bioinformatics, 16, 212-221), found that in Monilinia rRNA a species-specific identity is conserved when dinucleotide counts are compressed into net dinucleotide counts (e.g., 50AC + 20CA = 30nAC) and clusters of net dinucleotides of equal value (e.g., 30nAC + 30nCT + 30nTA = 30ACTA) which were called circuits. This study evaluates circuit assemblages (CAs)--the collection of all net dinucleotide circuits derived from a sequence--in a diverse set of 110 HIV-1 genomes. The circuit composition, which is often based on <or= 15% of the total dinucleotides of a sequence, uniquely characterizes each gene and genome, although the pairwise similarity of the sequences is as low as 70%. Variations in net dinucleotide distributions are associated with structural and functional features of the genome and its proteins. Circuit values of the env signal sequence are different between subtypes that have remained localized and those that have become pandemic. CAs of complete genomes of HIV-1 are similar to other retro-transcribing viruses, and distinct from viroids and single- and double-stranded DNA and RNA viruses. CAs provide a succinct, quantitative, and species-specific description of DNA composition that is consistent with the results of traditional analytic methods at multiple levels of genome organization.
Article
Full-text available
Solar UV radiation is a major mutagen that damages DNA through the formation of dimeric photoproducts between adjacent thymine and cytosine bases. A major effect of the GC content of the genome is thus anticipated, in particular in prokaryotes where this parameter significantly varies among species. We quantified the formation of UV-induced photolesions within both isolated and cellular DNA of bacteria of different GC content. First, we could unambiguously show the favored formation of cytosine-containing photoproducts with increasing GC content (from 28 to 72%) in isolated DNA. Thymine-thymine cyclobutane dimer was a minor lesion at high GC content. This trend was confirmed by an accurate and quantitative analysis of the photochemical data based on the exact dinucleotide frequencies of the studied genomes. The observation of the effect of the genome composition on the distribution of photoproducts was then confirmed in living cells, using two marine bacteria exhibiting different GC content. Because cytosine-containing photoproducts are highly mutagenic, it may be predicted that species with genomes exhibiting a high GC content are more susceptible to UV-induced mutagenesis.
Article
Full-text available
Several statistical methods were tested for accuracy in predicting observed frequencies of di- through hexanucleotides in 74,444 bp of E. coli DNA. A Markov chain was most accurate overall, whereas other methods, including a random model based on mononucleotide frequencies, were very inaccurate. When ranked highest to lowest abundance, the observed frequencies of oligonucleotides up to six bases in length in E. coll DNA were highly asymmetric. All ordered abundance plots had a wide linear range containing the majority of the oligomers which deviated sharply at the high and low ends of the curves. In general, values predicted by a Markov chain closely followed the overall shape of the ordered abundance curves. A simple equation was derived by which the frequency of any nucleotide longer than four bases in the E. coli genome (or any genome) can be relatively accurately estimated from the nested set of component tri- and tetranucleo-tides by serial application of a 3rd order Markov chain. The equation yielded a mean ratio of 1.03±0.94 for the observed-to-expected frequencies of the 4,096 hexanucleotides. Hence, the method is a relatively accurate but not perfect predictor of the length in nucleotides between hexanucleotide sites. Higher accuracy can be achieved using a 4th order Markov chain and larger data sets. The high asymmetry in oligonucleotide abundance neans that in the E. coli genome of 4.2 106 bp many relatively short sequences of 7-9 bp are very rare or absent.
Article
Full-text available
The sequences of the human genome compiled in DNA databases are now about 10 megabase pairs (Mb), and thus the size of the sequences is several times the average size of chromosome bands at high resolution. By surveying this large quantity of data, it may be possible to clarify the global characteristics of the human genome, that is, correlation of gene sequence data (kb-level) to cytogenetic data (Mb-level). By extensively searching the GenBank database, we calculated codon usages in about 2000 human sequences. The highest G + C percentage at the third codon position was 97%, and that of about 250 sequences was 80% or more. The lowest G + C% was 27%, and that in about 150 sequences was 40% or less. A major portion of the GC-rich genes was found to be on special subsets of R-bands (T-bands and/or terminal R-bands). AT-rich genes, however, were mainly on G bands or non-T-type internal R-bands. Average G + C% at the third position for individual chromosomes differed among chromosomes, and were related to T-band density, quinacrine dullness, and mitotic chiasmata density in the respective chromosomes.
Article
Full-text available
Noting the scarcity of CpG dinucleotide in total genomic DNA derived from higher organisms and the scarcity of TpA dinucleotide in total genomic DNA derived from most life forms, we examined the distribution of these dinucleotides in sequences derived from functionally distinct types of human DNA, including mitochondrial DNA, intergenic DNA, intron DNA, and DNA destined to be represented in the cytoplasm as mRNA, tRNA, or rRNA. While CpG frequency has fallen to its lowest levels in DNA that is transcriptionally silent, TpA is most stringently excluded in DNA destined to be expressed as mRNA in the cytosol. This observation suggests that the selective pressures leading to the removal of CpG and TpA operate at different levels. With respect to TpA, dinucleotide scarcity may reflect a requirement for mRNA stability and may indicate the action of UpA-selective ribonucleases. We propose that, by reason of its instability, UpA must have been very rare in primordial RNA. Therefore, tRNA with the anticodon for this dinucleotide may have failed to evolve, making UpA the primordial doublet "stop" codon. The modern triplet code has faithfully conserved this arrangement in the two universal stop codons, UAA and UAG.
Article
Full-text available
Recently, nearest neighbor patterns were observed in prokaryotic and eukaryotic DNA sequences. These are discussed with respect to some of their biological implications. It is suggested that their origins relate to different specific structures of nearest neighbor base pairs. These patterns strongly constrain the DNA sequence. As such, they "explain" to some degree the amino acid codon choice and have direct bearing on questions related to evolution.
Article
Full-text available
A method for assessing genomic similarity based on relative abundances of short oligonucleotides in large DNA samples is introduced. The method requires neither homologous sequences nor prior sequence alignments. The analysis centers on (i) dinucleotide (and tri- and tetra-) relative abundance extremes in genomic sequences, (ii) distances between sequences based on all dinucleotide relative abundance values, and (iii) a multidimensional partial ordering protocol. The emphasis in this paper is on assessments of general relatedness of genomes as distinguished from phylogenetic reconstructions. Our methods demonstrate that the relative abundance distances almost always differ more for genomic interspecific sequence comparisons than for genomic intraspecific sequence comparisons, indicating congruence over different genome sequence samples. The genomic comparisons are generally concordant with accepted phylogenies among vertebrate and among fungal species sequences. Several unexpected relationships between the major groups of metazoa, fungal, and protist DNA emerge, including the following. (i) Schizosaccharomyces pombe and Saccharomyces cerevisiae in dinucleotide relative abundance distances are as similar to each other as human is to bovine. (ii) S. cerevisiae, although substantially far from, is significantly closer to the vertebrates than are the invertebrates (Drosophila melanogaster, Bombyx mori, and Caenorhabditis elegans). This phenomenon may suggest variable evolutionary rates during the metazoan radiations and slower changes in the fungal divergences, and/or a polyphyletic origin of metazoa. (iii) The genomic sequences of D. melanogaster and Trypanosoma brucei are strikingly similar. This DNA similarity might be explained by some molecular adaptation of the parasite to its dipteran (tsetse fly) host, a host-parasite gene transfer hypothesis. Robustness of the methods may be due to a genomic signature of dinucleotide relative abundance values reflecting DNA structures related to dinucleotide stacking energies, constraints of DNA curvature, and mechanisms attendant to replication, repair, and recombination.
Article
A compact mitochondrial gene contains all essential information about the synthesis of mitochondrial proteins which play their roles in a small compartment of the mitochondrium. Almost no noncoding regions have been found through the gene, but a necessary set of tRNAs for the 20 amino acids is provided for biosynthesis, some of them coding different amino acids from those in a usual cell. Since the gene is so compact that the produced proteins would have some characteristic aspects for the mitochondrium, amino acid compositions of mitochondrial proteins (mt-proteins) were examined in the 20-dimensional composition space. The results show that compositions of proteins translated from the mitochondrial genes have a distinct character having more hydrophobic content than others, which is illustrated by a clustered distribution in the multidimensional composition space. The cluster is located at the tail edge of the global distribution pattern of a Gaussian shape for other various kinds of proteins in the space. The mt-proteins are rich in hydrophobic amino acids as is a membrane protein, but are different from other membrane proteins in a lesser content of Val. A good correlation found between the base and amino acid compositions for the mitochondria was examined in comparison to those of organisms such as thermophilic bacterium having an extreme G-C-rich base composition.
Article
The global, rather than local, variation in G+C content along the nuclear DNA sequences of various organisms was studied using GenBank sequence data. When long DNA sequences of the genomes of Escherichia coli and Saccharomyces cerevisiae were examined, the levels of their G+C content (G+C%) were found to be within a narrow range around that of the whole genome. The G+C% levels for sequences of vertebrate genomes, however, were found to cover a wide range, showing that their genome is a mosaic of sequences with different G+C% levels, in each of which the sequence is fairly homogeneous in its G+C% for a very long distance. Through surveying a human genetic map and GenBank DNA sequences, the global variations in G+C% along the human genome DNA were found to be correlated with chromosome band structures.
Article
The folding types of 135 proteins, the three-dimensional structures of which are known, were analyzed in terms of the amino acid composition. The amino acid composition of a protein was expressed as a point in a multidimensional space spanned with 20 axes, on which the corresponding contents of 20 amino acids in the protein were represented. The distribution pattern of proteins in this composition space was examined in relation to five folding types, alpha, beta, alpha/beta, alpha + beta, and irregular type. The results show that amino acid compositions of the alpha, beta, and alpha/beta types are located in different regions in the composition space, thus allowing distinct separation of proteins depending on the folding types. The points representing proteins of the alpha + beta and irregular types, however, are widely scattered in the space, and the existing regions overlap with those of the other folding types. A simple method of utilizing the "distance" in the space was found to be convenient for classification of proteins into the five folding types. The assignment of the folding type with this method gave an accuracy of 70% in the coincidence with the experimental data.
Article
We analyze the dinucleotide frequencies of occurrence and preferences separately within the vertebrates, nonvertebratea, DNA viruses, mitochondria, RNA viruses, bacteria and phage sequences. Over half a million nucleotides from more than 400 sequences were used in this study. Distinct patterns are observed. Some of the patterns are common to all sequences, some to either eukaryotes or prokaryotes and others to the subgroups within them. Doublets are the most basic ingredient of order in nucleotide sequences. We suggest that their preferences and the arrangement of nucleotides in the DNA in general is determined to a large extent by the conformational and packaging considerations of the double helix. Some principles of DNA conformation are viewed in light of our results.
Article
Correlations of the amino acid composition of a protein to its location in an organism, biological function, folding type, and disulfide bond(s) were examined for 356 proteins. In the present data set, 325 proteins of known location and biological characters were divided into 122 intracellular enzymes (BI), 73 intracellular non-enzymes (BII), 45 extracellular enzymes (BIII), and 85 extracellular nonenzymes (BIV). The composition of these proteins were expressed as points in the composition space of 18 orthogonal axes, each representing the content of an amino acid. The distributions of points of BI and BIII were narrow and approximately spherical but those of BII and BIV were distributed rather widely. The groups are separated from each other in the space. We divided the space into four regions (A1 to A4) corresponding to the groups BI to BIV. A protein could be assigned to one of the four groups (A1 to A4) from its amino acid composition: The proteins correctly assigned amounted to 177 out of 195 intracellular proteins, and 94 out of 130 extracellular proteins. The correspondence was about 80% for classification into intracellular and extracellular proteins and 66% for that into the four groups. The folding type also had a significant correlation to the above groups, i.e., intracellular enzymes are rich in alpha/beta, nonenzymes alpha, extracellular enzymes beta and alpha + beta, and nonenzymes beta. The differences in average composition between intra- and extracellular proteins, and between enzymes and nonenzymes were related to the structural characters, i.e., intracellular proteins contain more amino acids favoring alpha-helix than extracellular ones, and enzymes contain more hydrophobic amino acids than nonenzymes. The statistics on 213 Cys-containing proteins showed that disulfide bond(s) are found mostly (90%) in the extracellular proteins. The results indicate that amino acid composition is well correlated to location in an organism, biological function, folding type, and disulfide bonding. The implications of the new findings are discussed from the protein-taxonomical point of view, and the validity of the present method is assessed.
Article
Data on amino acid composition were collected in order to classify proteins into groups. The composition of a protein is expressed as a point in an orthogonal coordinate system, taking fractions of amino acids along 18 axes, which represent 18 amino acids (we use Asx and Glx for the sum of Asp and Asn and that of Glu and Gln, respectively). Thus, proteins of known amino acid compositions (356 single polypeptides chains) are distributed as points in this composition space. Since the radial distribution of the points from the origin (the average composition) did not show any distinct separation into groups, we checked the angular distribution of points in the space. Thirteen groups were found by a computer method based on the density. Analysis of the groups in terms of various characters of proteins, such as source (eukaryote or prokaryote), location in an organism, biological function, etc. revealed that the groups have strong correlations to the location (inside or outside the cell) and functional character (enzyme or nonenzyme). Also, the presence of disulfide bond(s) seems to be characteristic of extracellular proteins. Protein source, molecular size and ability to form an oligomer had little correlation to the grouping. Therefore, proteins may be classified into four types as follows: BI, intracellular enzymes; BII, intracellular nonenzymes; BIII, extracellular enzymes; and BIV, extracellular nonenzymes.
Article
Amino acid compositions of 356 proteins are expressed as points in an 18 dimensional space of 18 axes representing the contents of amino acids. The proteins are classified into four groups of intra- and extracellular enzymes and nonenzymes according to analysis of the distribution of the points. The groups have a significant correlation to four folding types of secondary structures, and extra- and intracellular proteins to those with and without disulfide bond(s), respectively. The location and function of a protein seem to determine its amino acid composition and folding type.
Article
Early biochemical experiments established that the set of dinucleotide odds ratios or 'general design' is a remarkably stable property of the DNA of an organism, which is essentially the same in protein-coding DNA, bulk genomic DNA, and in different renaturation rate and density gradient fractions of genomic DNA in many organisms. Analysis of currently available genomic sequence data has extended these earlier results, showing that the general designs of disjoint samples of a genome are substantially more similar to each other than to those of sequences from other organisms and that closely related organisms have similar general designs. From this perspective, the set of dinucleotide odds ratio (relative abundance) values constitute a signature of each DNA genome, which can discriminate between sequences from different organisms. Dinucleotide-odds ratio values appear to reflect not only the chemistry of dinucleotide stacking energies and base-step conformational preferences, but also the species-specific properties of DNA modification, replication and repair mechanisms.
Article
Genomic homogeneity is investigated for a broad base of DNA sequences in terms of dinucleotide relative abundance distances (abbreviated delta-distances) and of oligonucleotide compositional extremes. It is shown that delta-distances between different genomic sequences in the same species are low, only about 2 or 3 times the distance found in random DNA, and are generally smaller than the between-species delta-distances. Extremes in short oligonucleotides include underrepresentation of TpA and overrepresentation of GpC in most temperate bacteriophage sequences; underrepresentation of CTAG in most eubacterial genomes; underrepresentation of GATC in most bacteriophage; CpG suppression in vertebrates, in all animal mitochondrial genomes, and in many thermophilic bacterial sequences; and overrepresentation of GpG/CpC in all animal mitochondrial sets and chloroplast genomes. Interpretations center on DNA structures (dinucleotide stacking energies, DNA curvature and superhelicity, nucleosome organization), context-dependent mutational events, methylation effects, and processes of replication and repair.
Article
Genomic similarities and contrasts are investigated in a collection of 23 bacteriophages, including phages with temperate, lytic, and parasitic life histories, with varied sequence organizations and with different hosts and with different morphologies. Comparisons use relative abundances of di-, tri-, and tetranucleotides from entire genomes. We highlight several specific findings. (i) As previously shown for cellular genomes, each viral genome has a distinctive signature of short oligonucleotide abundances that pervade the entire genome and distinguish it from other genomes. (ii) The enteric temperate double-stranded (ds) phages, like enterobacteria, exhibit significantly high relative abundances of GpC = GC and significantly low values of TA, but no such extremes exist in ds lytic phages. (iii) The tetranucleotide CTAG is of statistically low relative abundance in most phages. (iv) The DAM methylase site GATC is of statistically low relative abundance in most phages, but not in P1. This difference may relate to controls on replication (e.g., actions of the host SeqA gene product) and to MutH cleavage potential of the Escherichia coli DAM mismatch repair system. (v) The enteric temperate dsDNA phages form a coherent group: they are relatively close to each other and to their bacteria] hosts in average differences of dinucleotide relative abundance values. By contrast, the lytic dsDNA phages do not form a coherent group. This difference may come about because the temperate phages acquire more sequence characteristics of the host because they use the host replication and repair machinery, whereas the analyzed lytic phages are replicated by their own machinery. (vi) The nonenteric temperate phages with mycoplasmal and mycobacterial hosts are relatively close to their respective hosts and relatively distant from any of the enteric hosts and from the other phages. (vii) The single-stranded RNA phages have dinucleotide relative abundance values closest to those for random sequences, presumably attributable to the mutation rates of RNA phages being much greater than those of DNA phages.