Conference Paper

Species identification based on approximate matching

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Genomic data mining and knowledge extraction is an important problem in bioinformatics. Existing methods for species identification are based on n-grams. In this paper, we propose a novel approach for identification of species. Given a database of genomic sequences, our proposed work includes extraction of all candidate/subsequences that satisfy: length grater or equal to given minimum length, given number of mismatches and support grater or equal to user threshold. These patterns are used as features for classifier. Classification of genome sequences has been done by using data mining techniques namely, Naive Bayes, support vector machine and nearest neighbor. Individual classifier accuracies are reported. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely E. coli and Yeast are used to verify proposed method.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This process is repeated for all the analysed sequences. The n-gram can be represented in binary form [14,15] or either in dinucleotide [16] or trinucleotide [17,18] frequencies. ...
... Patil et al. [14] proposed a method for species identification based on approximate pattern matching. The novelty in the work was feature extraction technique for genome data. ...
... The existing n-gram based methods extract frequencies of 4 n features from genome data. In [14] authors extracted all candidate/subsequences that satisfy: length grater or equal to given minimum length, given number of mismatches and support grater or equal to user threshold. These frequent subsequences are used as features to construct a binary table where the presence or absence of an attribute/feature in a sequence is represented by 1 or 0 respectively. ...
Article
Full-text available
Genomic data mining and knowledge extraction is an important problem in bioinformatics. Some research work has been done on unknown genome identification and is based on exact pattern matching of n-grams. In most of the real world biological problems exact matching may not give desired results and the problem in using n-grams is exponential explosion. In this paper we propose a method for genome data classification based on approximate matching. The algorithm works by selecting random samples from the genome database. Tolerance is allowed by generating candidates of varied length to query from these sample sequences. The Levenshtein distance is then checked for each candidate and whether they are k-fuzzily equal. The total number of fuzzy matches for each sequence is then calculated. This is then classified using the data mining techniques namely, naive Bayes, support vector machine, back propagation and also by nearest neighbor. Experiment results are provided for different tolerance levels and they show that accuracy increases as tolerance does. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely Yeast and E. coli are used to verify proposed method.
... Nonetheless, it is more frequently seen in small datasets. The field of influenza subtyping [24], viral fragment classification, HPV genotype prediction [25], taxonomic grouping of bacterial and eukaryotic genomes [17], identification of microbial DNA sequences [26], differentiation between the genomes of E. coli and yeast [27], classification of bacterial genome fragments [25], classification of splicing-related sequence snippets [28], and classification of archaeal and bacterial groups [29] have all been observed applications of these techniques. ...
... Nonetheless, it is more frequently seen in small datasets. The field of influenza subtyping [24], viral fragment classification, HPV genotype prediction [25], taxonomic grouping of bacterial and eukaryotic genomes [17], identification of microbial DNA sequences [26], differentiation between the genomes of E. coli and yeast [27], classification of bacterial genome fragments [25], classification of splicing-related sequence snippets [28], and classification of archaeal and bacterial groups [29] have all been observed applications of these techniques. ...
Article
Full-text available
The application of deep learning for taxonomic categorization of DNA sequences is investigated in this study. Two deep learning architectures, namely the Stacked Convolutional Autoencoder (SCAE) with Multilabel Extreme Learning Machine (MLELM) and the Variational Convolutional Autoencoder (VCAE) with MLELM, have been proposed. These designs provide precise feature maps for individual and inter-label interactions within DNA sequences, capturing their spatial and temporal properties. The collected features are subsequently fed into MLELM networks, which yield soft classification scores and hard labels. The proposed algorithms underwent thorough training and testing on unsupervised data, whereby one or more labels were concurrently taken into account. The introduction of the clade label resulted in improved accuracy for both models compared to the class or genus labels, probably owing to the occurrence of large clusters of similar nucleotides inside a DNA strand. In all circumstances, the VCAE-MLELM model consistently outperformed the SCAE-MLELM model. The best accuracy attained by the VCAE-MLELM model when the clade and family labels were combined was 94%. However, accuracy ratings for single-label categorization using either approach were less than 65%. The approach’s effectiveness is based on MLELM networks, which record connected patterns across classes for accurate label categorization. This study advances deep learning in biological taxonomy by emphasizing the significance of combining numerous labels for increased classification accuracy.
... In supervised classification, k-mer frequency or percentage vectors have been utilized, but usually with tiny datasets. These vectors have been used to subtype influenza and classify polyoma and rhinovirus fragments [102], to predict HPV genotypes [144], [135], to classify full b acterial genomes to their appropriate taxonomic groups at various levels [150], to classify whole eukaryotic mitochondrial genomes [112][113][114][115], to categorize 27 microbial nuclear DNA sequences [132], to classify hundreds of thousands of short (less than 10,000 base pairs long) prokaryote sequences into various phylogenetic groupings [80], [81], [149], to discriminate extremely small samples of the E.coli and yeast genomes [145], to classify short bacterial genome fragments from 28 species [144], to identify larger bacterial genome fragments from 118 species [148], to categorize short splicing-related sequence snippets [147], [146], and to classify various archaeal and bacterial groups [97]. ...
Thesis
Full-text available
Deep learning is one of the most prominent machine learning approaches today because of its capacity to autonomously extract features from massive volumes of data and automatically learn meaningful representations from them. Image and speech recognition, as well as robotics, are some of the domains where it is used. Deep learning has found usage in the biology area as a result of the recent increase of biological 'omics' data, including applications in early cancer detection and protein-protein interactions. In this research, we have used one-hot encoding on the DNA strands to convert the text into numbers and unique color platelets for each 4-mer in a sequence. Additionally, each DNA sequence number is tagged with labels such as; genus, family, order, class, phylum, and clade. The labels dataset, along with the one-hot encoding or image dataset, is then fed into the deep learning algorithms to classify the taxonomic labels of the DNA strand. The deep learning architectures proposed in this research are Stacked Convolutional Autoencoder (SCAE) with Multi-label Extreme Learning Machine (MLELM) and Variational Convolutional Autoencoder (VCAE) with MLELM. SCAE and VCAE generate the detailed feature map for individuals and between taxonomic labels of a DNA sequence from the one hot encoding of the DNA sequence input data by identifying the spatial and temporal salient qualities. The feature vector is then fed to the first MLELM network to produce a soft classification score for each data point. based on which the second MLELM network would generate hard labels. The suggested methods were excessively trained and tested on unsupervised data by considering one or more labels at a time. The model is also able to classify the DNA sequence characteristics based on the Phylogenetic tree. Through experimentation, it was found that the model is able to generate a better accuracy score label when classifying the host of the DNA sequence when considering the clade label rather than the class or genus label for both models. Due to the presence of large, similar groups of nucleotides within a DNA strand. Moreover, it was also observed that VCAE-MLEM performs much better than SCAE-MLELM under any circumstances. Due to this neural network structure. The highest accuracy obtained by VCAE-MLELM model when classifying the DNA sequences with consideration of clade and family label together is 94% accuracy. While SCAE-MLELM obtains 78% with consideration of clade-family labels. Single label classification for either of the algorithms generates accuracy scores lower than 65%. It is because of the MLELMs networks that it is possible to classify labels based on linked patterns between classes.
Article
String matching has become essential for modern computers. It is used in many applications ranging from data mining to network security. A problem is that current general purpose computers are no longer fast enough to deal with the ever increasing amounts of data that are passed through them due to the massive increases in network traffic and data storage capacities offered. This paper aims to demonstrate the significant performance gains that can be achieved by employing string matching algorithms directly in hardware using an FPGA, as opposed to the traditional software-only solution. A possible future FPGA-based string matching board that could be installed in current computers is discussed.
Article
Full-text available
Classification of organisms into different categories using their genomic sequences has found its importance in the study of evolutionary characteristics of organisms and specific identification of previously unknown organisms in biodiversity studies and related areas. Chaos game representation (CGR) uniquely represents DNA sequence in a visual format and reveals hidden patterns in it. Frequency CGR (FCGR) derived from CGR shows the frequency of sub-sequences present in the DNA sequence. In this paper, a novel method for classification of organisms based on a combination of FCGR and Artificial Neural network (ANN) is proposed. Eight categories, from the taxonomical distribution of Eukaryotic organisms are taken from NCBI and ANN is used for classification. Different configurations of ANN are tested and good accuracy is obtained. The way, the fractal nature of CGR helps in classification, is also investigated.
Article
Full-text available
The 4,639,221–base pair sequence of Escherichia coliK-12 is presented. Of 4288 protein-coding genes annotated, 38 percent have no attributed function. Comparison with five other sequenced microbes reveals ubiquitous as well as narrowly distributed gene families; many families of similar genes within E. coli are also evident. The largest family of paralogous proteins contains 80 ABC transporters. The genome as a whole is strikingly organized with respect to the local direction of replication; guanines, oligonucleotides possibly related to replication and recombination, and most genes are so oriented. The genome also contains insertion sequence (IS) elements, phage remnants, and many other patches of unusual composition indicating genome plasticity through horizontal transfer.
Article
Full-text available
A set of 16 kinds of dinucleotide compositions was used to analyze the protein-encoding nucleotide sequences in nine complete genomes: Escherichia coli, Haemophilus influenzae, Helicobacter pylori, Mycoplasma genitalium, Mycoplasma pneumoniae, Synechocystis sp., Methanococcus jannaschii, Archaeoglobus fulgidus, and Saccharomyces cerevisiae. The dinucleotide composition was significantly different between the organisms. The distribution of genes from an organism was clustered around its center in the dinucleotide composition space. The genes from closely related organisms such as Gram-negative bacteria, mycoplasma species and eukaryotes showed some overlap in the space. The genes from nine complete genomes together with those from human were discriminated into respective clusters with 80% accuracy using the dinucleotide composition alone. The composition data estimated from a whole genome was close to that obtained from genes, indicating that the characteristic feature of dinucleotides holds not only for protein coding regions but also noncoding regions. When a dendrogram was constructed from the disposition of the clusters in the dinucleotide space, it resembled the real phylogenetic tree. Thus, the distinct feature observed in the dinucleotide composition may reflect the phylogenetic relationship of organisms.
Article
Full-text available
We explored DNA structures of genomes by means of a new tool derived from the "chaotic dynamical systems" theory (the so-called chaos game representation [CGR]), which allows the depiction of frequencies of oligonucleotides in the form of images. Using CGR, we observe that subsequences of a genome exhibit the main characteristics of the whole genome, attesting to the validity of the genomic signature concept. Base concentrations, stretches (runs of complementary bases or purines/pyrimidines), and patches (over- or underexpressed words of various lengths) are the main factors explaining the variability observed among sequences. The distance between images may be considered a measure of phylogenetic proximity. Eukaryotes and prokaryotes can be identified merely on the basis of their DNA structures.
Article
Full-text available
Motivation: In drug discovery a key task is to identify characteristics that separate active (binding) compounds from inactive (non-binding) ones. An automated prediction system can help reduce resources necessary to carry out this task. Results: Two methods for prediction of molecular bioactivity for drug design are introduced and shown to perform well in a data set previously studied as part of the KDD (Knowledge Discovery and Data Mining) Cup 2001. The data is characterized by very few positive examples, a very large number of features (describing three-dimensional properties of the molecules) and rather different distributions between training and test data. Two techniques are introduced specifically to tackle these problems: a feature selection method for unbalanced data and a classifier which adapts to the distribution of the the unlabeled test data (a so-called transductive method). We show both techniques improve identification performance and in conjunction provide an improvement over using only one of the techniques. Our results suggest the importance of taking into account the characteristics in this data which may also be relevant in other problems of a similar type.
Article
Full-text available
Traditionally, housekeeping and tissue specific genes have been classified using direct assay of mRNA presence across different tissues, but these experiments are costly and the results not easy to compare and reproduce. In this work, a Naive Bayes classifier based only on physical and functional characteristics of genes already available in databases, like exon length and measures of chromatin compactness, has achieved a 97% success rate in classification of human housekeeping genes (93% for mouse and 90% for fruit fly). The newly obtained lists of housekeeping and tissue specific genes adhere to the expected functions and tissue expression patterns for the two classes. Overall, the classifier shows promise, and in the future additional attributes might be included to improve its discriminating power.
Article
Full-text available
The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model. The utility of the different soft computing methodologies is highlighted. Generally fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete/noisy data, mixed media information and human interaction, and can provide approximate solutions faster. Neural networks are nonparametric, robust, and exhibit good learning and generalization capabilities in data-rich environments. Genetic algorithms provide efficient search algorithms to select a model, from mixed media data, based on some preference criterion/objective function. Rough sets are suitable for handling different types of uncertainty in data. Some challenges to data mining and the application of soft computing methodologies are indicated. An extensive bibliography is also included.
Article
Classification of unknown genomes finds wide application in areas like evolutionary studies, bio-diversity researches and forensic studies which are viewed in a renewed 'genomic' perspective, lately. Only a few attempts are seen in literature focusing on unknown genome identification, and the reported accuracies are not more than 85%. Most works report classification into the major kingdoms only, not venturing further into their sub-classes. A novel combined technique of Chaos Game Representation (CGR) and machine learning is proposed, the former for feature extraction and the latter for subsequent sequence classification. Eight sub categories of eukaryotic mitochondrial genomes from NCBI are used for the study. The sequences are initially mapped into their Chaos Game Representation format. Genomic feature extraction is implemented by computing the Frequency Chaos Game Representation (FCGR) matrix. An order 3 FCGR matrix is considered here, which consists of 64 elements. The 64 element matrix acts as the feature descriptor for classification. The classification methods used are Difference Boosting Naïve Bayesian (DBNB) based method, Artificial Neural Network (ANN) based and Support Vector Machine (SVM) based methods. Accuracies of individual methods are reported. Although the average accuracy is seen highest for the SVM-CGR combination, better accuracies are seen for some categories in other methods too. Hence a voting classifier is implemented combining all the three methods. Accuracies of 100% were obtained for Vertebrata and Porifera whereas Acoelomata, Cnidaria and Fungi were classified with accuracies above 90%. The accuracies obtained for Protostomia, Plant, and Pseudocoelomata were respectively 90, 82 and 77%.
Article
Problem statement: Sequence analysis problems are NP hard and need optimal solutions. Interesting problems include duplicate sequence detection, sequence matching by relevance, sequence analysis using approximate comparison in general or using tools i.e., Matlab and multi-lingual sequence analysis. The usefulness of these operations is highlighted and future expectations are described. Approach: This study described the concepts, tools, methodologies, algorithms being used for sequence analysis. The sequences contained precious information that needed to be mined for useful purposes. There was high concentration required to model the optimal solution. The similarity and alignments concepts can not be addressed directly with one technique or algorithm, a better performance was achieved by the comprehension of different concepts. Results: We had compared different approaches using exemplary data and found that ClustalW2 is fairly good tool in terms of analysis. We assigned different weight values for relevant features and obtained score 95 in comparison phenomenon and 45 in alignment. Conclusion: Different techniques and approaches had been evaluated and compared.
Article
In this paper we address the problem of automated classification of isolates, i.e., the problem of determining the family of genomes to which a given genome belongs. Additionally, we address the problem of automated unsupervised hierarchical clustering of isolates according only to their statistical substring properties. For both of these problems we present novel algorithms based on nucleotide n-grams, with no required preprocessing steps such as sequence alignment. Results obtained experimentally are very positive and suggest that the proposed techniques can be successfully used in a variety of related problems. The reported experiments demonstrate better performance than some of the state-of-the-art methods. We report on a new distance measure between n-gram profiles, which shows superior performance compared to many other measures, including commonly used Euclidean distance.
Conference Paper
The exploration of DNA genomic huge sequences (up to several megabases) needs new kind of data representation allowing robust analyses. With the help of the chaos game representation method (CGR), fractal images can be generated, which allow to observe, at a glance, frequencies of words (small sequences of the four bases: G, A, T, C) in DNA sequences. Classification of CGR images and extraction of main features are the issues addressed in this work, using a classical statistical analysis (principal component analysis) and neural networks grounded on curvilinear component analysis algorithm and Kohonen map
Article
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems. 1