International Journal of Data Mining and Bioinformatics (INT J DATA MIN BIOIN )


Mining bioinformatics data is an emerging area at the intersection between bioinformatics and data mining. The objective of the IJDMB is to facilitate collaboration between data mining researchers and bioinformaticians by presenting cutting edge research topics and methodologies in the area of data mining for bioinformatics. This perspective acknowledges the inter-disciplinary nature of the research in data mining and bioinformatics and provides a unified forum for researchers/practitioners/students/policy makers to share the latest research and developments in this fast growing multi-disciplinary research area.

Impact factor 0.66

  • Hide impact factor history
    Impact factor
  • 5-year impact
  • Cited half-life
  • Immediacy index
  • Eigenfactor
  • Article influence
  • Website
    International Journal of Data Mining and Bioinformatics website
  • Other titles
    IJDMB, Data mining and bioinformatics
  • ISSN
  • OCLC
  • Material type
    Document, Periodical, Internet resource
  • Document type
    Internet Resource, Computer File, Journal / Magazine / Newspaper

Publications in this journal

  • [Show abstract] [Hide abstract]
    ABSTRACT: Early stage infections caused by fungal/oomycete spores may not be detected until signs or symptoms develop. Serological and molecular techniques are currently used for detecting these pathogens. Next-generation sequencing (NGS) has potential as a diagnostic tool, due to the capacity to target multiple unique signature loci of pathogens in an infected plant metagenome. NGS has significant potential for diagnosis of important eukaryotic plant pathogens. However, the assembly and analysis of huge amounts of sequence is laborious, time consuming, and not necessary for diagnostic purposes. Previous work demonstrated that a bioinformatic tool termed Electronic probe Diagnostic Nucleic acid Analysis (EDNA) had potential for greatly simplifying detecting Fungal and Oomycete plant pathogens in simulated metagenomes. The initial study demonstrated limitations for detection accuracy related to the analysis of matches between queries and metagenome reads. This study is a modification of EDNA demonstrating a better accuracy for detecting Fungal and Oomycete plant pathogens.
    International Journal of Data Mining and Bioinformatics 01/2015; In press.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Computational annotation and prediction of protein structure is very important in the post-genome era due to existence of many different proteins, most of which are yet to be verified. Mutual information based feature selection methods can be used in selecting such minimal yet predictive subsets of features. However, as protein features are organised into natural partitions, individual feature selection that ignores the presence of these views, dismantles them, and treats their variables intermixed along with those of others at best results in a complex un-interpretable predictive system for such multi-view datasets. In this paper, instead of selecting a subset of individual features, each feature subset is passed through a clustering step so that it is represented in discrete form using the cluster indices; this makes mutual information based methods applicable to view-selection. We present our experimental results on a multi-view protein dataset that are used to predict protein structure.
    International Journal of Data Mining and Bioinformatics 04/2014; 10(2):162-174.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Oligonucleotide sets are widely used in molecular biology to target a group of nucleic acid sequences using Polymerase Chain Reaction (PCR)-based technologies. Currently, the global matching efficiency of an oligonucleotide set is considered to be equal to the lower matching efficiency calculated for each oligonucleotide. However, sequences matching the limiting oligonucleotide did not always match the other oligonucleotide of the set, resulting in a biased evaluation of the matching efficiency. The Oligo- SpecificitySystem program avoid this bias by calculations of the real global matching efficiency of oligonucleotide sets. It can process all kinds of oligonucleotide sets, including the number of oligonucleotides, base pair degeneracy occurrences or mismatch occurrences.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(4):417 - 423.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Moreover, the large amount of textual knowledge in the existing biomedical literature is growing rapidly, and the creation of manual patterns from the available literature is becoming more difficult. There is an increasing demand to extract potential generic regulatory relationships from unlabelled data sets. In this paper, we describe a Semi-Supervised, Weighted Pattern Learning method (SSWPL) to extract such generic regulatory information from the literature. SSWPL can build new regulatory patterns according to predefined initial patterns from unlabelled data in the literature. These constructed regulatory patterns are then used to extract generic regulatory information from PubMed abstracts. The results presented herein demonstrate that our method can be utilised to effectively extract generic regulatory relationships from the literature by using learned, weighted patterns through semi-supervised pattern learning.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(4):401 - 416.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Message RNA (mRNA) is the template for protein synthesis. It carries information from DNA in the nucleus to the ribosome sites of protein synthesis in the cell. The turnover process of mRNA is a chemical event with multiple small step reactions and the degradation of mRNA molecules is an important step in gene expression. A number of mathematical models have been proposed to study the dynamics of mRNA turnover, ranging from a one-step first order reaction model to the linear multi-component models. Although the linear multi-component models provide detailed dynamics of mRNA degradation, the simple first-order reaction model has been widely used in mathematical modelling of genetic regulatory networks. To illustrate the difference between these models, we first considered a stochastic model based on the multi-component model. Then a simpler stochastic model was proposed to approximate the linear multi-component model. We also discussed the delayed one-step reaction models with different types of time delay, including the constant delay, exponentially distributed delay and Erlang distributed delay. The comparison study suggested that the one-step reaction models failed to realise the dynamics of mRNA turnover accurately. Therefore, more sophisticated one-step reaction models are needed to study the dynamics of mRNA degradation.
    International Journal of Data Mining and Bioinformatics 01/2014; 10(1):18 - 32.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The aim of the study is to evaluate gene component analysis for microarray studies. Three dimensional reduction strategies, Principle Component Regression (PCR), Partial Least Square (PLS) and Reduced Rank Regression (RRR) were applied to publicly available breast cancer microarray dataset and the derived gene components were used for tumor classification by Logistic Regression (LR) and Linear Discriminative Analysis (LDA). The impact of gene selection/filtration was evaluated as well. We demonstrated that gene component classifiers could reduce the high-dimensionality of gene expression data and the collinearity problem inherited in most modern microarray experiments. In our study gene component analysis could discriminate Estrogen Receptor (ER) positive breast cancers from negative cancers and the proposed classifiers were successfully reproduced and projected into independent microarray dataset with high predictive accuracy.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(2):149-71.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The prediction of operons is a critical step for the reconstruction of biochemical and regulatory networks at the whole genome level. In this paper, a novel operon prediction model is proposed based on Markov Clustering (MCL). The model employs a graph-clustering method by MCL for prediction and does not need a classifier. In the cross-species validation, the accuracies of E. coli K12, Bacillus subtilis and P. furiosus are 92.1, 86.9 and 87.3%, respectively. Experimental results show that the proposed method has a powerful capability of operon prediction. The compiled program and test data sets are publicly available at
    International Journal of Data Mining and Bioinformatics 01/2014; 9(4):424 - 443.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Understanding the interaction patterns among biological entities in a pathway can potentially reveal the role of the entities in biological systems. Although considerable effort has been contributed to this direction, querying biological pathways remained relatively unexplored. Querying is principally different in which we retrieve pathways satisfying a given property in terms of its topology, or constituents. One such property is subnetwork matching using various constituent parameters. In this paper, we introduce a logic based framework for querying biological pathways using a novel and generic subgraph isomorphism computation technique. We develop a graphical interface called IsoKEGG to facilitate flexible querying of KEGG pathways based on isomorphic pathway topologies as well as matching any combination of node names, types, and edges. It allows editing KGML represented query pathways and returns all isomorphic patterns in KEGG pathways satisfying a given query condition for further analysis.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(1):1-21.
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, some methods for ensemble learning of protein fold recognition based on a decision tree (DT) are compared and contrasted against each other over three datasets taken from the literature. According to previously reported studies, the features of the datasets are divided into some groups. Then, for each of these groups, three ensemble classifiers, namely, random forest, rotation forest and AdaBoost.M1 are employed. Also, some fusion methods are introduced for combining the ensemble classifiers obtained in the previous step. After this step, three classifiers are produced based on the combination of classifiers of types random forest, rotation forest and AdaBoost.M1. Finally, the three different classifiers achieved are combined to make an overall classifier. Experimental results show that the overall classifier obtained by the genetic algorithm (GA) weighting fusion method, is the best one in comparison to previously applied methods in terms of classification accuracy.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(1):89-105.
  • [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, mass spectrometry data analysis has become an important protein identification technique. The mass spectrometry technologies emerge as useful tools for biomarker discovery through studying protein profiles in various biological specimens. In mining mass spectrometry datasets, peak alignment is a critical issue among the preprocessing steps that affect the quality of analysis results. However, the existing peak alignment methods are sensitive to noise peaks across various mass spectrometry samples. In this paper, we proposed a novel algorithm named Two-Phase Clustering for peak Alignment (TPC-Align) to align mass spectrometry peaks across samples in the pre-processing phase. The TPC-Align algorithm sequentially considers the distribution of intensity values and the locations of mass-to-charge ratio values of peaks between samples. Moreover, TPC-Align algorithm can also report a list of significantly differential peaks between samples, which serve as the candidate biomarkers for further biological study. The proposed peak alignment method was compared to the current peak alignment approach based on one-dimension hierarchical clustering through experimental evaluations and the results show that TPC-Align outperforms the traditional method on the real dataset.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(1):52-66.
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present NetLoc, a novel diffusion Kernel-based Logistic Regression (KLR) algorithm for predicting protein subcellular localisation using four types of protein networks including physical PPI networks, genetic Protein-Protein Interaction (PPI) networks, mixed PPI networks and co-expression networks. NetLoc is applied to yeast protein localisation prediction. The results showed that protein networks can provide rich information for protein localisation prediction, achieving Area Under Curve (AUC) score of 0.93. We also showed that networks with high connectivity and high percentage of co-localised PPI lead to better prediction performance. Investigation showed that NetLoc is a very robust approach which can produce good performance (AUC = 0.75) only using 30% of original interactions and capable of producing overall accuracy greater than 0.5 only with 20% annotation coverage. Compared to the previous network feature based prediction algorithm which achieved AUC scores of 0.49 and 0.52 on the yeast PPI network, NetLoc achieved significantly better overall performance with the AUC of 0.74.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(4):386 - 400.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Although Ant-Miner has been used with relative ease for datasets with categorical data and small-sized feature vectors, microarray datasets, which contain a few samples with large amount of genes, are a totally different story. The Ant-Miner is an ant colony optimisation algorithm that extracts predictive rules from datasets and intrinsically works on discrete values. This study has developed a new algorithm, "Enhanced Ant-Miner" (EAM), based on previous works. EAM deals with continuous attributes as well as categorical ones and presents its captured models in the form of predictive rules. EAM has been tested versus SVM, CN2, K-means and hierarchical clustering and the results show that EAM is the best in the context of predictive accuracy. Additionally, its agent-based nature gives it a much more charming ability to speed up the whole process when compared to other trivial miners.
    International Journal of Data Mining and Bioinformatics 01/2014; 10(1):83 - 97.
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the availability of full-text documents in many online databases, the paradigm of biomedical literature mining and document understanding has shifted to analysis of both text and figures to derive implicit messages that are unforeseen with text mining only. To enable automatic, massive processing, a key step is to extract and parse figures embedded in papers. In this paper, we present a novel model-driven, hierarchical method to classify and extract panels from figures in scientific papers. Our method consists of two integrated components: figure (or panel) classification and panel segmentation. Figure classification evaluates each panel and decides the existence of photographs and drawings. Mixtures of photographs and non-photographs are divided into subfigures. The splitting process repeats until no further panel collage can be identified. Detection of highlighted views is addressed with Hough space analysis. Using reconstruction from Hough peaks, enclosed panels are retrieved and saved into separate files. Experiments were conducted with a total of 360 figures extracted from two sets of papers that are retrieved with difference sets of keywords. Experimental results demonstrated that our method successfully segmented figures and extracted photographs and non-photographs with high accuracy and robustness. In addition, our method was able to identify zoom-in views that are superimposed on the original photographs. The efficiency of our method allows online implementation.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(1):22-36.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Identifying glioma cancer-alerted genetic markers through analysis of microarray data allows us to detect tumours at the genome-wide level. To this end, we propose to identify glioma gene markers based primarily on their correlation with the glioma diagnostic outcomes, rather than merely on the classification quality or differential expression levels, as it is not the classification or expression level per se that is crucial, but the selection of biologically relevant biomarkers is the most important issue. With the help of singular value decomposition, microarray data are decomposed and the eigenvectors corresponding to the biological effect of diagnostic outcomes are identified. Genes that play important roles in determining this biological effect are thus detected. Therefore, genes are essentially identified in terms of their strength of association with diagnostic outcomes. Monte Carlo simulations are then used to fine tune the selected gene set in terms of classification accuracy. Experiments show that the proposed method achieves better classification accuracies and is data sets independent. Graph-based statistical analysis showed that the selected genes have close relationships with glioma diagnostic outcomes. Further biological database and literature study confirms that the identified genes are biologically relevant.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(1):67-88.
  • International Journal of Data Mining and Bioinformatics 01/2014; 10(2):206.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Local protein structure prediction is one of important tasks for bioinformatics research. In order to further enhance the performance of local protein structure prediction, we propose the Multi-level Clustering Support Vector Machine Trees (MLSVMTs). Building on the multi-cluster tree structure, the MLSVMTs model uses multiple SVMs, each of which is customized to learn the unique sequence-to-structure relationship for one cluster. Both the combined 5 x 2 CV F test and the independent test show that the local structure prediction accuracy of MLSVMTs is significantly better than that of one-level K-means clustering, Multi-level clustering and Clustering Support Vector Machines.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(2):172-98.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Glycoside Hydrolases (GHs) have played key roles in the development of biofuels as well as many other industries. Research aimed at accurate classification of catalytic mechanisms to increase the catalytic activity of GHs is receiving extensive attention. The traditional theories or methods used in the study of catalytic mechanisms of GHs are limited by reaction conditions. They are not suitable for the study of various GHs because different enzymes would show devious physicochemical properties. In this paper, a new method is proposed to classify and predict the catalytic mechanism of a certain glycoside hydrolase according to their sequence and structure features using k-Nearest Neighbour (kNN) classifier, Support Vector Machine (SVM), Naive Bayes (NB) Classifier and the Multilayer Perception (MLP) Classifier. The classification performance of the four computational methods used were evaluated and compared. Experimental results show that each classifier has its own advantages, but the kNN classifier is more accurate at the overall level. This research also helps us to gain a better understanding of the catalytic mechanisms in different GHs.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(4):444 - 457.