[Show abstract][Hide abstract] ABSTRACT: Genetic and pharmacological perturbation experiments, such as deleting a gene and monitoring gene expression responses, are powerful tools for studying cellular signal transduction pathways. However, it remains a challenge to automatically derive knowledge of a cellular signaling system at a conceptual level from systematic perturbation-response data. In this study, we explored a framework that unifies knowledge mining and data mining towards the goal. The framework consists of the following automated processes: 1) applying an ontology-driven knowledge mining approach to identify functional modules among the genes responding to a perturbation in order to reveal potential signals affected by the perturbation; 2) applying a graph-based data mining approach to search for perturbations that affect a common signal; and 3) revealing the architecture of a signaling system by organizing signaling units into a hierarchy based on their relationships. Applying this framework to a compendium of yeast perturbation-response data, we have successfully recovered many well-known signal transduction pathways; in addition, our analysis has led to many new hypotheses regarding the yeast signal transduction system; finally, our analysis automatically organized perturbed genes as a graph reflecting the architecture of the yeast signaling system. Importantly, this framework transformed molecular findings from a gene level to a conceptual level, which can be readily translated into computable knowledge in the form of rules regarding the yeast signaling system, such as "if genes involved in the MAPK signaling are perturbed, genes involved in pheromone responses will be differentially expressed."
PLoS ONE 04/2013; 8(4):e61134. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Most of the knowledge regarding genes and proteins is stored in biomedical literature as free text. Extracting information from complex biomedical texts demands techniques capable of inferring biological concepts from local text regions and mapping them to controlled vocabularies. To this end, we present a sentence-based correspondence latent Dirichlet allocation (scLDA) model which, when trained with a corpus of PubMed documents with known GO annotations, performs the following tasks: 1) learning major biological concepts from the corpus, 2) inferring the biological concepts existing within text regions (sentences), and 3) identifying the text regions in a document that provides evidence for the observed annotations. When applied to new gene-related documents, a trained scLDA model is capable of predicting GO annotations and identifying text regions as textual evidence supporting the predicted annotations. This study uses GO annotation data as a testbed; the approach can be generalized to other annotated data, such as MeSH and MEDLINE documents.
[Show abstract][Hide abstract] ABSTRACT: The Gene Ontology (GO) is a controlled vocabulary designed to represent the biological concepts pertaining to gene products. This study investigates the methods for identifying informative subsets of GO terms in an automatic and objective fashion. This task in turn requires addressing the following issues: how to represent the semantic context of GO terms, what metrics are suitable for measuring the semantic differences between terms, how to identify an informative subset that retains as much as possible of the original semantic information of GO.
We represented the semantic context of a GO term using the word-usage-profile associated with the term, which enables one to measure the semantic differences between terms based on the differences in their semantic contexts. We further employed the information bottleneck methods to automatically identify subsets of GO terms that retain as much as possible of the semantic information in an annotation database. The automatically retrieved informative subsets align well with an expert-picked GO slim subset, cover important concepts and proteins, and enhance literature-based GO annotation.
[Show abstract][Hide abstract] ABSTRACT: The Gene Ontology is the most commonly used controlled vocabulary for annotating proteins. The concepts in the ontology are organized as a directed acyclic graph, in which a node corresponds to a biological concept and a directed edge denotes the parent-child semantic relationship between a pair of terms. A large number of protein annotations further create links between proteins and their functional annotations, reflecting the contemporary knowledge about proteins and their functional relationships. This leads to a complex graph consisting of interleaved biological concepts and their associated proteins. What is needed is a simple, open source library that provides tools to not only create and view the Gene Ontology graph, but to analyze and manipulate it as well. Here we describe the development and use of GOGrapher, a Python library that can be used for the creation, analysis, manipulation, and visualization of Gene Ontology related graphs.
An object-oriented approach was adopted to organize the hierarchy of the graphs types and associated classes. An Application Programming Interface is provided through which different types of graphs can be pragmatically created, manipulated, and visualized. GOGrapher has been successfully utilized in multiple research projects, e.g., a graph-based multi-label text classifier for protein annotation.
The GOGrapher project provides a reusable programming library designed for the manipulation and analysis of Gene Ontology graphs. The library is freely available for the scientific community to use and improve.
[Show abstract][Hide abstract] ABSTRACT: Functional magnetic resonance imaging (fMRI) is a technology used to detect brain activity. Patterns of brain activation have been utilized as biomarkers for various neuropsychiatric applications. Detecting deception based on the pattern of brain activation characterized with fMRI is getting attention - with machine learning algorithms being applied to this field in recent years. The high dimensionality of fMRI data makes it a difficult task to directly utilize the original data as input for classification algorithms in detecting deception. In this paper, we investigated the procedures of feature selection to enhance fMRI-based deception detection.
We used the t-statistic map derived from the statistical parametric mapping analysis of fMRI signals to construct features that reflect brain activation patterns. We subsequently investigated various feature selection methods including an ensemble method to identify discriminative features to detect deception. Using 124 features selected from a set of 65,166 original features as inputs for a support vector machine classifier, our results indicate that feature selection significantly enhanced the classification accuracy of the support vector machine in comparison to the models trained using all features and dimension reduction based models. Furthermore, the selected features are shown to form anatomic clusters within brain regions, which supports the hypothesis that specific brain regions may play a role during deception processes.
Feature selection not only enhances classification accuracy in fMRI-based deception detection but also provides support for the biological hypothesis that brain activities in certain regions of the brain are important for discrimination of deception.
[Show abstract][Hide abstract] ABSTRACT: The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification.
In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community.
Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.
[Show abstract][Hide abstract] ABSTRACT: Predicting protein subcellular locations may help us understand protein functions and analyse protein interactions with other molecules. Many machine learning and computational techniques have been used to predict protein subcellular locations. In this paper, we propose a new hybrid classification system called SVM-ANFIS based on Support Vector Machines and Adaptive Neuro Fuzzy Inference System for protein subcellular location prediction. The experimental results show that the new system can not only achieve high total accuracies but also improve local accuracies in protein subcellular location prediction.
International Journal of Computational Intelligence in Bioinformatics and Systems Biology 01/2009; 1(1).
[Show abstract][Hide abstract] ABSTRACT: To deal with different membership functions of the same linguistic term, a new interval reasoning method using new granular sets is proposed based on Yin Yang methodology. To make interval-valued granular reasoning efficiently and optimize interval membership functions based on training data effectively, a granular neural network (GNN) with a new high-speed evolutionary interval learning is designed. Simulation results in nonlinear function approximation and bioinformatics have shown that the GNN with the evolutionary interval learning is able to extract interval-valued granular rules effectively and efficiently from training data by using the new evolutionary interval learning algorithm.
IEEE Transactions on Fuzzy Systems 05/2008; · 6.31 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: With the growing interests of biological data prediction and chemical data prediction, more powerful and flexible kernels need to be designed so that the prior knowledge and relationships within data can be expressed effectively in kernel functions. In this paper, Granular Kernel Trees (GKTs) are proposed and parallel Genetic Algorithms (GAs) are used to optimise the parameters of GKTs. In applications, SVMs with new kernel trees are employed for drug activity comparisons. The experimental results show that GKTs and evolutionary GKTs can achieve better performances than traditional RBF kernels in terms of prediction accuracy.
International Journal of Data Mining and Bioinformatics 02/2007; 1(3):270-85. · 0.66 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present a genetic fuzzy feature transformation method for support vector machines (SVMs) to do more accurate data classification. Given data are first transformed into a high feature space by a fuzzy system, and then SVMs are used to map data into a higher feature space and then construct the hyperplane to make a final decision. Genetic algorithms are used to optimize the fuzzy feature transformation so as to use the newly generated features to help SVMs do more accurate biomedical data classification under uncertainty. The experimental results show that the new genetic fuzzy SVMs have better generalization abilities than the traditional SVMs in terms of prediction accuracy.
Information Sciences 01/2007; 177:476-489. · 3.89 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: To make interval-valued granular reasoning efficiently and optimize interval membership functions based on training data effectively,
a new Genetic Granular Neural Network (GGNN) is desinged. Simulation results have shown that the GGNN is able to extract useful
fuzzy knowledge effectively and efficiently from training data to have high training accuracy.
Advances in Neural Networks - ISNN 2007, 4th International Symposium on Neural Networks, ISNN 2007, Nanjing, China, June 3-7, 2007, Proceedings, Part II; 01/2007
[Show abstract][Hide abstract] ABSTRACT: With the growing interest of biological data prediction and chemical data prediction, more and more complicated kernels are
designed to integrate data structures and relationships. We proposed a kind of evolutionary granular kernel trees (EGKTs)
for drug activity comparisons . In EGKTs, feature granules and tree structures are predefined based on the possible substituent
locations. In this paper, we present a new system to evolve the structures of granular kernel trees (GKTs) in the case that
we lack knowledge to predefine kernel trees. The new granular kernel tree structure evolving system is used for cyclooxygenase-2
inhibitor activity comparison. Experimental results show that the new system can achieve better performance than SVMs with
traditional RBF kernels in terms of prediction accuracy.
Transactions on Computational Systems Biology - TCSB. 01/2006;
[Show abstract][Hide abstract] ABSTRACT: Due to the fact that the training time and space complexities of SVMs are mainly dependent on the size of training set, SVMs are not suitable for classifying large data sets with several millions of examples. To solve this problem, we in this paper propose a new algorithm called minimum enclosing ball (MEB) based SVM (MEB-SVM). In MEB-SVM, the boundary of each class data set is first measured by several MEBs, and then an SVM is trained by the data locating on the two class boundaries. Experiments on the KDDCUP-99 intrusion detection data set with about five million examples, the Ringnorm artificial data set with one hundred million examples, and the NDC data set with two million examples show that the new algorithm has competitive performance in terms of running time, testing accuracy and number of support vectors.
Fuzzy Systems, 2006 IEEE International Conference on; 01/2006
[Show abstract][Hide abstract] ABSTRACT: How to design powerful and flexible kernels to improve the system performance is an important topic in kernel based classification.
In this paper, we present a new granular kernel method to improve the performance of Support Vector Machines (SVMs). In the
system, genetic algorithms (GAs) are used to generate feature granules and optimize them together with fusions and parameters
of granular kernels. The new granular kernel method is used for cyclooxygenase-2 inhibitor activity comparison. Experimental
results show that the new method can achieve better performance than SVMs with traditional RBF kernels in terms of prediction
Advances in Neural Networks - ISNN 2006, Third International Symposium on Neural Networks, Chengdu, China, May 28 - June 1, 2006, Proceedings, Part I; 01/2006
[Show abstract][Hide abstract] ABSTRACT: Kernel methods, specifically support vector machines (SVMs), have been widely used in many fields for data classification and pattern recognition. The performance of SVMs is mainly affected by kernel functions. With the growing interest of biological data prediction and chemical data prediction such as structure-property based molecule comparison, protein structure prediction and long DNA sequence comparison, more powerful and flexible kernels need to be designed in order effectively to express the prior knowledge and relationships within each data item. In this paper, the granular kernel concept is presented and related properties are described in detail. A hierarchical kernel design method is proposed to construct granular kernel trees (GKTs). For a particular problem, genetic algorithms (GAs) are used to find the optimum parameter settings of GKTs. In applications, SVMs with new kernel trees are employed for the comparisons of drug activities. The experimental results show that SVMs with GKTs and evolutionary GKTs can achieve better performances than SVMs with traditional RBF kernels in terms of prediction accuracy.
Computational Intelligence in Bioinformatics and Computational Biology, 2005. CIBCB '05. Proceedings of the 2005 IEEE Symposium on; 12/2005
[Show abstract][Hide abstract] ABSTRACT: Protein homology prediction between protein sequences is one of critical problems in computational biology. Such a complex classification problem is common in medical or biological information processing applications. How to build a model with superior generalization capability from training samples is an essential issue for mining knowledge to accurately predict/classify unseen new samples and to effectively support human experts to make correct decisions.
A new learning model called granular support vector machines (GSVM) is proposed based on our previous work. GSVM systematically and formally combines the principles from statistical learning theory and granular computing theory and thus provides an interesting new mechanism to address complex classification problems. It works by building a sequence of information granules and then building support vector machines (SVM) in some of these information granules on demand. A good granulation method to find suitable granules is crucial for modeling a GSVM with good performance. In this paper, we also propose an association rules-based granulation method. For the granules induced by association rules with high enough confidence and significant support, we leave them as they are because of their high "purity" and significant effect on simplifying the classification task. For every other granule, a SVM is modeled to discriminate the corresponding data. In this way, a complex classification problem is divided into multiple smaller problems so that the learning task is simplified.
The proposed algorithm, here named GSVM-AR, is compared with SVM by KDDCUP04 protein homology prediction data. The experimental results show that finding the splitting hyperplane is not a trivial task (we should be careful to select the association rules to avoid overfitting) and GSVM-AR does show significant improvement compared to building one single SVM in the whole feature space. Another advantage is that the utility of GSVM-AR is very good because it is easy to be implemented. More importantly and more interestingly, GSVM provides a new mechanism to address complex classification problems.
Artificial Intelligence in Medicine 09/2005; 35(1-2):121-34. · 1.36 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In support vector machines (SVMs) learning, data to be classified are directly fed to the algorithms without modification. In many real world applications, objects however cannot be represented by original feature vectors accurately because the original features of vectors might contain noise, imprecise description, or unrelated information, which negatively affect SVMs to learn useful knowledge from raw given data. To challenging this problem, we in this paper present an evolutionary feature weights optimization method, which is used to transform the raw data into a "better" feature space to improve SVMs classification accuracies.
Fuzzy Information Processing Society, 2005. NAFIPS 2005. Annual Meeting of the North American; 07/2005