[show abstract][hide abstract] ABSTRACT: Background
Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis, and treatment of complex human diseases. Applying machine learning or statistical methods to epistatic interaction detection will encounter some common problems, e.g., very limited number of samples, an extremely high search space, a large number of false positives, and ways to measure the association between disease markers and the phenotype.
To address the problems of computational methods in epistatic interaction detection, we propose a score-based Bayesian network structure learning method, EpiBN, to detect epistatic interactions. We apply the proposed method to both simulated datasets and three real disease datasets. Experimental results on simulation data show that our method outperforms some other commonly-used methods in terms of power and sample-efficiency, and is especially suitable for detecting epistatic interactions with weak or no marginal effects. Furthermore, our method is scalable to real disease data.
We propose a Bayesian network-based method, EpiBN, to detect epistatic interactions. In EpiBN, we develop a new scoring function, which can reflect higher-order epistatic interactions by estimating the model complexity from data, and apply a fast Branch-and-Bound algorithm to learn the structure of a two-layer Bayesian network containing only one target node. To make our method scalable to real data, we propose the use of a Markov chain Monte Carlo (MCMC) method to perform the screening process. Applications of the proposed method to some real GWAS (genome-wide association studies) datasets may provide helpful insights into understanding the genetic basis of Age-related Macular Degeneration, late-onset Alzheimer’s disease, and autism.
BMC Systems Biology 08/2013; 6(3). · 2.98 Impact Factor
[show abstract][hide abstract] ABSTRACT: Most proteins execute their functions through interacting with other proteins. Thus, understanding protein-protein interactions (PPI) is essential to decipher biological functions in a living cell. To predict large-scale PPIs, effective and efficient computational approaches are desirable to integrate heterogeneous data sources provided by advanced technologies. In this paper, we extend our previous work on a Bayesian classifier for human PPI predictions from model organisms, by introducing a tree-augmented naïve Bayes (TAN) classifier. TAN maintains the simplicity and robustness of a naïve Bayes classifier while allows for the dependence among variables. Our empirical results show that by integrating features extracted from microarray expression measurements, Gene Ontology values, and orthologous scores, TAN achieves higher classification accuracy than the manually constructed Bayesian network classifier and naïve Bayes. For human PPI prediction, TAN obtains 88% sensitivity while keeping a reasonable 70% specificity on testing samples.
[show abstract][hide abstract] ABSTRACT: Automatic detection and identification of landmarks in cephalometry is of great significance to orthognathic surgery and clinic applications. Motivated by the increasing demands of computerized cephalometric analysis, we present a tree-shaped deformable template which detects the landmark points of a grayscale cephalometric x-ray image. After normalization, a group of randomly selected images are used to train the geometric prior, and a dynamic programming algorithm enhanced by down sampling is employed to find the optimal landmark configuration. The proposed algorithm demonstrates promising detection results as well as time efficiency on both soft and hard contours. This leads to a significant improvement over the state-of-art diagnostic tools in the area of cephalometric diagnosis.
Healthcare Informatics, Imaging and Systems Biology (HISB), 2011 First IEEE International Conference on; 08/2011
[show abstract][hide abstract] ABSTRACT: The availability of rapidly increasing repositories of micro array data requires the help of computer-aided analysis techniques. This data combined with a growing knowledge base about molecular processes enables the use of intelligent machine learning algorithms to expand the existing knowledge base. In this paper, we propose a novel algorithm, namely iterated Hidden Markov Model, to query micro array expression data with genes known to be involved in the same function to produce novel genes involved with the same cellular function. We run this algorithm on publicly available benchmark data sets and show that it outperforms comparable machine learning approaches.
IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011, Atlanta, GA, USA, 12-15 November, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis and treatment of complex human diseases. A recent study in automatic detection of epistatic interactions shows that Markov Blanket-based methods are capable of finding genetic variants strongly associated with common diseases and reducing false positives when the number of instances is large. Unfortunately, a typical dataset from genome-wide association studies consists of very limited number of examples, where current methods including Markov Blanket-based method may perform poorly.
To address small sample problems, we propose a Bayesian network-based approach (bNEAT) to detect epistatic interactions. The proposed method also employs a Branch-and-Bound technique for learning. We apply the proposed method to simulated datasets based on four disease models and a real dataset. Experimental results show that our method outperforms Markov Blanket-based methods and other commonly-used methods, especially when the number of samples is small.
Our results show bNEAT can obtain a strong power regardless of the number of samples and is especially suitable for detecting epistatic interactions with slight or no marginal effects. The merits of the proposed approach lie in two aspects: a suitable score for Bayesian network structure learning that can reflect higher-order epistatic interactions and a heuristic Bayesian network structure learning method.
[show abstract][hide abstract] ABSTRACT: The interactions among genetic factors related to diseases are called epistasis. With the availability of genotyped data from genome-wide association studies, it is now possible to computationally unravel epistasis related to the susceptibility to common complex human diseases such as asthma, diabetes, and hypertension. However, the difficulties of detecting epistatic interaction arose from the large number of genetic factors and the enormous size of possible combinations of genetic factors. Most computational methods to detect epistatic interactions are predictor-based methods and can not find true causal factor elements. Moreover, they are both time-consuming and sample-consuming.
We propose a new and fast Markov Blanket-based method, FEPI-MB (Fast EPistatic Interactions detection using Markov Blanket), for epistatic interactions detection. The Markov Blanket is a minimal set of variables that can completely shield the target variable from all other variables. Learning of Markov blankets can be used to detect epistatic interactions by a heuristic search for a minimal set of SNPs, which may cause the disease. Experimental results on both simulated data sets and a real data set demonstrate that FEPI-MB significantly outperforms other existing methods and is capable of finding SNPs that have a strong association with common diseases.
FEPI-MB algorithm outperforms other computational methods for detection of epistatic interactions in terms of both the power and sample-efficiency. Moreover, compared to other Markov Blanket learning methods, FEPI-MB is more time-efficient and achieves a better performance.
[show abstract][hide abstract] ABSTRACT: While genome sequencing projects have generated tremendous amounts of protein sequence data for a vast number of genomes, substantial portions of most genomes are still unannotated. Despite the success of experimental methods for identifying protein functions, they are often lab intensive and time consuming. Thus, it is only practical to use in silico methods for the genome-wide functional annotations. In this paper, we propose new features extracted from protein sequence only and machine learning-based methods for computational function prediction. These features are derived from a position-specific scoring matrix, which has shown great potential in other bininformatics problems. We evaluate these features using four different classifiers and yeast protein data. Our experimental results show that features derived from the position-specific scoring matrix are appropriate for automatic function annotation.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 01/2011; 8(2):308-15. · 2.25 Impact Factor
[show abstract][hide abstract] ABSTRACT: KUPS (The University of Kansas Proteomics Service) provides high-quality protein-protein interaction (PPI) data for researchers developing and evaluating computational models for predicting PPIs by allowing users to construct ready-to-use data sets of interacting protein pairs (IPPs), non-interacting protein pairs (NIPs) and associated features. Multiple filters and options allow the user to control the make-up of the IPPs and NIPs as well as the quality of the resultant data sets. Each data set is built from the overall database, which includes 185 446 IPPs and ∼1.5 billion NIPs from five primary databases: IntAct, HPRD, MINT, UniProt and the Gene Ontology. The IPP set can be set to specific model organisms, interaction types and experimental evidence. The NIP set can be generated using four different strategies, which can alleviate biased estimation problems. Lastly, multiple features can be provided for all of the IPP and NIP pairs. Additionally, KUPS provides two benchmark data sets to help researchers compare their algorithms to existing approaches. KUPS is freely available at http://www.ittc.ku.edu/chenlab.
Nucleic Acids Research 10/2010; 39(Database issue):D750-4. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: Multi-label classification refers to learning tasks with each instance belonging to one or more classes simultaneously. It arose from real-world applications such as information retrieval, text categorization and functional genomics. Currently, most of the multi-label learning methods use the strategy called binary relevance, which constructs a classifier for each unique label by grouping data into positives (examples with this label) and negatives (examples without this label). With binary relevance, an example with multiple labels is considered as a positive data for each label it belongs to. For some classes, this data point may behave like an outlier confusing classifiers, especially in the cases of well-separated classes. In this paper, we first introduce a new strategy called soft relevance, where each multi-label example is assigned a relevance score to the labels it belongs to. This soft relevance is then employed in a voting function used in a k nearest neighbor classifier. Furthermore, a voting-margin ratio is introduced to the k nearest neighbor classifier for better performance. We compare the proposed method to other multi-label learning methods over three multi-label datasets and demonstrate that the proposed method provides an effective way to multi-label learning.
Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010; 01/2010
[show abstract][hide abstract] ABSTRACT: Detecting epistatic interactions associated with complex and common diseases can help to improve prevention, diagnosis and treatment of these diseases. With the development of genome-wide association studies (GWAS), designing powerful and robust computational method for identifying epistatic interactions associated with common diseases becomes a great challenge to bioinformatics society, because the study of epistatic interactions often deals with the large size of the genotyped data and the huge amount of combinations of all the possible genetic factors. Most existing computational detection methods are based on the classification capacity of SNP sets, which may fail to identify SNP sets that are strongly associated with the diseases and introduce a lot of false positives. In addition, most methods are not suitable for genome-wide scale studies due to their computational complexity.
We propose a new Markov Blanket-based method, DASSO-MB (Detection of ASSOciations using Markov Blanket) to detect epistatic interactions in case-control GWAS. Markov blanket of a target variable T can completely shield T from all other variables. Thus, we can guarantee that the SNP set detected by DASSO-MB has a strong association with diseases and contains fewest false positives. Furthermore, DASSO-MB uses a heuristic search strategy by calculating the association between variables to avoid the time-consuming training process as in other machine-learning methods. We apply our algorithm to simulated datasets and a real case-control dataset. We compare DASSO-MB to other commonly-used methods and show that our method significantly outperforms other methods and is capable of finding SNPs strongly associated with diseases.
Our study shows that DASSO-MB can identify a minimal set of causal SNPs associated with diseases, which contains less false positives compared to other existing methods. Given the huge size of genomic dataset produced by GWAS, this is critical in saving the potential costs of biological experiments and being an efficient guideline for pathogenesis research.
[show abstract][hide abstract] ABSTRACT: Content-based image search on the Internet is a challenging problem, mostly due to the semantic gap between low-level visual features and high-level content, as well as the excessive computation brought by huge amount of images and high dimensional features. In this paper, we present iLike, a new approach to truly combine textual features from web pages, and visual features from image content for better image search in a vertical search engine. We tackle the first problem by trying to capture the meaning of each text term in the visual feature space, and re-weight visual features according to their significance to the query content. Our experimental results in product search for apparels and accessories demonstrate the effectiveness of iLike and its capability of bridging semantic gaps between visual features and abstract concepts.
Proceedings of the 18th International Conference on Multimedea 2010, Firenze, Italy, October 25-29, 2010; 01/2010
[show abstract][hide abstract] ABSTRACT: The class imbalance problem is encountered in real-world applications of machine learning and results in a classifier's suboptimal performance. Researchers have rigorously studied the resampling, algorithms, and feature selection approaches to this problem. No systematic studies have been conducted to understand how well these methods combat the class imbalance problem and which of these methods best manage the different challenges posed by imbalanced data sets. In particular, feature selection has rarely been studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using area under the receiver operating characteristic (AUC) and area under the precision-recall curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and Feature Assessment by Sliding Thresholds (FAST) are great candidates for feature selection in most applications, especially when selecting very small numbers of features.
IEEE Trans. Knowl. Data Eng. 01/2010; 22:1388-1400.
[show abstract][hide abstract] ABSTRACT: Small sample size is one of the biggest challenges in microarray data analysis. With microarray data being dramatically accumulated, integrating data from related studies represents a natural way to increase sample size so that more reliable statistical analysis may be performed. In this paper, we present a simple and effective integration scheme, called Normalised Linear Transform (NLT), to combine data from different microarray platforms. The NLT scheme is compared with three other integration schemes for two tasks: classification analysis and gene marker selection. Our experiments demonstrate that the NLT scheme performs best in terms of classification accuracy, and leads to more biologically significant marker genes.
International Journal of Data Mining and Bioinformatics 01/2010; 4(2):142-57. · 0.39 Impact Factor
[show abstract][hide abstract] ABSTRACT: The sequencing of whole genomes from various species has provided us with a wealth of genetic information. To make use of the vast amounts of data available today it is necessary to devise computer-based analysis techniques.
We propose a Hidden Markov Model (HMM) based algorithm to detect groups of genes functionally similar to a set of input genes from microarray expression data. A subset of experiments from a microarray is selected based on a set of related input genes. HMMs are trained from the input genes and a group of random gene input sets to provide significance estimates. Every gene in the microarray is scored using all HMMs and significant matches with the input genes are retained. We ran this algorithm on the life cycle of Drosophila microarray data set with KEGG pathways for cell cycle and translation factors as input data sets. Results show high functional similarity in resulting gene sets, increasing our biological insight into gene pathways and KEGG annotations. The algorithm performed very well compared to the Signature Algorithm and a purely correlation-based approach.
Java source codes and data sets are available at http://www.ittc.ku.edu/~xwchen/software.htm
[show abstract][hide abstract] ABSTRACT: Identifying genes (biomarkers) and predicting the clinical outcomes with censored survival times are important for cancer prognosis and pathogenesis. In this article, we propose a novel method with L(1) penalized global AUC summary maximization (L(1)GAUCS). The L(1)GAUCS method is developed for simultaneous gene (feature) selection and survival prediction. L(1) penalty shrinks coefficients and produces some coefficients that are exactly zero, and therefore selects a small subset of genes (features). It is a well-known fact that many genes are highly correlated in gene expression data and the highly correlated genes may function together. We, therefore, define a correlation measure to identify those genes such that their expression level may be low but they are highly correlated with the downstream highly expressed genes selected with L(1)GAUCS. Partial pathways associated with the correlated genes are identified with DAVID (http://david.abcc.ncifcrf.gov/). Experimental results with chemotherapy and gene expression data demonstrate that the proposed procedures can be used for identifying important genes and pathways that are related to time to death due to cancer and for building a parsimonious model for predicting the survival of future patients. Software is available upon request from the first author.
Journal of computational biology: a journal of computational molecular cell biology 09/2009; 16(12):1661-70. · 1.69 Impact Factor
[show abstract][hide abstract] ABSTRACT: Protein-protein interactions (PPIs), though extremely valuable towards a better understanding of protein functions and cellular processes, do not provide any direct information about the regions/domains within the proteins that mediate the interaction. Most often, it is only a fraction of a protein that directly interacts with its biological partners. Thus, understanding interaction at the domain level is a critical step towards (i) thorough understanding of PPI networks; (ii) precise identification of binding sites; (iii) acquisition of insights into the causes of deleterious mutations at interaction sites; and (iv) most importantly, development of drugs to inhibit pathological protein interactions. In addition, knowledge derived from known domain-domain interactions (DDIs) can be used to understand binding interfaces, which in turn can help discover unknown PPIs.
Here, we describe a novel method called K-GIDDI (knowledge-guided inference of DDIs) to narrow down the PPI sites to smaller regions/domains. K-GIDDI constructs an initial DDI network from cross-species PPI networks, and then expands the DDI network by inferring additional DDIs using a divide-and-conquer biclustering algorithm guided by Gene Ontology (GO) information, which identifies partial-complete bipartite sub-networks in the DDI network and makes them complete bipartite sub-networks by adding edges. Our results indicate that K-GIDDI can reliably predict DDIs. Most importantly, K-GIDDI's novel network expansion procedure allows prediction of DDIs that are otherwise not identifiable by methods that rely only on PPI data.
[show abstract][hide abstract] ABSTRACT: Oxidative stress (OS) is an important factor in brain aging and neurodegenerative diseases. Certain neurons in different brain regions exhibit selective vulnerability to OS. Currently little is known about the underlying mechanisms of this selective neuronal vulnerability. The purpose of this study was to identify endogenous factors that predispose vulnerable neurons to OS by employing genomic and biochemical approaches.
In this report, using in vitro neuronal cultures, ex vivo organotypic brain slice cultures and acute brain slice preparations, we established that cerebellar granule (CbG) and hippocampal CA1 neurons were significantly more sensitive to OS (induced by paraquat) than cerebral cortical and hippocampal CA3 neurons. To probe for intrinsic differences between in vivo vulnerable (CA1 and CbG) and resistant (CA3 and cerebral cortex) neurons under basal conditions, these neurons were collected by laser capture microdissection from freshly excised brain sections (no OS treatment), and then subjected to oligonucleotide microarray analysis. GeneChip-based transcriptomic analyses revealed that vulnerable neurons had higher expression of genes related to stress and immune response, and lower expression of energy generation and signal transduction genes in comparison with resistant neurons. Subsequent targeted biochemical analyses confirmed the lower energy levels (in the form of ATP) in primary CbG neurons compared with cortical neurons.
Low energy reserves and high intrinsic stress levels are two underlying factors for neuronal selective vulnerability to OS. These mechanisms can be targeted in the future for the protection of vulnerable neurons.
[show abstract][hide abstract] ABSTRACT: Previous applications of microarray technology for cancer research have mostly focused on identifying genes that are differentially expressed between a particular cancer and normal cells. In a biological system, genes perform different molecular functions and regulate various biological processes via interactions with other genes thus forming a variety of complex networks. Therefore, it is critical to understand the relationship (e.g., interactions) between genes across different types of cancer in order to gain insights into the molecular mechanisms of cancer. Here we propose an integrative method based on the bootstrapping Kolmogorov-Smirnov test and a large set of microarray data produced with various types of cancer to discover common molecular changes in cells from normal state to cancerous state. We evaluate our method using three key pathways related to cancer and demonstrate that it is capable of finding meaningful alterations in gene relations.
BioMed Research International 02/2009; 2009:707580. · 2.88 Impact Factor
[show abstract][hide abstract] ABSTRACT: Identification of protein interaction sites has significant impact on understanding protein function, elucidating signal transduction networks and drug design studies. With the exponentially growing protein sequence data, predictive methods using sequence information only for protein interaction site prediction have drawn increasing interest. In this article, we propose a predictive model for identifying protein interaction sites. Without using any structure data, the proposed method extracts a wide range of features from protein sequences. A random forest-based integrative model is developed to effectively utilize these features and to deal with the imbalanced data classification problem commonly encountered in binding site predictions.
We evaluate the predictive method using 2829 interface residues and 24,616 non-interface residues extracted from 99 polypeptide chains in the Protein Data Bank. The experimental results show that the proposed method performs significantly better than two other sequence-based predictive methods and can reliably predict residues involved in protein interaction sites. Furthermore, we apply the method to predict interaction sites and to construct three protein complexes: the DnaK molecular chaperone system, 1YUW and 1DKG, which provide new insight into the sequence-function relationship. We show that the predicted interaction sites can be valuable as a first approach for guiding experimental methods investigating protein-protein interactions and localizing the specific interface residues.
Datasets and software are available at http://ittc.ku.edu/~xwchen/bindingsite/prediction.