[Show abstract][Hide abstract] ABSTRACT: Using transport smart card transaction data to understand the homework dynamics of a city for urban planning is emerging as an alternative to traditional surveys which may be conducted every few years are no longer effective and efficient for the rapidly transforming modern cities. As commuters travel patterns are highly diverse, existing rule-based methods are not fully adequate. In this paper, we present iVizTRANS-a tool which combines an interactive visual analytics (VA) component to aid urban planners to analyse complex travel patterns and decipher activity locations for single public transport commuters. It is coupled with a machine learning component that iteratively learns from the planners classifications to train a classifier. The classifier is then applied to the city-wide smart card data to derive the dynamics for all public transport commuters. Our evaluation shows it outperforms the rule-based methods in previous work.
[Show abstract][Hide abstract] ABSTRACT: This paper proposes a novel Integrated Oversampling (INOS) method that can handle highly imbalanced time series classification. We introduce an enhanced structure preserving oversampling (ESPO) technique and synergistically combine it with interpolation-based oversampling. ESPO is used to generate a large percentage of the synthetic minority samples based on multivariate Gaussian distribution, by estimating the covariance structure of the minority-class samples and by regularizing the unreliable eigen spectrum. To protect the key original minority samples, we use an interpolation-based technique to oversample a small percentage of synthetic population. By preserving the main covariance structure and intelligently creating protective variances in the trivial eigen dimensions, ESPO effectively expands the synthetic samples into the void area in the data space without being too closely tied with existing minority-class samples. This also addresses a key challenge for applying oversampling for imbalanced time series classification, i.e., maintaining the correlation between consecutive values through preserving the main covariance structure. Extensive experiments based on seven public time series data sets demonstrate that our INOS approach, used with support vector machines (SVM), achieved better performance over existing oversampling methods as well as state-of-the-art methods in time series classification.
Full-text · Article · Dec 2013 · IEEE Transactions on Knowledge and Data Engineering
[Show abstract][Hide abstract] ABSTRACT: While high-throughput technologies are expected to play a critical role in clinical translational research for complex disease diagnosis, the ability to accurately and consistently discriminate disease phenotypes by determining the gene and protein expression patterns as signatures of different clinical conditions remains a challenge in translational bioinformatics. In this study, we propose a novel feature selection algorithm: Multi-Resolution-Test (MRT-test) that can produce significantly accurate and consistent phenotype discrimination across a series of omics data. Our algorithm can capture those features contributing to subtle data behaviors instead of selecting the features contributing to global data behaviors, which seems to be essential in achieving clinical level diagnosis for different expression data. Furthermore, as an effective biomarker discovery algorithm, it can achieve linear separation for high-dimensional omics data with few biomarkers. We apply our MRT-test to complex disease phenotype diagnosis by combining it with state-of-the-art classifiers and attain exceptional diagnostic results, which suggests that our method's advantage in molecular diagnostics. Experimental evaluation showed that MRT-test based diagnosis is able to generate consistent and robust clinical-level phenotype separation for various diseases. In addition, based on the seed biomarkers detected by the MRT-test, we design a novel network marker synthesis (NMS) algorithm to decipher the underlying molecular mechanisms of tumorigenesis from a systems viewpoint. Unlike existing top-down gene network building approaches, our network marker synthesis method has a less dependence on the global network and enables it to capture the gene regulators for different subnetwork markers, which will provide biologically meaningful insights for understanding the genetic basis of complex diseases.
Full-text · Article · Dec 2013 · Journal of Bioinformatics and Computational Biology
[Show abstract][Hide abstract] ABSTRACT: Background
Many biological processes are carried out by proteins interacting with each other in the form of protein complexes. However, large-scale detection of protein complexes has remained constrained by experimental limitations. As such, computational detection of protein complexes by applying clustering algorithms on the abundantly available protein-protein interaction (PPI) networks is an important alternative. However, many current algorithms have overlooked the importance of selecting seeds for expansion into clusters without excluding important proteins and including many noisy ones, while ensuring a high degree of functional homogeneity amongst the proteins detected for the complexes.
We designed a novel method called Probabilistic Local Walks (PLW) which clusters regions in a PPI network with high functional similarity to find protein complex cores with high precision and efficiency in O (|V| log |V| + |E|) time. A seed selection strategy, which prioritises seeds with dense neighbourhoods, was devised. We defined a topological measure, called common neighbour similarity, to estimate the functional similarity of two proteins given the number of their common neighbours.
Our proposed PLW algorithm achieved the highest F-measure (recall and precision) when compared to 11 state-of-the-art methods on yeast protein interaction data, with an improvement of 16.7% over the next highest score. Our experiments also demonstrated that our seed selection strategy is able to increase algorithm precision when applied to three previous protein complex mining techniques.
The software, datasets and predicted complexes are available at http://wonglkd.github.io/PLW
[Show abstract][Hide abstract] ABSTRACT: Many important biological processes, such as the signaling pathways, require protein-protein interactions (PPIs) that are designed for fast response to stimuli. These interactions are usually transient, easily formed, and disrupted, yet specific. Many of these transient interactions involve the binding of a protein domain to a short stretch (3-10) of amino acid residues, which can be characterized by a sequence pattern, i.e., a short linear motif (SLiM). We call these interacting domains and motifs domain-SLiM interactions. Existing methods have focused on discovering SLiMs in the interacting proteins' sequence data. With the recent increase in protein structures, we have a new opportunity to detect SLiMs directly from the proteins' 3D structures instead of their linear sequences. In this chapter, we describe a computational method called SLiMDIet to directly detect SLiMs on domain interfaces extracted from 3D structures of PPIs. SLiMDIet comprises two steps: (1) interaction interfaces belonging to the same domain are extracted and grouped together using structural clustering and (2) the extracted interaction interfaces in each cluster are structurally aligned to extract the corresponding SLiM. Using SLiMDIet, de novo SLiMs interacting with protein domains can be computationally detected from structurally clustered domain-SLiM interactions for PFAM domains which have available 3D structures in the PDB database.
No preview · Article · Jan 2013 · Methods in molecular biology (Clifton, N.J.)
[Show abstract][Hide abstract] ABSTRACT: We present a novel elastic system architecture called Plug Cloud, which aims to increase the power of low-compute (and possibly mobile) devices, such as tablets, through distributing high-compute tasks, such as rendering, data analysis and visualization, to a set of Plug Computers [1, 2] that can be added or removed from the system incrementally. These devices are connected through a wired/wireless connection and the network is formed seamlessly with zero configurations. Plug Cloud allows users of low-compute devices to acquire more processing power externally on demand by plugging one or more plug-computers as needed. Furthermore, it allows a user to remove the plug-computer safely at any time without bringing down the whole system. This makes an elastic network where it can expand and shrink automatically with one or more plug-computers being added or removed from the system. This innovative architecture forms a personal cloud infrastructure on demand to support users' computational needs. We have implemented a computer graphics rendering application using Plug Cloud architecture. We will demo our architecture and prototype system using a tablet and several plug-computers.
[Show abstract][Hide abstract] ABSTRACT: Living cells are realized by complex gene expression programs that are moderated by regulatory proteins called transcription factors (TFs). The TFs control the differential expression of target genes in the context of transcriptional regulatory networks (TRNs), either individually or in groups. Deciphering the mechanisms of how the TFs control the differential expression of a target gene in a TRN is challenging, especially when multiple TFs collaboratively participate in the transcriptional regulation. To unravel the roles of the TFs in the regulatory networks, we model the underlying regulatory interactions in terms of the TF-target interactions' directions (activation or repression) and their corresponding logical roles (necessary and/or sufficient). We design a set of constraints that relate gene expression patterns to regulatory interaction models, and develop TRIM (Transcriptional Regulatory Interaction Model Inference), a new hidden Markov model, to infer the models of TF-target interactions in large-scale TRNs of complex organisms. Besides, by training TRIM with wild-type time-series gene expression data, the activation timepoints of each regulatory module can be obtained. To demonstrate the advantages of TRIM, we applied it on yeast TRN to infer the TF-target interaction models for individual TFs as well as pairs of TFs in collaborative regulatory modules. By comparing with TF knockout and other gene expression data, we were able to show that the performance of TRIM is clearly higher than DREM (the best existing algorithm). In addition, on an individual Arabidopsis binding network, we showed that the target genes' expression correlations can be significantly improved by incorporating the TF-target regulatory interaction models inferred by TRIM into the expression data analysis, which may introduce new knowledge in transcriptional dynamics and bioactivation.
Full-text · Article · Oct 2012 · Journal of Bioinformatics and Computational Biology
[Show abstract][Hide abstract] ABSTRACT: An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.
[Show abstract][Hide abstract] ABSTRACT: Many real-world applications in time series classification fall into the class of positive and unlabeled (PU) learning. Furthermore, in many of these applications, not only are the negative examples absent, the positive examples available for learning can also be rather limited. As such, several PU learning algorithms for time series classification have recently been developed to learn from a small set P of labeled seed positive examples augmented with a set U of unlabeled examples. The key to these algorithms is to accurately identify the likely positive and negative examples from U, but it has remained a challenge, especially for those uncertain examples located near the class boundary. This paper presents a novel ensemble based approach that restarts the detection phase several times to probabilistically label these uncertain examples more robustly so that a reliable classifier can be built from the limited positive training examples. Experimental results on time series data from different domains demonstrate that the new method outperforms existing state-of-the art methods significantly.
[Show abstract][Hide abstract] ABSTRACT: Appetitive operant conditioning in Aplysia for feeding behavior via the electrical stimulation of the esophageal nerve contingently reinforces each spontaneous bite during the feeding process. This results in the acquisition of operant memory by the contingently reinforced animals. Analysis of the cellular and molecular mechanisms of the feeding motor circuitry revealed that activity-dependent neuronal modulation occurs at the interneurons that mediate feeding behaviors. This provides evidence that interneurons are possible loci of plasticity and constitute another mechanism for memory storage in addition to memory storage attributed to activity-dependent synaptic plasticity. In this paper, an associative ambiguity correction-based neuro-fuzzy network, called appetitive reward-based pseudo-outer-product-compositional rule of inference [ARPOP-CRI(S)], is trained based on an appetitive reward-based learning algorithm which is biologically inspired by the appetitive operant conditioning of the feeding behavior in Aplysia. A variant of the Hebbian learning rule called Hebbian concomitant learning is proposed as the building block in the neuro-fuzzy network learning algorithm. The proposed algorithm possesses the distinguishing features of the sequential learning algorithm. In addition, the proposed ARPOP-CRI(S) neuro-fuzzy system encodes fuzzy knowledge in the form of linguistic rules that satisfies the semantic criteria for low-level fuzzy model interpretability. ARPOP-CRI(S) is evaluated and compared against other modeling techniques using benchmark time-series datasets. Experimental results are encouraging and show that ARPOP-CRI(S) is a viable modeling technique for time-variant problem domains.
No preview · Article · Feb 2012 · IEEE transactions on neural networks and learning systems
[Show abstract][Hide abstract] ABSTRACT: We present a system called AssocExplorer to support exploratory data analysis via association rule visualization and exploration. AssocExplorer is designed by following the visual information-seeking mantra: overview first, zoom and filter, then details on demand. It effectively uses coloring to deliver information so that users can easily detect things that are interesting to them. If users find a rule interesting, they can explore related rules for further analysis, which allows users to find interesting phenomenon that are difficult to detect when rules are examined separately. Our system also allows users to compare rules and inspect rules with similar item composition but different statistics so that the key factors that contribute to the difference can be isolated.
[Show abstract][Hide abstract] ABSTRACT: Many biologically important protein-protein interactions (PPIs) have been found to be mediated by short linear motifs (SLiMs). These interactions are mediated by the binding of a protein domain, often with a nonlinear interaction interface, to a SLiM. We propose a method called D-SLIMMER to mine for SLiMs in PPI data on the basis of the interaction density between a nonlinear motif (i.e., a protein domain) in one protein and a SLiM in the other protein. Our results on a benchmark of 113 experimentally verified reference SLiMs showed that D-SLIMMER outperformed existing methods notably for discovering domain-SLiMs interaction motifs. To illustrate the significance of the SLiMs detected, we highlighted two SLiMs discovered from the PPI data by D-SLIMMER that are variants of the known ELM SLiM, as well as a literature-backed SLiM that is yet to be listed in the reference databases. We also presented a novel SLiM predicted by D-SLIMMER that was strongly supported by existing biological literatures. These examples showed that D-SLIMMER is able to find SLiMs that are biologically relevant.
No preview · Article · Dec 2011 · Journal of Proteome Research
[Show abstract][Hide abstract] ABSTRACT: This paper presents a novel structure preserving over sampling (SPO) technique for classifying imbalanced time series data. SPO generates synthetic minority samples based on multivariate Gaussian distribution by estimating the covariance structure of the minority class and regularizing the unreliable eigen spectrum. By preserving the main covariance structure and intelligently creating protective variances in the trivial eigen feature dimensions, the synthetic samples expand effectively into the void area in the data space without being too closely tied with existing minority-class samples. Extensive experiments based on several public time series datasets demonstrate that our proposed SPO in conjunction with support vector machines can achieve better performances than existing over sampling methods and state-of-the-art methods in time series classification.
[Show abstract][Hide abstract] ABSTRACT: Phenotypically similar diseases have been found to be caused by functionally related genes, suggesting a modular organization of the genetic landscape of human diseases that mirrors the modularity observed in biological interaction networks. Protein complexes, as molecular machines that integrate multiple gene products to perform biological functions, express the underlying modular organization of protein-protein interaction networks. As such, protein complexes can be useful for interrogating the networks of phenome and interactome to elucidate gene-phenotype associations of diseases.
We proposed a technique called RWPCN (Random Walker on Protein Complex Network) for predicting and prioritizing disease genes. The basis of RWPCN is a protein complex network constructed using existing human protein complexes and protein interaction network. To prioritize candidate disease genes for the query disease phenotypes, we compute the associations between the protein complexes and the query phenotypes in their respective protein complex and phenotype networks. We tested RWPCN on predicting gene-phenotype associations using leave-one-out cross-validation; our method was observed to outperform existing approaches. We also applied RWPCN to predict novel disease genes for two representative diseases, namely, Breast Cancer and Diabetes.
Guilt-by-association prediction and prioritization of disease genes can be enhanced by fully exploiting the underlying modular organizations of both the disease phenome and the protein interactome. Our RWPCN uses a novel protein complex network as a basis for interrogating the human phenome-interactome network. As the protein complex network can capture the underlying modularity in the biological interaction networks better than simple protein interaction networks, RWPCN was found to be able to detect and prioritize disease genes better than traditional approaches that used only protein-phenotype associations.
[Show abstract][Hide abstract] ABSTRACT: Abstract Many cellular functions involve protein complexes that are formed by multiple interacting proteins. Tandem Affinity Purification (TAP) is a popular experimental method for detecting such multi-protein interactions. However, current computational methods that predict protein complexes from TAP data require converting the co-complex relationships in TAP data into binary interactions. The resulting pairwise protein-protein interaction (PPI) network is then mined for densely connected regions that are identified as putative protein complexes. Converting the TAP data into PPI data not only introduces errors but also loses useful information about the underlying multi-protein relationships that can be exploited to detect the internal organization (i.e., core-attachment structures) of protein complexes. In this article, we propose a method called CACHET that detects protein complexes with Core-AttaCHment structures directly from bipartitETAP data. CACHET models the TAP data as a bipartite graph in which the two vertex sets are the baits and the preys, respectively. The edges between the two vertex sets represent bait-prey relationships. CACHET first focuses on detecting high-quality protein-complex cores from the bipartite graph. To minimize the effects of false positive interactions, the bait-prey relationships are indexed with reliability scores. Only non-redundant, reliable bicliques computed from the TAP bipartite graph are regarded as protein-complex cores. CACHET constructs protein complexes by including attachment proteins into the cores. We applied CACHET on large-scale TAP datasets and found that CACHET outperformed existing methods in terms of prediction accuracy (i.e., F-measure and functional homogeneity of predicted complexes). In addition, the protein complexes predicted by CACHET are equipped with core-attachment structures that provide useful biological insights into the inherent functional organization of protein complexes. Our supplementary material can be found at http://www1.i2r.a-star.edu.sg/∼xlli/CACHET/CACHET.htm ; binary executables can also be found there. Supplementary Material is also available at www.liebertonline.com/cmb .
Full-text · Article · Jul 2011 · Journal of computational biology: a journal of computational molecular cell biology
[Show abstract][Hide abstract] ABSTRACT: Hypothesis testing is a well-established tool for scientific discovery. Conventional hypothesis testing is carried out in a hypothesis-driven manner. A scientist must first formulate a hypothesis based on his/her knowledge and experience, and then devise a variety of experiments to test it. Given the rapid growth of data, it has become virtually impossible for a person to manually inspect all the data to find all the interesting hypotheses for testing. In this paper, we propose and develop a data-driven system for automatic hypothesis testing and analysis. We define a hypothesis as a comparison between two or more sub-populations. We find sub-populations for comparison using frequent pattern mining techniques and then pair them up for statistical testing. We also generate additional information for further analysis of the hypotheses that are deemed significant. We conducted a set of experiments to show the efficiency of the proposed algorithms, and the usefulness of the generated hypotheses. The results show that our system can help users (1) identify significant hypotheses; (2) isolate the reasons behind significant hypotheses; and (3) find confounding factors that form Simpson's Paradoxes with discovered significant hypotheses.
[Show abstract][Hide abstract] ABSTRACT: People regularly attend various social events to interact with other community members. For example, researchers attend conferences
to present their work and to network with other researchers. In this paper, we propose an E
vent-based COmmunity DEtection algorithm ECODE to mine the underlying community substructures of social networks from event information. Unlike conventional
approaches, ECODE makes use of content similarity-based virtual
links which are found to be more useful for community detection than the physical links. By performing partial computation between
an event and its candidate relevant set instead of computing pair-wise similarities between all the events, ECODE is able
to achieve significant computational speedup. Extensive experimental results and comparisons with other existing methods showed
that our ECODE algorithm is both efficient and effective in detecting communities from social networks.
Keywordssocial network mining–community detection–virtual links
[Show abstract][Hide abstract] ABSTRACT: Question and answer pairs in Community Question Answer-ing (CQA) services are organized into hierarchical structures or taxonomies to facilitate users to find the answers for their questions conveniently. We observe that different CQA ser-vices have their own knowledge focus and used different tax-onomies to organize their question and answer pairs in their archives. As there are no simple semantic mappings between the taxonomies of the CQA services, the integration of CQA services is a challenging task. The existing approaches on in-tegrating taxonomies ignore the hierarchical structures of the source taxonomy. In this paper, we propose a novel approach that is capable of incorporating the parent-child and sibling information in the hierarchical structures of the source taxon-omy for accurate taxonomy integration. Our experimental re-sults with real world CQA data demonstrate that the proposed method significantly outperforms state-of-the-art methods.
[Show abstract][Hide abstract] ABSTRACT: In many real-world applications of the time series classification problem, not only could the negative training instances be missing, the number of positive instances available for learning may also be rather limited. This has motivated the development of new classification algorithms that can learn from a small set P of labeled seed positive instances augmented with a set U of unlabeled instances (i.e. PU learning algorithms). However, existing PU learning algorithms for time series classification have less than satisfactory performance as they are unable to identify the class boundary between positive and negative instances accurately. In this paper, we propose a novel PU learning algorithm LCLC (Learning from Common Local Clusters) for time series classification. LCLC is designed to effectively identify the ground truths' positive and negative boundaries, resulting in more accurate classifiers than those constructed using existing methods. We have applied LCLC to classify time series data from different application domains; the experimental results demonstrate that LCLC out-performs existing methods significantly.