[Show abstract][Hide abstract] ABSTRACT: Recommender systems can suffer from data sparsity and cold start issues. However, social networks, which enable users to build relationships and create different types of items, present an unprecedented opportunity to alleviate these issues. In this paper, we represent a social network as a star-structured hybrid graph centered on a social domain, which connects with other item domains. With this innovative representation, useful knowledge from an auxiliary domain can be transferred through the social domain to a target domain. Various factors of item transferability, including popularity and behavioral consistency, are determined. We propose a novel Hybrid Random Walk (HRW) method, which incorporates such factors, to select transferable items in auxiliary domains, bridge cross-domain knowledge with the social domain, and accurately predict user-item links in a target domain. Extensive experiments on a real social dataset demonstrate that HRW significantly outperforms existing approaches.
IEEE Transactions on Knowledge and Data Engineering 11/2015; 27(11):1-1. DOI:10.1109/TKDE.2015.2432811 · 2.07 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Personalized predictive models are customized for an individual patient and trained using information from similar patients. Compared to global models trained on all patients, they have the potential to produce more accurate risk scores and capture more relevant risk factors for individual patients. This paper presents an approach for building personalized predictive models and generating personalized risk factor profiles. A locally supervised metric learning (LSML) similarity measure is trained for diabetes onset and used to find clinically similar patients. Personalized risk profiles are created by analyzing the parameters of the trained personalized logistic regression models. A 15,000 patient data set, derived from electronic health records, is used to evaluate the approach. The predictive results show that the personalized models can outperform the global model. Cluster analysis of the risk profiles show groups of patients with similar risk factors, differences in the top risk factors for different groups of patients and differences between the individual and global risk factors.
[Show abstract][Hide abstract] ABSTRACT: Drug-drug interaction (DDI) is an important topic for public health, and thus attracts attention from both academia and industry. Here we hypothesize that clinical side effects (SEs) provide a human phenotypic profile and can be translated into the development of computational models for predicting adverse DDIs. We propose an integrative label propagation framework to predict DDIs by integrating SEs extracted from package inserts of prescription drugs, SEs extracted from FDA Adverse Event Reporting System, and chemical structures from PubChem. Experimental results based on hold-out validation demonstrated the effectiveness of the proposed algorithm. In addition, the new algorithm also ranked drug information sources based on their contributions to the prediction, thus not only confirming that SEs are important features for DDI prediction but also paving the way for building more reliable DDI prediction models by prioritizing multiple data sources. By applying the proposed algorithm to 1,626 small-molecule drugs which have one or more SE profiles, we obtained 145,068 predicted DDIs. The predicted DDIs will help clinicians to avoid hazardous drug interactions in their prescriptions and will aid pharmaceutical companies to design large-scale clinical trial by assessing potentially hazardous drug combinations. All data sets and predicted DDIs are available at http://astro.temple.edu/~tua87106/ddi.html.
[Show abstract][Hide abstract] ABSTRACT: Data mining and healthcareHealthcare is a field that is closely related to everyone’s daily life. Because of the high complexity in healthcare industry, every year a huge amount of money is wasted. In recent years researchers from different areas went into the healthcare world with the hope of helping to reduce the cost and improve the quality of care delivery. Among all those emerging trends, data driven technologies have received a lot of attentions due to the availability of more and more healthcare data. Data-driven healthcare is at the center of the vision of learning health systems and holds great promise for transforming the current healthcare status.The goal of this special issue is to present novel approaches in the field of data mining that can be applied in different fields of healthcare. The vast majority of the papers presented in this special issue present novel methods that were empirically evaluated on medical datasets that are much larger than datasets we were used to ...
Data Mining and Knowledge Discovery 07/2015; 29(4). DOI:10.1007/s10618-015-0414-1 · 1.99 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The Nyström method approximates eigenvectors of a given kernel matrix by randomly sampling subset of data. Previous researches focus on good kernel approximation while the quality of eigenvector approximation is rarely explored. In online eigenvector approximation method, one can minimize the kernel approximation error to guarantee a good eigenvector approximation. However in this work, we paradoxically prove that for batch approximation methods like Nyström, it is no longer true. This unexpected discovery opens a question: What criterion should we use in Nyström to generate a decent eigenvector approximation? To address this problem, we propose a novel criterion named Hilbert Space Embedding (HSE) Nyström criterion which directly minimizes the eigenvector approximation error. The proposed HSE criterion provides a general framework to approximate eigenvectors within linear time and space complexity. We then show that we can rediscover many successful Nyström methods with the proposed criterion, including K-means Nyström and Density Nyström. To further demonstrate the power of our criterion, we actually design a novel algorithm to approximate eigenvectors of Laplacian matrices based on the proposed criterion with better accuracy among existing linear complexity methods. We demonstrate the efficiency and efficacy of our proposal in numerical experiments.
[Show abstract][Hide abstract] ABSTRACT: Cascades are ubiquitous in various network environments. How to predict these
cascades is highly nontrivial in several vital applications, such as viral
marketing, epidemic prevention and traffic management. Most previous works
mainly focus on predicting the final cascade sizes. As cascades are typical
dynamic processes, it is always interesting and important to predict the
cascade size at any time, or predict the time when a cascade will reach a
certain size (e.g. an threshold for outbreak). In this paper, we unify all
these tasks into a fundamental problem: cascading process prediction. That is,
given the early stage of a cascade, how to predict its cumulative cascade size
of any later time? For such a challenging problem, how to understand the micro
mechanism that drives and generates the macro phenomenons (i.e. cascading
proceese) is essential. Here we introduce behavioral dynamics as the micro
mechanism to describe the dynamic process of a node's neighbors get infected by
a cascade after this node get infected (i.e. one-hop subcascades). Through
data-driven analysis, we find out the common principles and patterns lying in
behavioral dynamics and propose a novel Networked Weibull Regression model for
behavioral dynamics modeling. After that we propose a novel method for
predicting cascading processes by effectively aggregating behavioral dynamics,
and propose a scalable solution to approximate the cascading process with a
theoretical guarantee. We extensively evaluate the proposed method on a large
scale social network dataset. The results demonstrate that the proposed method
can significantly outperform other state-of-the-art baselines in multiple tasks
including cascade size prediction, outbreak time prediction and cascading
[Show abstract][Hide abstract] ABSTRACT: A care/clinical pathway (CP) is a standardized care process where temporal and data constraints of clinical activities are defined to ensure quality of care. In actual care practice, various situations of compliance and non-compliance with CPs can be observed. Analysis of these CP variation patterns (CPVPs) can help improve care quality and enhance decision support. In this paper, we propose an automatic method to detect CPVPs in electronic medical records (EMR), and statistically examine their correlation with patient outcomes. From each CP constraint, we first derive a CPVP tree, where each pattern is represented using first-order linear temporal logic and translated into a Büchi automaton for pattern detection. Then we identify the CPVPs that are evidently correlated with a patient outcome by examining the odds ratios. The method has been applied to a CP for congestive heart failure and real world EMR to demonstrate the effectiveness.
Studies in health technology and informatics 05/2015; 210:692-6. DOI:10.3233/978-1-61499-512-8-692
[Show abstract][Hide abstract] ABSTRACT: Care pathways play significant roles in delivering evidence-based and coordinated care to patients with specific conditions. In order to put care pathways into practice, clinical institutions always need to adapt them based on local care settings so that the best local practices can be incorporated and used to develop refined pathways. However, it is knowledge-intensive and error-prone to incorporate various analytic insights from local data sets. In order to assist care pathway developers in working effectively and efficiently, we propose to automatically synthesize the analytical evidences derived from multiple analysis methods, and recommend modelling operations accordingly to derive a refined care pathway for a specific patient cohort. We validated our method by adapting a Congestive Heart Failure (CHF) Ambulatory Care Pathway for patients with additional condition of COPD through synthesizing the results of variation analysis and frequent pattern mining against patient records.
Studies in health technology and informatics 05/2015; 210:70-4.
[Show abstract][Hide abstract] ABSTRACT: Patient similarity is an important analytic operation in healthcare applications. At the core, patient similarity takes an index patient as the input and retrieves a ranked list of similar patients that are relevant in a specific clinical context. It takes patient information such as their electronic health records (EHR) as input and computes the distance between a pair of patients based on those information. To construct a clinically valid similarity measure, physician input often needs to be incorporated.However, obtaining physicians' input is difficult and expensive. As a result, typically only limited physician feedbacks can be obtained on a small portion of patients. How to leverage all unlabelled patient data and limited supervision information from physicians to construct a clinically meaningful distance metric? In this paper, we present a Patient Similarity Framework (PSF) that unifies and significantly extends existing supervised patient similarity metric learning methods. PSF is a general framework that can learn an appropriate distance metric through supervised and unsupervised information. Within PSF framework, we propose a novel patient similarity algorithm that uses local spline regression to capture the unsupervised information. To speedup the incorporation of physician feedback or newly available clinical information, we introduce a general online update algorithm for an existing PSF distance metric.
[Show abstract][Hide abstract] ABSTRACT: Regulations and privacy concerns often hinder exchange of healthcare data between hospitals or other healthcare providers. Sharing predictive models built on original data and averaging their results offers an alternative to more efficient prediction of outcomes on new cases. Although one can choose from many techniques to combine outputs from different predictive models, it is difficult to find studies that try to interpret the results obtained from ensemble-learning methods.
We propose a novel approach to classification based on models from different hospitals that allows a high level of performance along with comprehensibility of obtained results. Our approach is based on regularized sparse regression models in two hierarchical levels and exploits the interpretability of obtained regression coefficients to rank the contribution of hospitals in terms of outcome prediction.
The proposed approach was used to predict the 30-days all-cause readmissions for pediatric patients in 54 Californian hospitals. Using repeated holdout evaluation, including more than 60,000 hospital discharge records, we compared the proposed approach to alternative approaches. The performance of two-level classification model was measured using the Area Under the ROC Curve (AUC) with an additional evaluation that uncovered the importance and contribution of each single data source (i.e. hospital) to the final result. The results for the best distributed model (AUC=0.787, 95% CI: 0.780-0.794) demonstrate no significant difference in terms of AUC performance when compared to a single elastic net model built on all available data (AUC=0.789, 95% CI: 0.781-0.796).
This paper presents a novel approach to improved classification with shared predictive models for environments where centralized collection of data is not possible. The significant improvements in classification performance and interpretability of results demonstrate the effectiveness of our approach.
[Show abstract][Hide abstract] ABSTRACT: Exponential growth of information generated by online social networks demands effective and scalable recommender systems to give useful results. Traditional techniques become unqualified because they ignore social relation data; existing social recommendation approaches consider social network structure, but social contextual information has not been fully considered. It is significant and challenging to fuse social contextual factors which are derived from users' motivation of social behaviors into social recommendation. In this paper, we investigate the social recommendation problem on the basis of psychology and sociology studies, which exhibit two important factors: individual preference and interpersonal influence. We first present the particular importance of these two factors in online behavior prediction. Then we propose a novel probabilistic matrix factorization method to fuse them in latent space. We further provide a scalable algorithm which can incrementally process the large scale data. We conduct experiments on both Facebook style bidirectional and Twitter style unidirectional social network data sets. The empirical results and analysis on these two large data sets demonstrate that our method significantly outperforms the existing approaches.approaches.
IEEE Transactions on Knowledge and Data Engineering 11/2014; 26(11):2789-2802. DOI:10.1109/TKDE.2014.2300487 · 2.07 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Care pathways (CPs) as a means of healthcare quality control are getting increasing attention due to widespread recognition in the healthcare industry of the need for well coordinated, evidence based and personalized care. To keep the promise, CPs require continuous refinement in order to stay up to date with regard to both clinical guidelines and data-driven insights from real world practices. There is therefore a strong demand for a unified platform that allows harmonization of evidence coming from multiple sources. In this paper we describe Care Pathway Workbench, a web-based platform that enables users to build and continuously improve Case Management Model and Notation based CPs by harmonizing evidences from guidelines and patient data. To illustrate the functionalities, we describe how a CHF (Congestive Heart Failure) Ambulatory Care Pathway can be developed using this workbench by first extracting key elements from widely accepted guidelines for CHF management, then incorporating evidence mined from clinical practice data, and finally transforming and exporting the resulting CP model to a care management product.
Studies in health technology and informatics 08/2014; 205:23-27.
[Show abstract][Hide abstract] ABSTRACT: Inferring phenotypic patterns from population-scale clinical data is a core computational task in the development of personalized medicine. One important source of data on which to conduct this type of research is patient Electronic Medical Records (EMR). However, the patient EMRs are typically sparse and noisy, which creates significant challenges if we use them directly to represent patient phenotypes. In this paper, we propose a data driven phenotyping framework called Pacifier (PAtient reCord densIFIER), where we interpret the longitudinal EMR data of each patient as a sparse matrix with a feature dimension and a time dimension, and derive more robust patient phenotypes by exploring the latent structure of those matrices. Specifically, we assume that each derived phenotype is composed of a subset of the medical features contained in original patient EMR, whose value evolves smoothly over time. We propose two formulations to achieve such goal. One is Individual Basis Approach (IBA), which assumes the phenotypes are different for every patient. The other is Shared Basis Approach (SBA), which assumes the patient population shares a common set of phenotypes. We develop an efficient optimization algorithm that is capable of resolving both problems efficiently. Finally we validate Pacifier on two real world EMR cohorts for the tasks of early prediction of Congestive Heart Failure (CHF) and End Stage Renal Disease (ESRD). Our results show that the predictive performance in both tasks can be improved significantly by the proposed algorithms (average AUC score improved from 0.689 to 0.816 on CHF, and from 0.756 to 0.838 on ESRD respectively, on diagnosis group granularity). We also illustrate some interesting phenotypes derived from our data.
[Show abstract][Hide abstract] ABSTRACT: Logistic regression is one core predictive modeling technique that has been used extensively in health and biomedical problems. Recently a lot of research has been focusing on enforcing sparsity on the learned model to enhance its effectiveness and interpretability, which results in sparse logistic regression model. However, no matter the original or sparse logistic regression, they require the inputs to be in vector form. This limits the applicability of logistic regression in the problems when the data cannot be naturally represented vectors (e.g., functional magnetic resonance imaging and electroencephalography signals). To handle the cases when the data are in the form of multi-dimensional arrays, we propose MulSLR: Multilinear Sparse Logistic Regression. MulSLR can be viewed as a high order extension of sparse logistic regression. Instead of solving one classification vector as in conventional logistic regression, we solve for K classification vectors in MulSLR (K is the number of modes in the data). We propose a block proximal descent approach to solve the problem and prove its convergence. The convergence rate of the proposed algorithm is also analyzed. Finally we validate the efficiency and effectiveness of MulSLR on predicting the onset risk of patients with Alzheimer's disease and heart failure.
[Show abstract][Hide abstract] ABSTRACT: Behavioral pattern discovery is increasingly being studied to understand human behavior and the discovered patterns can be used in many real world applications such as web search, recommender system and advertisement targeting. Traditional methods usually consider the behaviors as simple user and item connections, or represent them with a static model. In real world, however, human behaviors are actually complex and dynamic: they include correlations between user and multiple types of objects and also continuously evolve along time. These characteristics cause severe data sparsity and computational complexity problem, which pose great challenge to human behavioral analysis and prediction. In this paper, we propose a Flexible Evolutionary Multi-faceted Analysis (FEMA) framework for both behavior prediction and pattern mining. FEMA utilizes a flexible and dynamic factorization scheme for analyzing human behavioral data sequences, which can incorporate various knowledge embedded in different object domains to alleviate the sparsity problem. We give approximation algorithms for efficiency, where the bound of approximation loss is theoretically proved. We extensively evaluate the proposed method in two real datasets. For the prediction of human behaviors, the proposed FEMA significantly outperforms other state-of-the-art baseline methods by 17.4%. Moreover, FEMA is able to discover quite a number of interesting multi-faceted temporal patterns on human behaviors with good interpretability. More importantly, it can reduce the run time from hours to minutes, which is significant for industry to serve real-time applications.
[Show abstract][Hide abstract] ABSTRACT: Social media allow people to express and promote different opinions, on which people’s sentiments to a subject often diverge when their opinions conflict. An intuitive visualization that unfolds the process of sentiment divergence from the rich and massive social media data will have far-reaching impact on various domains including social science, politics and economics. In this paper, we propose a visual analysis system, SocialHelix, to achieve this goal. SocialHelix is a novel visual design which enables the users to detect and trace topics and events occurring in social media, and to understand when and why divergences occurred and how they evolved among different social groups. We demonstrate the effectiveness and usefulness of SocialHelix by conducting in-depth case studies on tweets related to the national political debates.
Journal of Visualization 05/2014; 18(2):221-235. DOI:10.1007/s12650-014-0246-x · 0.58 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The rapid adoption of electronic health records (EHR) provides a comprehensive source for exploratory and predictive analytic to support clinical decision-making. In this paper, we investigate how to utilize EHR to tailor treatments to individual patients based on their likelihood to respond to a therapy. We construct a heterogeneous graph which includes two domains (patients and drugs) and encodes three relationships (patient similarity, drug similarity, and patient-drug prior associations). We describe a novel approach for performing a label propagation procedure to spread the label information representing the effectiveness of different drugs for different patients over this heterogeneous graph. The proposed method has been applied on a real-world EHR dataset to help identify personalized treatments for hypercholesterolemia. The experimental results demonstrate the effectiveness of the approach and suggest that the combination of appropriate patient similarity and drug similarity analytics could lead to actionable insights for personalized medicine. Particularly, by leveraging drug similarity in combination with patient similarity, our method could perform well even on new or rarely used drugs for which there are few records of known past performance.