Fei Wang

IBM, Armonk, New York, United States

Are you Fei Wang?

Claim your profile

Publications (146)92.52 Total impact

  • Fei Wang, Jimeng Sun
    [Show abstract] [Hide abstract]
    ABSTRACT: Patient similarity is an important analytic operation in healthcare applications. At the core, patient similarity takes an index patient as the input and retrieves a ranked list of similar patients that are relevant in a specific clinical context. It takes patient information such as their electronic health records (EHR) as input and computes the distance between a pair of patients based on those information. To construct a clinically valid similarity measure, physician input often needs to be incorporated.However, obtaining physicians' input is difficult and expensive. As a result, typically only limited physician feedbacks can be obtained on a small portion of patients. How to leverage all unlabelled patient data and limited supervision information from physicians to construct a clinically meaningful distance metric? In this paper, we present a Patient Similarity Framework (PSF) that unifies and significantly extends existing supervised patient similarity metric learning methods. PSF is a general framework that can learn an appropriate distance metric through supervised and unsupervised information. Within PSF framework, we propose a novel patient similarity algorithm that uses local spline regression to capture the unsupervised information. To speedup the incorporation of physician feedback or newly available clinical information, we introduce a general online update algorithm for an existing PSF distance metric.
    04/2015; 19(3). DOI:10.1109/JBHI.2015.2425365
  • Fei Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: With the rapid development of information technologies, tremendous amount of data became readily available in various application domains. This big data era presents challenges to many conventional data analytics research directions including data capture, storage, search, sharing, analysis, and visualization. It is no surprise to see that the success of next-generation healthcare systems heavily relies on the effective utilization of gigantic amounts of medical data. The ability of analyzing big data in modern healthcare systems plays a vital role in the improvement of the quality of care delivery. Specifically, patient similarity evaluation aims at estimating the clinical affinity and diagnostic proximity of patients. As one of the successful data driven techniques adopted in healthcare systems, patient similarity evaluation plays a fundamental role in many healthcare research areas such as prognosis, risk assessment, and comparative effectiveness analysis. However, existing algorithms for patient similarity evaluation are inefficient in handling massive patient data. In this paper, we propose an Adaptive Semi-Supervised Recursive Tree Partitioning (ART) framework for large scale patient indexing such that the patients with similar clinical or diagnostic patterns can be correctly and efficiently retrieved. The framework is designed for semi-supervised settings since it is crucial to leverage experts' supervision knowledge in medical scenario, which are fairly limited compared to the available data. Starting from the proposed ART framework, we will discuss several specific instantiations and validate them on both benchmark and real world healthcare data. Our results show that with the ART framework, the patients can be efficiently and effectively indexed in the sense that (1) similarity patients can be retrieved in a very short time; (2) the retrieval performance can beat the state-of-the art indexing methods. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 02/2015; 55. DOI:10.1016/j.jbi.2015.01.009 · 2.48 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Care pathways play significant roles in delivering evidence-based and coordinated care to patients with specific conditions. In order to put care pathways into practice, clinical institutions always need to adapt them based on local care settings so that the best local practices can be incorporated and used to develop refined pathways. However, it is knowledge-intensive and error-prone to incorporate various analytic insights from local data sets. In order to assist care pathway developers in working effectively and efficiently, we propose to automatically synthesize the analytical evidences derived from multiple analysis methods, and recommend modelling operations accordingly to derive a refined care pathway for a specific patient cohort. We validated our method by adapting a Congestive Heart Failure (CHF) Ambulatory Care Pathway for patients with additional condition of COPD through synthesizing the results of variation analysis and frequent pattern mining against patient records.
    Studies in health technology and informatics 01/2015; 210:70-4.
  • [Show abstract] [Hide abstract]
    ABSTRACT: A care/clinical pathway (CP) is a standardized care process where temporal and data constraints of clinical activities are defined to ensure quality of care. In actual care practice, various situations of compliance and non-compliance with CPs can be observed. Analysis of these CP variation patterns (CPVPs) can help improve care quality and enhance decision support. In this paper, we propose an automatic method to detect CPVPs in electronic medical records (EMR), and statistically examine their correlation with patient outcomes. From each CP constraint, we first derive a CPVP tree, where each pattern is represented using first-order linear temporal logic and translated into a Büchi automaton for pattern detection. Then we identify the CPVPs that are evidently correlated with a patient outcome by examining the odds ratios. The method has been applied to a CP for congestive heart failure and real world EMR to demonstrate the effectiveness.
    Studies in health technology and informatics 01/2015; 210:692-6.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Data mining and healthcareHealthcare is a field that is closely related to everyone’s daily life. Because of the high complexity in healthcare industry, every year a huge amount of money is wasted. In recent years researchers from different areas went into the healthcare world with the hope of helping to reduce the cost and improve the quality of care delivery. Among all those emerging trends, data driven technologies have received a lot of attentions due to the availability of more and more healthcare data. Data-driven healthcare is at the center of the vision of learning health systems and holds great promise for transforming the current healthcare status.The goal of this special issue is to present novel approaches in the field of data mining that can be applied in different fields of healthcare. The vast majority of the papers presented in this special issue present novel methods that were empirically evaluated on medical datasets that are much larger than datasets we were used to ...
    Data Mining and Knowledge Discovery 01/2015; DOI:10.1007/s10618-015-0414-1 · 1.74 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Regulations and privacy concerns often hinder exchange of healthcare data between hospitals or other healthcare providers. Sharing predictive models built on original data and averaging their results offers an alternative to more efficient prediction of outcomes on new cases. Although one can choose from many techniques to combine outputs from different predictive models, it is difficult to find studies that try to interpret the results obtained from ensemble-learning methods. We propose a novel approach to classification based on models from different hospitals that allows a high level of performance along with comprehensibility of obtained results. Our approach is based on regularized sparse regression models in two hierarchical levels and exploits the interpretability of obtained regression coefficients to rank the contribution of hospitals in terms of outcome prediction. The proposed approach was used to predict the 30-days all-cause readmissions for pediatric patients in 54 Californian hospitals. Using repeated holdout evaluation, including more than 60,000 hospital discharge records, we compared the proposed approach to alternative approaches. The performance of two-level classification model was measured using the Area Under the ROC Curve (AUC) with an additional evaluation that uncovered the importance and contribution of each single data source (i.e. hospital) to the final result. The results for the best distributed model (AUC=0.787, 95% CI: 0.780-0.794) demonstrate no significant difference in terms of AUC performance when compared to a single elastic net model built on all available data (AUC=0.789, 95% CI: 0.781-0.796). This paper presents a novel approach to improved classification with shared predictive models for environments where centralized collection of data is not possible. The significant improvements in classification performance and interpretability of results demonstrate the effectiveness of our approach.
    2014 AMIA Annual Symposium, Washington, DC; 11/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Exponential growth of information generated by online social networks demands effective and scalable recommender systems to give useful results. Traditional techniques become unqualified because they ignore social relation data; existing social recommendation approaches consider social network structure, but social contextual information has not been fully considered. It is significant and challenging to fuse social contextual factors which are derived from users' motivation of social behaviors into social recommendation. In this paper, we investigate the social recommendation problem on the basis of psychology and sociology studies, which exhibit two important factors: individual preference and interpersonal influence. We first present the particular importance of these two factors in online behavior prediction. Then we propose a novel probabilistic matrix factorization method to fuse them in latent space. We further provide a scalable algorithm which can incrementally process the large scale data. We conduct experiments on both Facebook style bidirectional and Twitter style unidirectional social network data sets. The empirical results and analysis on these two large data sets demonstrate that our method significantly outperforms the existing approaches.approaches.
    IEEE Transactions on Knowledge and Data Engineering 11/2014; 26(11):2789-2802. DOI:10.1109/TKDE.2014.2300487 · 1.82 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Risk stratification is instrumental to modern clinical decision support systems. Comprehensive risk stratification should be able to provide the clinicians with not only the accurate assessment of a patient's risk but also the clinical context to be acted upon. However, existing risk stratification techniques mainly focus on predicting the risk score for individual patients; at the cohort level, they offer little insight beyond a flat score-based segmentation. This essentially reduces a patient to a score and thus removes him/her from his/her clinical context. To address this limitation, in this paper we propose a bilinear model for risk stratification that simultaneously captures the three key aspects of risk stratification: (1) it predicts the risk of each individual patient; (2) it stratifies the patient cohort based on not only the risk score but also the clinical characteristics; and (3) it embeds all patients into clinical contexts with clear interpretation. We apply our model to a cohort of 4977 patients, 1127 among which were diagnosed with Congestive Heart Failure (CHF). We demonstrate that our model cannot only accurately predict the onset risk of CHF but also provide rich and actionable clinical insights into the patient cohort. Copyright © 2014 Elsevier Inc. All rights reserved.
    Journal of Biomedical Informatics 10/2014; 53. DOI:10.1016/j.jbi.2014.10.004 · 2.48 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Behavioral pattern discovery is increasingly being studied to understand human behavior and the discovered patterns can be used in many real world applications such as web search, recommender system and advertisement targeting. Traditional methods usually consider the behaviors as simple user and item connections, or represent them with a static model. In real world, however, human behaviors are actually complex and dynamic: they include correlations between user and multiple types of objects and also continuously evolve along time. These characteristics cause severe data sparsity and computational complexity problem, which pose great challenge to human behavioral analysis and prediction. In this paper, we propose a Flexible Evolutionary Multi-faceted Analysis (FEMA) framework for both behavior prediction and pattern mining. FEMA utilizes a flexible and dynamic factorization scheme for analyzing human behavioral data sequences, which can incorporate various knowledge embedded in different object domains to alleviate the sparsity problem. We give approximation algorithms for efficiency, where the bound of approximation loss is theoretically proved. We extensively evaluate the proposed method in two real datasets. For the prediction of human behaviors, the proposed FEMA significantly outperforms other state-of-the-art baseline methods by 17.4%. Moreover, FEMA is able to discover quite a number of interesting multi-faceted temporal patterns on human behaviors with good interpretability. More importantly, it can reduce the run time from hours to minutes, which is significant for industry to serve real-time applications.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Inferring phenotypic patterns from population-scale clinical data is a core computational task in the development of personalized medicine. One important source of data on which to conduct this type of research is patient Electronic Medical Records (EMR). However, the patient EMRs are typically sparse and noisy, which creates significant challenges if we use them directly to represent patient phenotypes. In this paper, we propose a data driven phenotyping framework called Pacifier (PAtient reCord densIFIER), where we interpret the longitudinal EMR data of each patient as a sparse matrix with a feature dimension and a time dimension, and derive more robust patient phenotypes by exploring the latent structure of those matrices. Specifically, we assume that each derived phenotype is composed of a subset of the medical features contained in original patient EMR, whose value evolves smoothly over time. We propose two formulations to achieve such goal. One is Individual Basis Approach (IBA), which assumes the phenotypes are different for every patient. The other is Shared Basis Approach (SBA), which assumes the patient population shares a common set of phenotypes. We develop an efficient optimization algorithm that is capable of resolving both problems efficiently. Finally we validate Pacifier on two real world EMR cohorts for the tasks of early prediction of Congestive Heart Failure (CHF) and End Stage Renal Disease (ESRD). Our results show that the predictive performance in both tasks can be improved significantly by the proposed algorithms (average AUC score improved from 0.689 to 0.816 on CHF, and from 0.756 to 0.838 on ESRD respectively, on diagnosis group granularity). We also illustrate some interesting phenotypes derived from our data.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Logistic regression is one core predictive modeling technique that has been used extensively in health and biomedical problems. Recently a lot of research has been focusing on enforcing sparsity on the learned model to enhance its effectiveness and interpretability, which results in sparse logistic regression model. However, no matter the original or sparse logistic regression, they require the inputs to be in vector form. This limits the applicability of logistic regression in the problems when the data cannot be naturally represented vectors (e.g., functional magnetic resonance imaging and electroencephalography signals). To handle the cases when the data are in the form of multi-dimensional arrays, we propose MulSLR: Multilinear Sparse Logistic Regression. MulSLR can be viewed as a high order extension of sparse logistic regression. Instead of solving one classification vector as in conventional logistic regression, we solve for K classification vectors in MulSLR (K is the number of modes in the data). We propose a block proximal descent approach to solve the problem and prove its convergence. The convergence rate of the proposed algorithm is also analyzed. Finally we validate the efficiency and effectiveness of MulSLR on predicting the onset risk of patients with Alzheimer's disease and heart failure.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Social media allow people to express and promote different opinions, on which people’s sentiments to a subject often diverge when their opinions conflict. An intuitive visualization that unfolds the process of sentiment divergence from the rich and massive social media data will have far-reaching impact on various domains including social science, politics and economics. In this paper, we propose a visual analysis system, SocialHelix, to achieve this goal. SocialHelix is a novel visual design which enables the users to detect and trace topics and events occurring in social media, and to understand when and why divergences occurred and how they evolved among different social groups. We demonstrate the effectiveness and usefulness of SocialHelix by conducting in-depth case studies on tweets related to the national political debates. Graphical Abstract
    Journal of Visualization 05/2014; 18(2):221-235. DOI:10.1007/s12650-014-0246-x · 0.59 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The rapid adoption of electronic health records (EHR) provides a comprehensive source for exploratory and predictive analytic to support clinical decision-making. In this paper, we investigate how to utilize EHR to tailor treatments to individual patients based on their likelihood to respond to a therapy. We construct a heterogeneous graph which includes two domains (patients and drugs) and encodes three relationships (patient similarity, drug similarity, and patient-drug prior associations). We describe a novel approach for performing a label propagation procedure to spread the label information representing the effectiveness of different drugs for different patients over this heterogeneous graph. The proposed method has been applied on a real-world EHR dataset to help identify personalized treatments for hypercholesterolemia. The experimental results demonstrate the effectiveness of the approach and suggest that the combination of appropriate patient similarity and drug similarity analytics could lead to actionable insights for personalized medicine. Particularly, by leveraging drug similarity in combination with patient similarity, our method could perform well even on new or rarely used drugs for which there are few records of known past performance.
    04/2014; 2014:132-6.
  • David Gotz, Fei Wang, Adam Perer
    [Show abstract] [Hide abstract]
    ABSTRACT: Patients’ medical conditions often evolve in complex and seemingly unpredictable ways. Even within a relatively narrow and well-defined episode of care, variations between patients in both their progression and eventual outcome can be dramatic. Understanding the patterns of events observed within a population that most correlate with differences in outcome is therefore an important task in many types of studies using retrospective electronic health data. In this paper, we present a method for interactive pattern mining and analysis that supports ad hoc visual exploration of patterns mined from retrospective clinical patient data. Our approach combines (1) visual query capabilities to interactively specify episode definitions, (2) pattern mining techniques to help discover important intermediate events within an episode, and (3) interactive visualization techniques that help uncover event patterns that most impact outcome and how those associations change over time. In addition to presenting our methodology, we describe a prototype implementation and present use cases highlighting the types of insights or hypotheses that our approach can help uncover.
    Journal of Biomedical Informatics 04/2014; 48. DOI:10.1016/j.jbi.2014.01.007 · 2.48 Impact Factor
  • Fei Wang, Jimeng Sun
    [Show abstract] [Hide abstract]
    ABSTRACT: Distance metric learning is a fundamental problem in data mining and knowledge discovery. Many representative data mining algorithms, such as \(k\) -nearest neighbor classifier, hierarchical clustering and spectral clustering, heavily rely on the underlying distance metric for correctly measuring relations among input data. In recent years, many studies have demonstrated, either theoretically or empirically, that learning a good distance metric can greatly improve the performance of classification, clustering and retrieval tasks. In this survey, we overview existing distance metric learning approaches according to a common framework. Specifically, depending on the available supervision information during the distance metric learning process, we categorize each distance metric learning algorithm as supervised, unsupervised or semi-supervised. We compare those different types of metric learning methods, point out their strength and limitations. Finally, we summarize open challenges in distance metric learning and propose future directions for distance metric learning.
    Data Mining and Knowledge Discovery 03/2014; 29(2):534-564. DOI:10.1007/s10618-014-0356-z · 1.74 Impact Factor
  • Adam Perer, Fei Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Extracting insights from temporal event sequences is an important challenge. In particular, mining frequent patterns from event sequences is a desired capability for many domains. However, most techniques for mining frequent patterns are ineffective for real-world data that may be low-resolution, concurrent, or feature many types of events, or the algorithms may produce results too complex to interpret. To address these challenges, we propose Frequence, an intelligent user interface that integrates data mining and visualization in an interactive hierarchical information exploration system for finding frequent patterns from longitudinal event sequences. Frequence features a novel frequent sequence mining algorithm to handle multiple levels-of-detail, temporal context, concurrency, and outcome analysis. Frequence also features a visual interface designed to support insights, and support exploration of patterns of the level-of-detail relevant to users. Frequence's effectiveness is demonstrated with two use cases: medical research mining event sequences from clinical records to understand the progression of a disease, and social network research using frequent sequences from Foursquare to understand the mobility of people in an urban environment.
    Proceedings of the 19th international conference on Intelligent User Interfaces; 02/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: On top of an enterprise social platform, we are building a smart social QA system that automatically routes questions to suitable employees who are willing, able, and ready to provide answers. Due to a lack of social QA history (training data) to start with, in this paper, we present an optimization-based approach that recommends both top-matched active (seed) and inactive (prospect) answerers for a given question. Our approach includes three parts. First, it uses a predictive model to find top-ranked seed answerers by their fitness, including their ability and willingness, to answer a question. Second, it uses distance metric learning to discover prospects most similar to the seeds identified in the first step. Third, it uses a constraint-based approach to balance the selection of both seeds and prospects identified in the first two steps. As a result, not only does our solution route questions to top-matched active users, but it also engages inactive users to grow the pool of answerers. Our real-world experiments that routed 114 questions to 684 people identified from 400,000+ employees included 641 prospects (93.7%) and achieved about 70% answering rate with 83% of answers received a lot/full confidence.
    Proceedings of the 19th international conference on Intelligent User Interfaces; 02/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Diffusion and cascades have been studied for many years in sociology, and different theoretical models have been developed. However, experimental validation has been always carried out in relatively small datasets. In recent years, with the availability of large-scale network and cascade data, research on cascading and diffusion phenomena has aroused considerable interests from various fields in computer science. One of the main goals is to discover different propagation patterns from historical cascade data. In this context, understanding the mechanisms underlying diffusion in both micro- and macro-scale levels and further develop predictive model of diffusion are fundamental problems of crucial importance.
    Proceedings of the 7th ACM international conference on Web search and data mining; 02/2014
  • Ping Zhang, Fei Wang, Jianying Hu
    [Show abstract] [Hide abstract]
    ABSTRACT: In response to the high cost and high risk associated with traditional de novo drug discovery, investigation of potential additional uses for existing drugs, also known as drug repositioning, has attracted increasing attention from both the pharmaceutical industry and the research community. In this paper, we propose a unified computational framework, called DDR, to predict novel drug-disease associations. DDR formulates the task of hypothesis generation for drug repositioning as a constrained nonlinear optimization problem. It utilizes multiple drug similarity networks, multiple disease similarity networks, and known drug-disease associations to explore potential new associations among drugs and diseases with no known links. A large-scale study was conducted using 799 drugs against 719 diseases. Experimental results demonstrated the effectiveness of the approach. In addition, DDR ranked drug and disease information sources based on their contributions to the prediction, thus paving the way for prioritizing multiple data sources and building more reliable drug repositioning models. Particularly, some of our novel predictions of drug-disease associations were supported by clinical trials databases, showing that DDR could serve as a useful tool in drug discovery to efficiently identify potential novel uses for existing drugs.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2014; 2014:1258-67.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Disease risk prediction has been a central topic of medical informatics. Although various risk prediction models have been studied in the literature, the vast majority were designed to be single-task, i.e. they only consider one target disease at a time. This becomes a limitation when in practice we are dealing with two or more diseases that are related to each other in terms of sharing common comorbidities, symptoms, risk factors, etc., because single-task prediction models are not equipped to identify these associations across different tasks. In this paper we address this limitation by exploring the application of multi-task learning framework to joint disease risk prediction. Specifically, we characterize the disease relatedness by assuming that the risk predictors underlying these diseases have overlap. We develop an optimization-based formulation that can simultaneously predict the risk for all diseases and learn the shared predictors. Our model is applied to a real Electronic Health Record (EHR) database with 7,839 patients, among which 1,127 developed Congestive Heart Failure (CHF) and 477 developed Chronic Obstructive Pulmonary Disease (COPD). We demonstrate that a properly designed multi-task learning algorithm is viable for joint disease risk prediction and it can discover clinical insights that single-task models would overlook.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2014; 2014:1180-7.

Publication Stats

1k Citations
92.52 Total Impact Points


  • 2011–2014
    • IBM
      Armonk, New York, United States
  • 2009–2011
    • Florida International University
      • School of Computing and Information Sciences
      Miami, Florida, United States
  • 2005–2011
    • Tsinghua University
      • Department of Automation
      Beijing, Beijing Shi, China
  • 2010
    • Cornell University
      • Department of Statistical Science
      Ithaca, New York, United States