Fei Wang

University of Connecticut, Сторс, Connecticut, United States

Are you Fei Wang?

Claim your profile

Publications (151)99.69 Total impact

  • Adam Perer · Fei Wang · Jianying Hu
    [Show abstract] [Hide abstract]
    ABSTRACT: In order to derive data-driven insights, we develop Care Pathway Explorer, a system that mines and visualizes a set of frequent event sequences from patient EMR data. The goal is to utilize historical EMR data to extract common sequences of medical events such as diagnoses and treatments, and investigate how these sequences correlate with patient outcome. The Care Pathway Explorer uses a frequent sequence mining algorithm adapted to handle the real-world properties of EMR data, including techniques for handling event concurrency, multiple levels-of-detail, temporal context, and outcome. The mined patterns are then visualized in an interactive user interface consisting of novel overview and flow visualizations. We use the proposed system to analyze the diagnoses and treatments of a cohort of hyperlipidemic patients with hypertension and diabetes pre-conditions, and demonstrate the clinical relevance of patterns mined from EMR data. The patterns that were identified corresponded to clinical and published knowledge, some of it unknown to the physician at the time of discovery. Care Pathway Explorer, which combines frequent sequence mining techniques with advanced visualizations supports the integration of data-driven insights into care pathway discovery. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 07/2015; 56. DOI:10.1016/j.jbi.2015.06.020 · 2.48 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Data mining and healthcareHealthcare is a field that is closely related to everyone’s daily life. Because of the high complexity in healthcare industry, every year a huge amount of money is wasted. In recent years researchers from different areas went into the healthcare world with the hope of helping to reduce the cost and improve the quality of care delivery. Among all those emerging trends, data driven technologies have received a lot of attentions due to the availability of more and more healthcare data. Data-driven healthcare is at the center of the vision of learning health systems and holds great promise for transforming the current healthcare status.The goal of this special issue is to present novel approaches in the field of data mining that can be applied in different fields of healthcare. The vast majority of the papers presented in this special issue present novel methods that were empirically evaluated on medical datasets that are much larger than datasets we were used to ...
    Data Mining and Knowledge Discovery 07/2015; 29(4). DOI:10.1007/s10618-015-0414-1 · 1.74 Impact Factor
  • Ming Lin · Fei Wang · Changshui Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: The Nyström method approximates eigenvectors of a given kernel matrix by randomly sampling subset of data. Previous researches focus on good kernel approximation while the quality of eigenvector approximation is rarely explored. In online eigenvector approximation method, one can minimize the kernel approximation error to guarantee a good eigenvector approximation. However in this work, we paradoxically prove that for batch approximation methods like Nyström, it is no longer true. This unexpected discovery opens a question: What criterion should we use in Nyström to generate a decent eigenvector approximation? To address this problem, we propose a novel criterion named Hilbert Space Embedding (HSE) Nyström criterion which directly minimizes the eigenvector approximation error. The proposed HSE criterion provides a general framework to approximate eigenvectors within linear time and space complexity. We then show that we can rediscover many successful Nyström methods with the proposed criterion, including K-means Nyström and Density Nyström. To further demonstrate the power of our criterion, we actually design a novel algorithm to approximate eigenvectors of Laplacian matrices based on the proposed criterion with better accuracy among existing linear complexity methods. We demonstrate the efficiency and efficacy of our proposal in numerical experiments.
    Pattern Recognition 05/2015; 48(5). DOI:10.1016/j.patcog.2014.11.017 · 2.58 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cascades are ubiquitous in various network environments. How to predict these cascades is highly nontrivial in several vital applications, such as viral marketing, epidemic prevention and traffic management. Most previous works mainly focus on predicting the final cascade sizes. As cascades are typical dynamic processes, it is always interesting and important to predict the cascade size at any time, or predict the time when a cascade will reach a certain size (e.g. an threshold for outbreak). In this paper, we unify all these tasks into a fundamental problem: cascading process prediction. That is, given the early stage of a cascade, how to predict its cumulative cascade size of any later time? For such a challenging problem, how to understand the micro mechanism that drives and generates the macro phenomenons (i.e. cascading proceese) is essential. Here we introduce behavioral dynamics as the micro mechanism to describe the dynamic process of a node's neighbors get infected by a cascade after this node get infected (i.e. one-hop subcascades). Through data-driven analysis, we find out the common principles and patterns lying in behavioral dynamics and propose a novel Networked Weibull Regression model for behavioral dynamics modeling. After that we propose a novel method for predicting cascading processes by effectively aggregating behavioral dynamics, and propose a scalable solution to approximate the cascading process with a theoretical guarantee. We extensively evaluate the proposed method on a large scale social network dataset. The results demonstrate that the proposed method can significantly outperform other state-of-the-art baselines in multiple tasks including cascade size prediction, outbreak time prediction and cascading process prediction.
  • Fei Wang · Jimeng Sun
    [Show abstract] [Hide abstract]
    ABSTRACT: Patient similarity is an important analytic operation in healthcare applications. At the core, patient similarity takes an index patient as the input and retrieves a ranked list of similar patients that are relevant in a specific clinical context. It takes patient information such as their electronic health records (EHR) as input and computes the distance between a pair of patients based on those information. To construct a clinically valid similarity measure, physician input often needs to be incorporated.However, obtaining physicians' input is difficult and expensive. As a result, typically only limited physician feedbacks can be obtained on a small portion of patients. How to leverage all unlabelled patient data and limited supervision information from physicians to construct a clinically meaningful distance metric? In this paper, we present a Patient Similarity Framework (PSF) that unifies and significantly extends existing supervised patient similarity metric learning methods. PSF is a general framework that can learn an appropriate distance metric through supervised and unsupervised information. Within PSF framework, we propose a novel patient similarity algorithm that uses local spline regression to capture the unsupervised information. To speedup the incorporation of physician feedback or newly available clinical information, we introduce a general online update algorithm for an existing PSF distance metric.
    04/2015; 19(3). DOI:10.1109/JBHI.2015.2425365
  • Fei Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: With the rapid development of information technologies, tremendous amount of data became readily available in various application domains. This big data era presents challenges to many conventional data analytics research directions including data capture, storage, search, sharing, analysis, and visualization. It is no surprise to see that the success of next-generation healthcare systems heavily relies on the effective utilization of gigantic amounts of medical data. The ability of analyzing big data in modern healthcare systems plays a vital role in the improvement of the quality of care delivery. Specifically, patient similarity evaluation aims at estimating the clinical affinity and diagnostic proximity of patients. As one of the successful data driven techniques adopted in healthcare systems, patient similarity evaluation plays a fundamental role in many healthcare research areas such as prognosis, risk assessment, and comparative effectiveness analysis. However, existing algorithms for patient similarity evaluation are inefficient in handling massive patient data. In this paper, we propose an Adaptive Semi-Supervised Recursive Tree Partitioning (ART) framework for large scale patient indexing such that the patients with similar clinical or diagnostic patterns can be correctly and efficiently retrieved. The framework is designed for semi-supervised settings since it is crucial to leverage experts' supervision knowledge in medical scenario, which are fairly limited compared to the available data. Starting from the proposed ART framework, we will discuss several specific instantiations and validate them on both benchmark and real world healthcare data. Our results show that with the ART framework, the patients can be efficiently and effectively indexed in the sense that (1) similarity patients can be retrieved in a very short time; (2) the retrieval performance can beat the state-of-the art indexing methods. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 02/2015; 55. DOI:10.1016/j.jbi.2015.01.009 · 2.48 Impact Factor
  • IEEE Transactions on Knowledge and Data Engineering 01/2015; DOI:10.1109/TKDE.2015.2432811 · 2.07 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Care pathways play significant roles in delivering evidence-based and coordinated care to patients with specific conditions. In order to put care pathways into practice, clinical institutions always need to adapt them based on local care settings so that the best local practices can be incorporated and used to develop refined pathways. However, it is knowledge-intensive and error-prone to incorporate various analytic insights from local data sets. In order to assist care pathway developers in working effectively and efficiently, we propose to automatically synthesize the analytical evidences derived from multiple analysis methods, and recommend modelling operations accordingly to derive a refined care pathway for a specific patient cohort. We validated our method by adapting a Congestive Heart Failure (CHF) Ambulatory Care Pathway for patients with additional condition of COPD through synthesizing the results of variation analysis and frequent pattern mining against patient records.
    Studies in health technology and informatics 01/2015; 210:70-4.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A care/clinical pathway (CP) is a standardized care process where temporal and data constraints of clinical activities are defined to ensure quality of care. In actual care practice, various situations of compliance and non-compliance with CPs can be observed. Analysis of these CP variation patterns (CPVPs) can help improve care quality and enhance decision support. In this paper, we propose an automatic method to detect CPVPs in electronic medical records (EMR), and statistically examine their correlation with patient outcomes. From each CP constraint, we first derive a CPVP tree, where each pattern is represented using first-order linear temporal logic and translated into a Büchi automaton for pattern detection. Then we identify the CPVPs that are evidently correlated with a patient outcome by examining the odds ratios. The method has been applied to a CP for congestive heart failure and real world EMR to demonstrate the effectiveness.
    Studies in health technology and informatics 01/2015; 210:692-6.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Regulations and privacy concerns often hinder exchange of healthcare data between hospitals or other healthcare providers. Sharing predictive models built on original data and averaging their results offers an alternative to more efficient prediction of outcomes on new cases. Although one can choose from many techniques to combine outputs from different predictive models, it is difficult to find studies that try to interpret the results obtained from ensemble-learning methods. We propose a novel approach to classification based on models from different hospitals that allows a high level of performance along with comprehensibility of obtained results. Our approach is based on regularized sparse regression models in two hierarchical levels and exploits the interpretability of obtained regression coefficients to rank the contribution of hospitals in terms of outcome prediction. The proposed approach was used to predict the 30-days all-cause readmissions for pediatric patients in 54 Californian hospitals. Using repeated holdout evaluation, including more than 60,000 hospital discharge records, we compared the proposed approach to alternative approaches. The performance of two-level classification model was measured using the Area Under the ROC Curve (AUC) with an additional evaluation that uncovered the importance and contribution of each single data source (i.e. hospital) to the final result. The results for the best distributed model (AUC=0.787, 95% CI: 0.780-0.794) demonstrate no significant difference in terms of AUC performance when compared to a single elastic net model built on all available data (AUC=0.789, 95% CI: 0.781-0.796). This paper presents a novel approach to improved classification with shared predictive models for environments where centralized collection of data is not possible. The significant improvements in classification performance and interpretability of results demonstrate the effectiveness of our approach.
    2014 AMIA Annual Symposium, Washington, DC; 11/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Exponential growth of information generated by online social networks demands effective and scalable recommender systems to give useful results. Traditional techniques become unqualified because they ignore social relation data; existing social recommendation approaches consider social network structure, but social contextual information has not been fully considered. It is significant and challenging to fuse social contextual factors which are derived from users' motivation of social behaviors into social recommendation. In this paper, we investigate the social recommendation problem on the basis of psychology and sociology studies, which exhibit two important factors: individual preference and interpersonal influence. We first present the particular importance of these two factors in online behavior prediction. Then we propose a novel probabilistic matrix factorization method to fuse them in latent space. We further provide a scalable algorithm which can incrementally process the large scale data. We conduct experiments on both Facebook style bidirectional and Twitter style unidirectional social network data sets. The empirical results and analysis on these two large data sets demonstrate that our method significantly outperforms the existing approaches.approaches.
    IEEE Transactions on Knowledge and Data Engineering 11/2014; 26(11):2789-2802. DOI:10.1109/TKDE.2014.2300487 · 2.07 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Risk stratification is instrumental to modern clinical decision support systems. Comprehensive risk stratification should be able to provide the clinicians with not only the accurate assessment of a patient's risk but also the clinical context to be acted upon. However, existing risk stratification techniques mainly focus on predicting the risk score for individual patients; at the cohort level, they offer little insight beyond a flat score-based segmentation. This essentially reduces a patient to a score and thus removes him/her from his/her clinical context. To address this limitation, in this paper we propose a bilinear model for risk stratification that simultaneously captures the three key aspects of risk stratification: (1) it predicts the risk of each individual patient; (2) it stratifies the patient cohort based on not only the risk score but also the clinical characteristics; and (3) it embeds all patients into clinical contexts with clear interpretation. We apply our model to a cohort of 4977 patients, 1127 among which were diagnosed with Congestive Heart Failure (CHF). We demonstrate that our model cannot only accurately predict the onset risk of CHF but also provide rich and actionable clinical insights into the patient cohort. Copyright © 2014 Elsevier Inc. All rights reserved.
    Journal of Biomedical Informatics 10/2014; 53. DOI:10.1016/j.jbi.2014.10.004 · 2.48 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Behavioral pattern discovery is increasingly being studied to understand human behavior and the discovered patterns can be used in many real world applications such as web search, recommender system and advertisement targeting. Traditional methods usually consider the behaviors as simple user and item connections, or represent them with a static model. In real world, however, human behaviors are actually complex and dynamic: they include correlations between user and multiple types of objects and also continuously evolve along time. These characteristics cause severe data sparsity and computational complexity problem, which pose great challenge to human behavioral analysis and prediction. In this paper, we propose a Flexible Evolutionary Multi-faceted Analysis (FEMA) framework for both behavior prediction and pattern mining. FEMA utilizes a flexible and dynamic factorization scheme for analyzing human behavioral data sequences, which can incorporate various knowledge embedded in different object domains to alleviate the sparsity problem. We give approximation algorithms for efficiency, where the bound of approximation loss is theoretically proved. We extensively evaluate the proposed method in two real datasets. For the prediction of human behaviors, the proposed FEMA significantly outperforms other state-of-the-art baseline methods by 17.4%. Moreover, FEMA is able to discover quite a number of interesting multi-faceted temporal patterns on human behaviors with good interpretability. More importantly, it can reduce the run time from hours to minutes, which is significant for industry to serve real-time applications.
  • Jiayu Zhou · Fei Wang · Jianying Hu · Jieping Ye
    [Show abstract] [Hide abstract]
    ABSTRACT: Inferring phenotypic patterns from population-scale clinical data is a core computational task in the development of personalized medicine. One important source of data on which to conduct this type of research is patient Electronic Medical Records (EMR). However, the patient EMRs are typically sparse and noisy, which creates significant challenges if we use them directly to represent patient phenotypes. In this paper, we propose a data driven phenotyping framework called Pacifier (PAtient reCord densIFIER), where we interpret the longitudinal EMR data of each patient as a sparse matrix with a feature dimension and a time dimension, and derive more robust patient phenotypes by exploring the latent structure of those matrices. Specifically, we assume that each derived phenotype is composed of a subset of the medical features contained in original patient EMR, whose value evolves smoothly over time. We propose two formulations to achieve such goal. One is Individual Basis Approach (IBA), which assumes the phenotypes are different for every patient. The other is Shared Basis Approach (SBA), which assumes the patient population shares a common set of phenotypes. We develop an efficient optimization algorithm that is capable of resolving both problems efficiently. Finally we validate Pacifier on two real world EMR cohorts for the tasks of early prediction of Congestive Heart Failure (CHF) and End Stage Renal Disease (ESRD). Our results show that the predictive performance in both tasks can be improved significantly by the proposed algorithms (average AUC score improved from 0.689 to 0.816 on CHF, and from 0.756 to 0.838 on ESRD respectively, on diagnosis group granularity). We also illustrate some interesting phenotypes derived from our data.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Logistic regression is one core predictive modeling technique that has been used extensively in health and biomedical problems. Recently a lot of research has been focusing on enforcing sparsity on the learned model to enhance its effectiveness and interpretability, which results in sparse logistic regression model. However, no matter the original or sparse logistic regression, they require the inputs to be in vector form. This limits the applicability of logistic regression in the problems when the data cannot be naturally represented vectors (e.g., functional magnetic resonance imaging and electroencephalography signals). To handle the cases when the data are in the form of multi-dimensional arrays, we propose MulSLR: Multilinear Sparse Logistic Regression. MulSLR can be viewed as a high order extension of sparse logistic regression. Instead of solving one classification vector as in conventional logistic regression, we solve for K classification vectors in MulSLR (K is the number of modes in the data). We propose a block proximal descent approach to solve the problem and prove its convergence. The convergence rate of the proposed algorithm is also analyzed. Finally we validate the efficiency and effectiveness of MulSLR on predicting the onset risk of patients with Alzheimer's disease and heart failure.
  • Nan Cao · Lu Lu · Yu-Ru Lin · Fei Wang · Zhen Wen
    [Show abstract] [Hide abstract]
    ABSTRACT: Social media allow people to express and promote different opinions, on which people’s sentiments to a subject often diverge when their opinions conflict. An intuitive visualization that unfolds the process of sentiment divergence from the rich and massive social media data will have far-reaching impact on various domains including social science, politics and economics. In this paper, we propose a visual analysis system, SocialHelix, to achieve this goal. SocialHelix is a novel visual design which enables the users to detect and trace topics and events occurring in social media, and to understand when and why divergences occurred and how they evolved among different social groups. We demonstrate the effectiveness and usefulness of SocialHelix by conducting in-depth case studies on tweets related to the national political debates. Graphical Abstract
    Journal of Visualization 05/2014; 18(2):221-235. DOI:10.1007/s12650-014-0246-x · 0.59 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The rapid adoption of electronic health records (EHR) provides a comprehensive source for exploratory and predictive analytic to support clinical decision-making. In this paper, we investigate how to utilize EHR to tailor treatments to individual patients based on their likelihood to respond to a therapy. We construct a heterogeneous graph which includes two domains (patients and drugs) and encodes three relationships (patient similarity, drug similarity, and patient-drug prior associations). We describe a novel approach for performing a label propagation procedure to spread the label information representing the effectiveness of different drugs for different patients over this heterogeneous graph. The proposed method has been applied on a real-world EHR dataset to help identify personalized treatments for hypercholesterolemia. The experimental results demonstrate the effectiveness of the approach and suggest that the combination of appropriate patient similarity and drug similarity analytics could lead to actionable insights for personalized medicine. Particularly, by leveraging drug similarity in combination with patient similarity, our method could perform well even on new or rarely used drugs for which there are few records of known past performance.
    04/2014; 2014:132-6.
  • David Gotz · Fei Wang · Adam Perer
    [Show abstract] [Hide abstract]
    ABSTRACT: Patients’ medical conditions often evolve in complex and seemingly unpredictable ways. Even within a relatively narrow and well-defined episode of care, variations between patients in both their progression and eventual outcome can be dramatic. Understanding the patterns of events observed within a population that most correlate with differences in outcome is therefore an important task in many types of studies using retrospective electronic health data. In this paper, we present a method for interactive pattern mining and analysis that supports ad hoc visual exploration of patterns mined from retrospective clinical patient data. Our approach combines (1) visual query capabilities to interactively specify episode definitions, (2) pattern mining techniques to help discover important intermediate events within an episode, and (3) interactive visualization techniques that help uncover event patterns that most impact outcome and how those associations change over time. In addition to presenting our methodology, we describe a prototype implementation and present use cases highlighting the types of insights or hypotheses that our approach can help uncover.
    Journal of Biomedical Informatics 04/2014; 48. DOI:10.1016/j.jbi.2014.01.007 · 2.48 Impact Factor
  • Fei Wang · Jimeng Sun
    [Show abstract] [Hide abstract]
    ABSTRACT: Distance metric learning is a fundamental problem in data mining and knowledge discovery. Many representative data mining algorithms, such as \(k\) -nearest neighbor classifier, hierarchical clustering and spectral clustering, heavily rely on the underlying distance metric for correctly measuring relations among input data. In recent years, many studies have demonstrated, either theoretically or empirically, that learning a good distance metric can greatly improve the performance of classification, clustering and retrieval tasks. In this survey, we overview existing distance metric learning approaches according to a common framework. Specifically, depending on the available supervision information during the distance metric learning process, we categorize each distance metric learning algorithm as supervised, unsupervised or semi-supervised. We compare those different types of metric learning methods, point out their strength and limitations. Finally, we summarize open challenges in distance metric learning and propose future directions for distance metric learning.
    Data Mining and Knowledge Discovery 03/2014; 29(2):534-564. DOI:10.1007/s10618-014-0356-z · 1.74 Impact Factor
  • Adam Perer · Fei Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Extracting insights from temporal event sequences is an important challenge. In particular, mining frequent patterns from event sequences is a desired capability for many domains. However, most techniques for mining frequent patterns are ineffective for real-world data that may be low-resolution, concurrent, or feature many types of events, or the algorithms may produce results too complex to interpret. To address these challenges, we propose Frequence, an intelligent user interface that integrates data mining and visualization in an interactive hierarchical information exploration system for finding frequent patterns from longitudinal event sequences. Frequence features a novel frequent sequence mining algorithm to handle multiple levels-of-detail, temporal context, concurrency, and outcome analysis. Frequence also features a visual interface designed to support insights, and support exploration of patterns of the level-of-detail relevant to users. Frequence's effectiveness is demonstrated with two use cases: medical research mining event sequences from clinical records to understand the progression of a disease, and social network research using frequent sequences from Foursquare to understand the mobility of people in an urban environment.
    Proceedings of the 19th international conference on Intelligent User Interfaces; 02/2014

Publication Stats

1k Citations
99.69 Total Impact Points

Institutions

  • 2015
    • University of Connecticut
      Сторс, Connecticut, United States
  • 2011–2014
    • IBM
      Armonk, New York, United States
  • 2009–2011
    • Florida International University
      • School of Computing and Information Sciences
      Miami, Florida, United States
  • 2005–2011
    • Tsinghua University
      • Department of Automation
      Beijing, Beijing Shi, China
  • 2010
    • Cornell University
      • Department of Statistical Science
      Ithaca, New York, United States