Visual cluster analysis in support of clinical decision intelligence.
ABSTRACT Electronic health records (EHRs) contain a wealth of information about patients. In addition to providing efficient and accurate records for individual patients, large databases of EHRs contain valuable information about overall patient populations. While statistical insights describing an overall population are beneficial, they are often not specific enough to use as the basis for individualized patient-centric decisions. To address this challenge, we describe an approach based on patient similarity which analyzes an EHR database to extract a cohort of patient records most similar to a specific target patient. Clusters of similar patients are then visualized to allow interactive visual refinement by human experts. Statistics are then extracted from the refined patient clusters and displayed to users. The statistical insights taken from these refined clusters provide personalized guidance for complex decisions. This paper focuses on the cluster refinement stage where an expert user must interactively (a) judge the quality and contents of automatically generated similar patient clusters, and (b) refine the clusters based on his/her expertise. We describe the DICON visualization tool which allows users to interactively view and refine multidimensional similar patient clusters. We also present results from a preliminary evaluation where two medical doctors provided feedback on our approach.
- SourceAvailable from: ncbi.nlm.nih.gov[show abstract] [hide abstract]
ABSTRACT: Providing near-term prognostic insight to clinicians helps them to better assess the near-term impact of their decisions and potential impending events affecting the patient. In this work, we present a novel system, which leverages inter-patient similarity for retrieving patients who display similar trends in their physiological time-series data. Data from the retrieved patient cohort is then used to project patient data into the future to provide insights for the query patient. The proposed approach and system were tested using the MIMIC II database, which consists of physiological waveforms, and accompanying clinical data obtained for ICU patients. In the experiments we report the effectiveness of the inter-patient similarity measure and the accuracy of the projection of patients' data. We also discuss the visual interface that conveys the near-term prognostic decision support to the user.AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2010; 2010:192-6.
Visual Cluster Analysis in Support of Clinical Decision Intelligence
David Gotz, PhD1, Jimeng Sun, PhD1, Nan Cao, MS2, Shahram Ebadollahi, PhD1
1IBM T.J. Watson Research Center, New York, USA;2HKUST, Hong Kong, China
Electronic health records (EHRs) contain a wealth of information about patients. In addition to providing efficient
and accurate records for individual patients, large databases of EHRs contain valuable information about overall
patient populations. While statistical insights describing an overall population are beneficial, they are often not
specific enough to use as the basis for individualized patient-centric decisions. To address this challenge, we describe
an approach based on patient similarity which analyzes an EHR database to extract a cohort of patient records most
similar to a specific target patient. Clusters of similar patients are then visualized to allow interactive visual refinement
by human experts. Statistics are then extracted from the refined patient clusters and displayed to users. The statistical
insights taken from these refined clusters provide personalized guidance for complex decisions. This paper focuses on
the cluster refinement stage where an expert user must interactively (a) judge the quality and contents of automatically
generated similar patient clusters, and (b) refine the clusters based on his/her expertise. We describe the DICON
visualization tool which allows users to interactively view and refine multidimensional similar patient clusters. We
also present results from a preliminary evaluation where two medical doctors provided feedback on our approach.
Motivated by several perceived advantages and hastened by government regulation, adoption rates for electronic health
records (EHRs) are increasing across the globe. The primary use case for an EHR is to digitally capture all medical
data for an individual patient and to provide efficient access to the stored data at the point of care. Despite the
financial investments in information technology required for the deployment and maintenance of EHR systems, these
technologies can provide many important benefits, ranging from the reduction of medication errors, to more timely
access to medical records, to improved physician communication with both other providers and patients.1
While enormously valuable, these benefits to traditional care delivery represent just one aspect of EHR technology.
A number of secondary uses for EHRs are being explored which exploit the large collections of electronic data that
result from EHR adoption. Such applications include, for example, both clinical research2and data-driven quality
measures.3These applications take advantage of population wide statistical data that can be extracted by examining
the EHRs for many patients as a group.
Taking this approach one step further are personalized clinical decision intelligence technologies. For a given target
patient, these techniques use data analysis algorithms to dynamically identify cohorts of similar patients from within an
institution’s EHR database. Based on these personalized cohorts of similar patients, the systems then extract statistical
data to drive alerts or provide personalized decision support. For example, similar patient analysis has been shown to
be effective at near-term prognostics for physiological data.4Others have used patient similarity for risk assessment.5
Along these lines, our lab is building a similarity-based decision intelligence system which provides medical profes-
sionals managing complex patients with personalized evidence that is extracted from an institution’s EHR database.
Our approach is to apply statistical cluster analysis algorithms to EHR data to find clusters (which we call cohorts)
of similar patients which are relevant to a target patient. Then, once cohorts have been identified, aggregate histori-
cal statistics are extracted and displayed to users as added input to their decision making process. This workflow is
illustrated in Figure 1.
One critical challenge in this approach is that the similar patient clusters identified by data analysis algorithms are often
difficult to understand semantically. Cluster analysis algorithms group similar patients based on statistical patterns.
However, because these patterns are hidden within the complex information space of EHRs, it can be challenging for
users to understand the semantic differences between statistically significant clusters. Moreover, the clustering may be
imperfect for a given clinical task. However, the ability to understand which patients are in each cluster and to allow
user refinement of the cluster definitions based on domain expertise is critical to our approach.
Figure 1: Patient similarity analytics are used to identify a group of EHR records for patients that are similar to a
target patient. Cluster analytics are then applied to the set of similar records to produce several different similar patient
cohorts. Users can then interactively refine these cohorts based on their expertise using the DICON visual analysis
tool described in this paper. Statistics from the clinician-refined cohorts can be used to inform decisions.
To help meet this challenge, we have developed an interactive visualization system which helps domain experts view
and refine the similar patient cohorts produced by our analytics. The visualization technique, named DICON, uses
treemap-based icons to represent clusters of similar patients. The icons convey multi-dimensional statistical informa-
tion at a glance and can be manipulated interactively and intuitively to merge, split, and refine the initial clusters into
task-appropriate cohorts. These cohorts can then be used as the basis for generating statistical evidence. In this paper
we provide an overview of our approach to clinical decision intelligence, describe the DICON visualization which we
developed for cluster analysis, and share feedback we received from physicians who were given access to our software.
The secondary use of EHR data is a topic that has received increasing attention as EHR adoption proliferates. Signifi-
cant attention has been given to improving overall health policy and to developing a framework that would open health
data for new applications.6,7Such frameworks would significantly lower the barriers for new technology development
Benefits of the broader use of EHR data have been demonstrated in a number of research projects. Most relevant to the
work presented in this paper are systems that have analyzed large databases of EHR data to find sets of similar records.
Such “patient similarity” approaches have been explored in a variety of practice areas ranging from emergency rooms
to risk scoring. For example, Orthuber and Sommer developed a similarity-based search tool for patient records that is
used for decision support.8A slightly different approach was adopted by Wongsuphasawat and Shneiderman who used
visualization-based techniques to interactively identify similar records.9Both of these techniques help users identify
individual similar records which can be used anecdotally to inform decision makers.
Another class of algorithms uses aggregate statistics from clusters of similar patients as an added input when making
difficult decisions. For example, Ebadollahi et al. used similar patient records to improve near-term prognosis of
physiological data.4For a given patient, their system retrieved a cohort of statistically similar patients and analyzed
aggregate statistics from the cohort’s historical physiological data to accurately predict when adverse events were
likely to occur. Following a related approach, Chattopadhyay et al. utilize historical data from similar patient records
to calculate suicide risk.5While powerful, these techniques rely upon clusters of similar patients which are determined
by complex algorithms. As a result, it can be difficult for doctors to understand the characteristics of patients in a
cluster. In addition, automatically generated clusters can often require manual adjustment by domain experts yet this
capability is typically missing or very limited.
Because of these challenges, which are universal across many application areas that rely on clustering algorithms,
several information visualization techniques have been designed for these tasks. These range from scatter plots10,11
to parallel coordinates12,13to heat maps.14,15,16These techniques can be highly effective under various conditions.
However, they typically do not scale well for large numbers of clusters and can be difficult for users to follow. Most
importantly, these techniques support little or no refinement of the initial clustering structure produced by underlying
analysis algorithms. Unfortunately, these limitations are problematic for the clinical applications that are a focus of
this paper. We therefore use an iconic treemap-based visualization scheme which provides a compact and intuitive
multi-dimensional visual cluster representation that scales easily to large sets of clusters. The resulting visualization
also provides clear well-defined visual objects which can be easily selected by users for interactive manipulation at
A final area of information visualization work related to DICON is in the use of icon-based visual representa-
data, and easy to manipulate via user interaction. A limitation, however, is that icons are often limited in the amount
of information they can convey. DICON embraces many of the benefits of these tools while embedding a large amount
of information about both overall cluster statistics and individual entity properties that are often missing in classic
Clinical Decision Intelligence Using Patient Similarity
Adopting an effective EHR system provides many benefits, such as improved accuracy and information sharing, when
used as a straightforward replacement for traditional paper records. However, as described earlier in this paper, the
databases of medical information produced by such systems can be exploited in many valuable secondary ways. In
particular, as EHR databases grow sufficiently large, they can be mined to extract statistically significant insights about
personalized populations of patients.
Along these lines, we are developing a similarity-based clinical decision intelligence system which provides med-
ical professionals responsible for complex patients with personalized evidence extracted from an institution’s EHR
database. Our approach is to apply similarity and cluster analysis algorithms to EHR data to find clusters of patients
which are relevant to a medical decision. This workflow is depicted in Figure 1. For a given patient, similarity analysis
produces a set of the most similar EHR records. However, these similar patients are similar in many different ways.
For instance, a patient with several co-morbidities might have different groups of patients who are relevant to each of
her underlying problems. We apply cluster analysis algorithms to subdivide the overall similar patient cohort into a
number of statistically interesting clusters.
While the cohorts produced by cluster analysis can be used directly as the basis for clinical intelligence generation,
clinicians often need to explore and refine the cohorts based on their domain expertise. We refer to this stage as cohort
refinement. Refinement is valuable because cluster analysis algorithms detect statistical patterns, often with little or no
a priori semantic knowledge. As a result, these automated algorithms can produce cohorts that are hard for clinicians
to label semantically. However, semantically meaningful cohorts are required if the statistical insights extracted from
the cohorts are to be used clinically.
To enable interactive cohort refinement by domain experts, we have developed a new visualization technique which
we call Dynamic Icons, or DICON.21Using DICON, clinicians can interactively explore the clusters produced by
the automated analysis step and judge their quality. In addition, DICON lets users intuitively manipulate clusters of
patients via drag and drop techniques to merge and/or split groups of patients based on domain expertise. We describe
DICON in more detail in the next section.
Consider a user who is making a medication order decision for a specific cancer patient. Using DICON, this user
can apply his/her domain expertise and contextual knowledge to refine the initial set of algorithmically determined
similar patient clusters into cohorts that are more decision-appropriate. After refinement, historical statistics are then
extracted for each cohort and presented as supporting data to aid in the user’s decision. For example, in our prototype
system we present a target patient’s lab test results in the context of aggregate lab test results for various similar
patient cohorts who have undergone alternative disease-appropriate medication treatments. An example of this display
is shown in Figure 2. Following a similar workflow, such an approach is useful not only for clinicians but also for
other professionals such as medical directors and researchers.
Figure 2: Statistics for each cohort of similar patients are presented using histograms in our web-based prototype
system. This view shows how the target patient’s lab results in the context of results for various similar patient
DICON: Visualization Support for Cluster Analysis
DICON is an interactive visualization tool designed for cluster analysis. It uses dynamic icons to represent clusters of
data as shown in Figure 1. In this section we first describe the design principles we followed when developing DICON.
We then describe the visual encoding methodology employed by the DICON visualization. Next, we introduce three
key user interactions which enable dynamic user-driven cluster manipulation. Finally, we provide a brief overview of
the DICON system. A formal description and evaluation of DICON from a visualization perspective is beyond the
scope of this paper and is available elsewhere.21
While exploring solutions for the problem of cohort refinement, we identified four central design principles that guided
the development of DICON. Specifically, we determined that the DICON visualization must provide:
• Multi-Granularity. Multidimensional EHR data contains a wide variety of information. An effective design
should be able to show various types of information distributions, data variances and diversities at different
levels of detail.
• Consistency. A visualization design should apply a uniform visual encoding across data types so that users can
smoothly switch between different information concepts. In particular, our design utilizes the same set of visual
properties and features to represent a range of data from individual patients to patient clusters.
• Stable Spatial Organization. Patient features, patients, and patient clusters should be spatially organized
such that positions encode meaning. Data updates, such as redefining cluster relationships, should be visually
reflected in a stable manner to maintain a user’s mental map as much as possible.
• Rich Interactivity. A rich set of user interactions should be supported to enable intuitive exploratory analysis
and refinement of patient clusters.
Figure 3: DICON uses an icon design that encodes (a) a feature vector for each patient as (b) a series of color
coded regions. This process is (c) repeated for all patients and (d) clusters are represented as interleaved hierarchical
arrangements of the color-coded regions. The overall icon conveys the underlying prominence of each dimension of
the feature vector across the cluster through the total area allocated to each color.
Following the design principles listed above, we designed a Dynamic ICON visualization technique which represents
clusters of multidimensional patient data as compact glyphs. The design uses a combination of spatial size, position,
color, and opacity to convey key cluster properties. The visual encoding for our design is illustrated in Figure 3.
As shown in Figure 3(a), patients are described by a set of numerical attributes. These values are derived from a
patient’s EHR. A subset of these attributes are selected as features to be represented in the visualization. DICON
visually represents each of these patient features using a colored rectangle. The color of the rectangle indicates the
type of feature while the area indicates the feature value. Feature values are normalized to a common scale (e.g.,
between 0 and 1) to allow the visualization of multiple features in the same icon regardless of scale. The rectangles
are packed together to form an iconic representation of the patient as shown in Figure 3(b).
When a cluster contains more than one patient, the individual patient icons must be combined into a single aggregate
iconic representation. We generate a cluster’s icon by splitting each patient’s icon into the individual feature rectangles
and repacking these rectangles after grouping them by feature type. This is done using a treemap-based layout where a
cluster serves as the top level object, feature types form the second level of the hierarchy, and individual patients make
up the third and final level of the hierarchy. The size of a cluster icon represents the total number of entities in that
cluster. For examples, the total area for an icon representing a 20 patient cluster will be twice the size of an icon for a
10 patient cluster. We use rectangular treemaps22as the base structure for our icons and apply the squarified treemap
layout algorithm23to obtain desirable aspect ratios for the rectangular cells.
Each cell is normally rendered with full opacity, resulting in the same color for all cells for a given feature type.
However, color opacity can also be mapped to one of several statistical measures to highlight various cluster properties.
For example, if a user wants to see a visual representation of cluster consistency, she can set the color opacity for cells
to reflect the difference between a cell’s value and the cluster’s mean value for the given feature. In this way, outliers
can be made to stand out from cells that are close to the mean.
This design brings a few key advantages. First, it compresses high dimensional cluster information within relatively
small cluster icons which can be easily embedded within other visualizations. For example, Figure 6 shows the
icons embedded within a scatter plot visualization. In addition, the design provides several visual cues that facilities
exploratory analysis. Finally, our design scales well to large numbers of clusters as shown in Figure 4. Yet there are
also some limitations to our approach. In particular, the number of feature dimensions that can be visualized at any