Time to CARE: a collaborative engine for practical disease prediction
ABSTRACT The monumental cost of health care, especially for chronic disease treatment, is quickly becoming unmanageable. This crisis
has motivated the drive towards preventative medicine, where the primary concern is recognizing disease risk and taking action
at the earliest signs. However, universal testing is neither time nor cost efficient. We propose CARE, a Collaborative Assessment
and Recommendation Engine, which relies only on patient’s medical history using ICD-9-CM codes in order to predict future
disease risks. CARE uses collaborative filtering methods to predict each patient’s greatest disease risks based on their own
medical history and that of similar patients. We also describe an Iterative version, ICARE, which incorporates ensemble concepts
for improved performance. Also, we apply time-sensitive modifications which make the CARE framework practical for realistic
long-term use. These novel systems require no specialized information and provide predictions for medical conditions of all
kinds in a single run. We present experimental results on a large Medicare dataset, demonstrating that CARE and ICARE perform
well at capturing future disease risks.
KeywordsCollaborative filtering-Prospective medicine-Disease prediction-Electronic healthcare record
Article: Exploring and exploiting disease interactions from multi-relational gene and phenotype networks.[show abstract] [hide abstract]
ABSTRACT: The availability of electronic health care records is unlocking the potential for novel studies on understanding and modeling disease co-morbidities based on both phenotypic and genetic data. Moreover, the insurgence of increasingly reliable phenotypic data can aid further studies on investigating the potential genetic links among diseases. The goal is to create a feedback loop where computational tools guide and facilitate research, leading to improved biological knowledge and clinical standards, which in turn should generate better data. We build and analyze disease interaction networks based on data collected from previous genetic association studies and patient medical histories, spanning over 12 years, acquired from a regional hospital. By exploring both individual and combined interactions among these two levels of disease data, we provide novel insight into the interplay between genetics and clinical realities. Our results show a marked difference between the well defined structure of genetic relationships and the chaotic co-morbidity network, but also highlight clear interdependencies. We demonstrate the power of these dependencies by proposing a novel multi-relational link prediction method, showing that disease co-morbidity can enhance our currently limited knowledge of genetic association. Furthermore, our methods for integrated networks of diverse data are widely applicable and can provide novel advances for many problems in systems biology and personalized medicine.PLoS ONE 01/2011; 6(7):e22670. · 4.09 Impact Factor
Data Min Knowl Disc (2010) 20:388–415
Time to CARE: a collaborative engine
for practical disease prediction
Darcy A. Davis · Nitesh V. Chawla ·
Nicholas A. Christakis · Albert-László Barabási
Received: 22 January 2009 / Accepted: 8 October 2009 / Published online: 25 November 2009
The Author(s) 2009
treatment, is quickly becoming unmanageable. This crisis has motivated the drive
towards preventative medicine, where the primary concern is recognizing disease risk
and taking action at the earliest signs. However, universal testing is neither time nor
cost efficient. We propose CARE, a Collaborative Assessment and Recommendation
to predict future disease risks. CARE uses collaborative filtering methods to predict
each patient’s greatest disease risks based on their own medical history and that of
similar patients. We also describe an Iterative version, ICARE, which incorporates
ensemble concepts for improved performance. Also, we apply time-sensitive modifi-
cations which make the CARE framework practical for realistic long-term use. These
novel systems require no specialized information and provide predictions for medical
conditions of all kinds in a single run. We present experimental results on a large
The monumental cost of health care, especially for chronic disease
Responsible editor: R. Bharat Rao and Romer Rosales.
D. A. Davis · N. V. Chawla (B )
Department of Computer Science and Engineering, Interdisciplinary Center for Network Science
and Applications (iCeNSA), University of Notre Dame, Notre Dame, Indiana, USA
e-mail: email@example.com; firstname.lastname@example.org
D. A. Davis
N. A. Christakis
Harvard Medical School, Boston, MA, USA
Northeastern University, Boston, MA, USA
Time to CARE: a collaborative engine for practical disease prediction 389
Medicare dataset, demonstrating that CARE and ICARE perform well at capturing
future disease risks.
Electronic healthcare record
Collaborative filtering · Prospective medicine · Disease prediction ·
Medical care and research are literally the most vital part of science for humans, as
care expenditure in the U.S. alone is an overwhelming sum, with a strong majority
of this money used for chronic disease treatment. Experts expect the burden on the
system to continually increase in coming years. A Center for Disease Control and
Prevention (CDC) study estimates that 880.5million visits were made to physician
offices, about 3.1 visits per patient, in 2001 (Cherry et al. 2001). Since 1992, the aver-
age age increased to 45years, and the visit rate for persons 45years of age and over
increased by 17% from 407.3 to 478.2 visits per 100 persons.
Research has shown many conditions to have recognizable indicators before onset
or preventable risk factors. From these discoveries comes the idea of prospective
medicine, aimed at determining and minimizing individual risk, as well as actively
addressing conditions at the earliest indication. In theory, these practices reduce the
number of conditions needing treatment and improve the effectiveness of necessary
interventions. However, the combinatorial problem generated by the different disease
health history and physical examination to approximate the risk of a patient, guiding
laboratory tests to further assess the patient’s stage of health. However, these sporadic
and qualitative ‘risk assessments’ generally focus on only a few diseases and are lim-
care is reactive, stepping in once the symptoms of a disease have emerged, rather than
proactive, treating or eliminating a disease at the earliest signs.
ease–gene associations, offering us detailed information on mutations, SNPs, and the
associated likelihood of developing specific disease phenotypes (Consortium 2007).
The underlying hypothesis behind this line of research is that once we catalogue all
disease-related mutations, we will be able to predict the susceptibility of each indi-
vidual to future diseases using various molecular biomarkers, ushering us into an era
of predictive medicine. Yet, these rapid advances have also unraveled the limitations
of the genome based approaches (Loscalzo 2007). Given the weak signals that most
disease associated SNPs or mutations offer, it is increasingly clear that the promise of
the genome based approaches may not be realized soon.
genomic approaches sufficiently mature? Our aim here is to show that phenotype and
390D. A. Davis et al.
disease history based approaches offer the promise of rapid advances towards disease
This research seeks to aid the development of a predictive system by examining the
interdependencies and how they translate into a patient’s future. Due to the common
genetic, molecular, environmental, and lifestyle-based individual risk factors, most
diseases do not occur in isolation (Barabasi 2007; Consortium 2007; Loscalzo et al.
2007). Shared risk and environmental factors have similar consequences, prompting
the co-occurrence of related diseases in the same patient. Therefore, a patient diag-
nosed for a combination of diseases and exposed to specific environmental, lifestyle
and genetic risk factors may be at a considerable risk of developing several other
genetically and environmentally related diseases.
diseases a patient may develop? The underlying thesis of our work is to generate a
patient’s prognosis based on the experiences of other similar patients. Our goal is to
provide every patient with a personalized answer to the question: What are my disease
Weapproach thisproblemusingcollaborative filteringmethodology. Collaborative
filtering is designed to predict the preferences of one person (active user) based on
the preferences of other similar persons (users). The technique is based on the intui-
tive assumption that people will enjoy the same items as their similar peers, or more
mon preferences. Predictions are based on datasets consisting of many user profiles,
each containing information about the individual user’s preferences. This has made a
significant impact on marketing strategies. We draw an analogy between marketing
and medical prediction. Each user is a patient whose profile is a vector of diagnosed
diseases. Using collaborative filtering, we can generate predictions on other diseases
based on a set of other similar patients. However, the ratings in our case are binary; a
patient either has a disease (1) or does not have a disease (0). There is no ordinal set
of ratings as is typically observed in movie or music data. Another difference is that
the users choose to rate movies and music, while the diseases are not a patient choice.
Key contributions in this work are listed below. Earlier work on the first two con-
tributions can also be found in (Davis et al. 2008a,b).
1. A novel application of collaborative filtering in the medical domain for advancing
the field of prospective medicine. To our knowledge, collaborative filtering has
not been used for disease prediction. Unlike other disease prediction software, we
present a general system which makes predictions on all types of diseases and
medical conditions. Our system uses only ICD-9-CM (International classification
of Diseases) codes (NC for Health Statistics 2007) to make predictions, which are
a common standard for insurance and medical databases. We do not require any
other information such as lab tests, etc., which can be expensive.
Time to CARE: a collaborative engine for practical disease prediction 391
2. The collaborative filtering method employed, while building upon prior work,
incorporates new elements of significance testing and ensemble methods within
the CARE framework.
3. A time-sensitive system which uses a best sub-vector matching concept to exploit
the known ordering of disease diagnosis. The time-sensitive improvements to our
framework make it applicable to long-term, diverse data such as public health
records. They also help to automatically differentiate and correctly deal with
chronic versus non-chronic diseases.
4. Analysis of performance trends dependent on the amount of data known and the
length of time between diagnoses. This information provides guidelines for effi-
cient use in a practical setting.
5. Case studies are provided as a real-world example of the potential benefits of
2 Related work
from the medical community which further support the need for preventative med-
icine, and various interdisciplinary efforts which previously led to computer-aided
medical prediction systems. While most of these systems are only loosely comparable
to CARE, they are representative of the same goals. We are not aware of any work
which is directly comparable to CARE.
As mentioned in the introduction, collaborative filtering is a data mining technique
that makes predictions about an active user based on information about other similar
and generate predictions based on their preferences. The first automated collaborative
(Shardanand and Maes 1995), which recommended internet news articles and music,
respectively. These systems are part of the larger class of memory-based algorithms,
which make predictions using the entire user database. This is typically accomplished
by calculating a weight of similarity between the active user and all others, and the
active user’s opinion is determined by the weighted average of the others’ opinions.
In many cases, only a limited number of ‘nearest neighbors’ are included in the cal-
culation. The most common similarity metrics are the Pearson correlation coefficient
Memory-based algorithms are simple, easily updated, and generally produce good
results. These advantages come at the cost of high resource consumption, since the
(Grcar et al. 2005). The second widely-used class of collaborative filtering algorithms
was preconstructed on the user database. The model-based algorithms are faster and
more scalable, in general. However, model building tends to be expensive, leading
to inflexibility for introducing new data. The quality of predictions for model-based
methods widely vary (Si and Jin 2003). Well known model-based methods include
392D. A. Davis et al.
Bayesian clustering or models (Breese et al. 1998), Personality Diagnosis (Pennock
and Horvitz 1999), Singular Value Decomposition (Goldberg et al. 2000; Paterek
2007), and the Aspect model (Hofmann 2004; Hofmann and Puzicha 1999). There
are also many content-based recommender systems. There is not an appropriate and
available source of disease ‘content’, so these are of little relevance to our problem.
Early treatment (Coyle and Hartung 2002), screening (Institute 2007), lifestyle
change (Hunt et al. 1995), and other interventions (Edelman 2006; Koertge et al.
2003) are common themes in modern medical research, where early intervention is
shown repeatedly to improve disease outcome and quality of life. Nonetheless, these
proactive treatments are far from the norm in our largely reactive health care system.
In Snyderman and Williams (2003) provide an outstanding overview of the flaws of
the current system and potential benefits of a prospective health care system. They
suggest that data mining is a “central feature” of prospective health care. Glasgow
(2001) support the feasibility of the preventive approach. They state that much of the
chronic disease burden can be prevented, and further posit that existing management
strategies can also be used to advance prevention.
Many proponents of prospective medicine emphasize genomic studies and other
breakthrough research in human biology. It is undeniable that genomic research is
rapidly advancing (Consortium 2007) and holds great promise for medicine. Unfor-
tunately, applicability to the general public is still very limited (Loscalzo 2007). Sim-
ilarly, in Weston and Hood (2004) express excitement with advancements in systems
biology and proteomics, but acknowledge that we still need to learn how to realisti-
cally translate discoveries into health benefits. Also, they recognize that there are still
“enormous challenges” to overcome. Though low-tech in comparison, CARE demon-
strates that existing data and technology can provide immediate advancement toward
prediction due to the known collaborative nature of diseases. A wide variety of stud-
ies on disease comorbidity, i.e., the simultaneous occurrence of two or more distinct
diseases, have shown that multiple risk factors cannot reliably be considered in isola-
tion (Starfield et al. 2003). Co-occurring factors can have a synergystic effect, leading
to unexpectedly high risk (Loscalzo 2007; Kannel et al. 1961). In van den Akker et
al. (1998), mention that the incidence of comorbid diseases is increasing. They state
that statistical clustering of comorbid diseases was surprisingly strong, even among
young subjects.Thisresultsimplies likelyinteraction between many ofthe coinciding
Many different computer-aided methods have been developed for medical predic-
tion. Most of these systems are designed to make predictions about a single disease
or class of diseases. Usually, the predictions are generated from some combination of
basic data such as demographic information and physical description with addition
condition-specific test results or family history. One well-known system is Apache III
(Wong and Knaus 1991), a prognostic scoring system for predicting inpatient mortal-
health status. A wide variety of systems have been developed for predicting risk of
individual diseases or complications, such as specific heart conditions (Cordn et al.
2002), Alzheimer’s disease (Liu et al. 2007), and cancer (Mould 2003). While data