ArticlePDF Available

A prediction model for targeting low-cost, high-risk members of managed care organizations


Abstract and Figures

To describe the development and validation of a predictive model designed to identify and target HMO members who are likely to incur high costs. Split-sample multivariate regression analysis. We studied enrollees in a 350000-member HMO with > or = 1 claim in 1998 and 1999. The prediction model uses a combination of clinical and behavioral vaiables and 1998 and 1999 claims data. The prediction model was applied and used to rank low-cost patients (1998 cost < dollars 2000) according to their estimated probability of incurring costs > or = dollars 2000 in 1999. For prospective testing, we applied our models to data that are not available in advance. The same prediction model was applied to rank a different set of low-cost patients (1999 cost < dollars 2000) according to estimated probability of incurring costs > or = dollars 2000 in 2000. Because the predictions were used for disease management purposes, the outcomes of a randomly selected control group not intervened on for the disease management program was analyzed. The predictive accuracy of the model was tested by comparing the percentages of "targeted" vs all low-cost patients who incurred high costs in the subsequent year. Of the low-cost, top-ranked 1998 patients, 47.8% incurred high (> or = dollars 2000) medical expenses in 1999 vs 14.2% of randomly selected patients who were low cost in 1998. Of the top-ranked 1999 patients, 39.7% incurred high costs in 2000 vs 12.2% of the randomly selected low-ranked patients. The prediction model successfully identifies low-cost, high-risk patients who are likely to incur high costs in the next 12 months.
Content may be subject to copyright.
resently, most managed care organizations
(MCOs) manage population risk by applying
disease management or case management pro-
grams. The targeting of specific events or diseases is
a proxy for risk because members who were hospi-
talized or have 1 specific (“index”) diseases (such
as asthma, congestive heart failure, or diabetes mel-
litus) experience higher claims costs than the aver-
age for the plan. However, there are many ways of
trying to identify potentially high-cost patients,
including using diseases (such as AIDS or renal fail-
ure) likely to require expensive treatments, previous
hospital or emergency department utilization, or the
level or rate of increase in recent medical expenses.
Different interventions may be applied to each
group depending on the identification method.
Interventions are used on these members with the
objectives of reducing costs and improving outcomes.
This approach has 2 obvious disadvantages. First,
some patients without one of the index diseases are
likely to incur high medical expenses. Second, dis-
eased populations are not homogeneous; thus, mem-
bers whose medical conditions are well controlled
are statistically less likely to experience higher-than-
average costs in the future (and, therefore, are less
likely to benefit from any intervention). The tradi-
tional approach has a third shortcoming: because of
the intensity of resources required to manage the
potentially high-risk members, health plans limit
their interventions either to case management of the
current high-cost patients or to broad, often untar-
geted, disease management programs.
“Population risk management” consists of 3
Identification of high-risk populations or
Use of specific interventions in the high-risk
group to reduce the resource utilization and cost
of the group.
Application of pricing and underwriting tech-
niques to convey financial signals to plan mem-
bers and sponsors. A group’s health insurance
premiums, for example, could be based on their
estimated future utilization rather than on the
traditional underwriting method of a projection of
historical costs.
APrediction Model for Targeting
Low-Cost, High-Risk Members of
Managed Care Organizations
Henry G. Dove, PhD; Ian Duncan, BPhil, BA; and Arthur Robb, PhD
Objective: To describe the development and validation of
a predictive model designed to identify and target HMO
members who are likely to incur high costs.
Study Design: Split-sample multivariate regression analysis.
Patients and Methods: We studied enrollees in a 350 000-
member HMO with 1 claim in 1998 and 1999. The predic-
tion model uses a combination of clinical and behavioral
vaiables and 1998 and 1999 claims data. The prediction
model was applied and used to rank low-cost patients (1998
cost <$2000) according to their estimated probability of incur-
ring costs $2000 in 1999. For prospective testing, we applied
our models to data that are not available in advance. The same
prediction model was applied to rank a different set of low-
cost patients (1999 cost <$2000) according to estimated prob-
ability of incurring costs $2000 in 2000. Because the
predictions were used for disease management purposes, the
outcomes of a randomly selected control group not intervened
on for the disease management program was analyzed. The
predictive accuracy of the model was tested by comparing the
percentages of “targeted” vs all low-cost patients who incurred
high costs in the subsequent year.
Results: Of the low-cost, top-ranked 1998 patients, 47.8%
incurred high ($2000) medical expenses in 1999 vs 14.2% of
randomly selected patients who were low cost in 1998. Of the
top-ranked 1999 patients, 39.7% incurred high costs in 2000
vs 12.2% of the randomly selected low-ranked patients.
Conclusions: The prediction model successfully identifies
low-cost, high-risk patients who are likely to incur high costs
in the next 12 months.
(Am J Manag Care 2003;9:381-389)
From the Division of Health Policy and Administration,
Department of Epidemiology and Public Health, Yale University,
New Haven, Conn (HGD); Lotter Actuarial Partners, Inc, New York,
NY (ID); and LandaCorp, Inc, Montclair, NJ (AR).
This study was suported by Landacorp, Inc, Atlanta, Ga.
Corresponding author: Henry G. Dove, PhD, Division of Health
Policy and Administration, Department of Epidemiology and Public
Health, Yale University, 60 College St, PO Box 208034, New
Haven, CT 06520-8034. E-mail:
It is well known in healthcare that a small per-
centage of the members of a health plan consume a
significant percentage of its resources; it is assumed,
often incorrectly, that the behavior of this minority
is replicated period after period. This assumption is
refuted by the data in Table 1. These data, repre-
senting approximately 209 000 members of a 350 000-
member health plan, show that the highest-cost
members, those costing $25 000 in incurred claims
in 1998 (and, hence, candidates for traditional case
management), represented 1% of the members but
21% of the cost in 1998; in the following year, this
cohort consumed only 7% of total plan costs.
Conversely, enrollees in the lowest-cost medical
class, with costs <$2000 in 1998, accounted for 58%
of all costs in 1999 (Table 1). The phenomenon
being illustrated here, regression to the mean (in
which the resource consumption of most high-cost
patients generally decreases, even in the absence
of any intervention), is well known in health plans
but seems to be overlooked as plans attempt to
find and manage their high-risk members.
In population risk management, a variety of clin-
ical and behavioral variables are used to rank each
patient according to his or her estimated probability
of incurring high medical expenses in a subsequent
period. This article describes the design, develop-
ment, and validation of a prediction model targeting
selected members in a large regional HMO in the
southwest United States. The goal is to identify low-
cost patients (<$2000 in the “base year”) who are
likely to become high-cost patients (in the absence
of any intervention) in the subsequent year.
Two other components of a population risk man-
agement strategy are interventions (programs that
aim to change patient behavior, healthcare delivery
processes, and patient outcomes through education
and coaching) and pricing and underwriting (tech-
niques that aim to change behavior through price
signals). Selecting appropriate, cost-effective inter-
vention strategies for targeted patients and reflecting
prospective risk in pricing and underwriting deci-
sions (to the extent allowed by regulatory and ethi-
cal constraints) are related challenges that are topics
for future articles.
The prediction model was developed on patients
who were enrolled in a large HMO in 1998 and 1999
and had at least 1 medical or pharmacy claim in
both years. Patient medical claims and pharmacy
claims incurred in the subsequent year were the
source of outcomes for these patients. Population
risk management aims to reduce the cost of the tar-
geted population; although this may result in
improved health markers, the objective of risk
management is to improve financial outcomes for
the health plan. No reviews of patient medical
records, questionnaires, or special surveys are con-
ducted. Reliance on administrative data is an effi-
cient, low-cost approach that is ideal for population
risk management.
Before creating a prediction model, the demo-
graphic data and medical and pharmacy claims of
MCO patients were checked for completeness,
integrity, and consistency. Data preparation includ-
ed several activities:
Adopting an adequate “run-out” period (in this
case 4 months), determined using standard actu-
arial methods for assessing the completeness of
incurred claims data.
Table 1. Distribution of Enrollees by Expense Categories and Percentage of Total Expenditures in 1998 and 1999
1998 1999
Medical Expense Average Average
Category Per Capita Total Total Per Capita Total
in 1998 Expenses, $ Enrollment, % Expenditures, % Expenses, $ Expenditures, %
Low (<$2000) 324 87 23 1191 58
Medium ($2000-$24 999) 5658 12 56 5385 35
High ($25 000) 49 032 1 21 15 800 7
Separating medical expenses into categories such
as professional services, hospital inpatient servic-
es, hospital outpatient services, laboratory and
diagnostic tests, and pharmaceutical items. Data
checks were used to measure the data's internal
consistency against benchmarks. Data that were
rejected based on the diagnostic reports were
resubmitted and rechecked.
Distinguishing the employee (policyholder) and
his or her dependents by assigning each enrollee
a unique member number. The claims for each
patient were collected, tabulated, and grouped
into a patient-centered database.
Identifying “covered charges” that reflected only
those medical services that the MCO was obligat-
ed to pay.
Using the amount the MCO paid for each claim
rather than the billed or charged amount or the
amount patients paid.
Members of the health plan are subject to differ-
ent plan designs, with variable copays, limits, exclu-
sions, and so on, as set by their employers. It can be
argued that a member’s behavior is influenced by
the specific design of the benefit; however, this is
one of many variables that we do not recognize in
our modeling. The potential for incorrectly assign-
ing a member (as high risk or not high risk) based
solely on plan design is considered to be minor.
The concepts of “patient risk” and “outcomes”
require clarification because these terms are often
unclear or imprecise. Outcomes is a broad term that
can have very different meanings for clinicians, epi-
demiologists, utilization management specialists,
risk managers, actuaries, and quality assurance pro-
fessionals. Outcomes may pertain to functional sta-
tus, patient satisfaction, mortality, hospital
utilization, or cost. Consistent with the objective of
population risk management, our sole focus was on
financial outcomes, as defined by total incurred
medical claims in a 1-year period.
Our analyses focus on a “$2000/<$2000” criteri-
on, which raises 2 issues:
Why a threshold? Why not predict dollar cost?
If we are going to recognize a threshold, why
$2000 and not another dollar amount?
A cutoff value of $2000 for annual patient-
incurred medical (including pharmacy) expenses
was used to establish a binary outcome variable, that
is, enrollees were either low-cost patients (<$2000)
or high-cost patients ($2000). A component of our
definition of patient risk, the probability that a
patient will incur paid medical expenses $2000 in
the succeeding 1-year period, is based on a patient’s
characteristics in the base year. The base year in
this article is 1998, and the subsequent year is 1999.
The $2000 threshold was chosen for practical and
statistical reasons and to differentiate our approach
from the traditional approach used in disease man-
agement and case management. Disease manage-
ment–and case management–based methods for
identifying high-risk members rely on diagnoses
(using International Classification of Diseases, Ninth
Revision, Clinical Modification, codes) and events
(hospitalizations, emergency department visits, etc).
Frequently, members traditionally labeled as high risk
have base year costs $2000 and often considerably
>$2000. Conversely, a population with costs <$2000
in the base year is not traditionally thought of as
being at high risk. Yet, our data and models show
that this population contains a considerable num-
ber of high-risk members. The models are general
and may be applied to any dollar threshold.
Because the subpopulation of low-cost consumers
is large, the targeting process will deliver a large
number of potential high-risk intervention candi-
dates who generate a significant percentage of a
health plan’s total expense. Conversely, if the focus
is on only the “repeaters” from the high-cost sub-
groups, relatively little total cost is identified for
intervention and management. From a business per-
spective, health plans are interested in identifying
subgroups whose management will have a noticeable
effect on the plan’s overall financial results.
Our definition of risk extends beyond cost. In our
work, we used a compound dependent variable
designed to capture 2 important risk factors: the
absolute amount of claims and the incidence or con-
centration of those claims. Thus, a patient whose
claims are more highly aggregated represents a high-
er risk than a patient who has the same absolute
amount of claims spread over 12 months. In general,
healthcare costs are transformed if they are to be
used as a dependent variable. As pointed out in the
recent Society of Actuaries study of risk adjusters,
large claims should be truncated. Owing to the distri-
bution, costs must be logarithmically transformed. In
a sense, the binary variable (/<$2000) is the sim-
plest transform. This form of transform in turn allows
Targeted Care for Low-Cost Members
us to include additional information (on the concen-
tration of claim amount) in the dependent variable.
There is a simple reason why categorical variables
are advantageous: raw dollar amounts are not uni-
formly calibrated from patient to patient. In other
words, 2 patients may have been subject to the same
medical services but may incur significantly different
costs due to differences in plan design, choice of
providers, or provider billing practices. Because sig-
nificant cost differences can be due to exogenous fac-
tors, it is preferable to use cost as a relative
approximation of risk rather than as an absolute
measure. In this sense, the $2000 threshold is mere-
ly a simple categorical variable.
The specific threshold of $2000 was chosen for 2
reasons. One is a practical consequence of the fact
that this model was used to select patients for an
intervention. In general, to show positive return on
investment in an intervention program, as well as to
simply have statistically measurable outcomes,
interventions require significant numbers of poten-
tial enrollees. Use of the $2000 threshold casts a
“wide net” for intervention targets. Second, because
higher costs are driven by events, it was deemed
advantageous to choose a threshold that correlated
strongly to the existence of an event as both a posi-
tive and negative proxy indicator. The $2000 thresh-
old fulfills this need.
Few articles have appeared in peer-reviewed jour-
nals that attempt to identify patients likely to incur
high medical expenses in a subsequent year among
patients who incurred modest medical expenses in
the preceding or base year. Meenan and his col-
at the Kaiser Permanente Center for Health
Services Research developed and tested 3 models to
identify high-cost risk status in a large sample of
approximately 100 000 HMO members from 3
health plans. LoBianco et al
studied high-cost
Medicaid users. Forman and his colleagues
Lynch et al
studied repeaters. We believe that our
analytical technique addresses an important under-
studied group: individuals with no obvious previous
costs who are less likely to be identified in the other
studies referred to previously.
The more common focus of research using
administrative data sets has been for the purpose of
risk adjustment.
The statistical methods, data
sources, and variables used for identifying high-cost
patients and creating risk adjusters are similar.
However, the goal of risk adjustment is not to iden-
tify individual patients with high-cost conditions or
to intervene in their care. The main objective of risk
adjustment is to accurately predict the average
annual expenditures for an individual patient to
redistribute premiums to health plans. Thus, the
coefficients or groupings models that researchers
have created for risk adjustment are not designed for
identifying high-cost patients, although these models
also use disease categories, comorbidities, and
demographic variables.
Targeting the Right Patients in
Population Risk Management
The study objective was to develop a prediction
model using variables in medical and pharmacy
claims data sets to identify patients with medical
expenses <$2000 in 1998 who were likely to incur
high medical expenses ($2000) in 1999. This study
was the first phase of a population risk manage-
ment project in which these targeted members
were randomly assigned to an intervention consist-
ing of a nurse-based, outbound-telephone survey
with 3 purposes:
To identify gaps in care.
To further stratify the population to identify “false
positives” whose diseases are well controlled.
To help members become more compliant with
the prescribed treatment regimen.
Targeted members were assigned randomly to
intervention and control groups to evaluate the
interventions. The effectiveness of the interventions
is a subject for future articles.
Table 1, which was constructed from data before
the introduction of a targeted intervention program,
shows that a significant number of high-cost patients
in 1998 became low-cost patients in 1999.
Identification of high-risk members for interventions
is just one aspect of population risk management.
Equally important is the identification of high-cost
1998 members whose medical costs will decline
because they represent a group for whom the appli-
cation of resources should be limited.
Modeling for population risk management should
take into account a patient’s characteristics in addi-
tion to his or her base year expenses. On a larger
scale, claims-based prediction models should also
take into account plan composition. Significant dif-
ferences among MCO populations occur because of
plan-specific factors such as plan type (independent
practice association vs staff model, etc), capitation
agreements, copayment/deductible agreements,
Medicare/commercial mix, and level of pharmaceuti-
cal caps. Regional differences in medical expenses
may also reflect cultural differences, variation in clin-
ical practices, and availability of medical resources. If
an MCO has a large number of enrollees in a single
region, we found that it is preferable to develop a pre-
diction model for a single MCO rather than to pool
and then try to make statistical adjustments to
claims data from multiple heterogeneous health
plans. The brunt of the effort is in the process of
preparing data, not in making statistical calculations.
The model was developed for a large HMO with an
average enrollment of approximately 350 000 persons.
The 209 000 patients studied had at least 1 claim in
1998 and 1999 and costs of <$2000 in 1998. The
base year was 1998, and 1999 was the subsequent
period in which financial outcomes were measured
from medical and pharmacy claims. The standard
split-sample technique (model developed on half of
the 1998 patients and tested and validated on the
other half) was used to prevent overfitting.
The prediction model was developed using multi-
ple regression model analysis. Several regression
models were calculated using dependent variables
such as the $2000/<$2000 binary variable, various
transformations for the 1999 cost, and a proprietary
“cost grouper” that measures the degree of cost con-
centration in a given period. The final independent
patient variables included patient age, sex, number
of specific comorbid conditions, number of distinct
drug classes, number of physician visits, and non-
physician/nonhospital medical utilization. Binary
(0=absent and 1=present) variables were created
that flagged diabetes, cardiovascular, respiratory,
and psychiatric diseases.* Other independent vari-
ables that reflect behavioral factors (eg, the primary
treatment regimen for each disease state, the
patient’s prescription compliance pattern, and the
patient’s propensity to keep regular appointments
with physicians) were created through a transforma-
tion of each patient’s medical and pharmacy claims
in the base year and were available for inclusion in
the model. In the Appendix, we show an example of
the specific coefficients and variable for one model
used to develop predictions.
The relationships between input variables and
outcomes (eg, costs <$2000 annually/costs $2000
annually) may not be linear. However, input vari-
ables can often be transformed so that the resulting
relationship between the aggregates of transformed
values of the independent variable and the depend-
ent variable is linear or nearly linear. For example
the relationship between the number of comorbid
conditions and the probability of annual costs being
$2000 is nearly linear (Figure 1).
By applying the regression coefficients to low-cost
patients’ individual characteristics, a numerical
score was calculated for every patient. The score
directly corresponds to the probability that the
patient will incur medical expenses $2000 in the
subsequent year. For the purpose of designing an
intervention program, the patients who are targeted
for earliest deployment to health management and
nurse intervention are those with the highest scores
(ie, the patients with the highest probability of
incurring costs $2000).
The probability of identified members experienc-
ing costs $2000 declines as the number of identi-
fied members increases. This inverse relationship,
or “yield curve,” suggests that a population risk
management program should focus on high-risk
patients (Figure 2).
Targeted Care for Low-Cost Members
Figure 1. Relationship Between the Number of
Comorbid Conditions and the Probability of Annual
Costs Being $2000
Comorbidities, No.
Medical Expenses $2000, %
*International Classification of Diseases, Ninth Revision, Clinical
Modification, code values of the comorbid conditions or diagnostic
categories are available from the authors. The clinical logic that
determines the presence or absence of a specific condition [based
on pharmacy utilization patterns, demographic characteristics, lab-
oratory test results, and physician visit patterns] is proprietary.
Retrospective Validation: 1999 Claims Expenses
The prediction model was applied to approxi-
mately 209 000 members who were enrolled in 1998
and had claims of <$2000. The members were
ranked from high to low according to the probability
of incurring medical expenses $2000 in 1999.
Results are given in Table 2 and illustrate more
clearly the relationship exhibited in Figure 2.
The inverse relationship between these 2 vari-
ables indicates that the higher the percentage of low-
cost 1998 enrollees targeted, the lower the
proportion of those targeted patients incurring med-
ical expenses $2000 in 1999. For example, the
highest ranked 1054 low-cost 1998 patients, who
were 0.5% of the 1998 low-cost enrolled population,
had a 51.0% probability of incurring high costs,
approximately 3.6 times the average of the entire
low-cost population. Table 2 also gives (by rank) the
average claims costs incurred by targeted members
who had claims $2000.
Prospective Validation: 2000 Claims Expenses
For the prospective test, the prediction model was
applied to low-cost 1999 patients. Members in 1999
and 2000 were ranked from high to low according to
the estimated probability of incurring costs $2000
in 2000. Because this prediction was done as part of
an intervention program, the highest-ranking high-
risk patients were selected to receive nurse inter-
ventions. This group of 5535 members had a risk
ranking similar to that of the group we reported pre-
viously herein (risk rank 34). Eighty percent of
these targeted patients (n=4428) were randomly
selected to receive an intervention and, thus, were
inappropriate for prospective validation because
their behavior was subject to change by the inter-
vention. The remaining 20% of the high-risk patients
(n=1107) composed the control group and received
no intervention.
The subpopulation identified by the prediction
model as high risk, on average, was older, was more
likely to be male, and had more comorbid condi-
tions than the nontargeted population (Table 3).
The prevalences of asthma, diabetes, and conges-
tive heart failure for the high-risk population were
also significantly higher than those for the low-risk
Table 4 displays results for the 1107 members of
the control group and of the remaining, untargeted
171 071 members who experienced claims <$2000
in 1999 but who were not targeted for intervention
because their risk scores were low (risk rank 33).
Pharmaceutical and medical claims incurred in
2000 were used to calculate the relative frequency of
patients with paid claims $2000 and to compare
the average cost of the “control” high-risk patients
with that of the low-risk patients.
Table 4 provides the claims expenses for 2000 for
the targeted and nontargeted patients. Of the target-
ed, highest-ranked patients, 39.7% incurred high
($2000) claims expenses in 2000 compared with 12.2%
of the nontargeted patients.
Although the prediction model was constructed
using data from one large HMO instead of pooled data
from other MCOs, the model can be generalized and
modified to fit other populations.
The key variables used in the regression model
(patient age, sex, number of chronic conditions,
number of distinct drug classes, number of physician
visits, nonphysician/nonhospital medical utilization,
and pre-sence or absence of diabetes, cardiovascular,
respiratory, and psychiatric diseases) have been test-
ed on several other large HMO data sets and different
periods. The coefficients differ somewhat because of
differences bet-ween plan-specific factors such as
plan type, physician in-centives, copayment/deductible,
and pharmacy benefit levels. However, the variables
used in the prediction model in the 209 000-mem-
ber study group are identical to those found in
HMOs with different patient and financial charac-
teristics. If 2 years’ data are available for a large
MCO, our preference is to construct MCO-specific
prediction models rather than to make adjustments
for the differences in plan characteristics.
Figure 2. Yield Curve Showing that the Probability
of Identified Members Experiencing Costs $2000
Declines as the Number of Identified Members
Patients Targeted, %
Targeted Patients with
Cost $2000 in 1999, %
020406080 100
Our analyses
shed new light on
regression to the
mean. We found
that very few
patients are con-
sistently high-cost
members (data not
shown). Of those
members who in-
curred catastroph-
ic costs in 1999
($25 000), 39%
were in the 1998
low-cost category
and 43% came
from the previous
year’s medium-
cost segment. Only
18% of the 1998
high-cost category
members were
“repeat” high-cost
consumers in 1999.
In 2000, a very
small percentage of
the high-cost pa-
tients accounted
for ~20% of total expenditures; our data thus suggest
that expenditure levels in the base year are not a
good predictor of high costs in the subsequent year.
The transient nature of patients in MCOs is well
known in the industry, with turnover rates of 20% to
25% per year. Our prediction models required 2
years’ data for model construction and validation.
Disenrollment of >30 000 patients occurred in 2000.
Additional work is perhaps needed to study the char-
acteristics and utilization patterns of patients who
enroll and disenroll. Patients who are identified as
high-risk patients in the first year but then disenroll
before an intervention can be undertaken and meas-
ured will continue to complicate the evaluation of
population risk management programs.
In our analytical approach, it was not practical to
make any adjustments to reflect possible increases
in provider reimbursement rates, which were mod-
est in 1998-2000. We doubt that price adjustments,
which would require considerable effort, would
change the basic results of our research. One alter-
native worth consideration for future modeling is to
study expenditure patterns in the base and subse-
quent years by quintile or quartile.
The small price increases may have caused a few
patients to shift from low cost (<$2000) to high cost
Targeted Care for Low-Cost Members
Table 2. Patients With Claims in Both 1998 and 1999 and 1998 Costs <$2000
Cumulative Average 1999
Cumulative Probability Claims Cumulative Cost per
Rank Patients, No. $2000 in 1999 Patients, % Costs in 1999, $ Patient, $
40 1054 51.0 0.5 3 899 475 7248
38 1713 50.2 0.8 5 963 129 6934
36 2668 48.7 1.3 9 041 030 6965
34 4019 47.8 1.9 13 106 399 6823
32 5741 45.6 2.7 17 510 712 6694
30 7952 43.7 3.8 23 255 465 6686
28 10 821 41.5 5.2 29 956 931 6678
26 14 487 39.0 6.9 37 678 296 6669
24 18 899 37.0 9.0 45 867 938 6558
22 24 125 35.3 11.5 55 632 741 6539
20 30 194 33.1 14.4 65 010 923 6507
18 37 474 31.2 17.9 75 654 386 6476
16 46 223 29.1 22.1 87 219 404 6474
14 56 369 27.2 27.0 98 565 035 6426
12 67 260 25.5 32.2 110 030 854 6405
10 79 390 24.0 38.0 121 234 656 6371
894913 22.0 45.4 132 390 434 6328
6 114 680 20.1 54.9 144 371 078 6276
4 139 547 18.0 66.7 156 633 721 6247
2 172 071 15.9 82.3 169 449 327 6184
0 209 069 14.2 100.0 183 062 265 6168
Table 3. Characteristics of High-Risk Patients With 1999 Costs <$2000
Disease Prevalence, %
Sex, %
Average Average Comorbid Congestive
Male Female Age, y Diseases, No. Diabetes Heart Failure
High risk, targeted, no intervention
(n = 1107) 58.9 41.1 58.7 2.9 59.3 2.0
All members 44.5 55.5 36.4 0.5 3.4 0.2
($2000). Using different cutoff values ($2500,
$3000, etc) did not materially affect our results.
The prediction model is based on historical claims
data and was used to score each member’s risk for
incurring high medical expenses in the second year.
The prediction model successfully identified patients
with low medical expenses in 1998 who were 3.6
times as likely to incur high medical expenses in
1999 as the overall low-cost population. The predic-
tion model was tested prospectively on 1107 patients
who received no intervention and were identified as
likely to incur high medical expenses in 2000. The
1107 patients were more likely to incur costs $2000
(39.7% vs 12.2% for the nontargeted group). Their
average costs were $6602 vs $1108.
The prediction model is only the first step in
developing cost-effective intervention programs.
Much hard work remains:
Creating, evaluating, and implementing new
Adopting objective criteria to evaluate interven-
tions, which may eventually involve measuring
clinical outcomes and cost savings;
Testing and adopting enhancements to the model,
including increasing the horizon beyond a single
year to identify members at risk for longer-term
Forecasting the financial impact of interventions
by including cost estimates of interventions and
estimated impact on medical expenses;
Collecting data prospectively to assess the cost-
effectiveness of new and existing interventions;
Providing tools for making better resource alloca-
tion, staffing, and intervention decisions;
Finding ways to better identify and engage mem-
bers who are not compliant but whose behavior
may be changed by an intervention; and
Developing tools to incorporate predictions into
pricing and underwriting strategies.
Population risk management depends on the
development of accurate prediction models in which
patients are selected for intervention according to
their predicted risk. The second component of pop-
ulation health management is to devise new inter-
ventions, that is, programs that change healthcare
delivery and hopefully improve patient outcomes.
A third and final step, in the absence of randomiza-
tion, is to use the prediction model to adjust
patients’ outcomes so that actual-to-expected out-
comes can be compared.
1. Cumming RB, Knutson D, Cameron BA, Derrick BA. Compara-
tive Analysis of Claims-based Methods of Health Risk Assessment
for Commercial Populations. Schaumburg, Ill: Society of Actuaries;
2. Meenan RT, O'Keeffe-Rosetti C, Hornbrook MC, et al. The
sensitivity and specificity of forecasting high-cost users of medical
care. Med Care. 1999;37:815-823.
3. LoBianco MS, Mills ME, Moore HW. A model for case manage-
ment of high cost Medicaid users. Nurs Econ. 1996;14:303-
307, 314.
4. Forman SA. Breakthroughs in High Risk Population Health
Management. San Francisco, Calif: Jossey-Bass Publishers; 2000.
5. Forman SA, Kelliher M, Wood G. Clinical improvement with
bottom-line impact: custom care planning for patients with acute
and chronic illnesses in a managed care setting. Am J Manag Care.
6. Lynch JP, Forman SA, Graff S, Gunby MC. High-risk popula-
tion health management: achieving improved patient outcomes
and near-term financial results. Am J Manag Care. 2000;6:
7. Ellis RP, Pope GC, Iezzoni L, et al. Diagnosis-based risk adjust-
ment for Medicare capitation payments. Health Care Financ Rev.
8. Kronick R, Dreyfus T, Lee L, Zhou Z. Diagnostic risk adjust-
ment for Medicaid: the disability payment system. Health Care
Financ Rev. 1996;17:7-33.
9. Pope GC, Ellis RP, Ash AS, et al. Principal inpatient diagnostic
cost group model for Medicare risk adjustment. Health Care
Financ Rev. 2000;21:93-118.
10. Weiner JP, Tucker AM, Collins AM, et al. The development of
a risk-adjusted capitation payment system: the Maryland Medicaid
model. J Ambulatory Care Manage. 1998;21:29-52.
Table 4. Year 2000 Outcomes of Patients With 1999 Cost <$2000
1999 All Members Members With Claims $2000
High risk, targeted with no intervention (n = 1107)
Average cost, % 1278 3176 6602
Costs >$2000, % 0 39.7 100.0
Low risk, not targeted (n = 171 107)
Average cost, $ 432 1108 6033
Costs >$2000, % 0 12.2 100
Targeted Care for Low-Cost Members
1. Specifics of the Prediction Model
All variables are transformed to be used in the model. Because this is a relative risk stratification, rather than
an absolute determination of cost, there is no need for an intercept variable.
Independent Variable Coefficient
Diabetes (drug and diagnosis based) 0.069
Cardiac diagnosis (drug and diagnosis based) 0.039
Respiratory disease (drug and diagnosis based) 0.0039
Psychiatric diagnosis (drug and diagnosis based) 0.025
Physician visit variable 0.0029
Nonhospital, non–emergency department, nonphysician medical claim variable 0.012
Composite prescription drug variable: measure of prescription drug classes 0.013
Comorbidities (truncated at 4) 0.0076
2. Evaluation of models was through receiver oper-
ating characteristic (ROC) curves. The ROC curve
shown below for 1998-1999 continuous enrollment
has an area of 0.73.
3. Sample Demographic Statistics
Eligibility, 1998 Cost <$2000
Disease Classes: 1998 Cost <$2000
4. Cost Distribution (1998)
All 1998
0 20 40 60
1-Specificity, %
Sensitivity, %
80 100
Age Female Male All Gender
Unknown 10 968 10 622 21 590
<15 31 942 33 690 65 632
15-24 19 224 14 659 33 883
25-34 23 635 15 144 38 779
35-44 25 902 18 113 44 015
45-54 21 729 15 952 37 681
55-64 13 633 10 565 24 198
65-74 13 016 10 709 23 725
75-84 6173 4337 10 510
85+ 1709 808 2517
All Ages 167 931 134 599 302 530
Sex unknown 1647 All 304 177
Cost Group Cumulative % %
$ Claimants Claimants Claimants Claimants
0-999 269 489 269 489 77.3 77.3
1000-1999 34 688 304 177 9.9 87.2
2000-2999 14 450 318 627 4.1 91.4
3000-3999 7936 326 563 2.3 93.7
4000-4999 4931 331 494 1.4 95.1
5000-9999 9639 341 133 2.8 97.8
10 000-19 999 4634 345 767 1.3 99.2
20 000-24 999 904 346 671 0.3 99.4
25 000+ 2009 348 680 0.6 100.0
Heart Cardiac Respiratory
Claimants Failure Condition Asthma Diabetes Condition
304 177 756 29 812 27 910 10 412 68 579
100.0% 0.2% 9.8% 9.2% 3.4% 22.5%
... Numerous studies have also successfully made use of the predictive power of administrative and clinical data to predict healthcare costs [10]. Predictive patient categorization can also be utilized to predict the cost and risk profiles of future patients and target them with adapted cost-efficient strategies [11]. In this regard, an interesting approach described in [12] calculates a comorbidity score for patients and uses it to predict their associated healthcare expenditures. ...
... Several studies [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21] have addressed the prediction of health care costs, approaching the issue as either a regression problem or a classification problem (classifying costs into predefined "buckets"). Morid et al. [22] conducted a literature review summarising and comparing the existing models. ...
Full-text available
Background: Rising health care costs are a major public health issue. Thus, accurately predicting future costs and understanding which factors contribute to increases in health care expenditures are important. The objective of this project was to predict patients healthcare costs development in the subsequent year and to identify factors contributing to this prediction, with a particular focus on the role of pharmacotherapy. Methods: We used 2014-2015 Swiss health insurance claims data on 373'264 adult patients to classify individuals' changes in health care costs. We performed extensive feature generation and developed predictive models using logistic regression, boosted decision trees and neural networks. Based on the decision tree model, we performed a detailed feature importance analysis and subgroup analysis, with an emphasis on drug classes. Results: The boosted decision tree model achieved an overall accuracy of 67.6% and an area under the curve-score of 0.74; the neural network and logistic regression models performed 0.4 and 1.9% worse, respectively. Feature engineering played a key role in capturing temporal patterns in the data. The number of features was reduced from 747 to 36 with only a 0.5% loss in the accuracy. In addition to hospitalisation and outpatient physician visits, 6 drug classes and the mode of drug administration were among the most important features. Patient subgroups with a high probability of increase (up to 88%) and decrease (up to 92%) were identified. Conclusions: Pharmacotherapy provides important information for predicting cost increases in the total population. Moreover, its relative importance increases in combination with other features, including health care utilisation.
... In particular, Rosenberg et al. (2007) identified four major health-sector applications of predictive modeling: the first is provider profiling where provider performances are ordered based on the quality of treatment, number of tests, and disease severity of their case mix (Christiansen and Morris 1997;Delong et al. 1997;Hu and Lesneski 2004); a second application is provider reimbursement, where providers who treat Medicare insured patient receive payment as determined by a statistical model (Ash and Byrne-Logan 1998;Ash et al. 2000;Pope et al. 2000;Kronick et al. 2000); the third application is in the identification of individual or groups of patients that are likely to be high-cost users of future medical services for the purposes of targeting them with interventions to reduce future costs (Cousins, Shickle, and Bander 2002;Passwater and Seiler 2004;Dove, Duncan and Robb 2003;Meenan et al. 1999;Zhao et al. 2003). Finally, predictive modeling is used to supplement the underwriting and pricing of small group health insurance Ellis et al. 2003;Hu and Lesneski 2004;Rosenberg and Johnson 2007). ...
... Higher accuracies have been achieved by more specialized prediction models, such as one for imaging utilization [14]. Other investigators [15][16][17] have built successful models on the basis of demographic and utilization characteristics using a limited subset of clinical data. However, these strategies may not fully exploit the highly detailed clinical history available in electronic health records (EHR). ...
Background: Because 5% of patients incur 50% of healthcare expenses, population health managers need to be able to focus preventive and longitudinal care on those patients who are at highest risk of increased utilization. Predictive analytics can be used to identify these patients and to better manage their care. Data mining permits the development of models that surpass the size restrictions of traditional statistical methods and take advantage of the rich data available in the electronic health record (EHR), without limiting predictions to specific chronic conditions. Objective: The objective was to demonstrate the usefulness of unrestricted EHR data for predictive analytics in managed healthcare. Methods: In a population of 9,568 Medicare and Medicaid beneficiaries, patients in the highest 5% of charges were compared to equal numbers of patients with the lowest charges. Contrast mining was used to discover the combinations of clinical attributes frequently associated with high utilization and infrequently associated with low utilization. The attributes found in these combinations were then tested by multiple logistic regression, and the discrimination of the model was evaluated by the c-statistic. Results: Of 19,014 potential EHR patient attributes, 67 were found in combinations frequently associated with high utilization, but not with low utilization (support>20%). Eleven of these attributes were significantly associated with high utilization (p<0.05). A prediction model composed of these eleven attributes had a discrimination of 84%. Conclusions: EHR mining reduced an unusably high number of patient attributes to a manageable set of potential healthcare utilization predictors, without conjecturing on which attributes would be useful. Treating these results as hypotheses to be tested by conventional methods yielded a highly accurate predictive model. This novel, two-step methodology can assist population health managers to focus preventive and longitudinal care on those patients who are at highest risk for increased utilization.
Full-text available
Objetivo: Identificar variáveis de saúde mental do trabalhador relacionadas ao alto custo nos planos de saúde, por meio do aprendizado de máquina. Método: Pesquisa quantitativa, retrospectiva e de caráter descritivo, com dados administrativos de demandas por procedimentos de saúde de janeiro de 2019 a março de 2021, e de questionário de saúde, aplicado em outubro de 2020, de 586 trabalhadores, assistidos por um plano de saúde. A pesquisa compreendeu quatro etapas: (i) pré-processamento das bases de dados; (ii) construção do modelo com uso do algoritmo random forest; (iii) avaliação das variáveis preditoras, com base no método de importância de Gini; (iv) avaliação dos resultados por especialistas em gestão de saúde. Resultados: Variáveis relacionadas aos transtornos mentais: transtorno bipolar, uso de bebida alcoólica, ansiedade e depressão, foram identificadas como preditoras de casos de alto custo: transtorno bipolar, uso de bebida alcoólica, ansiedade e depressão aos casos de alto custo. Houve concordância dos especialistas quanto a relação destas variáveis com o desfecho alto custo. Considerações finais: Apoiar iniciativas de saúde nas empresas pode promover mudanças que impactam não somente na saúde dos trabalhadores, mas também na produtividade e resultados das organizações, ampliando a atuação de ambulatórios e de gestores de saúde ocupacional.
Full-text available
Healthcare is indeed a considerable pointer for the development of society. Health does not only mean as dearth of disease but also capability to apprehend one's potential. In reality, there is a big gap between the rural and urban health service facility and accessibility. This paper identifies some of the problems in Indian healthcare and attempts to provide a solution by exploring the capabilities of healthcare. So, the services rendered by healthcare are not a mere responsibility of medical field but also of information technology. In fact, data mining plays an active role in providing a consistent accuracy in predicting the diseases and its risk factors. Some of the data mining applications and techniques used in real world are discussed.
Frequent emergency department (ED) users impose a significant burden on the healthcare system. Case management (CM) can target potential frequent users to reduce their ED utilization. As CM is costly, it is essential to enroll individuals who will achieve improved health outcomes. We present a novel machine learning framework for effectively selecting enrollees for CM. Unlike traditional methods that only target current frequent users, our approach selects members for enrollment based on their likelihood of frequent use and their potential benefit from the program. We develop predictive models for two types of future frequent users—“jumpers” whose current ED usage is low but will increase significantly in the future, and “repeaters” whose ED usage remains consistently high. We propose a strategy to select optimal combinations of these two types of users, and compare the cost effectiveness. We demonstrate that the traditional enrollment strategy works well only for targeting potential repeaters, yet it will not result in positive savings unless the CM program is very effective in reducing ED usage. Including jumpers helps to improve cost effectiveness, because of the strength of the machine learning models, and jumpers are more likely to achieve successful outcomes from participation in a CM program.
Episode Treatment Groups (ETGs) classify related services into medically relevant and distinct units describing an episode of care. Proper model selection for those ETG-based costs is essential to adequately price and manage health insurance risks. The optimal claim cost model (or model probabilities) can vary depending on the disease. We compare four potential models (lognormal, gamma, log-skew- t and Lomax) using four different model selection methods (AIC and BIC weights, Random Forest feature classification and Bayesian model averaging) on 320 ETGs. Using the data from a major health insurer, which consists of more than 33 million observations from 9 million claimants, we compare the various methods on both speed and precision, and also examine the wide range of selected models for the different ETGs. Several case studies are provided for illustration. It is found that Random Forest feature selection is computationally efficient and sufficiently accurate, hence being preferred in this large data set. When feasible (on smaller data sets), Bayesian model averaging is preferred because of the posterior model probabilities.
Risk assessment is essential for insurance pricing and risk management. This study develops several predictive models with data from a major national health insurer. Specifically, four models (lognormal, gamma, log-skew-t, and Lomax) for Episode Treatment Groups based costs are compared using four different metrics (AIC and BIC weights, random forest feature classification, and Bayesian model averaging). Several case studies are provided for illustration. Experimental results show that random forest feature classification is preferred for large data set for its computational efficiency and sufficient accuracy. For small data sets, Bayesian model averaging is recommended for its better accuracy. Given the target variable is semi-continuous, heavy-tailed and clustered, nine candidate models are investigated including the Tweedie GLM and GAM, several two-part models, quantile regression, and a finite mixture model. A comprehensive model selection strategy and framework are suggested for different goals. A few evaluation mechanisms are investigated, considering measures of distance, effectiveness, distribution similarity, or location. In particular, the minimal distance probability matrix is proposed as a robust model selection technique. A few interesting conclusions are drawn between the transitivity of the matrix of relation and the existence of a single robust best model among candidates. This research also develops a stop-loss coverage pricing model for self-funded health plans. The formulas that denote the net stop-loss premium are derived and predictive analytics are deployed to capture the relationship between certain characteristics and the target variable. A case study about Specific Stop-Loss (SSL) only coverage is given and future work is summarized.
Full-text available
The Balanced Budget Act (BBA) of 1997 required HCFA to implement health-status-based risk adjustment for Medicare capitation payments for managed care plans by January 1, 2000. In support of this mandate, HCFA has been collecting inpatient encounter data from health plans since 1997. These data include diagnoses and other information that can be used to identify chronic medical problems that contribute to higher costs, so that health plans can be paid more when they care for sicker patients. In this article, the authors describe the risk-adjustment model HCFA is implementing in the year 2000, known as the Principal Inpatient Diagnostic Cost Group (PIPDCG) model.
Full-text available
Using 1991-92 data for a 5-percent Medicare sample, we develop, estimate, and evaluate risk-adjustment models that utilize diagnostic information from both inpatient and ambulatory claims to adjust payments for aged and disabled Medicare enrollees. Hierarchical coexisting conditions (HCC) models achieve greater explanatory power than diagnostic cost group (DCG) models by taking account of multiple coexisting medical conditions. Prospective models predict average costs of individuals with chronic conditions nearly as well as concurrent models. All models predict medical costs far more accurately than the current health maintenance organization (HMO) payment formula.
Objectives. This study compares the ability of 3 risk-assessment models to distinguish high and low expense-risk status within a managed care population. Models are the Global Risk-Assessment Model (GRAM) developed at the Kaiser Permanente Center for Health Research; a logistic version of GRAM; and a prior-expense model. GRAM was originally developed for use in adjusting Medicare payments to health plans. Methods. Our sample of 98,985 cases was drawn from random samples of memberships of 3 staff/group health plans. Risk factor data were from 1992 and expenses were measured for 1993. Models produced distributions of individual-level annual expense forecasts (or predicted probabilities of high expense-risk status for logistic) for comparison to actual values. Prespecified "high-cost" thresholds were set within each distribution to analyze the models' ability to distinguish high and low expense-risk status. Forecast stability was analyzed through bootstrapping. Results. GRAM discriminates better overall than its comparators (although the models are similar for policy-relevant thresholds). All models forecast the highest-cost cases relatively well. GRAM forecasts high expense-risk status better than its comparators within chronic and serious disease categories that are amenable to early intervention but also generates relatively more false positives within these categories. Conclusions. This study demonstrates the potential of risk-assessment models to inform care management decisions by efficiently screening managed care populations for high expense-risk. Such models can act as preliminary screens for plans that can refine model forecasts with detailed surveys. Future research should involve multiple-year data sets to explore the temporal stability of forecasts.
The material presented reveals a unique solution to meeting the health care needs of a certain population. The State of Maryland through the Maryland Medicaid High Cost User Initiative uses nursing case management to decrease costs while providing quality care. The Center for Health Program Development and Management (CHPDM) manages the program and will be providing "enhanced" case management services to the identified high-cost users under the age of 65.
A fully capitated, integrated healthcare delivery system endeavored to improve the care of its sickest members. A computer algorithm severity index that encompassed a 1-year history of hospitalization and adjusted for inclusion of a variety of chronic conditions was calculated on the basis of clinical and administrative claims databases for the entire membership of the healthcare system. Monthly updated lists were produced to find patients with acute and chronic illnesses. These patients accounted for one-fourth of hospital admissions and almost half of inpatient days, but they numbered less than 1% of system membership. Each listed person, regardless of age or diagnosis, had a custom care plan formulated by nurses in consultation with the primary care physician and involved specialists. Plan development featured in-home assessments in most instances and incorporated a variety of ancillary services, telephone and home-care follow-up, and strategies to increase continuity and access to care. Patient-reported functional status was obtained at establishment of the care plan and periodically thereafter in expectation of raising the cross-sectional mean values of the population. Three months after initiation of the program, the expected winter hospitalization peak did not occur, and utilization tended to be lower in subsequent months. Inpatient admissions among members with acute and chronic illnesses decreased 20%, and inpatient days decreased 28% from baseline levels. Among the subset of seniors in the population, inpatient days decreased 37%. Net financial impact was a medical expenditure decrease of more than 5% from 1995 levels. On a population basis, functional status was raised, and the acuity of patients' conditions and need for inpatient hospital care were reduced.
This article describes a system of diagnostic categories that Medicaid programs can use for adjusting capitation payments to health plans that enroll people with disability. Medicaid claims from Colorado, Michigan, Missouri, New York, and Ohio are analyzed to demonstrate that the greater predictability of costs among people with disabilities makes risk adjustment more feasible than for a general population and more critical to creating health systems for people with disability. The application of our diagnostic categories to State claims data is described, including estimated effects on subsequent-year costs of various diagnoses. The challenges of implementing adjustment by diagnosis are explored.
This article describes the risk-adjusted payment methodology employed by the Maryland Medicaid program to pay managed care organizations. It also presents an empirical simulation analysis using claims data from 230,000 Maryland Medicaid recipients. This simulation suggests that the new payment model will help adjust for adverse or favorable selection. The article is intended for a wide audience, including state and national policy makers concerned with the design of managed care Medicaid programs and actuaries, analysts, and researchers involved in the design and implementation of risk-adjusted capitation payment systems.
A managed care organization sought to achieve efficiencies in care delivery and cost savings by anticipating and better caring for its frail and least stable members. Time sequence case study of program intervention across an entire managed care population in its first year compared with the prior baseline year. Key attributes of the intervention included predictive registries of at-risk members based on existing data, relentless focus on the high-risk group, an integrated clinical and psychosocial approach to assessments and are planning, a reengineered care management process, secured Internet applications enabling rapid implementation and broad connectivity, and population-based outcomes metrics derived from widely used measures of resource utilization and functional status. Concentrating on the highest-risk group, which averaged just 1.1% prevalence in the total membership, yielded bottom line results. When the year before program implementation (July 1997 through June 1998) was compared with the subsequent year, the total population's annualized commercial admission rate was reduced 5.3%, and seniors' was reduced 3.0%. A claims-paid analysis exclusively of the highest-risk group revealed that their efficiencies and savings overwhelmingly contributed to the membershipwide effect. This subgroup's costs dropped 35.7% from preprogram levels of $2590 per member per month (excluding pharmaceuticals). During the same time, patient-derived cross-sectional functional status rose 12.5%. A sharply focused, Internet-deployed case management strategy achieved economic and functional status results on a population basis and produced systemwide savings in its first year of implementation.