Meaningful Variation in Performance
A Systematic Literature Review
Vicki Fung, PhD,* Julie A. Schmittdiel, PhD,* Bruce Fireman, MA,* Aabed Meer, BA,*
Sean Thomas, MD,† Nancy Smider, PhD,† John Hsu, MD, MBA, MSCE,*
and Joseph V. Selby, MD, MPH*
Background: Recommendations for directing quality improvement
initiatives at particular levels (eg, patients, physicians, provider
groups) have been made on the basis of empirical components of
variance analyses of performance.
Objective: To review the literature on use of multilevel analyses of
variability in quality.
Research Design: Systematic literature review of English-lan-
guage articles (n ? 39) examining variability and reliability of
performance measures in Medline using PubMed (1949–Novem-
Results: Variation was most commonly assessed at facility (eg,
hospital, medical center) (n ? 19) and physician (n ? 18) levels;
most articles reported variability as the proportion of total vari-
ation attributable to given levels (n ? 22). Proportions of vari-
ability explained by aggregated levels were generally low (eg,
?19% for physicians), and numerous authors concluded that the
proportion of variability at a specific level did not justify target-
ing quality interventions to that level. Few articles based their
recommendations on absolute differences among physicians, hos-
pitals, or other levels. Seven of 12 articles that assessed reliability
found that reliability was poor at the physician or hospital level due to
low proportional variability and small sample sizes per unit, and
cautioned that public reporting or incentives based on these measures
may be inappropriate.
Conclusions: The proportion of variability at levels higher than
patients is often found to be “low.” Although low proportional
variability may lead to poor measurement reliability, a number of
authors further suggested that it also indicates a lack of potential for
quality improvement. Few studies provided additional information
to help determine whether variation was, nevertheless, clinically
Key Words: quality improvement, quality measurement,
performance measurement, physician profiling, systematic reviews
(Med Care 2010;48: 140–148)
reduce unwanted practice variation, and increase accountabil-
ity at levels ranging from the individual provider to the
geographic delivery area. Profiles generated by these mea-
surement efforts are used in a number of ways, including
public report cards that rank entities by level of performance
or quality, as well as pay-for-performance initiatives that link
quality ratings with financial incentives.
Since the work of Wennberg et al examining small area
variation, there has been a longstanding assumption that a
high degree of variability in performance suggests a potential
for quality improvement.1,2This principle has led numerous
authors to examine variability at different “levels” of health
care delivery, such as physicians, provider groups, hospitals,
and health plans. Components of variance are analyzed, often
using hierarchical models, to apportion the total observed
variation in performance measures in a patient population to
1 or more aggregated levels. The findings of these analyses
have then been used to make inferences about the appropriate
level for profiling and intervention efforts.3–12
A closely related issue in performance measurement is
reliability, which declines as the proportion of total variabil-
ity (intraclass correlation coefficient ?ICC?) at a given level
decreases. Reliability is also a function of the number of
available patients per unit at the level. Whether units have
sufficient sample size to produce reliable measures de-
pends upon the level of analysis (eg, physician, provider
group, health plan); the prevalence of the condition; and
the type of quality measure selected. Reliability reflects the
consistency of the measure, the extent to which it is
reproducible rather than random, and is particularly im-
portant when performance rankings are being used for
public reporting or for determining incentive payments.3,13
If units cannot be reliably ranked, quality improvement
efforts based on such measures may unfairly punish or
erformance measurement is an important component of
national and local efforts to improve quality of care,
From the *Division of Research, Kaiser Permanente Medical Care Program,
Oakland, CA; and †Epic Systems Corporation, Verona, WI.
Supported by the Council of Accountable Physician Practices (CAPP) and
Office of Research in Women’s Health Building Interdisciplinary Careers
in Women’s Health K 12 Career Development Award (K12HD052163)
Reprints: Vicki Fung, PhD, 2000 Broadway, 3rd Floor, Oakland, CA.
Supplemental digital content is available for this article. Direct URL citations
appear in the printed text and are provided in the HTML and PDF versions
of this article on the journal’s Web site (www.lww-medicalcare.com).
Copyright © 2010 by Lippincott Williams & Wilkins
Medical Care • Volume 48, Number 2, February 2010
140 | www.lww-medicalcare.com
reward individuals or groups; in addition, unreliable mea-
sures may mislead patients who use this information in a
predictive manner to make health care decisions.
We examined studies that explicitly assessed vari-
ability and/or reliability of performance measures across 2
or more nested levels (eg, the individual patient and the
provider). We specifically assessed whether authors linked
recommendations for reporting or providing incentives for
quality at specific levels to the proportion of variability
observed at that level and, if so, what kind of criteria were
given for an acceptable amount of variability and/or reli-
ability to justify performance reporting or incentives.
Data Sources and Search Strategy
Our systematic literature search employed a 2-step
strategy: a database search for relevant peer-reviewed articles
in Medline and a subsequent hand-search of reference lists
from articles identified in the first step. We searched
English-language articles in Medline using PubMed from
inception (1949) through November 25, 2008 using title
and abstract keyword searches and Medical Subject Head-
ing Terms. The PubMed keyword searches relied on pri-
mary terms (eg, quality measure*, perform* measure*, and
profile or profiling) alone and in combination with addi-
tional parameters (eg, varia* ?with “*” indicating a wild
card?, multilevel or multilevel,). We also searched on
Medical Subject Heading Terms that were commonly used
to index articles we initially identified as representing the
core theme of our literature review3,14: Quality Assurance,
Quality Indicators, Physician’s Practice Patterns, and Out-
come and Process Assessment (complete list of search
terms available in the Appendix, Supplemental Digital
Content 1, available online at: http://links.lww.com/MLR/A58).
The selection process involved 3 review stages: title,
abstract, and article (Fig. 1). Two researchers indepen-
dently conducted title and abstract reviews. Articles were
selected if they appeared to address issues relating to the
statistical modeling of quality measurement or measuring
variation or reliability of quality measures at 1 or more
aggregated levels (eg, physician, hospital, geographic).
Because the related issue of case-mix adjustment has
received thorough treatment in the literature,15–17it was
not a primary focus of this review.
Content Abstraction and Synthesis
After excluding articles that did not assess either com-
ponents of variation or the precision/reliability of health care
quality measures when grouped at 1 or more levels above the
individual person, we abstracted the following information
from the remaining articles selected for inclusion: study
design, study time period, setting/population, data source(s),
levels of analysis, sample size by level, outcome variables,
modeling approach, case-mix adjustors, methods for assess-
ing components of variance, methods for assessing reliability,
the proportion or amount of variance found at each level, the
degree of reliability reported, and the authors’ interpretations
of their findings (eg, whether they considered the amount of
variability to be “high” or “low”).
To further examine important issues related to reliabil-
ity we conducted sample size calculations and simulations to
FIGURE 1. Article Selection. This
figure presents the article selection
process. The total citations re-
trieved includes duplicate titles that
were identified both through initial
identification and the systematic
Medline search using PubMed. The
number of articles included in ab-
stract review and article abstraction
represent unique titles. Initially
identified articles were those identi-
fied nonsystematically before the
initiation of the systematic review.
Medical Care • Volume 48, Number 2, February 2010Meaningful Variation in Performance
© 2010 Lippincott Williams & Wilkins
www.lww-medicalcare.com | 141
9. Davis P, Gribben B, Lay-Yee R, et al. How much variation in clinical
activity is there between general practitioners? A multi-level analysis of
decision-making in primary care. J Health Serv Res Policy. 2002;7:202–
10. Cowen ME, Strawderman RL. Quantifying the physician contribution to
managed care pharmacy expenses: a random effects approach. Med
11. Baker LC, Hopkins D, Dixon R, et al. Do health plans influence quality
of care? Int J Qual Health Care. 2004;16:19–30.
12. Young GJ. Can multi-level research help us design pay-for-performance
programs? Med Care. 2008;46:109–111.
13. Fuhlbrigge A, Carey VJ, Finkelstein JA, et al. Are performance mea-
sures based on automated medical records valid for physician/practice
profiling of asthma care? Med Care. 2008;46:620–626.
14. Greenfield S, Kaplan SH, Kahn R, et al. Profiling care provided by
different groups of physicians: effects of patient case-mix (bias) and
physician-level clustering on quality assessment results. Ann Intern Med.
15. DeLong ER, Peterson ED, DeLong DM, et al. Comparing risk-adjust-
ment methods for provider profiling. Stat Med. 1997;16:2645–2664.
16. Salem-Schatz S, Moore G, Rucker M, et al. The case for case-mix
adjustment in practice profiling. When good apples look bad. JAMA.
17. Landon B, Iezzoni LI, Ash AS, et al. Judging hospitals by severity-
adjusted mortality rates: the case of CABG surgery. Inquiry. 1996;33:
18. Snijders TAB, Bosker RJ. Multilevel Analysis: An Introduction to Basic
and Advanced Multilevel Modeling. Sage Publications Inc; 1999.
19. Aiello A, Garman A, Morris SB. Patient satisfaction with nursing care:
a multilevel analysis. Qual Manag Health Care. 2003;12:187–190.
20. Bjertnaes OA, Garratt A, Ruud T. Family physicians’ experiences with
community mental health centers: a multilevel analysis. Psychiatr Serv.
21. Bjorngaard JH, Ruud T, Garratt A, et al. Patients’ experiences and
clinicians’ ratings of the quality of outpatient teams in psychiatric care
units in Norway. Psychiatr Serv. 2007;58:1102–1107.
22. Degenholtz HB, Kane RA, Kane RL, et al. Predicting nursing facility
residents’ quality of life using external indicators. Health Serv Res.
23. D’Errigo P, Tosti ME, Fusco D, et al. Use of hierarchical models to
evaluate performance of cardiac surgery centers in the Italian CABG
outcome study. BMC Med Res Methodol. 2007;7:29.
24. Dijkstra RF, Braspenning JC, Huijsmans Z, et al. Patients and nurses
determine variation in adherence to guidelines at Dutch hospitals more
than internists or settings. Diabet Med. 2004;21:586–591.
25. Gifford E, Foster EM. Provider-level effects on psychiatric inpatient
length of stay for youth with mental health and substance abuse disor-
ders. Med Care. 2008;46:240–246.
26. Harman JS, Cuffel BJ, Kelleher KJ. Profiling hospitals for length of stay
for treatment of psychiatric disorders. J Behav Health Serv Res. 2004;
27. Hawley ST, Hofer TP, Janz NK, et al. Correlates of between-surgeon
variation in breast cancer treatments. Med Care. 2006;44:609–616.
28. Normand SL, Wolf RE, Ayanian JZ, et al. Assessing the accuracy of
hospital clinical performance measures. Med Decis Making. 2007;27:
29. Normand SL, Glickman ME, Gatsonis CA. Statistical methods for
profiling providers of medical care: issues and applications. J Am Stat
30. O’Brien SM, Shahian DM, DeLong ER, et al. Quality measurement in
adult cardiac surgery: part 2–statistical considerations in composite
measure scoring and provider rating. Ann Thorac Surg. 2007;83(suppl
31. O’Connor PJ, Rush WA, Davidson G, et al. Variation in quality of
diabetes care at the levels of patient, physician, and clinic. Prev Chronic
32. Phillips CD, Shen R, Chen M, et al. Evaluating nursing home perfor-
mance indicators: an illustration exploring the impact of facilities on
ADL change. Gerontologist. 2007;47:683–689.
33. Phillips CD, Chen M, Sherman M. To what degree does provider
performance affect a quality indicator? The case of nursing homes and
ADL change. Gerontologist. 2008;48:330–337.
34. Sa Carvalho M, Henderson R, Shimakura S, et al. Survival of hemodi-
alysis patients: modeling differences in risk of dialysis centers. Int J
Qual Health Care. 2003;15:189–196.
35. Safran DG, Karp M, Coltin K, et al. Measuring patients’ experiences
with individual primary care physicians. Results of a statewide demon-
stration project. J Gen Intern Med. 2006;21:13–21.
36. Sjetne IS, Veenstra M, Stavem K. The effect of hospital size and
teaching status on patient experiences with hospital care: a multilevel
analysis. Med Care. 2007;45:252–258.
37. Solomon LS, Zaslavsky AM, Landon BE, et al. Variation in patient-
reported quality among health care organizations. Health Care Financ
38. Sullivan CO, Omar RZ, Ambler G, et al. Case-mix and variation in
specialist referrals in general practice. Br J Gen Pract. 2005;55:529–
39. Swinkels IC, Wimmers RH, Groenewegen PP, et al. What factors
explain the number of physical therapy treatment sessions in patients
referred with low back pain; a multilevel analysis. BMC Health Serv
40. Tan A, Freeman JL, Freeman DH Jr. Evaluating health care perfor-
mance: strengths and limitations of multilevel analysis. Biom J. 2007;
41. Thomas N, Longford NT, Rolph JE. Empirical Bayes methods for
estimating hospital-specific mortality rates. Stat Med. 15 1994;13:889–
42. Tuerk PW, Mueller M, Egede LE. Estimating physician effects on
glycemic control in the treatment of diabetes: methods, effects sizes, and
implications for treatment policy. Diabetes Care. 2008;31:869–873.
43. Turenne MN, Hirth RA, Pan Q, et al. Using knowledge of multiple
levels of variation in care to target performance incentives to providers.
Med Care. 2008;46:120–126.
44. Veenstra M, Hofoss D. Patient experiences with information in a
hospital setting: a multilevel approach. Med Care. 2003;41:490–499.
45. Zaslavsky AM, Zaborski LB, Cleary PD. Plan, geographical, and tem-
poral variation of consumer assessments of ambulatory health care.
Health Serv Res. 2004;39:1467–1485.
46. Hofer TP, Hayward RA. Can early re-admission rates accurately detect
poor-quality hospitals? Med Care. 1995;33:234–245.
47. Epstein A. Performance reports on quality–prototypes, problems, and
prospects. N Engl J Med. 1995;333:57–61.
48. Landon BE, Normand SL, Blumenthal D, et al. Physician clinical
performance assessment: prospects and barriers. JAMA. 2003;290:1183–
49. Kassirer JP. The use and abuse of practice profiles. N Engl J Med.
50. Bindman AB. Can physician profiles be trusted? JAMA. 1999;281:2142–
Fung et al
Medical Care • Volume 48, Number 2, February 2010
© 2010 Lippincott Williams & Wilkins
148 | www.lww-medicalcare.com