Page 1

Ranking USRDS provider specific SMRs from 1998-2001

Rongheng Lin

Department of Public Health, University of Massachusetts Amherst, Rm 411 Arnold House, 715 N.

Pleasant Rd., Amherst, MA 01003, USA

Thomas A. Louis

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD

21205, USA e-mail: tlouis@jhsph.edu

Susan M. Paddock

RAND Corporation, Santa Monica, CA 90407, USA e-mail: paddock@rand.org

Greg Ridgeway

e-mail: gregr@rand.org

Abstract

Provider profiling (ranking/percentiling) is prevalent in health services research. Bayesian models

coupled with optimizing a loss function provide an effective framework for computing non-standard

inferences such as ranks. Inferences depend on the posterior distribution and should be guided by

inferential goals. However, even optimal methods might not lead to definitive results and ranks should

be accompanied by valid uncertainty assessments. We outline the Bayesian approach and use

estimated Standardized Mortality Ratios (SMRs) in 1998-2001 from the United States Renal Data

System (USRDS) as a platform to identify issues and demonstrate approaches. Our analyses extend

Liu et al. (2004) by computing estimates developed by Lin et al. (2006) that minimize errors in

classifying providers above or below a percentile cut-point, by combining evidence over multiple

years via a first-order, autoregressive model on log(SMR), and by use of a nonparametric prior.

Results show that ranks/percentiles based on maximum likelihood estimates of the SMRs and those

based on testing whether an SMR = 1 substantially under-perform the optimal estimates. Combining

evidence over the four years using the autoregressive model reduces uncertainty, improving

performance over percentiles based on only one year. Furthermore, percentiles based on posterior

probabilities of exceeding a properly chosen SMR threshold are essentially identical to those

produced by minimizing classification loss. Uncertainty measures effectively calibrate performance,

showing that considerable uncertainty remains even when using optimal methods. Findings highlight

the importance of using loss function guided percentiles and the necessity of accompanying estimates

with uncertainty assessments.

Keywords

Provider profiling; Ranks/percentiles; Bayesian hierarchical model; Uncertainty assessment

1 Introduction

Research on and application of performance evaluation steadily increases with applications to

evaluating health service providers (Christiansen and Morris 1997; Goldstein and Spiegelhalter

Correspondence to: Rongheng Lin.

e-mail: rlin@schoolph.umass.edu.

NIH Public Access

Author Manuscript

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

Published in final edited form as:

Health Serv Outcomes Res Methodol. 2009 March 1; 9(1): 22–38. doi:10.1007/s10742-008-0040-0.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

1996; Landrum et al. 2000; Liu et al. 2004; McClellan and Staiger 1999; Grigg et al. 2003;

Zhang et al. 2006; Normand and Shahian 2007; Ohlssen et al. 2007), prioritizing environmental

assessments in small areas (Conlon and Louis 1999; Louis and Shen 1999; Shen and Louis

2000) and ranking teachers and schools (Lockwood et al. 2002). Inferential goals of these

studies include evaluating the population performance, such as the average performance of all

health providers and comparing performance among providers. Performance evaluations

include comparing unit-specific, substantive measures such as death rates, identifying the

group of poorest or best performing units and overall ranking of the units, e.g., profiling or

league tables (Goldstein and Spiegelhalter 1996).

The Standardized Mortality Ratio (SMR), the ratio of observed to expected deaths, is an

important service quality indicator (Zaslavsky 2001). The United States Renal Data System

(USRDS) produces annual estimated SMRs for several thousand dialysis centers and uses these

as a quality screen (Lacson et al. 2001; ESRD 2000; USRDS 2005). Invalid estimation or

inappropriate interpretation can have serious consequences for these dialysis centers and for

their patients. We present an analysis of the information from the United States Renal Data

System (USRDS) for 1998-2001 as a platform for demonstrating and comparing approaches

to ranking health service providers. From the USRDS we obtained observed and expected

deaths for the K = 3173 dialysis centers that contributed information for all four years. The

approach used by USRDS to produce these values can be found in USRDS (2005).

Though estimating SMRs is a standard statistical operation (produce provider-specific

expected deaths based on a statistical model, and then compute the “observed/expected” ratio),

it is important and challenging to deal with complications such as the need to specify a reference

population (providers included, the time period covered, attribution of events), the need to

validate the model used to adjust for important patient attributes (age, gender, diabetes, type

of dialysis, severity of disease), and the need to adjust for potential biases induced when

attributing deaths to providers and accounting for informative censoring.

The multi-level data structure and complicated inferential goals require the use of a hierarchical

Bayesian model that accounts for nesting relations and specifies both population values and

random effects. Correctly specified, the model properly accounts for the sample design,

variance components and other uncertainties, producing valid and efficient estimates of

population parameters, variance components and unit-specific random effects (provider-,

clinician-, or region-specific latent attributes), all accompanied by valid uncertainty

assessments. Importantly, the Bayesian approach provides the necessary structure for

developing scientific and policy-relevant inferences based on the joint posterior distribution

of all unknowns.

As Shen and Louis (1998) show and Gelman and Price (1999) present in detail, no single set

of estimates or assessments can effectively address multiple goals and we provide a suite of

assessments. Guided by a loss function, the Bayesian approach structures non-standard

inferences such as ranking (including identification of extremely poor and good performers)

and estimating the histogram of unit-specific random effects. For example, as Liu et al.

(2004) show, when estimation uncertainty varies over dialysis centers, ranks produced by Z-

scores that test whether a provider's SMR = 1 tend to identify providers with relatively low

variance as extreme because these tests have the highest power; ranks produced from the

provider-specific maximum likelihood estimates (MLEs) are more likely to identify dialysis

centers with relatively high variance as extreme. Effective ranks depend on striking an effective

tradeoff between signal and noise.

Lin et al. (2006) present estimates that minimize errors in classifying providers above or below

a percentile cut-point. Our analyses build on Liu et al. (2004) by extending the application of

Lin et al.Page 2

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

Lin et al. (2006)'s estimates to combine evidence over multiple years via a first-order,

autoregressive model on log(SMR), and by use of a nonparametric prior. For single-year

analyses we compare the results from the log-normal prior to those based on the Non-

Parametric, Maximum Likelihood (NPML) prior (Laird 1978).

In following, Sect. 2 presents our models; Sect. 3 outlines several ranking methods; Sect. 4

gives uncertainty measures; Sect. 5 presents results and Sect. 6 sums up and identifies

additional research. Computing code for all routines is available at,

http://people.umass.edu/rlin/jhuwebhost/usrds-ranking.htm.

2 Statistical models

We employ both single-year and longitudinal models for observed deaths and underlying

parameters, with the former a sub-model of the latter. To this end, let (Ykt, mkt) be the observed

and case-mix adjusted, expected deaths for provider k in year t, k = 1,... 3173, t = 0, 1, 2, 3 and

ρkt be the SMR. The USRDS computes the expecteds under the assumption that all providers

give the same quality of care for patients with identical covariates, see USRDS (2005) for

details. We employ the conditional Poisson model,

(1)

If the provider has “average performance”, ρkt = 1. For both single-year and multiple-year

analyses we model θkt = log(ρkt).

2.1 Single-year analyses

For single-year analyses, we assume that for year t; θkt Gt, k = 1,…, 3173: We use a year-

specific, normal prior (see the note after Eq. 2) and for the single-year analyses also use the

non-parametric maximum likelihood (NPML) prior. See and Carlin and Louis (2008) and

Paddock et al. (2006) for additional details and Appendix C for the estimation algorithm.

2.2 The longitudinal, AR(1) model

To model longitudinal correlation among (ρk0, ρk1, ρk2, ρk3), let ϕ = cor(θk,t, θk(t+1)), with -1 <

ϕ < 1. Then, use a normal prior on the θkt and a normal prior on Z(ϕ) = 0.5 log {(1 + ϕ)/(1-ϕ)}

in the hierarchical model,

(2)

The notation “iid” means independently and identically distributed and “ind” means

independently distributed. The relation is first-order Markov, because though conditioning is

on all prior θs, only ρk(t-1) appears on the right-hand side of Eq. 2.

Marginally, for year t, θkt iid N(ξt, ) and setting ϕ = 0 produces four, single-year analyses,

each using the Liu et al. (2004) model with no borrowing of information over time. For ϕ > 0,

Lin et al.Page 3

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

we have a standard AR(1) model on the latent log(SMR)s and the posterior distribution

combines evidence across dialysis centers within year and within dialysis center across years.

2.3 Posterior sampling implementation and hyper-prior parameters

We implement a Gibbs sampler for model (2) with WinBUGS via the R package

R2WinBUGS, using the coda package to diagnose convergence (Spiegelhalter et al. 1999;

Gelman et al. 2006; Plummer et al. 2006). We use V = 10, μ = 0.01, α = 0.05, values that

stabilize the simulation while allowing sufficient adaptation to the data. With V = 10, the a

priori, 95% probability interval for ξt is (-6.20, 6.20) [(0.002, 492.75) in the SMR scale]; the

values for α and μ produce a distribution for τ2 with center near 100, inducing large, a priori

variation for the θkt. For the AR(1) model, reported results are based on the Vϕ = 0.2. This

produces an a priori 95% probability interval for ϕ of (-0.70, 0.70). In a sensitivity analysis,

we also tried Vϕ = 2, which produced the a priori interval (-0.99, 0.99) and yielded results

virtually identical to those based on the Vϕ = 0.2 hyper-prior. In both cases, the data likelihood

dominated the priors. This can also be seen in the shrinkage of τ2 towards zero, as reported in

the Sect. 5.4. There is no strong posterior correlation observed between ϕ and the τ2s.

3 Loss function based ranking methods

Two general strategies for ranking are available. The preferred strategic approach first

computes the joint posterior distribution of the ranks and then uses it to produce estimates and

uncertainty assessments, generally guided by a loss function that is appropriate for analytic

goals. This approach ensures that estimated ranks have desired properties such as not depending

on a monotone transform of the target parameters. The other approach is based on ordering

estimates of target parameters (MLEs or posterior means) or on ordering statistics testing the

null hypothesis that SMRk ≡ 1. If the posterior distributions of the target parameters are

stochastically ordered, then for a broad class of loss functions (estimation criteria) optimally

estimated ranks will not depend on the strategy. However, Lin et al. (2006) and others have

shown that estimates not derived from the distribution of the ranks can perform very poorly

and may not be invariant under monotone transformation of the target parameters. Producing

the joint posterior distribution of the ranks is computationally intensive, but most estimates

depend only on easily computable features.

We first define ranks and then specify candidate ranking methods. For clarity in defining ranks,

we drop the index t and write

1. Rank-based estimates are based on the joint posterior distribution of the Rk(ρ) and are

invariant under monotone transform of the ρk.

, with the smallest ρk having rank

3.1 Squared-error loss

Shen and Louis (1998) and Lockwood et al. (2002) study ranks that minimize the posterior

risk induced by squared error loss (SEL):

posterior expected ranks,

. It is minimized by the

(3)

Lin et al. Page 4

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

where pr(·) stands for probability. The optimal mean squared error (MSE) in estimating the

ranks is equal to the average posterior variance of the ranks. Generally, the are not integers;

for optimal, distinct integer ranks, use

.

In the notation that follows, generally we drop dependency on ρ (equivalently, on θ) and omit

conditioning on Y. For example, Rk stands for Rk(θ) and

either ranks (Rk) or, equivalently, percentiles [Pk = Rk/(K + 1)] with percentiles providing an

effective normalization. For example, Lockwood et al. (2002) show that MSE for percentiles

rapidly converges to a function of ranking estimator and posterior distributions of parameters

that does not depend on K.

stands for . We present

3.2 Optimizing (above γ)/(below γ) classification errors

The USRDS uses percentiles to identify the best and the worst performers. Let γ be the fraction

of top performers among the total that we want to identify, 0 < γ < 1. A loss function designed

to address this inferential goal was proposed by Lin et al. (2006). The loss function (Eq. 4)

penalizes for misclassification and also imposes a distance penalty between estimated

percentiles and the cutoff γ.

(4)

For ease of presentation, we have assumed that γK is an integer and so γ(K + 1) is not. It is not

necessary to make the distinction between > and ≥. To minimize the posterior risk induced by

(4), let and .

is minimized by:

(5)

Dominici et al. (1999) use this approach with γ = K/(K + 1), ordering by the probability of a

unit having the largest latent attribute.

3.3 Equivalence of the and ordering posterior exceedance probabilities

Given an SMR threshold t, the ranks/percentiles induced by ordering the posterior probabilities

that an SMR exceeds the threshold, pr(ρk > t|Y) allow us to make a connection between the

and the substantive scale (in our application, SMR). Normand et al. (1997) rank providers

based on these “exceedance probabilities” and Diggle et al. (2007) use them to identify the

areas with elevated disease rates. Lin et al. (2006) shows that exceedance probability based

percentiles are virtually identical to the by choosing the γth percentile of the average of

posterior cumulative distribution function as the threshold t, i.e., , where

. We denote the percentiles based on as . In

addition to providing a connection to the SMR scale, the

are the . Note that the

are far easier to compute than

are invariant under the monotone transform of ρk.

Lin et al.Page 5

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 6

4 Performance measures

As for all statistical procedures, estimated ranks/percentiles must be accompanied by

uncertainty statement. A wide variety of univariate and multivariate performance measures are

available and we propose three univariate measures of uncertainty.

4.1 Mean squared error

Using MCMC, the posterior mean squared error of percentiles produced by any method and

95% posterior intervals of dialysis center-specific percentiles can be computed. As a baseline,

if the data are completely uninformative so that the percentiles

3173/3174) are randomly assigned to the 3173 dialysis centers, then

.

(1/3174, 2/3174,…,

4.2 Operating characteristic for (above γ)/(below γ) classification

The vector of

γ) groups and the posterior classification performance (operating characteristic) can be

computed. Following Lin et al. (2006), and suppressing dependence on Pest

from any ranking method can be used to classify units into (above γ)/(below

(6)

where,ABR(γ|Y) = pr(percentile) > γ|percentile estimated < γ, Y) = pr(P > γ|Pest < γ, Y) BAR

(γ|Y) = pr(percentile < γ|percentile estimated > γ, Y) = pr(P < γ|Pest > γ, Y).

The second equality in (6) results from the identity, γABR(γ|Y) = (1 - γ)BAR(γ|Y). If the goal

is to identify units with the largest percentiles, then BAR(γ|Y) is similar to the False Discovery

Rate (Benjamini and Hochberg 1995; Efron and Tibshirani 2002; Storey 2002; Storey 2003).

ABR(γ|Y) is similar to the False non-Discovery Rate. When the data are completely

uninformative, BAR(γ|Y)/γ ≐ 1 and so OC(γ|Y) produces a standardized comparison across

γ values. Minimizing it produces the most informative cut point for a given Pest.

For any percentiling method, OC(γ|Y) provides a data analytic performance evaluation. The

direct computation of it sums πk(γ|Y) = pr(Pk > (γ|Y) over a Pest produced set of indices.

Plotting the πk(γ|Y) versus the

performance. For ideal fully informative data, the exceedance probability should be 1 for those

classified as above γ and 0 for those classified as below γ. OC(γ) is the area between

curve and 1 for j ≥ [γK] + 1 plus the area below

(see Fig. 1) displays percentile-specific, classification

curve for j ≤ [γK]. Using for the

X-axis produces a monotone plot and

to that proposed by Pepe et al. (2008)

is the minimum attainable. This plot is similar

Computing the πk(γ|Y) is numerically challenging. However, the virtual equivalence between

and justifies replacing these posterior probabilities by the easily computed pr(ρk

> t|Y) with .

Lin et al. Page 6

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 7

4.3 Longitudinal variation

For most of dialysis centers, we expect that percentile estimates of different years are similar

to each other. To measure variation in the ranks/percentile estimates within dialysis centers

over the four years, we compute Longitudinal Variation:

where

the four years. A smaller LV value indicates better consistency in percentiles estimates of

different years.

is the estimated percentile for dialysis center k in year t and is the mean over

4.4 Subset dependency

Unlike estimating individual parameters (where there is individual shrinkage), ranks are highly

correlated and so changing the posterior distribution of some target parameters or removing or

adding units rearrange the order of individual parameters in a complicated manner. Ranks

computed using the posterior distribution of the ranks are thus not subset invariant in that re-

ranking the ranks for a subset of providers will not be the same as ranking only those providers.

Section Appendix A gives a numeric illustrative example. However, if the prior distribution

is known, ranks based on provider-specific summaries such as the MLEs, PMs, exceedance

probabilities or single-provider hypothesis tests are subset invariant. Of course, in an empirical

Bayes or fully Bayesian analysis with an unknown prior (thus, including a known hyper-prior),

no method is subset invariant because the data are also used to estimate the prior or to update

the hyper-prior. We investigate subset dependence by including/removing providers with small

mkt (high variance MLEs). These providers are generally small dialysis centers with very few

patients. Ranking procedures excluding these centers imply that the centers are first categorized

according to their sizes and rankings are then generated in different categories separately. We

pursue our comparison under model (2).

5 Results

5.1 Simulated performance

We conducted simulation studies comparing ranking/percentiling methods for the Poisson

sampling distribution similar to those reported in Lin et al. (2006) for the Gaussian sampling

distribution. Conclusions were similar with

functions, with MLE-based ranks performing poorly and ranks of posterior mean performing

reasonably well but by no means optimally (see Louis and Shen 1999; Gelman and Price

1999). Performances of all methods improved with increasing mkt (reduced sampling variance),

but generally the ranking results are quite indefinitive unless information in the sampling

distribution (e.g., provided by the data) is very high relative to that in the prior.

performing well over a broad class of loss

5.2 Subset dependency and the effect of unstable SMR estimates

We studied the effect including or excluding providers with small mkt (high-variance MLE

estimates) by running both single-year and multiple-year analysis with and without the 68

providers with expected deaths <0.1 in 1998. Comparisons based on

Fig. 2 shows that, there is almost no change in percentiles for providers ranked either high or

low, but noticeable re-ordering happens in the middle range. This is not surprising in that the

ranks for high-variance providers are shrunken considerably towards the midrank (K + 1)/2

and are not ranked at the extremes. The high variance providers “mix up” with the ranks from

in a graph similar to

Lin et al.Page 7

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 8

more stably estimated, central region providers, but are not contenders for extreme ranks/

percentiles. Also, there are more providers in a given interval length in the middle of the

distribution of parameters than in the tails. The ranks of these large providers in the middle

range will be more sensitive to the change of the joint posterior distribution caused by including

small providers. Performance measures MSE and OC(γ) were very similar for the two datasets.

5.3 Comparisons using the 1998 data

We computed ranks (formula (7)) based on the MLE and hypothesis testing statistics Z-scores

(testing the hypothesis H0: ρ = 1 for 1998); we computed the Bayesian estimates

percentiles based on the posterior means (

and

) using model (2) with ϕ ≡ 0.

(7)

Globally, if we regard a dialysis center with

of dialysis centers will be identified; if we regard a dialysis center with Z-score greater than

1.645 as “flagged,” then 647 (20.4%) of dialysis centers are identified.

greater than 1.5 as “flagged,” then 379 (12%)

To compare methods, we select the 634 (20%) worst dialysis centers by ranking and selecting

the largest MLEs and Z-scores and compare to those identified by (0.8). The 80th percentiles

of the ρmle and Z-score are 1.44 and 1.67, respectively, whereas the 80th percentile of the

ρpm is 1.10 (these PMs are closer to 1 than their respective MLEs).

We calculate the kappa statistics between the (above γ)/(below γ) classifications based on

and other estimators. The classifications based on

those based on with respective kappa statistics 0.90 and 0.94. The kappa statistics

between the MLE and is 0.78, between the Z-score and

and Z-score is the lowest 0.68.

and ρpm have high agreement with

is 0.83, and between MLE

Figure 3 compares different methods based on their posterior probability of correct

classification pr(Pk > 0.8|Y) The curve for

by ranking these probabilities. The curves for percentiles based on ρpm and on

very close to that for (not plotted). The curves for MLE-based and Z-score-based

percentiles are far from monotone.

is monotone and optimal because we construct

are

The MLE SMRs for centers with relatively small expected deaths have relatively large

variances. To study the impact of large and small variances on estimated percentiles, Fig. 3

identifies those for the 147 dialysis centers that treated fewer than 5 patients in 1998. These

constitute 4.5% of all centers. Generally, MLE-based percentiles for these centers are at the

extremes whereas Z-score based percentiles tend to be near 0.5. However, because the posterior

distribution of ρk for the high variance centers is concentrated around 1, the

centers are near 0.5 and similarly for and percentiles based on ρpm. For dialysis centers with

a large number of patients and thereby a small variance, the optimally estimated percentiles,

and spread out to cover full range from 0 to 1. There is better agreement between

MLE-based, Z-score-based and optimal percentiles when the small centers are removed from

the dataset and estimates are recomputed.

for these

Lin et al. Page 8

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 9

Figure 4 displays estimates for the 40 providers at the 1/3174, 82/3174, 163/3174,…,

3173/3174 percentiles as determined by

95% posterior interval. The X-axis for the upper left panel is , for the upper right is percentiles

based on ρpm, for the lower left is percentiles based on ρmle, and for the lower right is percentiles

based on Z-scores testing ρk = 1. To deal with cases where Ykt = 0, for the hypothesis test

statistic we use

. For each display, the Y-axis is 100 × with its

See Conlon and Louis (1999) for a similar plot based on SMRs of disease rates in small areas.

Note that in the upper left display the

shrunken toward 0.50 by an amount that reflects estimation uncertainty. Also, the posterior

probability intervals are very wide, indicating considerable uncertainty in estimating ranks/

percentiles. The plotted points in the upper left display are monotone because the X-axis is the

percentile based on ranking of Y-axis values. Plotted points in the upper right display, which

are based on posterior mean, are almost monotone and close to the best attainable. The lower

left and lower right panels show considerable departure from monotonicity, indicating that

MLE-based ranks and hypothesis test-based ranks are very far from optimal. Note also that the

pattern of departures is quite different in the two panels, showing that these methods produce

quite different ranks. Similar comparisons for SMRs estimated from the pooled 1998-2001

data would be qualitatively similar, but the departures from monotonicity would be less

extreme.

do not fill out the (0, 1.0) percentile range; they are

We divide MSE for different ranking methods by the MSE of randomly assigned ranks (Sect.

4.1) for standardization. The methods based on posterior distributions, Ppm,

P☆ (0.8) perform pretty close to each other with standardized MSEs 44.5%, 44.4%, 46.2% and

46.2%, respectively. Rankings based on MLE and Z-score have less improvement (52.3% and

47.4%) over randomly assigned ranks. The differences in

wide 95% intervals presented in Fig. 4 indicate that none of methods can give a conclusive

ranking result.

, and

are less substantial and the

5.4 Single year and multi-year analyses

Using model (2) we estimated single-year based and AR(1) model based percentiles. Table 1

reports that the ξ are near 0, as should be the case since we have used internal standardization

(the typical log(SMR) = 0). The within year, between provider variation in 100 × log(SMR) is

essentially constant at approximately 100 × τ = 24, producing a 95% a priori interval for the

ρkt (0.62, 1.60). While we have a prior centering around 1000 for 100 × τ, the data likelihood

dominates the prior information and the posterior 95% credible interval of 100 × τt for all 4

years is (22.8, 26.8). Use of the AR(1) model to combine evidence over years (with the posterior

distribution for ϕ concentrated around 0.90) reduces 100 ×

48, a twenty percent decrease. Classification performance comparison using the

close to that for the optimal 100 ×

from around 61 to around

is very

.

Figure 1 displays the details behind the improvement of classification performance. In the

upper range of , the curve for the AR(1) model lies above that for the single year, in the

lower range it lies below. For the AR(1) model to dominate the single year at all values of

Lin et al.Page 9

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 10

, the curves would need to cross at

Appendix B provides some discussion on this phenomenon.

= 0.8, but the curves cross at about 0.7.

Longitudinal variation in ranks/percentiles (LVPest is dramatically reduced for the AR(1) model

going from 62 for the year-by-year analysis to 4 for the multi-year. As a basis for comparison,

if ϕ → 1, → 0 and if the data provide no information on the SMRs (the τ → ∞), then

= 83.

We have not compared fit of the AR(1) model to other correlation structures such as compound

symmetry (constant correlation rather than exponential damping). With only 4 years of data

per center, the power to compare different correlation structures will be low. With ϕ's posterior

mean 0.90 and 95% credible interval (0.88, 0.92), the AR(1) model is well supported by the

data relative to independence. Note that the AR(1) model operates on the θkt = log(ρkt) and not

on the observed estimates

a hidden Markov model.

. The induced model for these is approximately ARMA(1, 1),

5.5 Parametric and non-parametric priors

We compare percentiles based on posterior distributions under the parametric and NPML priors

using 1998 data. Figure 5 displays Gaussian, posterior expected and smoothed NPML

estimated priors for θ = log(ρ). The Gaussian is produced by plugging in the posterior medians

for (μ0, τ0). The posterior expected is a mixture of Gaussians using the posterior distribution

of (μ0, τ0). The posterior distribution of (μ0, τ0) has close to 0 variance, so the two parametric

curves superimpose. The NPML is discrete and was smoothed using the “density” function in

R with adjustment parameter 10 (i.e., the Gaussian kernel bandwidth is ten times of the default

value, see Silverman (1986)). We smooth the NPML to graph a smooth curve, but use the

NPML itself to produce ranks/percentiles. Note that the smoothed NPML has at least two

modes with a considerable mass at approximately θ = 0.5; ρ = 1.65. However, this departure

from the Gaussian distribution has little effect on classification performance. Using 1998 data,

for the NPML 100 × OC(0.8) ≈ 67 while for the Gaussian prior the value is 62 (see Table 1).

For performance evaluations of the NPML, see Paddock et al. (2006). Fig. 2 compares

under the two priors. The centers at the top or the bottom have less uncertainty in

percentiles (strong signal), and their percentiles are generally same under two priors. For the

dialysis centers with larger variance, the percentiles depend on the prior.

5.6 Ranks based on exceedance probabilities

We compute P*(0.8) (see Sect. 3.3) using the Gaussian prior for θ and 1998 data. The θ-

threshold,

of P*(γ) and

nearly identical to that based on

(ρ-threshold = 1.184). Lin et al. (2006) prove the near equivalence

and Fig. 1 displays this equivalence in that the curve based on P*(0.8) is

for ϕ = 0.

6 Discussion

Ranks and percentiles are computed to address specific policy or management goals. It is

important to use a procedure that performs well for the primary goals. A structured approach

guided by a Bayesian hierarchical model and a loss function helps clarify goals and produces

ranks/percentiles that outperform other contenders, such as those based on MLEs and Z-scores.

When the uncertainties of the direct estimate vary considerably over providers, the estimates

are very sensitive to the method used. In that situation, a structured approach is especially

important.

Lin et al.Page 10

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 11

Our data-analytic assessments support the Lin et al. (2006) finding that the (general purpose

percentiles) perform well over inferential goals addressed by a range of loss functions, but that

if a specific percentile cut-point, γ, is identified,

substantive application dictates otherwise, we recommend the use of these. Cost or other

considerations can be incorporated to select γ.

(or P*(γ)) should be used. Unless the

Though the loss function guided estimates are the best possible, the ranking results might not

be conclusive, partially indicated by the wide confidence interval as shown in the Fig. 4.

Therefore, data-analytic performance evaluations are a necessary companion to estimated

ranks. Uncertainty assessments include standard errors and tabulation or display of the

probabilities of correct classification in a (above γ)/(below γ) assessment (our πk(γ|Y)). These

probabilities can be used to temper penalties or rewards. When available, data of multiple years

can be combined to reduce the uncertainty in ranking results, as shown in Table 1 and Fig. 1.

Robustness of efficiency and validity are important attributes of any statistical procedure. For

sufficiently large K, using a smoothed non-parametric prior is highly efficient relative to a

correct, parametric approach and confers considerable robustness (see Paddock et al. 2006).

Additional study of this strategy is needed.

Percentiles are prima facie relative comparisons in that it is possible that all providers are doing

well or that all are doing poorly; percentiles will not pick this up. Indeed, the SMR is, itself, a

relative measure and so percentiles produced from it are twice removed from a normative

context! In situations where normative values are available (e.g., death rates), percentiles that

have a normative interpretation are attractive and those based on posterior probabilities of

exceeding some threshold (P*(γ)) are essentially identical to a loss function based approach

and so provide an excellent link to a substantively relevant scale. And, they confer a

considerable computing advantage over using the posterior distribution of the ranks to find the

.

Finally, because percentiles can be very sensitive to the estimation methods and because there

is considerable uncertainty associated with all percentiling methods, stakeholders need to be

informed of the issues in producing percentiles, in interpreting them, in their role in science

and policy, and in insisting on uncertainty assessments.

Acknowledgements

Supported by grant R01-DK61662 from U.S NIH, National Institute of Diabetes, Digestive and Kidney Diseases., We

thank Chris Forrest for his advice and comments.

Appendices

Appendix A: An illustrative example of subset dependency

In general, ranks depend on the framework for comparison and the list of contenders; they are

not necessarily subset invariant. For illustration, consider ranking 3 dialysis centers by

defined in Eq. 3. We have,

as

Lin et al. Page 11

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 12

Let

then pr(ρ1 > ρ2) = 0.51 > 0.49 = pr(ρ2 > ρ1); pr(ρ1 > ρ2) + pr(ρ1 > ρ3) = 0.51 < 0.98 = pr(ρ2 >

ρ1) + pr(ρ2 > ρ3). Thus if we rank only dialysis centers 1 and 2, center 2 has better rank (smaller

θ) than center 1; if we rank all 3 centers, center 2 has worse rank than center 1.

, i = 1, 2, 3. Let μ1 = 0, μ2 = -0.15, μ3 = 0.05, ,

Appendix B: Crossing points of exceedance probability curves

We start from a simplified scenario where θk's share a common posterior variance. Assume

θk ~ N(μk, ν) a posteriori, Φ(t; μk, ν) = pr(θ > t; μk, ν). Let ν0 and be two possible values of

ν, ν0 > . Without loss of generality, let μ1 < μ2 < …< μK.

For any given t,

• •

If ;

• •

If ;

By the common variance assumption and μ1 < μ2 <…μK, the posterior distributions of θk are

stochastically ordered and the rank of θk is k. The curves and

are both monotone increasing and cross each other between

if μi < t < μi+1. The value of y-coordinate of the crossing point is around 0.5 due to

and

. We denote the x-coordinate of the crossing point as

difference.

ignoring at most 1/K

If t1 and t2 satisfy

then the curves and cross each other at ; the curves

and cross each other at ; And the crossing-over of

curves and happens between and . When γ is

greater or smaller than both of

μK), the x-coordinate of the crossing point can not be at γ.

and , which depend on γ, ν0, and vector (μ1, μ2,…,

Denote the posterior distributions of θk as N(μk, νk) and N

years analyses. If the dialysis centers perform consistently over years, inference uncertainty of

θk should be reduced (assuming < νk) by accumulating data over years while the means do

in single year and multiple

Lin et al. Page 12

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 13

not change much (assuming = μk). Assuming N(μk, νk)'s are stochastically ordered, N

's are stochastically ordered, the above discussion applies to the crossing point of

the curves

and .

In Fig. 1, two curves

stochastically ordered assumption and the location crossing point is more complicated. In

general, it is not necessary that the x-coordinate of the crossing point will be at γ.

and are plotted without the

Appendix C: The NPML algorithm

Assume ρk ~ G, k = 1,…, K. G is discrete having at most J mass points u1,…,uJ with probabilities

p1,…, pJ. We use EM algorithm (Dempster et al. 1977) to estimate the u's and p's. Start with

and , for each recursion,

This recursion converges to a fixed point and, if unique, to the NPML. The recursion is

stopped when the maximum relative change in each step for both the

…,K is smaller than 0.001. At convergence, is both prior and the Shen and Louis (1998)

histogram estimate .

and the , j = 1, 2,

Care is needed in programming the recursion. The w-recursion is:

Since

computations we define,

can be extremely small (

> can be extremely large), to stabilize the

and write

Lin et al.Page 13

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 14

The w-recursion becomes:

References

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to

multiple testing. J. R. Stat. Soc. Ser. C Stat. Methodol 1995;57:289–300.

Carlin, BP.; Louis, TA. Bayesian Methods for Data Analysis. Vol. 3rd edn.. Chapman and Hall/CRC

Press; Boca Raton FL: 2008.

Christiansen C, Morris C. Improving the statistical approach to health care provider profiling. Ann. Intern.

Med 1997;127:764–768. [PubMed: 9382395]

Conlon, EM.; Louis, TA. Addressing multiple goals in evaluating region-specific risk using Bayesian

methods. In: Lawson, A.; Biggeri, A.; Böhning, D.; Lesaffre, E.; Viel, J-F.; Bertollini, R., editors.

Disease Mapping and Risk Assessment for Public Health. Wiley; 1999. p. 31-47.chap. 3

Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm

(C/R: P 22-37). J. R. Stat. Soc. Ser. C Stat. Methodol 1977;39:1–22.

Diggle PJ, Thomson MC, Christensen OF, Rowlingson B, Obsomer V, Gardon J, Wanji S, Takougang

I, Enyong P, Kamgno J, Remme JH, Boussinesq M, Molyneux DH. Spatial modelling and the

prediction of Loa loa risk: decision making under uncertainty. Ann. Trop. Med. Parasitol 2007;101

(6):499–509. [PubMed: 17716433]

Dominici F, Parmigiani G, Wolpert RL, Hasselblad V. Meta-analysis of migraine headache treatments:

combining information from heterogeneous designs. J. Am. Stat. Assoc 1999;94:16–28.

Efron B, Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays. Genet.

Epidemiol 2002;23:70–86. [PubMed: 12112249]

ESRD. 1999 Annual Report: ESRD Clinical Performance Measures Project. Health Care Financing

Administration; 2000. Technical Report

Gelman A, Price P. All maps of parameter estimates are misleading. Stat. Med 1999;18:3221–3234.

[PubMed: 10602147]

Gelman, A.; Sturtz, S.; Ligges, U.; Gorjanc, G.; Kerman, J. 2006. The R2WinBUGS package.

http://cran.r-project.org/doc/packages/R2WinBUGS.pdf

Goldstein H, Spiegelhalter D. League tables and their limitations: statistical issues in comparisons of

institutional performance (with discussion). J. R. Stat. Soc. Ser. A 1996;159:385–443.

Grigg OA, Farewell VT, Spiegelhalter DJ. Use of risk-adjusted CUSUM and RSPRT charts for

monitoring in medical contexts. Stat. Methods Med. Res 2003;12(2):147–170. [PubMed: 12665208]

Lacson E, Teng M, Lazarus J, Lew N, Lowrie E, Owen W. Limitations of the facility-specific standardized

mortality ratio for profiling health care quality in dialysis. Am. J. Kidney Dis 2001;37:267–275.

[PubMed: 11157366]

Laird NM. Nonparametric maximum likelihood estimation of a mixing distribution. J. Am. Stat. Assoc

1978;78:805–811.

Landrum M, Bronskill S, Normand S-L. Analytic methods for constructing cross-sectional profiles of

health care providers. Health. Serv. Outcomes Res. Method 2000;1:23–48.

Lin R, Louis TA, Paddock SM, Ridgeway G. Loss function based ranking in two-stages, hierarchical

models. Bayesian Anal 2006;1(4):915–946.

Liu J, Louis TA, Pan W, Ma J, Collins A. Methods for estimating and interpreting provider-specific,

standardized mortality ratios. Health. Serv. Outcomes Res. Method 2004;4:135–149.

Lin et al. Page 14

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 15

Lockwood J, Louis TA, McCaffrey DF. Uncertainty in rank estimation: implications for value-added

modeling accountability systems. J. Edu. Behav. Stat 2002;27(3):255–270.

Louis TA, Shen W. Innovations in Bayes and empirical Bayes methods: estimating parameters,

populations and ranks. Stat. Med 1999;18:2493–2505. [PubMed: 10474155]

Louis TA, Zeger SL. Effective communication of standard errors and confidence intervals. Biostatistics.

2008http://dx.doi.org/10.1093/biostatistics/kxn014

McClellan, M.; Staiger, D. The Quality of Health Care Providers. National Bureau of Economic Research;

1999. Technical Report 7327Working Paper

Normand S-LT, Glickman ME, Gatsonis CA. Statistical methods for profiling providers of medical care:

issues and applications. J. Am. Stat. Assoc 1997;92:803–814.

Normand S-LT, Shahian DM. Statistical and clinical aspects of hospital outcomes profiling. Stat. Sci

2007;22:206–226.

Ohlssen DI, Sharples LD, Spiegelhalter DJ. A hierarchical modelling framework for identifying unusual

performance in health care providers. J. R. Stat. Soc. Ser. A Stat. Soc 2007;170(4):865–890.

Paddock S, Ridgeway G, Lin R, Louis TA. Flexible distributions for triple-goal estimates in two-stage

hierarchical models. Comput. Stat. Data Anal 2006;50(11):3243–3262.

Pepe MS, Feng Z, Huang Y, Longton G, Prentice R, Thompson IM, Zheng Y. Integrating the

predictiveness of a marker with its performance as a classifier. Am. J. Epidemiol 2008;167(3):362–

368.http://dx.doi.org/10.1093/aje/kwm305 [PubMed: 17982157]

Plummer, M.; Best, N.; Cowles, K.; Vines, K. The CODA Package. 2006.

Shen W, Louis TA. Triple-goal estimates in two-stage, hierarchical models. J. R. Stat. Soc. Ser. B

1998;60:455–471.

Shen W, Louis TA. Triple-goal estimates for disease mapping. Stat. Med 2000;19:2295–2308. [PubMed:

10960854]

Silverman, BW. Density Estimation for Statistics and Data Analysis. Chapman & Hall Ltd; 1986.

Spiegelhalter, D.; Thomas, A.; Best, N.; Gilks, W. BUGS: Bayesian Inference Using Gibbs Sampling.

Vol. Version 0.60. Medical Research Council Biostatistics Unit, Institute of Public Health;

Cambridge University: 1999. Technical Report

Storey JD. A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Methodol 2002;64(3):479–

498.

Storey JD. The positive false discovery rate: a Bayesian Interpretation and the q-value. Ann. Stat 2003;31

(6):2013–2035.

USRDS. 2005 Annual Data Report: Atlas of end-stage renal disease in the United States. Health Care

Financing Administration; 2005. Technical report

Zaslavsky AM. Statistical issues in reporting quality data: small samples and casemix variation. Int. J.

Qual. Health Care 2001;13(6):481–488. [PubMed: 11769751]

Zhang M, Strawderman RL, Cowen ME, Wells MT. Bayesian inference for a two-part hierarchical model:

an application to profiling providers in managed health care. J. Am. Stat. Assoc 2006;101(475):934–

945.

Lin et al.Page 15

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 16

Fig. 1.

πk (0.8|Y) versus

with the single year model (ϕ ≡ 0) and the AR(1) model (ϕ = 0.90) Two curves don't cross at

γ = 0.8. The line for fully informative data, i.e., when there is no uncertainty associated with

ranking results is given as reference

for 1998. Optimal percentiles and posterior probabilities computed

Lin et al.Page 16

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 17

Fig. 2.

Comparison of 1998

centers evenly spread across percentiles estimated with NPML prior. The percentiles of the

same center are connected

with NPML and Gaussian prior. Circles represent 40 dialysis

Lin et al.Page 17

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 18

Fig. 3.

πk(0.8) versus estimated percentiles by three ranking methods using the 1998 data:

based and Z-score-based. For small dialysis centers (fewer than 5 patients in 1998), the symbol

“-” represents the MLE-based percentiles, the symbol “1” the Z-score-based percentiles and

the symbol “ ^ ” the

, MLE-

Lin et al.Page 18

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 19

Fig. 4.

SEL-based percentiles for 1998. For each display, the Y-axis is 100 ×

probability interval. The X-axis for the upper left panel is , for the upper right is percentiles

based on ρpm, for the lower left is percentiles based on the ρmle and for the lower right is

percentiles based on Z-scores testing ρk = 1

with its 95%

Lin et al. Page 19

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 20

Fig. 5.

Estimated priors for θ = log(ρ) using the 1998 data. The solid curve is a smoothed NPML using

the “density” function in R with adjustment parameter = 10. The dashed curve is Gaussian

using posterior medians for (μ, τ); the dotted curve is a mixture of Gaussians with (μ, τ) sampled

from their MCMC computed joint posterior distribution

Lin et al. Page 20

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 21

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Lin et al.Page 21

Table 1

Results for

and

. In the multi-year section, 100 × OC(0.8) is for the indicated year as estimated from the multi-year model

and 889092 is a notation for posterior median 90 and 95% credible interval (88, 92) (Louis and Zeger 2008)

Single year: (ϕ ϕ ≡ 0)

Multi-year: 100 × ϕ ∼ ϕ ∼ 889092)

Parameter

1998

1999

2000

2001

1998

1999

2000

2001

100 × ζ

-2.8

-1.3

-2.3

-0.7

-3.1

-0.8

-1.7

-0.3

100 × τ

24.1

23.5

23.1

22.2

25.8

25.0

24.9

24.1

100 × OCp

~(0.8)(0.8)

62

61

60

62

49

47

46

50

LV(P^

k)

62

4

Health Serv Outcomes Res Methodol. Author manuscript; available in PMC 2009 April 1.