PreprintPDF Available

Proportional Multicalibration

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Multicalibration is a desirable fairness criteria that constrains calibration error among flexibly-defined groups in the data while maintaining overall calibration. However, when outcome probabilities are correlated with group membership, multicalibrated models can exhibit a higher percent calibration error among groups with lower base rates than groups with higher base rates. As a result, it remains possible for a decision-maker to learn to trust or distrust model predictions for specific groups. To alleviate this, we propose proportional multicalibration, a criteria that constrains the percent calibration error among groups and within prediction bins. We prove that satisfying proportional multicalibration bounds a model's multicalibration as well its differential calibration, a stronger fairness criteria inspired by the fairness notion of sufficiency. We provide an efficient algorithm for post-processing risk prediction models for proportional multicalibration and evaluate it empirically. We conduct simulation studies and investigate a real-world application of PMC-postprocessing to prediction of emergency department patient admissions. We observe that proportional multicalibration is a promising criteria for controlling simultenous measures of calibration fairness of a model over intersectional groups with virtually no cost in terms of classification performance.
Proportional Multicalibration
William La CavaElle LettGuangya Wan
Computational Health Informatics Program, Boston Children’s Hospital, Harvard Medical School
Boston, MA, USA
{william.lacava,elle.lett,guangya.wan}@childrens.harvard.edu
Abstract
Multicalibration is a desirable fairness criteria that constrains calibration error
among flexibly-defined groups in the data while maintaining overall calibration.
However, when outcome probabilities are correlated with group membership, multi-
calibrated models can exhibit a higher percent calibration error among groups with
lower base rates than groups with higher base rates. As a result, it remains possible
for a decision-maker to learn to trust or distrust model predictions for specific
groups. To alleviate this, we propose proportional multicalibration, a criteria that
constrains the percent calibration error among groups and within prediction bins.
We prove that satisfying proportional multicalibration bounds a model’s multi-
calibration as well its differential calibration, a stronger fairness criteria inspired
by the fairness notion of sufficiency. We provide an efficient algorithm for post-
processing risk prediction models for proportional multicalibration and evaluate it
empirically. We conduct simulation studies and investigate a real-world application
of PMC-postprocessing to prediction of emergency department patient admissions.
We observe that proportional multicalibration is a promising criteria for controlling
simultenous measures of calibration fairness of a model over intersectional groups
with virtually no cost in terms of classification performance.
1 Introduction
Today, machine learning (ML) models have an impact on outcome disparities across sectors (health,
lending, criminal justice) due to their wide-spread use in decision-making. When applied in clinical
decision-making, ML models help care providers decide whom to prioritize to receive finite and
time-sensitive resources among a population of potentially very ill patients. These resources include
hospital beds [6,17], organ transplants [49], specialty treatment programs [31,43], and, recently,
ventilator and other breathing support tools to manage the COVID-19 pandemic [46].
In scenarios like these, decision makers typically rely on risk prediction models to be calibrated.
Calibration measures the extent to which a model’s risk scores,
R
, match the observed probability
of the event,
y
. Perfect calibration implies that
P(y|R=r) = r
, for all values of
r
. Calibration
allows the risk scores to be used to rank patients in order of priority and informs care providers
about the urgency of treatment. However, models that are not equally calibrated among subgroups
defined by different sensitive attributes (race, ethnicity, gender, income, etc.) may lead to systematic
denial of resources to marginalized groups (e.g. [43,1,47,51,41]). Just this scenario was observed
by Obermeyer et al. [43] analyzed a large health system algorithm used to enroll high-risk patients
into care management programs and showed that, at a given risk score, Black patients exhibited
significantly poorer health than white patients.
Corresponding author. https://cavalab.org
https://ellelett.com
Preprint. Under review.
arXiv:2209.14613v1 [cs.LG] 29 Sep 2022
To address equity in calibration, Hebert-Johnson et al. [28] proposed a fairness measure called
multicalibration (MC), which asks that calibration be satisifed simultaneously over many flexibly-
defined subgroups. Remarkably, MC can be satisfied efficiently by post-processing risk scores
without negatively impacting the generalization error of a model, unlike other fairness concepts like
demographic parity [23] and equalized odds [27]. This has motivated the use of MC in practical
settings (e.g. Barda et al. [7]) and has spurred several extensions [39,35,25,24]. If we bin our risk
predictions, the MC criteria specifies that, for every group within each bin, the absolute difference
between the mean observed outcome and the mean of the predictions should be small.
As Barocas, Hardt, and Narayanan [8] note, equity in calibration embeds the fairness notion called
sufficiency, which states: for a given risk prediction, the expected outcome should be independent of
group membership. Starting from this notion, we can assess the conditions under which MC satisfies
sufficiency. In this work, we derive a fairness criteria directly from sufficiency dubbed differential
calibration for its relation to differential fairness [21]. We show that satisfying differential calibration
can ensure that a model is equally “trustworthy" among groups in the data. By equally “trustworthy”,
we mean that a decision maker cannot reasonably come to distrust the model’s risk predictions for
specific groups, which may help prevent differences in decision-making between demographic groups,
given the same risk prediction.
By relating sufficiency to MC, we describe a shortcoming of MC that can occur when the outcome
probabilities are strongly tied to group membership. Under this condition, the amount of calibration
error relative to the expected outcome can be unequal between groups. This inequality hampers the
ability of MC to (approximately) guarantee sufficiency, and thus guarantee equity in trustworthiness
for the decision maker.
We propose a simple variant of MC called proportional multicalibration (PMC) that ensures that the
proportion of calibration error within each bin and group is small. We prove that PMC bounds both
multicalibration and differential calibration. We show that PMC can be satisfied with an efficinet
post-processing method, similarly to MC.
1.1 Our Contributions
In this manuscript, we formally analyze the connection of MC to the fairness notion of sufficiency.
To do so, we introduce differential calibration (DC), a sufficiency measure that constrains ratios
of population risk between pairs of groups within prediction bins. We describe how DC, like
sufficiency, provides a sense of equal trustworthiness from the point of view of the decision maker.
With this definition, we prove the following. First, models that are (
α
,
λ
)-multicalibrated satisfy
(log rmin+α
rminα, λ)
-DC, where
rmin
is the minimum expected risk prediction among categories defined
by subgroups and prediction intervals. We illustrate the meaning of this bound, which is that the
proportion of calibration error in multicalibrated models may scale inversely with the outcome
probability.
Based on these observations, we propose an alternate definition of MC, PMC, that controls the
percentage error by group and risk strata (Definition 3). We show that models satisfying
(α, λ)
-PMC
are
(α
1α, λ)
-multicalibrated and
(log 1+α
1α)
-differentially calibrated. Proportionally multicalibrated
models thereby obtain robust fairness guarantees that are independent of population risk categories.
Furthermore, we define an efficient algorithm for learning predictors satisfying α-PMC.
Finally, we investigate the application of these methods to predicting patient admissions in the
emergency department, a real-world resource allocation task, and show that post-processing for PMC
results in models that are accurate, multicalibrated, and differentially calibrated.
2 Reconciling Multicalibration and Sufficiency
2.1 Preliminaries
We consider the task of training a risk prediction model for a population of individuals with outcomes,
y {0,1}
, and features,
x X
. Let
D
be the joint distribution from which individual samples
(y, x)
are drawn. We assume the outcomes
y
are random samples from underlying independent Bernoulli
distributions, denoted as
p(x)[0,1]
. Given an individual’s attributes
x= (x1, . . . , xd)
, it will
be useful to refer to subsets we wish to protect, e.g. demographic identifiers. To do so, we define
2
A={A1, . . . , Ap}
,
pd
, such that
A1={x1i, . . . , x1k}
is a finite set of values taken by
attribute
x1
. Individuals can be further grouped into collections of subsets,
C 2X
, such that
S C
is the subset of individuals belonging to
S
, and
xS
indicates that individual
x
belongs to group
S
.
We denote our risk prediction model as
R(x) : X [0,1]
. In order to consider calibration in practice,
the risk predictions are typically discretized and considered within intervals. The coarseness of this
interval is parameterized by a partitioning parameter,
λ(0,1]
. The
λ
-discretization of
[0,1]
is
denoted by a set of intervals,
Λλ=n{Ij}11
j=0 o
, where
Ij= [jλ, (j+ 1)λ)
. For brevity, most
proofs in the following sections are given in Appendix A.3.
2.2 Multicalibration
MC [29] guarantees that the calibration error for any group from a collection of subsets,
C
will
not exceed a user-defined threshold, over the range of risk scores. In order to work with bins of
predictions, we will mostly concern ourselves with the discretized version of MC, defined below.
The non-discretized versions are given in Appendix A.2.
Definition 1
(
(α, λ)
-multicalibration)
.
Let
C 2X
be a collection of subsets of
X
. For any
α, λ > 0
, a predictor
R
is
(α, λ)
-multicalibrated on
C
if, for all
IΛλ
and
S C
where
PD(RI|xS)αλ,
E
D[y|RI, x S]E
D[R|RI, x S]
α.
MC is one of few approaches to achieving fairness that does not require a significant trade-off to be
made between a model’s generalization error and the improvement in fairness it provides [29]. As [29]
show, this is because achieving multicalibration is not at odds with achieving accuracy in expectation
for the population as a whole. This separates calibration fairness from other fairness constraints like
demographic parity and equalized odds [27], both of which may denigrate the performance of the
model on specific groups [11,45]. In clinical settings, such trade-offs may be difficult or impossible
to justify. In addition to its alignment with accuracy in expectation, Hébert-Johnson et al. [29] propose
an efficient post-processing algorithm for MC similar on boosting. We discuss additional extensions
to MC in Appendix A.1.
2.3 Sufficiency and Differential Calibration
MC provides a sense of fairness by approximating calibration by group, which is perfectly satisfied
when
PD(y|R=r, x S) = r
for all
SC
. Calibration by group is closely related to the
sufficiency fairness criterion [8]. Sufficiency is the condition where the outcome probability is
independent from
C
conditioned on the risk score. In the binary group setting (
C={Si, Sj}
),
sufficiency can be expressed as PD(y|R, x Si) = PD(y|R, x Sj), or
PD(y|R, x Si)
PD(y|R, x Sj)= 1.(1)
Unlike calibration by group, sufficiency does not stipulate that the risk scores be calibrated, yet from
a fairness perspective, sufficiency and calibration-by-group are equivalent [8]. Consider that one
can easily transform a model satisfying sufficiency into one that is calibrated-by-group with a single
function
f(R)[0,1]
, for example with Platt scaling [8]. In both cases, the sense of fairness stems
from the desire for the risk scores, Rto capture everything about group membership that is relevant
to predicting the outcome, y.
Under sufficiency, the risk score is equally informative of the outcome, regardless of group member-
ship. In this sense, a model satisfying sufficiency provides equally trustworthy risk predictions to a
decision maker, regardless of the groups to which an individual belongs.
Below, we define an approximate measure of sufficiency that constrains pairwise differentials between
groups, and accomodates binned predictions:
Definition 2
(Differential calibration)
.
Let
C 2X
be a collection of subsets of
X
. A model
R(x)
is
(
ε
,
λ
)-differentially calibrated with respect to
C
if, across prediction intervals
IΛλ
, for all pairs
3
(Si, Sj) C × C for which PD(Si), PD(Sj)>0,
eεED[y|RI, x Si]
ED[y|RI, x Sj]eε(2)
By inspection we see that
ε
in
(ε, λ)
-DC measures the extent to which
R
satisifies sufficiency. That its,
when
P(y|RI, x Si)P(y|RI , x Sj)
for all pairs,
ε0
.
(ε, λ)
-DC says that, within
any bin of risk scores, the outcome yis at most eεtimes more likely among one group than another,
and a minimum of
eε
less likely. Definition 2fits into the general definition of a differential fairness
measure proposed by Foulds et al. [22], although previously it was used to define demographic
parity criteria [23]. We describe the relation in more detail in Appendix A.1.1, including Eq. (2)’s
connection to differential privacy [18] and pufferfish privacy [38].
2.4 The differential calibration of multicalibrated models is limited by low-risk groups
At a basic level, the form of MC and sufficiency differ: MC constrainins absolute differences
between groups across prediction bins, whereas sufficiency constrains pairwise differentials between
groups. To reconcile MC and DC/sufficiency more formally, we pose the following question: if
a model satisfies
α
-MC, what, if anything does this imply about the
ε
-DC of the model? (In
Appendix A.4,Theorem 5, we answer the inverse question). We now show that multicalibrated models
have a bounded DC, but that this bound is limited by small values of R.
Theorem 1.
Let
R(x)
be a model satisfying (
α
,
λ
)-MC on a collection of subsets
C 2X
. Let
rmin =
min(S,I)∈C×ΛλED[R|RI, x S]
be the minimum expected risk prediction among categories
(S, I) C × Λλ. Then R(x) is (log rmin+α
rminα, λ)-differentially calibrated.
Proof.
Let
r=ED[R|RI, x S]
and
p=ED[y|RI, x S]
.
(α, λ)
-MC guarantees that
rαpr+α
for all groups
S C
and prediction intervals
rΛλ[0,1]
. Plugging these lower
and upper bounds into Eq. (2) yields
eεr+α
rα
. The maximum of this ratio, for a fixed
α
, occurs at
the smallest value of r; therefore εlog rmin+α
rminα.
Theorem 1illustrates the important point that, in terms of percentage error,
MC
does not provide
equal protection to groups with different risk profiles. Imagine a model satisfying (0.05,0.1)-MC
for groups
S C
. Consider individuals receiving model predictions in the interval
(0.9,1]
. MC
guarantees that, for any category
{x:xS, R(x)I= (0.9,1]}
, the expected outcome prevalence
(
ED[y|xS, R I]
) of at least
0.9α= 0.85
. This bounds the percent error among groups in
the
(0.9,1]
prediction interval to 6%. In contrast, consider individuals for whom
R(x)(0.30.4]
;
each group may have a true outcome prevalence as low as 0.25, which is an error of 20% - about 3.4x
higher than the percent error in the higher-risk group.
3 Proportional Multicalibration
We are motivated to define a measure that is efficiently learnable like MC (Definition 1) but better
aligned with the fundamental fairness notion of sufficiency, like DC (Definition 2). To do so, we
define PMC, a variant of MC that constrains the proportional calibration error of a model among
subgroups and risk strata. In this section, we show that bounding a model’s PMC is enough to
meaningfully bounds its DC and MC. Furthermore, we provide an efficient algorithm for satisfying
PMC based on a simple extension of MC/Multiaccuracy boosting [39].
Definition 3
(Proportional Multicalibration)
.
A model
R(x)
is
(α, λ)
-proportionally multicalibrated
with respect to a collection of subsets
C
if, for all
S C
and
IΛλ
satisfying
PD(R(x)I|x
S)αλ,
|ED[y|RI, x S]ED[R|RI, x S]|
ED[y|RI, x S]α. (3)
Note that, in practice, we must ensure
ED[y|RI, x S]6= 0
for Definition 3to be defined. We
handle this by introducing a parameter
ρ > 0
constraining the lowest expected outcome among
categories
(S, I)
. In the remainder of this section, we detail how PMC relates to suffiency/DC and
MC. We provide bounds on the values of
MC
and
DC
given a proportionally multicalibrated model,
and we illustrate the relationship between these three metrics in Fig. 1.
4
Comparison to Differential Calibration
Rather than constraining the differentials of prediction-
and group- specific outcomes among all pairs of subgroups in
C × C
as in DC (Definition 2), PMC
constrains the relative error of each group in
C
. In practical terms, this makes it more efficient to
calculate PMC by a factor of
O(|C|)
steps compared to DC. In addition, PMC does not require
additional assumptions about the overall calibration of a model in order to imply guarantees of MC,
since PMC directly constrains calibration rather than constraining sufficiency alone.
Theorem 2.
Let R(x) be a model satisfying
(α, λ)
-PMC on a collection
C
. Then
R(x)
is
(log 1+α
1α, λ)
-
differentially calibrated.
Proof.
Let
r=ED[R|RI, x S]
and
p=ED[y|RI, x S]
. If
R(x)
satisfies
α
-PMC
(Definition 3), then
r/(1 + α)pr/(1 α)
. Solving for the upper bound on
ε
-DC, we
immediately have εlog r(1+α)
r(1α)log 1+α
1α.
Theorem 2demonstrates that
α
-proportionally multicalibrated models satisfy a straightforward notion
of differential fairness that depends monotonically only on
α
. The relationship between PMC and
DC is contrasted with the relationship of MC and DC in Fig. 1, left panel. The figure illustrates how
MC’s sensitivity to small risk categories limits its DC.
Comparison to Multicalibration
Rather than constraining the absolute difference between risk
predictions and the outcome as in MC, PMC requires that the calibration error be a small fraction of
the expected risk in each category
(S, I)
. In this sense, it provides a stronger protection than MC
by requiring calibration error to be a small fraction regardless of the risk group. In many contexts,
we would argue that this is also more aligned with the notion of fairness in risk prediction contexts.
Under MC, the underlying prevalence of an outcome within a group affects the fairness protection
that is received (i.e., the percentage error that Definition 8allows). Because underlying prevalences of
many clinically relevant outcomes vary significantly among subpopulations, multicalibrated models
may systematically permit higher percentage error to specific risk groups. The difference in relative
calibration error among populations with different risk profiles also translates in weaker sufficiency
guarantees, as demonstrated in Theorem 1. In contrast, PMC provides a fairness guarantee that is
independent of subpopulation risks. In the following theorem, we show that MC is constrained when
a model satisfies PMC.
Theorem 3.
Let
R(x)
be a model satisfying
α
-PMC on a collection
C
. Then
R(x)
is (
α
1α
)-
multicalibrated on C.
Proof.
To distinguish the parameters, let
R(x)
be a model satisfying
δ
-PMC. Let
r=ED[R|R
I, x S]
and
p=ED[y|RI, x S]
. Then
r/(1 + δ)pr/(1 δ)
. We solve for the upper
bound on α-MC from Definition 8for the case when p> r. This yields
αpr
r
1δr
=rδ
1δ
δ
1δ.
The right panel of Fig. 1illustrates this relation in comparison to the DC-MC relationship described
in Appendix A.4, Theorem 5. At small values of
ε
and
α
and when the model is perfectly calibrated
overall,
α
-PMC and
ε
-DC behave similarly. However, given
δ > 0
,
ε
-differentially calibrated models
suffer from higher MC error than proportionally calibrated models when
α
-PMC
<0.3
. The right
graph also illustrates the feasible range of
α
for
α
-PMC is
0< α < 0.5
, past which it does not
provide meaningful
α
-MC. The steeper relation between
α
-PMC and MC may have advantages or
disadvantages, depending on context. It suggests that, by optimizing for
α
-PMC, small improvements
to this measure can result in relatively large improvements to MC; conversely,
ε
-DC models that are
well calibrated may satisfy a lower value of α-MC over a larger range of ε.
5
0.0 0.2 0.4 0.6 0.8 1.0
α
0
2
4
6
8
10
ε-Differential Calibration
rm
α-MC
α-PMC
0.0 0.1 0.2 0.3 0.4 0.5
α, ε
0.0
0.2
0.4
0.6
0.8
1.0
α-Multicalibration
δ-calibration
ε-DC
α-PMC
Figure 1: A comparison of ε-DC, α-MC, and α-PMC in terms of their parameters αand ε. In both
panes, the x value is a given value of one metric for a model, and the y axis is the implied value of the
other metric, according to Theorem 5-Theorem 3. The left filled area denotes the dependence of the
privacy/DC of
α
-multicalibrated models on the minimum risk interval,
rmin [0.01,1.0]
. The right
filled area denotes the dependence of the MC of
ε
-differentially calibrated models on their overall
calibration, δ[0.0,0.5].α-PMC does not have these sensitivities.
3.1 Learning proportionally multicalibrated predictors
So far we have demonstrated that models satisfying PMC exhibit desirable guarantees relative to
two previously defined measures of fair calibration, but have not considered whether PMC is easy to
learn. Here, we answer in the affirmative by proposing Algorithm 1to satisfy PMC and proving that
it learns an (α,λ)-PMC model in a polynomial number of steps.
Theorem 4.
Define
α, λ, γ, ρ > 0
. Let
C 2X
be a collection of subsets of
X
such that, for all
S C
,
PD(S)> γ
. Let
R(x)
be a risk prediction model to be post-processed. For all
(S, I) C × Λλ
, let
E[y|RI, x S]> ρ
. There exists an algorithm that satisfies
(α, λ)
-PMC with respect to
C
in
O(|C|
α3λ2ρ2γ)steps.
We analyze Algorithm 1and show it satisfies Theorem 4in Appendix A.3. Algorithm 1directly
extends MCBoost[44], but differs in that it does not terminate until
R(x)
is within
α¯y
for all categories,
as opposed to simply within
α
. This more stringent threshold requires an additional
O(1
ρ2)
steps,
where
ρ > 0
is a lower bound on the expected outcome within a category
(S, I)
. The parameter
ρ
also serves to smooth empirical estimates of Eq. (3) in our experiments.
4 Experiments
In our first set of experiments (Section 4), we study MC and PMC in simulated population data
to understand and validate the analysis in previous sections. In the second section, we compare
the performance of varied model treatments on a real world hospital admission task, using an
implementation of Algorithm 1. We make use of empirical versions of our fairness definitions which
we refer to as MC loss (Definition 9), PMC loss (Definition 10), and DC loss (Definition 11), defined
in Appendix A.2.
Simulation study
We simulate data from
α
-multicalibrated models. For simplicity, we specify
a data structure with a one-to-one correspondence between subset and model estimated risk, such
that for all
x
in subset
S
,
R(x) = R(x|xS) = R(S)
. Therefore all information for predicting
the outcome based on the features in
x
is contained in the attributes
A
that define subgroup
S
.
Prevalence is specified as
p
i=PD(y|xSi)=0.2+0.01(i1)
and
i= 1,· · · , Ns
, where
Ns
is the number of subsets
S
, defined by
A
and indexed by
i
with increasing
p
. For each group,
Ri=R(Si) = R(x|xSi) = p
ii.
We randomly select
i
for one group to be
±α
and for
the remaining groups,
i=±δ
, where
δUniform(min = 0,max = α)
. In all cases, the sign
of
i
is determined by a random draw from a Bernoulli distribution. For these simulations we set
NS= 61
and
α= 0.1
, such that
p
i[0.2,0.8]
and
Ri[0.1,0.9]
. We generate
Nsim = 1000
6
Algorithm 1 Proportional Multicalibration Post-processing
Require: Predictor R(x)
1: C 2Xsuch that for all S C, PD(S)γ
2: α, λ, γ, ρ > 0
3: D={(y, x)i}N
i=0 D
4: function PMC(R,C,D,α,λ,γ,ρ)
5: repeat
6: {(y, x)} sample D
7: for S C, I Λλsuch that PD(RI, x S)αλγ do
8: SrS {x:R(x)I}
9: ¯r1
|Sr|PxSrR(x).average group prediction
10: ¯y1
|Sr|PxSry(x).average subgroup risk
11: if ¯yρthen
12: continue
13: r¯y¯r
14: if |r| α¯ythen
15: R(x)R(x) + rfor all xSr
16: R(x)squash(R(x),[0,1]).squash updates to [0,1]
17: if No Updates to R(x) then
18: break
19: return R
simulated datasets, with
n= 1000
observations per group, and for each
Si
, we calculate the ratio of
the absolute mean error to p
i, i.e. the PMC loss function for this data generating mechanism.
We also simulate three specific scenarios where: 1)
|i|
is equivalent for all groups (Fixed); 2)
|i|
increases with increasing
p
i
; and 3)
|i|
decreases with increasing
p
i
, with
α= 0.1
in each case.
These scenarios compare when
α
is determined by all groups, the group with the lowest outcome
prevalence, and the group with the highest outcome prevalence, respectively.
Hospital admission
Next, we test PMC alongside other methods in application to prediction of
inpatient hospital admission for patients visiting the emergency department (ED). The burden of
overcrowding and long wait times in EDs is significantly higher among non-white, non-Hispanic
patients and socio-economically marginalized patients [32,42]. Recent work has demonstrated risk
prediction models that can expedite patient visits by predicting patient admission at an early stage of
a visit with a high degree of certainty (AUC
0.9 across three large care centers) [4,3,5,6]. Our
goal is to ensure no group of patients will be over- or under-prioritized over another by these models,
which could exacerbate the treatment and outcome disparities that currently exist.
We construct a prediction task similar to previous studies but using a new data resource: the MIMIC-
IV-ED repository [34]. The overall intersectional demographic statistics for these data are given
in Table 1. In Table 1we observe stark differences in admission rates by demographic group and
gender, suggesting that the use of a proportional measure of calibration could be appropriate for this
task. We trained and evaluated logistic regression (LR) and random forest (RF) models of patient
admission, with and without post-processing for MC [44] or PMC. We tested a number of parameter
settings given in Table 2, running 100 trials with different shuffles of the data. Comparisons are
reported on a test set of 20% of the data for each trial. Additional experiment details are available
in Appendix A.7 and code for the experiments is available here:
https://github.com/cavalab/
proportional-multicalibration
. The PMC-postprocessing method is available as a package
as well: https://github.com/cavalab/pmcboost.
5 Results
Fig. 2shows the PMC loss of
α
-multicalibrated models under the scenarios described in Section 4.
Proportional
α
-MC constrains the ratio of the absolute mean error (AME) to the outcome prevalence,
for groups defined by a risk interval
(R(x)I)
and subset within a collection of subsets (
xS, S
C)
. Without the proportionality factor
|ED[y|RI, x S]|1
,
α
-multicalibrated models allow a
7
Table 1: Admission prevalence (Admissions/Total (%)) among patients in the MIMIC-IV-ED data
repository, stratified by the intersection of ethnoracial group and gender.
Gender F M Overall
Ethnoracial Group
American Indian/Alaska Native 70/257 (27%) 82/170 (48%) 152/427 (36%)
Asian 1043/3595 (29%) 1032/2384 (43%) 2075/5979 (35%)
Black/African American 3124/27486 (11%) 2603/14458 (18%) 5727/41944 (14%)
Hispanic/Latino 1063/10262 (10%) 1168/5795 (20%) 2231/16057 (14%)
Other 1232/5163 (24%) 1479/3849 (38%) 2711/9012 (30%)
Unknown/Unable to Obtain 1521/2156 (71%) 2074/2377 (87%) 3595/4533 (79%)
White 18147/50174 (36%) 18951/45435 (42%) 37098/95609 (39%)
Overall 26200/99093 (26%) 27389/74468 (37%) 53589/173561 (31%)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Outcome Prevalence (Positivity Rate)
PMC Loss
a -MC scenario
|Δ| fixed across groups
|Δ| decreases with positivity rate
|Δ| increases with positivity rate
Figure 2: The relationship between MC, PMC,
and outcome prevalence as illustrated via a simu-
lation study in which the rates of the outcome are
associated with group membership. Gray points
denote the PMC loss of a (0.1,0.1)-MC model on
1000 simulated datasets, and colored lines denote
three specific scenarios in which each group’s cal-
ibration error (
||
) follows specific rules. PMC
loss is higher among groups with lower positiv-
ity rates in most scenarios unless the groupwise
calibration error increases with positivity rate.
dependence between the group prevalence and the error or privacy loss permitted that is unfair for
groups with lower outcome prevalence.
Results on the hospital admission prediction task are summarized in Fig. 3and Table 5. PMC
post-processing has a negligible effect on predictive performance (
<
0.1%
AUROC, LR and RF)
while reducing DC loss by 27% for LR and RF models, and reducing PMC loss by 40% and 79%,
respectively. In the case of RF models, PMC post-processing reduces MC loss by 23%, a significantly
larger improvement than MC post-processing itself (19%, p=9e-26).
Due to normalization by outcome rates, the optimal value of
α
for PMC is likely to differ from the
best value for MC (their relationship is shown in Fig. 1). For both methods, setting
α
too small
may result in over-fitting. To account for this, we quantified the number of trials for which a given
method produced the best model according to a given metric, over all parameter configurations in
Table 2. PMC post-processing (Algorithm 1) achieves the best fairness the highest percent of the
time, according to DC loss (63%), MC loss (70%), and PMC loss (72%), while MC-postprocessed
models achieve the best AUROC in 88% of cases. This provides strong evidence that, over a large
range of αvalues, PMC post-processing is beneficial compared to MC-postprocessing.
We characterize the sensitivity of PMC and MC to
α
and provide a more detailed breakdown of these
results in Appendix A.7.
6 Discussion and Conclusion
In this paper we have analyzed multicalibration through the lens of suffiency and differential
calibration to reveal the sensitivity of this metric to correlations between outcome rates and group
membership. We have proposed a measure, PMC, that alleviates this sensitivity and attempts to
capture the “best of both worlds" of MC and DC. PMC provides equivalent percentage calibration
protections to groups regardless of their risk profiles, and in so doing, bounds a model’s differential
calibration. We provide an efficient algorithm for learning PMC predictors by postprocessing a given
risk prediction model. On a real-world and clinically relevant task (admission prediction), we have
shown that post-processing LR and RF models with PMC leads to better performance across all
three fairness metrics, with little to no impact on predictive performance.
8
Table 2: Parameters for the hospital admission
prediction experiment.
Parameter
Values
α(0.001, 0.01, 0.05, 0.1)
γ(0.05, 0.1)
λ0.1
ρ(0.001, 0.01)
Model LR, RF
Groups
[(race/ethnicity, gender), (race/ethnicity,
gender, insurance product)]
Table 3: The number of times each post-
processing method achieved the best score
among all methods, out of 100 trials.
postprocessing Base Model MC PMC
metric
AUROC 5 88 6
DC loss 0 36 63
MC loss 8 21 70
PMC loss 0 27 72
RF LR
0.860
0.865
0.870
0.875
0.880
**
****
****
**
****
****
AUROC
RF LR
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
ns
****
**** ****
****
****
Multicalibration loss
RF LR
0
1
2
3
4
5
6
****
****
****
****
****
****
Proportional Multicalibration loss
RF LR
0.2
0.4
0.6
0.8
1.0
1.2
1.4
****
****
***
****
****
****
Differential Calibration loss
postprocessing
-
MC
PMC
Figure 3: A comparison of LR and RF models, with and without MC and PMC post-processing, on the
hospital admission task. From left to right, trained models are compared in terms of test set AUROC,
MC loss, PMC loss, and DC loss. Points represent the median performance over 100 shuffled train/test
splits with bootstrapped 99% confidence intervals. We test for significant differences between post-
processing methods using two-sided Wilcoxon rank-sum tests with Bonferroni correction. ns:
p <=
1; **: 1e-03 <p<=1e-02; ***: 1e-04 <p<=1e-03; ****: p <=1e-04.
Our preliminary analysis suggests PMC can be a valuable metric for training fair algorithms in
resource allocation contexts. Future work could extend this analysis on both the theoretical and
practical side. On the theoretical side, the generalization properties of the PMC measure should be
established and its sample complexity quantified, as Rose [48] did with MC. Additional extensions of
PMC could establish a bound on the accuracy of PMC-postprocessed models in a similar vein to work
by Kim, Ghorbani, and Zou [39] and Hébert-Johnson et al. [30]. On the empirical side, future works
should benchmark PMC on a larger set of real-world problems, and explore use cases in more depth.
References
[1]
Deepshikha Charan Ashana et al. “Equitably Allocating Resources during Crises: Racial
Differences in Mortality Prediction Models”. In: American journal of respiratory and critical
care medicine 204.2 (2021), pp. 178–186 (cit. on p. 1).
[2]
Nikhil Bansal and Anupam Gupta. “Potential-Function Proofs for Gradient Methods”. In:
Theory of Computing 15.1 (2019), pp. 1–32 (cit. on p. 16).
[3]
Yuval Barak-Corren, Andrew M. Fine, and Ben Y. Reis. “Early Prediction Model of Patient
Hospitalization From the Pediatric Emergency Department”. In: Pediatrics 139.5 (May 2017).
ISSN: 1098-4275. DOI:10.1542/peds.2016-2785 (cit. on p. 7).
[4]
Yuval Barak-Corren, Shlomo Hanan Israelit, and Ben Y Reis. “Progressive Prediction of
Hospitalisation in the Emergency Department: Uncovering Hidden Patterns to Improve Patient
Flow”. In: Emergency Medicine Journal 34.5 (May 2017), pp. 308–314. ISS N: 1472-0205,
1472-0213. DOI:10.1136/emermed-2014-203819 (cit. on p. 7).
9
[5]
Yuval Barak-Corren et al. “Prediction across Healthcare Settings: A Case Study in Predicting
Emergency Department Disposition”. In: npj Digital Medicine 4.1 (Dec. 2021), pp. 1–7. ISSN:
2398-6352. DOI:10.1038/s41746-021-00537- x (cit. on p. 7).
[6]
Yuval Barak-Corren et al. “Prediction of Patient Disposition: Comparison of Computer and
Human Approaches and a Proposed Synthesis”. In: Journal of the American Medical Informat-
ics Association 28.8 (July 2021), pp. 1736–1745. ISSN: 1527-974X. DOI:
10.1093/jamia/
ocab076 (cit. on pp. 1,7).
[7]
Noam Barda et al. “Addressing Bias in Prediction Models by Improving Subpopulation
Calibration”. In: Journal of the American Medical Informatics Association: JAMIA 28.3 (Mar.
2021), pp. 549–558. IS SN: 1527-974X. D OI:10.1093/jamia/ocaa283 (cit. on p. 2).
[8]
Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. fairml-
book.org, 2019 (cit. on pp. 2,3,13).
[9]
Richard Berk et al. “A Convex Framework for Fair Regression”. In: arXiv preprint
arXiv:1706.02409 (2017). arXiv: 1706.02409 (cit. on p. 13).
[10]
Alessandro Castelnovo et al. “The Zoo of Fairness Metrics in Machine Learning”. In:
arXiv:2106.00467 [cs, stat] (June 2021). arXiv: 2106.00467 [cs, stat] (cit. on p. 13).
[11]
Alexandra Chouldechova. “Fair Prediction with Disparate Impact: A Study of Bias in Recidi-
vism Prediction Instruments”. In: Big Data 5.2 (June 2017), pp. 153–163. ISSN: 2167-647X.
DOI:10.1089/big.2016.0047 (cit. on pp. 3,13).
[12]
Alexandra Chouldechova and Aaron Roth. “The Frontiers of Fairness in Machine Learning”.
In: arXiv:1810.08810 [cs, stat] (Oct. 2018). arXiv:
1810.08810 [cs, stat]
(cit. on p. 13).
[13]
Patricia Hill Collins. Black Feminist Though: Knowledge, Consciousness, and the Politics of
Empowerment. 1st ed. Routledge, Sept. 1990 (cit. on p. 14).
[14]
Patricia Hill Collins and Sirma Bilge. Intersectionality. John Wiley & Sons, 2020. ISBN:
1-5095-3969-7 (cit. on p. 14).
[15]
Kimberle Crenshaw. “Demarginalizing the Intersection of Race and Sex: A Black Feminist Cri-
tique of Antidiscrimination Doctrine, Feminist Theory and Antiracist Politics”. In: University
of Chicago Legal Forum 1989.1 (1989), p. 31 (cit. on p. 14).
[16]
Kimberle Crenshaw. “Mapping the Margins: Intersectionality, Identity Politics, and Violence
against Women of Color”. en. In: Stanford Law Review 43.6 (July 1991), p. 1241. ISSN:
00389765. DOI:
10.2307/1229039
.URL:
https://www.jstor.org/stable/1229039?
origin=crossref (visited on 05/24/2021) (cit. on p. 14).
[17]
Michael M Dinh and Saartje Berendsen Russell. “Overcrowding Kills: How COVID-19 Could
Reshape Emergency Department Patient Flow in the New Normal”. In: Emergency Medicine
Australasia 33.1 (2021), pp. 175–177. ISSN: 1742-6723. DOI:
10.1111/1742-6723.13700
(cit. on p. 1).
[18]
Cynthia Dwork and Jing Lei. “Differential Privacy and Robust Statistics”. In: Proceedings of
the Forty-First Annual ACM Symposium on Theory of Computing. 2009, pp. 371–380 (cit. on
pp. 4,14).
[19]
Cynthia Dwork et al. “Fairness Through Awareness”. In: Proceedings of the 3rd Innovations
in Theoretical Computer Science Conference. ITCS ’12. New York, NY, USA: ACM, 2012,
pp. 214–226. IS BN: 978-1-4503-1115-1. D OI:10.1145/2090236.2090255 (cit. on p. 13).
[20]
Cynthia Dwork et al. “Learning from Outcomes: Evidence-Based Rankings”. In: 2019 IEEE
60th Annual Symposium on Foundations of Computer Science (FOCS). Nov. 2019, pp. 106–
125. DOI:10.1109/FOCS.2019.00016 (cit. on p. 14).
[21]
James Foulds et al. “An Intersectional Definition of Fairness”. en. In: arXiv:1807.08362 [cs,
stat] (Sept. 2019). arXiv: 1807.08362. URL:
http://arxiv.org/abs/1807.08362
(visited
on 09/30/2021) (cit. on p. 2).
[22]
James Foulds et al. “An Intersectional Definition of Fairness”. In: arXiv:1807.08362 [cs, stat]
(Sept. 2019). arXiv: 1807.08362 [cs, stat] (cit. on pp. 4,13,14,17).
[23]
James R. Foulds and Shimei Pan. “Are Parity-Based Notions of AI Fairness Desirable?” In:
Data Engineering (2020), p. 51 (cit. on pp. 2,4,13,15).
[24]
Parikshit Gopalan et al. “Low-Degree Multicalibration”. In: arXiv:2203.01255 [cs] (Mar.
2022). arXiv: 2203.01255 [cs] (cit. on pp. 2,13).
[25]
Varun Gupta et al. Online Multivalid Learning: Means, Moments, and Prediction Intervals.
Jan. 2021. DO I:10.48550/arXiv.2101.01739 (cit. on pp. 2,13,14).
10
[26]
Alex Hanna et al. “Towards a Critical Race Methodology in Algorithmic Fairness”. In: Proceed-
ings of the 2020 Conference on Fairness, Accountability, and Transparency. 2020, pp. 501–512
(cit. on p. 13).
[27]
Moritz Hardt et al. “Equality of Opportunity in Supervised Learning”. In: Advances in Neural
Information Processing Systems 29. Ed. by D. D. Lee et al. Curran Associates, Inc., 2016,
pp. 3315–3323 (cit. on pp. 2,3,13,15).
[28]
Ursula Hebert-Johnson et al. “Multicalibration: Calibration for the (Computationally-
Identifiable) Masses”. In: Proceedings of the 35th International Conference on Machine
Learning. PMLR, July 2018, pp. 1939–1948 (cit. on pp. 2,16,17,19).
[29]
Úrsula Hébert-Johnson et al. “Calibration for the (Computationally-Identifiable) Masses”. In:
arXiv:1711.08513 [cs, stat] (Mar. 2018). arXiv:
1711.08513 [cs, stat]
(cit. on pp. 3,15).
[30]
Úrsula Hébert-Johnson et al. “Multicalibration: Calibration for the (Computationally-
Identifiable) Masses”. In: (), p. 10 (cit. on p. 9).
[31]
Katharine E. Henry et al. “A Targeted Real-Time Early Warning Score (TREWScore) for Septic
Shock”. In: Science Translational Medicine 7.299 (Aug. 2015), 299ra122. ISSN: 1946-6242.
DOI:10.1126/scitranslmed.aab3719 (cit. on p. 1).
[32]
Catherine A. James, Florence T. Bourgeois, and Michael W. Shannon. Association of
Race/Ethnicity with Emergency Department Wait Times”. In: Pediatrics 115.3 (Mar. 2005),
e310–e315. ISSN: 0031-4005. DOI:10.1542/peds.2004-1541 (cit. on p. 7).
[33]
Heinrich Jiang and Ofir Nachum. Identifying and Correcting Label Bias in Machine Learning.
Jan. 2019. DO I:
10.48550/arXiv.1901.04966
. arXiv:
1901.04966 [cs, stat]
(cit. on
p. 13).
[34] Alistair Johnson et al. MIMIC-IV-ED. 2021. DOI:10.13026/77Z6-9W59 (cit. on pp. 7,18).
[35]
Christopher Jung et al. “Moment Multicalibration for Uncertainty Estimation”. In: Proceedings
of Thirty Fourth Conference on Learning Theory. PMLR, July 2021, pp. 2634–2678 (cit. on
pp. 2,13).
[36]
Nathan Kallus, Xiaojie Mao, and Angela Zhou. Assessing Algorithmic Fairness with Unob-
served Protected Class Using Data Combination”. In: Proceedings of the 2020 Conference
on Fairness, Accountability, and Transparency. FAT* ’20. Barcelona, Spain: Association for
Computing Machinery, Jan. 2020, p. 110. ISBN: 978-1-4503-6936-7. DOI:
10.1145/3351095.
3373154 (cit. on p. 13).
[37]
Michael Kearns et al. “Preventing Fairness Gerrymandering: Auditing and Learning for
Subgroup Fairness”. In: arXiv:1711.05144 [cs] (Dec. 2018). arXiv:
1711.05144 [cs]
(cit. on
pp. 13,15).
[38]
Daniel Kifer and Ashwin Machanavajjhala. “Pufferfish: A Framework for Mathematical
Privacy Definitions”. In: ACM Transactions on Database Systems (TODS) 39.1 (2014), pp. 1–
36 (cit. on pp. 4,14).
[39]
Michael P. Kim, Amirata Ghorbani, and James Zou. “Multiaccuracy: Black-box Post-
Processing for Fairness in Classification”. In: Proceedings of the 2019 AAAI/ACM Conference
on AI, Ethics, and Society. 2019, pp. 247–254 (cit. on pp. 2,4,9,13,14,16).
[40]
Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. “Inherent Trade-Offs in the
Fair Determination of Risk Scores”. In: arXiv preprint arXiv:1609.05807 (2016). arXiv:
1609.05807 (cit. on p. 13).
[41]
Elaine Ku et al. “Racial Disparities in Eligibility for Preemptive Waitlisting for Kidney
Transplantation and Modification of eGFR Thresholds to Equalize Waitlist Time”. In: Journal
of the American Society of Nephrology 32.3 (2021), pp. 677–685 (cit. on p. 1).
[42]
Erica J. McDonald, Matthew Quick, and Mark Oremus. “Examining the Association between
Community-Level Marginalization and Emergency Room Wait Time in Ontario, Canada”. In:
Healthcare Policy 15.4 (May 2020), pp. 64–76. ISSN: 1715-6572. D OI:
10.12927/hcpol.
2020.26223 (cit. on p. 7).
[43]
Ziad Obermeyer et al. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of
Populations”. In: Science 366.6464 (Oct. 2019), pp. 447–453. ISSN: 0036-8075, 1095-9203.
DOI:10.1126/science.aax2342 (cit. on p. 1).
[44]
Florian Pfisterer et al. “Mcboost: Multi-Calibration Boosting for R”. In: Journal of Open
Source Software 6.64 (2021), p. 3453 (cit. on pp. 6,7).
11
[45]
Geoff Pleiss et al. “On Fairness and Calibration”. In: Advances in Neural Information Process-
ing Systems 30. Ed. by I. Guyon et al. Curran Associates, Inc., 2017, pp. 5680–5689 (cit. on
pp. 3,13).
[46]
Elisabeth D. Riviello et al. “Assessment of a Crisis Standards of Care Scoring System for
Resource Prioritization and Estimated Excess Mortality by Race, Ethnicity, and Socially
Vulnerable Area During a Regional Surge in COVID-19”. In: JAMA Network Open 5.3 (Mar.
2022), e221744. IS SN: 2574-3805. D OI:
10.1001/jamanetworkopen.2022.1744
(cit. on
p. 1).
[47]
Dorothy Roberts. Fatal invention: How science, politics, and big business re-create race in the
twenty-first century. New Press/ORIM, 2011. ISBN: 1-59558-691-1 (cit. on p. 1).
[48]
Sherri Rose. “Machine Learning for Prediction in Electronic Health Data”. In: JAMA network
open 1.4 (2018), e181404–e181404 (cit. on p. 9).
[49]
Erin M Schnellinger et al. “Mitigating Selection Bias in Organ Allocation Models”. In: BMC
medical research methodology 21.1 (2021), pp. 1–9 (cit. on p. 1).
[50]
Paula Tanabe et al. “Reliability and Validity of Scores on The Emergency Severity Index
Version 3”. In: Academic Emergency Medicine: Official Journal of the Society for Academic
Emergency Medicine 11.1 (Jan. 2004), pp. 59–65. ISSN: 1069-6563. DOI:
10.1197/j.aem.
2003.06.013 (cit. on p. 19).
[51]
Leila R Zelnick et al. “Association of the Estimated Glomerular Filtration Rate with vs without
a Coefficient for Race with Time to Eligibility for Kidney Transplant”. In: JAMA network open
4.1 (2021), e2034004–e2034004 (cit. on p. 1).
12
A Appendix
In this section, we include additional comparisons to related work, additional definitions,
proofs to the theorems in the main text, and additional experimental details. The code to
reproduce the figures and experiments is available here:
https://github.com/cavalab/
proportional-multicalibration.
A.1 Related Work
Definitions of Fairness
There are myriad ways to measure fairness that are covered in more detail
in other works [8,12,10]. We briefly review three notions here. The first, demographic parity,
requires the model’s predictions to be independent of patient demographics (
A
). Although a model
satisfying demographic parity can be desirable when the outcome should be unrelated to sensitive
attributes [23], it can be unfair if important risk factors for the outcome are associated with those
attributes [27]. For example, it may be more fair to admit socially marginalized patients to a hospital at
a higher rate if they are assessed less able to manage their care at home. Furthermore, if the underlying
rates of illness vary demographically, requiring demographic parity can result in a healthier patients
from one group being admitted more often than patients who urgently need care.
When the base rates of admission are expected to differ demographically, we can instead ask that the
model’s errors be balanced across groups. One such notion is equalized odds, which states that for a
given
Y
, the model’s predictions should be independent of
A
. Satisfying equalized odds is equivalent
to having equal FPR and FNR for every group in A.
When the model is used for patient risk stratification, as in the target use case in this paper, it is
important to consider a model’s calibration for each demographic group in the data. Because risk
prediction models influence who is prioritized for care, an unfairly calibrated model can systematically
under-predict risk for certain demographic groups and result in under-allocation of patient care to
those groups. Thus, guaranteeing group-wise calibration via an approach such as multicalibration
also guarantees fair patient prioritization for health care provision. In some contexts, risk predictions
are not directly interpreted, but only used to rank patients, which in some contexts is sufficient for
resource allocation. Authors have proposed various ways of measuring the fairness of model rankings,
for example by comparing AUROC between groups [36].
Approaches to Fairness
Many approaches to achieving fairness guarantees according to demo-
graphic parity, equalized odds and its relaxations have been proposed [19,27,9,33,37]. When
choosing an approach, is important to carefully weigh the relative impact of false positives, false
negatives, and miscalibration on patient outcomes, which differ by use case. When group base rates
differ (i.e., group-specific positivity rates), equalized odds and calibration by group cannot both
be satisfied [40]. Instead, one can often equalized multicalibration while satisfying relaxations of
equalized odds such as equalized accuracy, where
Accuracy =µT P R + (1 µ)(1 FPR)
for
a group with base rate
µ
. However, to do so requires denigrating the performance of the model on
specific groups [11,45], which is unethical in our context.
As mentioned in the introduction, we are also motivated to utilize approaches to fairness that 1)
dovetail well with intersectionality theory, and 2) provide privacy guarantees. Most work in the
computer science/ machine learning space does not engage with the broader literature on socio-
cultural concepts like intersectionality, which we see as a gap that makes adoption in real-world
settings difficult [26]. One exception to this statement is differential fairness [22], a measure designed
with intersectionality in mind. In addition to being a definition of fairness that provides equal
protection to groups defined by intersections of protected attributes, models satisfying
ε
-differential
fairness also satisfy
ε
-pufferfish privacy. This privacy guarantee is very desirable in risk prediction
contexts, because it limits the extent to which the model reveals sensitive information to a decision
maker that has the potential to influence their interpretation of the model’s recommendation. However,
prior work on differential fairness has been limited to using it to control for demographic parity,
which is not an appropriate fairness measure for our use case [23].
Multicalibration has inspired several extensions, including relaxations such as multiaccuracy [39],
low-degree multicalibration [24], and extensions to conformal prediction and online learning [35,25].
Noting that multicalibration is a guarantee over mean predictions on a collection of groups
C
, [35]
propose to extend multicalibration to higher-order moments (e.g., variances), which allows one to
13
estimate a confidence interval for the calibration error for each category. [25] extend this idea and
generalize it to the online learning context, in which an adversary chooses a sequence of examples for
which one wishes to quantify the uncertainty of different statistics of the predictions. Recent work
has also utilized higher order moments to “interpolate" between the guarantees provided by multi-
accuracy, which only requires accuracy in expectation for groups in
C
, and multicalibration, which
requires accuracy in expectation at each prediction interval [39]. Like proportional multicalibration
(Definition 3), definitions of multicalibration for higher order moments provide additional criteria for
quantifying model performance over many groups; in general, however, much of the focus in other
work is on statistics for uncertainty estimation. Like these works, one may view our proposal for
proportional multicalibration as alternative definition of what it means to be multicalibrated. The
key difference is that proportional multicalibration measures the degree to which multicalibration
depends on differences in outcome prevalence between groups, and in doing so provides guarantees
of pufferfish privacy and differential calibration.
[20] study the relation of fair rankings to multicalibration, and, in a similar vein to differential fairness
measures, formulate a fairness measure for group rankings using the relations between pairs of
groups. However, these definitions are specific to the ranking relation between the groups, whereas
differential calibration cares only about the outcome differential (conditioned on model predictions)
between pairs of groups.
A.1.1 Differential Fairness
DF was explicitly defined to be consistent with the social theoretical framework of intersectionality.
This framework dates back as early as the social movements of the ’60s and ’70s [14] and was brought
into the academic mainstream by pioneering work from legal scholar Kimberlé Crenshaw [15,16]
and sociologist Patricia Hill Collins [13]. Central to intersectionality is that hierarchies of power and
oppression are structural elements that are fundamental to our society. Through an intersectional lens,
these power structures are viewed as interacting and co-constituted, inextricably related to one another.
To capture this viewpoint, DF [22] constrains the differential of a general data mechanism among all
pairs of groups, where groups are explicitly defined as the intersections of protected attributes in A.
Definition 4
(
ε
-differential fairness [22])
.
Let
Θ
denote a set of distributions and let
xθ
for
θΘ
. A mechanism
M(x)
is
ε
-differentially fair with respect to (
C
,
Θ
) for all
θΘ
with
xθ
,
and mRange(M)if, for all (Si, Sj) C × C where P(Si|θ)>0,P(Sj|θ)>0,
eεPM,θ (M(x) = m|Si, θ)
PM,θ (M(x) = m|Sj, θ)eε(4)
Definition 5
(Pufferfish Privacy)
.
Let the collection of subsets
C
represent sets of secrets. A mech-
anism
M(x)
is
ε
-pufferfish private [38] with respect to
(C,Θ)
if for all
θΘ
with
xθ
, for all
secret pairs (Si, Sj) C × C and yRange(M),
eεPM,θ (M(x) = y|Si, θ)
PM,θ (M(x) = y|Sj, θ)eε,(5)
when Siand Sjare such that P(Si|θ)>0,P(Sj|θ)>0.
Note on pufferfish and differential privacy
Although Eq. (4) is notable in its similarity to dif-
ferential privacy [18], they differ in important ways. Differential privacy aims to limit the amount
of information learned about any one individual in a database by computations performed on the
data (e.g.
M(x)
). Pufferfish privacy only limits information learned about the group membership
of individuals as defined by
C
. [38] describe in detail the conditions under which these privacy
frameworks are equivalent.
Efficiency Property
[22] also define an interesting property of
ε
-differential fairness that allows
guarantees of higher order (i.e., marginal) groups to be met for free; the property is given in
Appendix A.2.
Definition 6
(Efficiency Property [22])
.
Let
M(x)
be an
ε
-differentially fair mechanism with respect
to
(C,Θ)
. Let the collection of subsets
C
group individuals according to the Cartesian product of
attributes
A A
. Let
G
be any collection of subsets that groups individuals by the Cartesian product
of attributes in A0, where A0Aand A06=. Then M(x)is ε-differentially fair in (G,Θ).
14
The authors call this the "intersectionality property", yet its implication is the opposite: if a model
satisfies
ε
-DF for the low level (i.e. intersectional) groups in
C
, then it satisfies
ε
-DF for every
higher-level (i.e. marginal) group. For example, if a model is (
ε
)-differentially fair for intersectional
groupings of individuals by race and sex, then it is
ε
-DF for the higher-level race and sex groupings
as well. Whereas the number of intersections grows exponentially as additional attributes are
protected [37], the number of total possible subgroupings grows at a larger combinatorial rate: for
p
protected attributes, we have Pp
k=1 p
kmk
agroups, where mais the number of levels of attribute a.
Limitations
To date, analysis of DF for predictive modeling has been limited to defining
R(x)
as
the mechanism, which is akin to asking for demographic parity. Under demographic parity, one
requires that model predictions be independent from group membership entirely, and this limits the
utility of it as a fairness notion. Although a model satisfying demographic parity can be desirable
when the outcome should be unrelated to
C
[23], it can be unfair if important risk factors for the
outcome are associated with demographics [27]. For example, if the underlying rates of an illness
vary demographically, requiring demographic parity can result in a healthier patients from one group
being admitted more often than patients who urgently need care.
A.2 Additional Definitions
Definition 7
(
α
-calibration [29])
.
Let
S X
. For
α[0,1]
,
R
is
α
-calibrated with respect to
S
if
there exists some S0Swith |S0| (1 α)|S|such that for all r[0,1],
E
D[y|R(x) = r, x S0]r
α.
Definition 8
(
α
-MC [29])
.
Let
C 2X
be a collection of subsets of
X
,
α[0,1]
. A predictor
R
is
α-multicalibrated on Cif for all S C,Ris α-calibrated with respect to S.
We note that, according to Definition 7, a model need only be calibrated over a sufficiently large
subset of each group (
S0
) in order to satisfy the definition. This relaxation is used to maintain a
satisfactory definition of MC when working with discretized predictions. That is, with Definition 7,
[29] show that (α, λ)-multicalibrated models are at most 2α-multicalibrated.
A.2.1 Loss functions
The following loss functions are empirical analogs of the definitions of
MC
,
P M C
, and
DC
, and
are used in the experiment section to measure performance.
Definition 9
(MC loss)
.
Let
D={(y, x)i}N
i=0 D
, and let
α, λ, γ > 0
. Define a collection
of subsets
C 2X
such that for all
S C,|S| γN
. Let
SI={x:R(x)I, x S}
for
(S, I) C × Λλ
. Define the collection
S
containing all
SI
satisfying
SIαλN
. The MC loss of a
model R(x)on Dis
max
SI∈S
1
|SI|
X
iSI
yiX
iSI
Ri
Definition 10
(PMC loss)
.
Let
D={(y, x)i}N
i=0 D
, and let
α, λ, γ, ρ > 0
. Define a collection
of subsets
C 2X
such that for all
S C,|S| γN
. Let
SI={x:R(x)I, x S}
for
(S, I)
C × Λλ
. Define the collection
S
containing all
SI
satisfying
SIαλN
. Let
1
|SI|PiSIyiρ
. The
PMC loss of a model R(x)on Dis
max
SI∈S
PiSIyiPiSIRi
PiSIyi
Definition 11
(DC loss)
.
Let
D={(y, x)i}N
i=0 D
, and let
α, λ, γ > 0
. Define a collection of
subsets
C 2X
such that for all
S C,|S| γN
. Given a risk model
R(x)
and prediction intervals
I
, Let
SI={x:R(x)I, x S}
for
(S, I) C × Λλ
. Define the collection
S
containing all
SI
satisfying SIαλN. The DC loss of a model R(x)on Dis
max
(Sa
I,Sb
I)∈S×S log
1
|Sa
I|X
iSa
I
yi1
|Sb
I|X
jSb
I
yj
15
A.3 Theorem Proofs
Theorem 4
Define
α, λ, γ, ρ > 0
. Let
C 2X
be a collection of subsets of
X
such that, for all
S C
,
PD(S)> γ
. Let
R(x)
be a risk prediction model to be post-processed. For all
(S, I) C×Λλ
,
let E[y|RI, x S]> ρ. There exists an algorithm that satisfies (α, λ)-PMC with respect to Cin
O(|C|
α3λ2ρ2γ)steps.
Proof.
We show that Algorithm 1converges using a potential function argument [2], similar to
the proof techniques for the MC boosting algorithms in [28,39]. Let
p
i
be the underlying risk,
Ri
be our initial model, and
R0
i
be our updated prediction model for individual
iSr
, where
Sr={x|xS, R(x)I}
and
(S, I) C × Λλ
. We use
p
,
R
, and
R0
without subscipts to
denote these values over
Sr
. We cannot easily construct a potential argument using progress towards
(
α
,
λ
)-PMC, since its derivative is undefined at
ED[y|RI, x S]
=0. Instead, we analyze progress
towards the difference in the `2norm at each step.
||pR|| ||pR0|| =X
iSr
(p
iRi)2X
iSr
(p
isquash(Ri+ r))2
X
iSr(p
iR)2(p
i(Ri+ r))2
=X
iSr2p
ir2Rirr2
= 2∆rX
iSr
(p
iRi) |Sr|r2(6)
From Algorithm 1we have
r=1
|Sr|X
iSr
(p
iRi)
Substituting into Eq. (6) gives
||pR|| ||pR0|| |Sr|r2
We know that |Sr| αλγN , and that the smallest update ris αρ. Thus,
||pR|| ||pR0|| α3ρ2λγN
Since our initial loss,
||pR||
, is at most
N
, Algorithm 1converges in at most
O(1
λ3ρ2λγ )
updates
for category Sr.
To understand the total number of steps, including those without updates, we consider the worst
case, in which only a single category
Sr
is updated in a cycle of the for loop (if no updates are
made, the algorithm exits). Since each repeat consists of at most
|C|
loop iterations, this results in
O(|C|
α3λ2ρ2γ)total steps.
A.4 Additional Theorems
A.4.1 Differentially calibrated models with global calibration are multicalibrated
Here we show that, under the assumption that a model is globally calibrated (satisfies
δ
-calibration),
models satisfying ε-DC are also multicalibrated.
Theorem 5.
Let R(x) be a model satisfying (
ε
,
λ
)-DC and
δ
-calibration. Then
R(x)
is (
1eε+δ
,
λ)-multicalibrated.
Proof.
From Eq. (2) we observe that
ε
is bounded by the two groups with the largest and smallest
group- and prediction- specific probabilities of the outcome. Let
IM
be the risk stratum maximizing
16
(ε, λ)
-DC, and let
pn= maxS∈C PD(y|RIM, x S)
and
pd= minS∈C PD(y|RIM, x S)
.
These groups determine the upper and lower bounds of εas eεpd/pnand pn/pdeε.
We note that
pdPD(y|RIM)pn
, since
P(y|RIM) = 1
NPS∈C |S|PD(y|RIM, x
S)
, and
pn
and
pd
are the extreme values of
P(y|RIM, x S)
among
S
. So,
α
-MC is bound
by the group outcome that most deviates from the predicted value, which is either
pn
or
pd
. Let
r=PD(R|RIM). There are then two scenarios to consider:
1. α |pnr|=pnrwhen r1
2(pn+pd); and
2. α |pdr|=rpdwhen r1
2(pn+pd).
We will look at the first case. Let
p
r=PD(y|RIM)
. Due to
δ
-calibration,
p
rδrp
r+δ
.
Then
αpnr
pn(p
rδ)
pnpd+δ
=pn(1 eε) + δ
α1eε+δ.
Above we have used the facts that
rp
rδ
,
p
rpd
,
pdeεpn
, and
pn1
. The second
scenario is complementary and produces the identical bound.
Theorem 5formally describes how
δ
-calibration controls the baseline calibration error contribution
to
α
-MC, while
ε
-DC limits the deviation around this value by constraining the (log) maximum and
minimum risk within each category.
A.5 Multicalibrated models satisfy intersectional guarantees
In contrast to DF, MC [28] was not designed to explicitly incorporate the principles of intersectionality.
However, we show that it provides an identical efficiency property to DF in the theorem below.
Theorem 6.
Let the collection of subsets
C 2X
define groups of individuals according to the
Cartesian product of attributes
A A
. Let
G 2X
be any collection of subsets that groups
individuals by the Cartesian product of attributes in
A0
, where
A0A
and
A06=
. If
R(x)
satisfies
α-MC on C, then R(x)is α-multicalibrated on G.
In proving Theorem 6, we will make use of the following lemma.
Lemma 7.
The
α
-MC criteria can be rewritten as: for a collection of subsets
C X
,
α[0,1]
, and
r[0,1],
max
c∈C
E
D[y|R(x) = r, x c]r+α
and
min
c∈C
E
D[y|R(x) = r, x c]rα
Proof.
The lemma follows from Definition 8, and simply restates it as a constraint on the maximum
and minimum expected risk among groups at each prediction level.
Proof of Theorem 6.
We use the same argument as [22] in proving this property for DF. Define
Q
as
the Cartesian product of the protected attributes included in
A
, but not
A0
. Then for any
(y, x)D
,
17
max
g∈G
E
D[y|R(x) = r, x g] = max
g∈G X
qQ
E
D[y|R(x) = r, x gq]P[xq|xg](7)
max
g∈G X
qQ
max
q0Q
E
D[y|R(x) = r, x gq0]P[xq|xg](8)
= max
g∈G max
q0Q
E
D[y|R(x) = r, x gq0](9)
= max
c∈C
E
D[y|R(x) = r, x c].(10)
Moving from (5) to (6) follows from substituting the maximum value of
ED[y|R(x) = r, x]
for
observations in the intersection of subsets in
G
and
Q
which is the upper limit of the expression in
(5). Moving from (6) to (7) follows from recognizing that the sum
P[xq|xg]
for all subsets in
Q
is 1. Finally, moving from (7) to (8) follows from recognizing that the intersections of subsets in
G
and Qthat satisfy (7), must define a subset of C. Applying the same argument, we can show that
min
g∈G
E
D[y|R(x) = r, x g]min
c∈C
E
D[y|R(x) = r, x c].
Substituting into Lemma 7,
max
g∈G
E
D[y|R(x) = r, x g]α+r
and
min
g∈G
E
D[y|R(x) = r, x g]rα
or
E
D[y|R(x) = r, x g]r
α
for all g G. Therefore R(x)is α-multicalibrated with respect to G.
As a concrete example, imagine we have the protected attributes
A=
{race {B, W },gender {M , F }}
. According to Theorem 6,
C
would contain four
sets:
{(B, M ),(B, F ),(W, M ),(W, F )}
. In contrast, there are eight possible sets in
G
:
{(B, M ),(B, F ),(W, M ),(W, F ),(B, ),(W, ),(, M ),(, F )}
, where the wildcard indicates a
match to either attribute. As noted in Appendix A.1.1, the efficiency property is useful because the
number of possible sets in
G
grows at a large combinatorial rate, rate as additional attributes are
added; meanwhile
C
grows at a slower, yet exponential, rate. For an intuition for why this property
holds, consider that the maximum calibration error of two subgroups is at least as large as the
maximum expected error of those groups combined; e.g., the maximum calibration error in a higher
order groups such as
(B, )
will be covered by the maximum calibration error in either
(B, M )
or
(B, F ).
A.6 Additional Experiment Details
Models were trained on a heterogenous computing cluster. Each training instance was limited to a
single core and 4 GB of RAM. We conducted a full parameter sweep of the parameters specified in
Table 2. A single trial consisted of a method, a parameter setting from Table 2, and a random seed.
Over 100 random seeds, the data was shuffled and split 75%/25% into train/test sets. Results in the
manuscript are summarized over these test sets.
Code
Code for the experiments is available here:
https://github.com/by1tTZ4IsQkAO80F/
pmc. Code is licensed under GNU Public License v3.0.
18
Table 4: Features used in the hospital admission task.
Description Features
Vitals
temperature, heartrate, resprate, o2sat, systolic
blood pressure, diastolic blood press,
Triage Acuity Emergency Severity Index [50]
Check-in Data chief complaint, self-reported pain score
Health Record Data no. previous visits, no. previous admissions
Demographic Data
ethnoracial group, gender, age, marital status,
insurance, primary language
Data
We make use of data from the MIMIC-IV-ED repository, version 1.0, to train admission risk
prediction models [34]. This resource contains more than 440,000 ED admissions from Beth Isreal
Deaconness Medical Center between 2011 and 2019. We preprocessed these data to construct an
admission prediction task in which our model delivers a risk of admission estimate for each ED
visitor after their first visit to triage, during which vitals are taken. Additional historical data for the
patient was also included (e.g., number of previous visits and admissions). A list of features is given
in Table 4.
A.7 Additional Results
Table 2lists a few parameters that may affect the performance of post-processing for both MC and
PMC. Of particular interest when comparing MC versus PMC post-processing is the parameter
α
,
which controls how stringent the calibration error must be across categories to terminate, and the
group definition (
A
), which selects which features of the data will be used to asses and optimize
fairness. In comparing Definitions 3and 8, we note PMC’s tolerance for error is more “aggressive"
for a given value of
α
, since
ED[y|RI, x S][0,1]
. Thus a natural question is whether MC
can match the performance of PMC on different fairness measures simply by specifying a smaller
α
.
We shed light on this question in three ways. First, we quantify how often the use of each post-
processing algorithm gives the best loss for each metric and trial in Table 5. Next, we look at the
performance of MC and PMC postprocessing over values of
α
and group definitions in Figs. 4to 6.
Finally, we empirically compare MC- and PMC-postprocessing by the number of steps required for
each to reach their best performance in Fig. 7and Table 6.
Table 5quantifies the number of trials for which the baseline model and the two post-processing
variants produce the best model according to a given metric, over all paramter configurations. In pure
head-to-head comparisons, we observe that PMC-postprocessing produces models with the lowest
fairness loss according to all three metrics (DC loss, MC loss, PMC loss) the majority of the time.
This provides strong evidence that, over a large range of
α
values, PMC post-processing is beneficial
compared to MC-postprocessing.
From Fig. 4, it is clear that post-processing has a minimal effect on AUROC in all cases; note the
differences dissapear if we round to two decimal places. When post processing with RF, we do note a
relationship between lower values of
α
and a very slight decrease in performance, particularly for
MC-postprocessing.
Figs. 5and 6show performance between methods on MC loss and PMC loss, respectively. In terms of
MC loss, PMC-postprocessing tends to produce models with the lowest loss, at
α
values greater than
0.01. Lower values of
α
do not help MC-postprocessing in most cases, suggesting that these smaller
updates may be overfitting to the post-processing data. In terms of PMC loss (Fig. 6), we observe
that performance by MC-postprocessing is highly sensitive to the value of
α
. For smaller values of
α
, MC-postprocessing is able to achieve decent performance by these metrics, although in all cases,
PMC-postprocessing generates a model with a better median loss value at some configuration of α.
The ability of MC-postprocessing to perform well in terms of PMC and DC loss for certain
values of
α
makes intuitive sense. If
α
can be made small enough, the calibration error
|ED[R|RI, x S]ED[y|RI, x S]|
on all categories will be small compared to the out-
come prevalence,
ED[y|RI, x S]
. However, to achieve this performance by MC-postprocessing
may require a large number of unnecessary updates for high risk intervals, since the DC and PMC
19
Table 5: Across 100 trials of dataset shuffles, we compare the post-processing configurations in
terms of the number of times they achieve the best score for the metric shown on the left. PMC
post-processing (Algorithm 1) achieves the best fairness the highest percent of the time, according to
DC loss (63%), MC loss (70%), and PMC loss (72%), while MC-postprocessed models achieve the
best AUROC in 88% of cases.
postprocessing Base Model MC PMC
metric
AUROC 5 88 6
DC loss 0 36 63
MC loss 8 21 70
PMC loss 0 27 72
0.15
0.1
0.05
0.01
0.001
alpha
ML = LR | groups = ethnicity,gender ML = LR | groups = ethnicity,gender,insurance
0.860 0.862 0.864 0.866 0.868 0.870
AUROC
0.15
0.1
0.05
0.01
0.001
alpha
ML = RF | groups = ethnicity,gender
0.860 0.862 0.864 0.866 0.868 0.870
AUROC
ML = RF | groups = ethnicity,gender,insurance postprocessing
Base Model
MC
PMC
Figure 4: AUROC test performance versus
α
across experiment settings. Rows are different ML base
models, and columns are different attributes used to define
C
. The color denotes the post-processing
method.
of multicalibrated models are limited by low-risk groups (Theorem 1). Furthermore, the number of
steps in MC-postprocessing (and PMC-postprocessing) scales as an inverse high-order polynomial of
α(cf. Thm. 2 [28]).
We assess how many steps/updates MC and PMC take for different values of
α
in Fig. 7, and
summarize empirical measures of running time in Table 6. On the figure, we annotate the point for
which each post-processing algorithm achieves the lowest median value of PMC loss across trials.
Fig. 7validates that PMC-postprocessing is more efficient than MC-postprocessing at producing
models with low PMC loss, on average requiring 4.0x fewer updates to achieve its lowest loss on
test. From Table 6we observe that PMC typically requires a larger number of updates to achieve
its best performance on MC loss (about 2x wall clock time and number of updates), whereas MC-
postprocessing requires a larger number of updates to achieves its best performance on PMC loss
and DC loss, due to its dependence on very small values of
α
. We accompany these results with the
caveat that they are based on performance on one real-world task, and wall clock time measurements
are influenced by the heterogenous cluster environment; future work could focus on a larger empirical
comparison.
20
0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26
0.15
0.1
0.05
0.01
0.001
alpha
ML = LR | groups = ethnicity,gender
0.100 0.125 0.150 0.175 0.200 0.225 0.250 0.275
ML = LR | groups = ethnicity,gender,insurance
0.14 0.16 0.18 0.20 0.22
MC loss
0.15
0.1
0.05
0.01
0.001
alpha
ML = RF | groups = ethnicity,gender
0.12 0.14 0.16 0.18 0.20
MC loss
ML = RF | groups = ethnicity,gender,insurance
postprocessing
Base Model
MC
PMC
Figure 5: MC loss test performance versus
α
across experiment settings. Rows are different ML base
models, and columns are different attributes used to define
C
. The color denotes the post-processing
method.
0.3 0.4 0.5 0.6 0.7 0.8
0.15
0.1
0.05
0.01
0.001
alpha
ML = LR | groups = ethnicity,gender
0.25 0.30 0.35 0.40 0.45
ML = LR | groups = ethnicity,gender,insurance
0.5 1.0 1.5 2.0 2.5 3.0 3.5
PMC loss
0.15
0.1
0.05
0.01
0.001
alpha
ML = RF | groups = ethnicity,gender
0.5 1.0 1.5 2.0 2.5 3.0
PMC loss
ML = RF | groups = ethnicity,gender,insurance
postprocessing
Base Model
MC
PMC
Figure 6: PMC loss test performance versus
α
across experiment settings. Rows are different
ML base models, and columns are different attributes used to define
C
. The color denotes the
post-processing method.
21
0 100 200 300 400 500 600 700 800
0.15
0.1
0.05
0.01
0.001
alpha
PMC=0.36
PMC=0.29
ML = LR | groups = ethnicity,gender
0 200 400 600 800
PMC=0.31
PMC=0.29
ML = LR | groups = ethnicity,gender,insurance
0 200 400 600 800
# of Updates
0.15
0.1
0.05
0.01
0.001
alpha
PMC=0.41
PMC=0.37
ML = RF | groups = ethnicity,gender
0 200 400 600 800
# of Updates
PMC=0.30
PMC=0.28
ML = RF | groups = ethnicity,gender,insurance
postprocessing
MC
PMC
Figure 7: Number of post-processing updates by MC and PMC versus
α
across experiment settings.
Rows are different ML base models, and columns are different attributes used to define
C
. The color
denotes the post-processing method. Each result is annotated with the median PMC loss for that
method and parameter combination.
Table 6: For MC- and PMC-postprocessing, we compare the median number of updates and median
wall clock time (s) taken to train for the configuration (
α
,
A
) that achieved the best performance on
each metric.
Best Loss # of Updates Wall Clock Time (s)
Metric ML Postprocessing
MC loss
LR MC 0.116 0 32.2
PMC 0.108 30 58.1
RF MC 0.147 82 44.3
PMC 0.135 172 79.6
PMC loss
LR MC 0.334 376 110.6
PMC 0.287 52 66.7
RF MC 0.356 504 106.5
PMC 0.325 188 81.7
DC loss
LR MC 0.491 376 110.8
PMC 0.485 334 117.0
RF MC 0.441 195 62.7
PMC 0.422 172 79.6
22
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Importance: Crisis standards of care (CSOC) scores designed to allocate scarce resources during the COVID-19 pandemic could exacerbate racial disparities in health care. Objective: To analyze the association of a CSOC scoring system with resource prioritization and estimated excess mortality by race, ethnicity, and residence in a socially vulnerable area. Design, setting, and participants: This retrospective cohort analysis included adult patients in the intensive care unit during a regional COVID-19 surge from April 13 to May 22, 2020, at 6 hospitals in a health care network in greater Boston, Massachusetts. Participants were scored by acute severity of illness using the Sequential Organ Failure Assessment score and chronic severity of illness using comorbidity and life expectancy scores, and only participants with complete scores were included. The score was ordinal, with cutoff points suggested by the Massachusetts guidelines. Exposures: Race, ethnicity, Social Vulnerability Index. Main outcomes and measures: The primary outcome was proportion of patients in the lowest priority score category stratified by self-reported race. Secondary outcomes were discrimination and calibration of the score overall and by race, ethnicity, and neighborhood Social Vulnerability Index. Projected excess deaths were modeled by race, using the priority scoring system and a random lottery. Results: Of 608 patients in the intensive care unit during the study period, 498 had complete data and were included in the analysis; this population had a median (IQR) age of 67 (56-75) years, 191 (38.4%) female participants, 79 (15.9%) Black participants, and 225 patients (45.7%) with COVID-19. The area under the receiver operating characteristic curve for the priority score was 0.79 and was similar across racial groups. Black patients were more likely than others to be in the lowest priority group (12 [15.2%] vs 34 [8.1%]; P = .046). In an exploratory simulation model using the score for ventilator allocation, with only those in the highest priority group receiving ventilators, there were 43.9% excess deaths among Black patients (18 of 41 patients) and 28.6% (58 of 203 patients among all others (P = .05); when the highest and intermediate priority groups received ventilators, there were 4.9% (2 of 41 patients) excess deaths among Black patients and 3.0% (6 of 203) among all others (P = .53). A random lottery resulted in more excess deaths than the score. Conclusions and relevance: In this study, a CSOC priority score resulted in lower prioritization of Black patients to receive scarce resources. A model using a random lottery resulted in more estimated excess deaths overall without improving equity by race. CSOC policies must be evaluated for their potential association with racial disparities in health care.
Article
Full-text available
Several approaches exist today for developing predictive models across multiple clinical sites, yet there is a lack of comparative data on their performance, especially within the context of EHR-based prediction models. We set out to provide a framework for prediction across healthcare settings. As a case study, we examined an ED disposition prediction model across three geographically and demographically diverse sites. We conducted a 1-year retrospective study, including all visits in which the outcome was either discharge-to-home or hospitalization. Four modeling approaches were compared: a ready-made model trained at one site and validated at other sites, a centralized uniform model incorporating data from all sites, multiple site-specific models, and a hybrid approach of a ready-made model re-calibrated using site-specific data. Predictions were performed using XGBoost. The study included 288,962 visits with an overall admission rate of 16.8% (7.9–26.9%). Some risk factors for admission were prominent across all sites (e.g., high-acuity triage emergency severity index score, high prior admissions rate), while others were prominent at only some sites (multiple lab tests ordered at the pediatric sites, early use of ECG at the adult site). The XGBoost model achieved its best performance using the uniform and site-specific approaches (AUC = 0.9–0.93), followed by the calibrated-model approach (AUC = 0.87–0.92), and the ready-made approach (AUC = 0.62–0.85). Our results show that site-specific customization is a key driver of predictive model performance.
Article
Full-text available
Background The lung allocation system in the U.S. prioritizes lung transplant candidates based on estimated pre- and post-transplant survival via the Lung Allocation Scores (LAS). However, these models do not account for selection bias, which results from individuals being removed from the waitlist due to receipt of transplant, as well as transplanted individuals necessarily having survived long enough to receive a transplant. Such selection biases lead to inaccurate predictions. Methods We used a weighted estimation strategy to account for selection bias in the pre- and post-transplant models used to calculate the LAS. We then created a modified LAS using these weights, and compared its performance to that of the existing LAS via time-dependent receiver operating characteristic (ROC) curves, calibration curves, and Bland-Altman plots. Results The modified LAS exhibited better discrimination and calibration than the existing LAS, and led to changes in patient prioritization. Conclusions Our approach to addressing selection bias is intuitive and can be applied to any organ allocation system that prioritizes patients based on estimated pre- and post-transplant survival. This work is especially relevant to current efforts to ensure more equitable distribution of organs.
Article
Full-text available
Objective: To compare the accuracy of computer versus physician predictions of hospitalization and to explore the potential synergies of hybrid physician-computer models. Materials and methods: A single-center prospective observational study in a tertiary pediatric hospital in Boston, Massachusetts, United States. Nine emergency department (ED) attending physicians participated in the study. Physicians predicted the likelihood of admission for patients in the ED whose hospitalization disposition had not yet been decided. In parallel, a random-forest computer model was developed to predict hospitalizations from the ED, based on data available within the first hour of the ED encounter. The model was tested on the same cohort of patients evaluated by the participating physicians. Results: 198 pediatric patients were considered for inclusion. Six patients were excluded due to incomplete or erroneous physician forms. Of the 192 included patients, 54 (28%) were admitted and 138 (72%) were discharged. The positive predictive value for the prediction of admission was 66% for the clinicians, 73% for the computer model, and 86% for a hybrid model combining the two. To predict admission, physicians relied more heavily on the clinical appearance of the patient, while the computer model relied more heavily on technical data-driven features, such as the rate of prior admissions or distance traveled to hospital. Discussion: Computer-generated predictions of patient disposition were more accurate than clinician-generated predictions. A hybrid prediction model improved accuracy over both individual predictions, highlighting the complementary and synergistic effects of both approaches. Conclusion: The integration of computer and clinician predictions can yield improved predictive performance.
Article
Full-text available
Importance Kidney transplant is associated with improved survival and quality of life among patients with kidney failure; however, significant racial disparities have been noted in transplant access. Common equations that estimate glomerular filtration rate (eGFR) include adjustment for Black race; however, how inclusion of the race coefficient in common eGFR equations corresponds with measured GFR and whether it is associated with delayed eligibility for kidney transplant listing are unknown. Objective To compare eGFR with measured GFR and evaluate the association between eGFR calculated with vs without a coefficient for race and time to eligibility for kidney transplant. Design, Setting, and Participants This prospective cohort study used data from the Chronic Renal Insufficiency Cohort, a multicenter cohort study of participants with chronic kidney disease (CKD). Self-identified Black participants from that study were enrolled between April 2003 and September 2008, with follow-up through December 2018. Statistical analyses were completed on November 11, 2020. Exposure Estimated GFR, measured annually and estimated using the creatinine-based Chronic Kidney Disease-Epidemiology (CKD-EPI) equation with and without a race coefficient. Main Outcomes and Measures Iothalamate GFR (iGFR) measured in a subset of participants (n = 311) and time to achievement of an eGFR less than 20 mL/min/1.73 m², an established threshold for kidney transplant referral and listing. Results Among 1658 self-identified Black participants, mean (SD) age was 58 (11) years, 848 (51%) were female, and mean (SD) eGFR was 44 (15) mL/min/1.73 m². The CKD-EPI eGFR with the race coefficient overestimated iGFR by a mean of 3.1 mL/min/1.73 m² (95% CI, 2.2-3.9 mL/min/1.73 m²; P < .001). The mean difference between CKD-EPI eGFR without the race coefficient and iGFR was of smaller magnitude (−1.7 mL/min/1.73 m²; 95% CI, −2.5 to −0.9 mL/min/1.73 m²). For participants with an iGFR of 20 to 25 mL/min/1.73 m², the mean difference in eGFR with vs without the race coefficient and iGFR was 5.1 mL/min/1.73 m² (95% CI, 3.3-6.9 mL/min/1.73 m²) vs 1.3 mL/min/1.73 m² (95% CI, −0.3 to 2.9 mL/min/1.73 m²). Over a median follow-up time of 4 years (interquartile range, 1-10 years), use of eGFR calculated without vs with the race coefficient was associated with a 35% (95% CI, 29%-41%) higher risk of achieving an eGFR less than 20 mL/min/1.73 m² and a shorter median time to this end point of 1.9 years. Conclusions and Relevance In this cohort study, inclusion of the race coefficient in the estimation of GFR was associated with greater bias in GFR estimation and with delayed achievement of a clinical threshold for kidney transplant referral and eligibility. These findings suggest that nephrologists and transplant programs should be cautious when using current estimating equations to determine kidney transplant eligibility.
Article
Rationale: Crisis standards of care guide critical care resource allocation during crises. Most recommend ranking patients based on their expected in-hospital mortality using the Sequential Organ Failure Assessment (SOFA) score, but it is unknown how SOFA or other acuity scores perform among patients of different races. Objective: To test the prognostic accuracy of the SOFA score and the Laboratory-based Acute Physiology Score, version 2 (LAPS2) among Black and White patients. Methods: We included Black and White patients admitted for sepsis or acute respiratory failure at 27 hospitals. We calculated discrimination and calibration for in-hospital mortality of SOFA, LAPS2, and modified versions of each, including categorical SOFA groups recommended in a popular crisis standard of care, and a SOFA score without creatinine to reduce the influence of race. Measurements and main results: Of 113,158 patients, 27,644 (24.4%) were Black. LAPS2 demonstrated higher discrimination (area under the curve [AUC] = 0.76; 95% confidence interval [CI], 0.76-0.77) than SOFA (AUC = 0.68; 95% CI, 0.68-0.69). LAPS2 was also better calibrated than SOFA, but both underestimated in-hospital mortality for White patients and overestimated in-hospital mortality for Black patients. Thus, in a simulation using observed mortality, 81.6% of Black patients included in lower priority crisis standard of care categories, and 9.4% of all Black patients, were erroneously excluded from receiving the highest prioritization. The SOFA score without creatinine reduced racial miscalibration. Conclusions: Using SOFA in crisis standards of care may lead to racial disparities in resource allocation. More equitable mortality prediction scores are needed.
Article
Background Patients may accrue wait time for kidney transplantation when their eGFR is ≤20 ml/min. However, Black patients have faster progression of their kidney disease compared with White patients, which may lead to disparities in accruable time on the kidney transplant waitlist before dialysis initiation. Methods We compared differences in accruable wait time and transplant preparation by CKD-EPI estimating equations in Chronic Renal Insufficiency Cohort participants, on the basis of estimates of kidney function by creatinine (eGFR cr ), cystatin C (eGFR cys ), or both (eGFR cr-cys ). We used Weibull accelerated failure time models to determine the association between race (non-Hispanic Black or non-Hispanic White) and time to ESKD from an eGFR of ≤20 ml/min per 1.73 m ² . We then estimated how much higher the eGFR threshold for waitlisting would be required to achieve equity in accruable preemptive wait time for the two groups. Results By eGFR cr , 444 CRIC participants were eligible for waitlist registration, but the potential time between eGFR ≤20 ml/min per 1.73 m ² and ESKD was 32% shorter for Blacks versus Whites. By eGFR cys , 435 participants were eligible, and Blacks had 35% shorter potential wait time compared with Whites. By the eGFR cr-cys equation, 461 participants were eligible, and Blacks had a 31% shorter potential wait time than Whites. We estimated that registering Blacks on the waitlist as early as an eGFR of 24–25 ml/min per 1.73 m ² might improve racial equity in accruable wait time before ESKD onset. Conclusions Policies allowing for waitlist registration at higher GFR levels for Black patients compared with White patients could theoretically attenuate disparities in accruable wait time and improve racial equity in transplant access.
Preprint
We present a general, efficient technique for providing contextual predictions that are "multivalid" in various senses, against an online sequence of adversarially chosen examples (x,y). This means that the resulting estimates correctly predict various statistics of the labels y not just marginally -- as averaged over the sequence of examples -- but also conditionally on xGx \in G for any G belonging to an arbitrary intersecting collection of groups G\mathcal{G}. We provide three instantiations of this framework. The first is mean prediction, which corresponds to an online algorithm satisfying the notion of multicalibration from Hebert-Johnson et al. The second is variance and higher moment prediction, which corresponds to an online algorithm satisfying the notion of mean-conditioned moment multicalibration from Jung et al. Finally, we define a new notion of prediction interval multivalidity, and give an algorithm for finding prediction intervals which satisfy it. Because our algorithms handle adversarially chosen examples, they can equally well be used to predict statistics of the residuals of arbitrary point prediction methods, giving rise to very general techniques for quantifying the uncertainty of predictions of black box algorithms, even in an online adversarial setting. When instantiated for prediction intervals, this solves a similar problem as conformal prediction, but in an adversarial environment and with multivalidity guarantees stronger than simple marginal coverage guarantees.