Page 1

A Partitioning Deletion/Substitution/Addition Algorithm for

Creating Survival Risk Groups

Karen Lostritto∗, Robert L. Strawderman‡, and Annette M. Molinaro∗

Abstract

Accurately assessing a patient’s risk of a given event is essential in making informed

treatment decisions. One approach is to stratify patients into two or more distinct risk

groups with respect to a specific outcome using both clinical and demographic variables.

Outcomes may be categorical or continuous in nature; important examples in cancer stud-

ies might include level of toxicity or time to recurrence. Recursive partitioning methods

are ideal for building such risk groups. Two such methods are Classification and Re-

gression Trees (CART) and a more recent competitor known as the partitioning Dele-

tion/Substitution/Addition (partDSA) algorithm. CART is an established method that can

be used for either continuous or categorical outcomes and builds a regression tree using only

’and’ statements constructed from partitions of predictor variables. partDSA, a new piece-

wise constant estimation algorithm, is based on a similar premise but allows for both ’and’

and ’or’ constructs. CART and partDSA both utilize loss functions (e.g. squared error for a

continuous outcome) as the basis for building, selecting and assessing predictors. Recently,

we have shown that partDSA can outperform CART in partitioning data in so-called “full

data” (e.g., uncensored) settings. However, when confronted with censored data, where

some patients are observed but fail to experience the time-to-event of interest, the loss func-

tions used by both procedures must be modified. There have been several attempts to adapt

CART for right-censored data. This article describes two such extensions for partDSA that

make use of observed data (i.e. possibly censored) loss functions. These observed data loss

functions, constructed using inverse probability of censoring weights, are consistent esti-

mates of their uncensored counterparts provided that the corresponding censoring model

is correctly specified. The relative performance of these new methods are evaluated via

simulation studies and illustrated through an analysis of publicly available data on breast

cancer patients. The implementation of partDSA for censored and uncensored outcomes is

publicly available in the R package, partDSA.

1Introduction

Clinicians carefully weigh a patient’s prognosis when deciding on the aggressiveness of treat-

ment. In determining a patient’s prognosis, clinicians may consider a patient’s age, gender,

clinical information and, more recently, biological variables such as gene or protein expression.

In the particular case of breast cancer patients, pertinent information includes clinical variables

of tumor size, nodal status, tumor grade, age, and metastasis, and biological variables such

∗Division of Biostatistics, Yale University School of Medicine, 60 College St., New Haven, 06519

‡ Department of Statistical Science, Cornell University, Ithaca, NY

1

arXiv:1101.4331v1 [stat.ME] 22 Jan 2011

Page 2

as estrogen and progesterone receptor status and mutations in the BRCA1 gene. Objective

guidelines for predicting a patient’s prognosis (i.e., risk) from clinical and biological informa-

tion can be obtained from risk indices estimated from data collected on an independent sample

of patients with known covariates and outcome. Frequently, the outcome of interest is the time

to occurrence of a specified event; common examples in cancer include death and recurrence or

progression of disease. The use of time-to-event outcomes frequently results in the presence of

right-censored outcome data on several patients.

There are many statistical learning methods that might be used in building predictors of risk

for a given outcome using covariate information. An attractive class of methods for building

clinically interpretable risk indices is recursive partitioning methods. Such methods can be

used with both categorical and continuous covariates and outcomes and result in prediction

rules that take the form of a decision tree. From the perspective of constructing clinical risk

indices, a decision tree is useful because it permits a clinician to easily estimate a patient’s level

of risk for a given outcome by following a specific path down the tree. For example, Molinaro

et al. (2004) show that a cohort of breast cancer patients can be separated into three risk groups

determined by copy number alterations on chromosomes 3 and 10.

The Classification and Regression Tree algorithm (CART; Breiman et al., 1984) is perhaps

the well-known recursive partitioning method. Left unchecked, CART has the capability of

placing each subject in their own terminal node. An important consideration therefore lies in

the selection of an appropriate number of splits (i.e., nodes). This “pruning” problem involves

an inherent but familiar trade-off: construct an accurate estimator that also avoids overfitting

the model to the data. CART uses cross-validation in combination with a specified loss function

in order to determine where to stop partitioning the covariate space. In the resulting pruned

tree, a single predicted value is assigned to each terminal node. For example, with a continuous

outcome and a L2 (i.e., squared-error) loss function, the predicted value for a each terminal

node is the mean outcome for those patients falling into that node.

Molinaro et al. (2010) recently developed partDSA, a new loss-based partitioning method

that may be viewed as a direct competitor to CART. Similarly to CART, partDSA divides

the covariate space into mutually exclusive and disjoint regions on the basis of a chosen loss

function. The resulting regression models continue to take the form of a decision tree; hence,

like CART, partDSA provides an excellent foundation for developing a clinician-friendly tool

for accurate risk prediction and stratification. However, this algorithm, described in greater

detail below, also differs from CART in that the decision tree may be constructed from both

‘and’ and ‘or’ conjunctions of predictors. The advantage of this representation is two-fold: (i)

in cases where only ’and’ conjunctions of predictors are required to build a parsimonious model,

the partDSA and CART representations coincide; and, (ii) in cases where CART requires two

or more terminal nodes to represent distinct subpopulations (i.e., defined by covariates) having

the same outcome distribution, partDSA can represent these same structures using a single

partition. Such increased flexibility has been previously shown to improve prediction accuracy

and predictor stability for the case of uncensored continuous outcomes under a L2loss function

(Molinaro et al., 2010).

In both CART and partDSA, the choice of loss function plays a key role in each step of the

model building process. The default choice in the case of a continuous outcome variable (i.e.,

L2loss) requires modification in the presence of censored outcome data. Numerous adaptations

of CART have been suggested in the literature for handling right-censored outcomes. Roughly

2

Page 3

speaking, these approaches may be classified according to whether a given censored-data-specific

node splitting criterion is intended to maximize homogeneity within nodes or differences between

nodes. Examples in the former category include Gordon and Olshen (1985), who used the Lp

Wasserstein or the Hellinger metric as a measure of the variability of the Kaplan Meier estimate

within each node; Therneau et al. (1990), who substituted martingale residuals directly into

the CART algorithm run with its usual L2loss function; and, LeBlanc and Crowley (1992),

where splits were chosen in order to achieve the maximal reduction in one-step deviance from

the parent nodes to the daughter nodes. Examples in the latter category include Ciampi et al.

(1986) and Segal (1988), who maximized between-node differences in survival through the use

of two-sample logrank test statistic; and, LeBlanc and Crowley (1993), who selected the best

split at a given node by determining the covariate and cutpoint that maximizes the log rank

test statistic.

With one exception, each of the aforementioned adaptations of CART replaces the usual L2

loss function with a criterion function that relies on more traditional estimators and measures

of fit used with right censored outcome data. In the absence of censoring, such adaptations

typically yield answers that differ from those provided by the default implementation of CART.

Adaptations that replace the L2loss function with other criteria also do not allow one to easily

quantify the prediction error of the final model using the same loss function. Molinaro et al.

(2004) proposed CARTIPCW, a direct adaptation of CART that replaces the L2loss function

with an Inverse Probability Censoring Weighted estimator (IPCW; Robins and Rotnitzky, 1992)

of the desired “full data” L2 loss function (i.e., that which would be used in the absence of

censoring). This allows for an otherwise unaltered implementation of CART to be used for

splitting, pruning, and estimation. The IPCW-L2loss function, computable for observed data,

is under standard assumptions a consistent estimator of the full data L2loss function. Molinaro

et al. (2004) demonstrate that CARTIPCW generally has lower prediction error, measured as

L1loss using the Kaplan Meier median as the predicted value within a given node, than the

one-step deviance method of LeBlanc and Crowley (1992) derived under a proportional hazards

model specification.

In this paper, we extend the loss-based estimation methodology used by partDSA to the

setting of right-censored data using a IPCW-L2function. The resulting methodology builds a

decision tree based on the (time-restricted) mean response in each node. We also consider an

extension of partDSA for predicting a binary outcome of the form Z(t) = I{T ≥ t}, where T

is the event time of interest and t is some specified time point. This extension also involves a

weighted modification of the L2loss function and makes use of the Brier Score (Graf et al.,

1999), a measure of survival prediction inaccuracy. Graf et al. (1999) introduces this measure

as a way to compare various estimators of survival experience. Two novel contributions of this

paper are to (i) demonstrate that this score has a simple but interesting representation as an

IPCW-weighted L2function; and, (ii) make use of the resulting loss function as the basis for

constructing a partDSA regression tree.

The remainder of this paper proceeds as follows.

data” and “observed data” structures and give a general overview of the problem of loss-

based estimation in each setting. We then briefly review the partDSA algorithm and describe

how these ideas may be used to extend the partDSA algorithm to right-censored outcomes.

Subsequently, the performance of these extensions of partDSA are evaluated via simulations

and in an analysis of a publicly available breast cancer data set. Finally, we close the paper

First, we describe the relevant “full

3

Page 4

with a discussion and comments on further, useful extensions.

Remark: Random Forests (Breiman, 2001; Ishwaran et al., 2008) and other ensemble meth-

ods use randomization as a way to improve the predictive performance of regression and classi-

fication trees. In the context of regression trees, such methods create many full-sized trees from

a given set of predictor variables using a bootstrap sampling procedure. Predicted outcomes for

each subject are then obtained by averaging the subject-specific predictions over all randomly

generated trees. Such procedures are being increasingly recognized for their impressive ability

to reduce prediction error. However, their use in developing clinically interpretable models of

risk presents challenges. For example, the unpruned trees built from each bootstrap sample

do not necessarily contain a given predictor variable each time; in addition, when chosen, such

variables also may not appear at the same level of the tree or even with the same split value.

Since an important practical goal of this work is to provide methodology for building clini-

cally useful decision trees with right-censored outcomes, we do not consider ensemble methods

further in this paper.

2 Methods

2.1Relevant Data Structures

2.1.1Full Data Structure

In the ideal setting, one observes n i.i.d. observations X1,...,Xn, of a “full data” structure,

say X = (T,W), where T denotes a response variable and W ∈ S denotes a (possibly high-

dimensional) vector of covariates. Denote the corresponding (unknown) distribution of the

full data structure X by FX,0.With time-to-event data, our focus hereafter, T > 0 is a

continuous random variable that denotes the event time of interest and W represents a set

of baseline covariates. In this case, we may equivalently define the full data as X = (Z,W),

where Z = logT. In the setting of breast cancer risk prediction, W may include age as well as

various genomic, epidemiologic and histologic measurements that are continuous or categorical

in nature. More generally, the actual covariate process may contain both time-dependent and

time-independent covariates; however, as with most work on analyzing time-to-event data using

regression trees, we shall focus on the setting of time-independent W only.

2.1.2Observed Data Structure

In practice, missing data can occur in one or more of the Xis; that is, one instead observes n

i.i.d. observations O1,...,Onof an observed data structure O having distribution FObs. The

most common form of missing data in the case of time-to-event outcomes is right-censoring

of event times. Here, O = (?T = min(T,C),∆ = I(T ≤ C),W), where?T = min(T,C) and

data can arise as a result of subject drop out or the end of follow-up. The distribution FObsof

O depends on the distribution of X (i.e., FX,0) and the conditional distribution of the censoring

variable C given X, say G0(·|X).

We assume that censoring satisfies a coarsening at random (CAR) condition (Gill et al.,

1997). Because X is assumed to include only time-independent covariates, CAR implies that

C is conditionally independent of T given baseline covariates W; that is, we have¯G0(·|X) =

∆ = I(T ≤ C); equivalently, O = (?Z,∆,X), where?Z = log?T. Such right censored outcome

4

Page 5

¯G0(·|W), where¯G0(c | X) = Pr(C ≥ c | X) and¯G0(c | W) is defined analogously. If CAR

is not satisfied, then¯G0(·|X) must additionally depend on T (or Z); such dependence cannot

be identified from the observed data and must be fully specified. For simplicity, we proceed

assuming CAR, noting that extensions to dependent censoring are possible in combination with

the use of sensitivity analysis (e.g., Scharfstein and Robins, 2002).

2.2Loss-Based Estimation

2.2.1Estimation with Full Data

Let ψ : S → R be a real-valued function of W, where ψ ∈ Ψ. Let L(X,ψ) denote a full data

loss function and define EFX,0[L(X,ψ)] =

The parameter of interest, ψ0, is then defined as a minimizer of the risk EFX,0[L(X,ψ0)] =

min

ψ∈ΨEFX,0[L(X,ψ)]. In practice, and on the basis of the fully observed data X1,...,Xn, ψ0can

be estimated by minimizing the empirical risk (i.e., average loss) n−1?n

case estimation of ψ0reduces to estimating the corresponding model parameter(s). We reserve

further discussion of this issue until Section 2.4, where piecewise constant regression estimators

will be of primary interest in connection with partDSA.

The purpose of the loss function is to quantify a specific measure of performance and,

depending on the parameter of interest, there could be numerous loss functions from which to

choose. In the case of the continuous outcome Z = logT, a common and important parameter

of interest is the conditional mean ψ0(W) = E[Z | W]; it is well known that ψ0(W) is under mild

conditions the unique minimizer of the risk under the squared error loss L(X,ψ) = (Z−ψ(W))2.

More generally, the conditional mean ψ0(W) minimizes the risk under the far larger class of

Bregman loss functions; see Banerjee et al. (2005) for details. However, the estimators of ψ0(W)

derived from the corresponding average loss may well be different for different choices of loss

functions within the Bregman family.

The mean is generally regarded as a useful and interpretable quantity. However, in the case

of time-to-event data, one is frequently interested in alternative summaries. For example, the

conditional median survival time ψ0(W) = Med[T | W] minimizes the risk under the absolute

error loss L(X,ψ) =| T −ψ(W) |. Alternatively, one may be specifically interested in predicting

survival status at a given time t, say Z(t) = I(T > t). Specifically, given a predictor?ψt(W)

the squared error loss L(X,?ψt) = (Z(t) −?ψt(W))2and establish that ψ0t(W) = P(T > t|W)

Ψ contains the true survivor function. Minimizing the corresponding average loss yields an

estimator for ψ0t(W).

?L(x,ψ)dFX,0(x) as the corresponding risk of ψ.

i=1L(Xi,ψ). Generally

speaking, feasible estimation procedures require ψ(·) to be modeled in some fashion, in which

for the status variable Z(t), Graf et al. (1999) define the expected Brier score as the risk under

minimizes risk under the loss function L(X,ψt) = (Z(t) − ψt(W))2for ψt∈ Ψ, assuming that

2.2.2Estimation with Observed Data

In the presence of right-censored outcome data, the goal remains to find an estimator for a

parameter ψ0that is defined in terms of the risk for a full data loss function L(X,ψ), e.g., a

predictor of the log survival time Z based on covariates W. An immediate problem is that

L(X,ψ) cannot be evaluated for any observation O having a censored survival time (∆ =

0); for example, L(X,ψ) = (Z − ψ(W))2cannot be evaluated if Z = logT is not observed.

5

Page 6

Risk estimators based on only uncensored observations, such as n−1?

loss function by an observed (censored) data loss function having the same expected value, i.e.,

risk (van der Laan and Robins, 2003).

The notion of an IPCW estimator was introduced by Robins and Rotnitzky (1992). Let

n−1?n

estimator can be written as:

1

n

i=1

Here,¯G0(c | W) = Pr(C ≥ c | W) is defined as in Section 2.1.2; hence, (1) may be interpreted

as a “complete case” estimator of EFX,0[D(X)] in which each complete case (i.e., ∆i = 1)

receives a weight that is inversely proportional to the probability of observing the event time

Ti for that subject. Under the assumption that EFObs[∆|X] =¯G0(T|W) > 0 a.s., (1) has

expectation EFX,0[D(X)], demonstrating unbiasedness for its target. Moreover, in the absence

of censoring, (1) reduces to the desired full data loss function.

While IPCW is most frequently used in conjunction with the construction of unbiased esti-

mating equations, the general principle remains valid in much greater generality. In particular,

taking D(X) = L(X,ψ) yields the observed data loss n−1?n

is the IPCW version of the L2 loss function introduced in Section 2.2.1, given by the above

formula with L(Xi,ψ) = (Zi−ψ(Wi))2with corresponding risk EFX,0[(Z−ψ(W))2]. Of course,

the censoring probability function¯G0(·|W) is in general unknown and must be estimated from

the data. For any consistent estimator¯Gn(·|W) of¯G0(·|W),

n

?

is under mild conditions a consistent estimator for the desired full data risk EFX,0[L(X,ψ)].

As a result, it is reasonable to consider estimating ψ(W) by minimizing (2). Notably, one

can also estimate¯G0(·|·) using covariates other than those used for estimating ψ(W), allowing

one to handle the possibility of certain forms of informative censoring and potentially increase

estimation efficiency.

iL(Xi,ψ)∆i, are biased

estimators for EFX,0[L(X,ψ)]. One possible solution here is to replace the full (uncensored) data

i=1D(Xi) estimate some parameter of interest in a setting where all event times are fully

observed (i.e., no censoring). Then, in the presence of censoring, the corresponding IPCW

n

?

D(Xi)

∆i

¯G0(Ti|Wi).(1)

i=1∆iL(Xi,ψ)/¯G0(Ti| Wi), an un-

biased estimator of the desired full data risk EFX,0[L(X,ψ)]. A simple but important example

1

n

i=1

L(Xi,ψ)

∆i

¯Gn(Ti|Wi).(2)

2.3 The Brier Score

As suggested in Section 2.2.1, the prediction of survival status at a given time t can be viewed

as a loss minimization problem, in which ψt(W) is chosen to minimize the average full data loss

function n−1?n

as a measure of prediction inaccuracy permitting comparisons between competing models for

the survival probability ψ0t(W) = P(T > t|W). Graf et al. (1999) also extend the definition of

the Brier score to accommodate censored survival data. In particular, given a specific time t,

they propose to divide the observations into three groups. Group 1 contains subjects censored

before t; here, ∆i = 0,?Ti < t and Zi(t) cannot be determined. Group 2 contains subjects

6

i=1L(Xi,ψ), where L(Xi,ψ) = (Zi(t)−ψ(Wi))2and Zi(t) = I(Ti> t). In Graf

et al. (1999), this average squared error loss is referred to as the Brier score, where it is proposed

Page 7

experiencing an event before t; here, ∆i= 1,?Ti< t, and Zi(t) = 0. Group 3 contains subjects

and thus Zi(t) = 1.

The corresponding average observed data loss function, generalized here to allow for

covariate-dependent censoring, is computed as follows:

?

¯Gn(?Ti|Wi)

where I(?Ti≤ t,∆i= 1) is the indicator function for Group 2, I(Ti> t) is the indicator function

·|W). Observe that subjects in Group 1 do not contribute to this loss function, except indirectly

through estimation of¯G0(·|W). It is easy to show that BSc(t), with¯Gn(·|W) replaced by

¯G0(·|W), is an unbiased estimator of the expected (full data) Brier score under mild conditions

that include random censoring.

We now demonstrate that BSc(t) has an interesting IPCW-based representation. Proceed-

ing similarly to Strawderman (2000), define ∆i(t) = I(?Ti(t) ≤ Ci), where?Ti(t) = min{?Ti,t}.

also have ∆i(t) = 1 for every t. Thus, the importance of ∆i(t) only becomes apparent when

∆i= 0; in particular, for a given t < ∞, it is possible for ∆i= 0 and ∆i(t) = 1. Specifically,

for a subject in Group 1, since ∆i= 0 and?Ti< t, we have both Ci< Tiand Ci≤ t. It follows

Ti≤ Ciand Ti≤ t. It follows that I(?Ti≤ t,∆i= 1) = ∆i(t) = 1 and that?Ti(t) =?Ti= Ti. For

follows that I(?Ti> t) = ∆i(t) = 1 and that?Ti(t) = t.

BSc(t)=

n

i=1

that remain at risk for an event at time t; here, the value of ∆ibecomes irrelevant, for?Ti> t

BSc(t) =1

n

n

?

i=1

(0 −ˆψt(Wi))2×I(?Ti≤ t,∆i= 1)

+ (1 −ˆψt(Wi))2×I(?Ti> t)

¯Gn(t|Wi)

?

(3)

for Group 3, and¯Gn(·|W) is an estimator for the censoring survival function¯G0(·|W) = P(C >

Clearly, ∆i(t) → ∆ias t → ∞. Moreover, it is evident that any subject with ∆i= 1 must

that?Ti(t) = Ciand ∆i(t) = 0. For a subject in Group 2, since ∆i= 1 and?Ti≤ t, we have

a subject in Group 3, we have?Ti> t regardless of ∆iand therefore that Ti> t and Ci> t. It

Defining Ti(t) = min{Ti,t}, it is now easy to show that (3) may be written as:

n

?

This IPCW-type average loss estimator uses a modified event time and censoring indicator

and will be consistent for the expected (full data) Brier score under mild regularity conditions.

Such a loss function is also easily extended to the setting of multiple time points; for example,

estimators for ψ0(W) might be obtained by minimizing loss functions of the form?p

1

∆i(t)

¯Gn(Ti(t)|Wi)×

?

Zi(t) −ˆψt(Wi)

?2.(4)

r=1BSc(tr),

?τ

2.4

0BSc(t)w(t)dt, or perhaps even max{BSc(t1),BSc(t2),...,BSc(tp)}.

partDSA: recursive partitioning for full data structures and extensions

partDSA (Molinaro et al., 2010) is a statistical methodology for predicting outcomes from a

complex covariate space with fully observed data. Similarly to CART, this novel algorithm

generates a piecewise constant estimation list of increasingly complex candidate predictors

based on an intensive and comprehensive search over the entire covariate space.

The strategy for estimator construction, selection, and performance assessment in partDSA

is entirely driven by the specification of a loss function and involves three main steps. Step 1

involves defining the parameter of interest, defined here as the minimizer of the expected loss

function (i.e. risk), where the given loss function is chosen to represent some desired measure

7

Page 8

of performance. In Step 2, candidate estimators are constructed based on the corresponding

empirical risk (i.e. average loss) function. In particular, we define a finite collection of can-

didate estimators for the parameter of interest based on a sieve of increasing dimension that

approximates the desired parameter space. For each element of the sieve, a candidate estimator

is defined as a minimizer of the empirical risk derived from the chosen loss function. Finally,

in Step 3, we use cross-validation to estimate risk and to select an optimal estimator among

the candidate estimators obtained in Step 2. This step relies on the unified cross-validation

methodology of van der Laan and Dudoit (2003). We comment briefly on Steps 2 and 3 below;

further details may be found in Molinaro et al. (2010)

As indicated in Step 1, partDSA relies on the specification of a loss function; as commented

in Section 2.2.1, the resulting regression function ψ(W) should also be parameterized in a way

that permits one to compute a useful estimator. This latter problem is dealt with in Step

2, where partDSA relies on the specification of a particular sieve, {Ψk}, of finite-dimensional

subspaces Ψk⊂ Ψ such that

?

Here, N denotes the non-negative integers, I ⊂ N is a finite index set with |I| elements, k is the

maximal index set size for Ψk, BI≡ {β = (β1,...,β|I|) : ψI,β∈ Ψ} ⊆ R|I|, and the {Rj: j ∈ I}

are constructed for any |I| ≤ k in a way that forms an exhaustive and mutually exclusive

partition of the entire covariate space S. Such sieves are members of the class of piecewise

constant regression models (H¨ ardle, 1989). For each k and a given loss function, the principal

goal then reduces to estimating the (finite dimensional) parameter ψ0kthat minimizes the risk

for ψ ∈ Ψkvia minimization of the corresponding average loss function.

For the specified sieve Ψk, solving this problem involves determining?βI for all possible

selected as the desired solution for the sieve ΨkIn practice, it is usually impossible to search

over the set of covariate partitions (i.e., the set of partitions generated by all index sets I

such that |I| ≤ k). Tree-based methods, which rely on the construction of piecewise constant

estimators as described above, avoid this problem by approximating the desired minimum

via recursive binary partitioning. In addition to using “and”-type statement generations like

those employed by CART, partDSA also explores and chooses the best among all possible

“or” statements. In this approach partitionings are generated via three different moves, or step

functions: Deletion, Substitution, and Addition. A Deletion move forms a union of two

partitions of the covariate space regardless of their spatial location, i.e., the two partitions may

not be contiguous. A Substitution move divides two disparate partitions into two subsets

each and then forms combinations of the four subsets resulting in two new partitions. Thus, this

step forms unions of partitions (or subsets within the partitions) as well as divides partitions.

An Addition move splits one partition into two distinct partitions. The goal is to use these

moves to minimize the empirical risk over all of the generated index sets. By doing so, a sieve

of increasingly complex predictors is generated, each element representing the best predictor of

size k, where k = 1,...,K. This process completes Step 2; the best partition is then selected

using v-fold cross-validation, completing Step 3.

In theory, partDSA can be defined and implemented for an arbitrary choice of loss func-

tion, the specific choice being tied to the parameter of interest (i.e., minimizer of the expected

Ψk≡

ψI,β(·) =

?

j∈I

βjI{· ∈ Rj} : β ∈ BI, I ⊂ N, |I| ≤ k

?

.

partitions I with |I| ≤ k; an estimator?β(k)leading to the minimum empirical risk is then

8

Page 9

loss). In practice, the computationally intensive nature of of partDSA necessitates using loss

functions in Step 2 that can be minimized very efficiently. In the case of fully observed con-

tinuous data, the L2loss function is a very favorable choice. Specifically, for each partition I

considered by partDSA and any fixed set of nonnegative weights α1...αn, the minimization

problem?βI= arg minβ∈BI

currently implemented for problems without missing data for select loss functions and αi= 1

for each i, is available from CRAN in the form of an R package (Molinaro et al., 2009).

The results summarized in Section 2.2.2 demonstrate how the problem of estimating a pa-

rameter that minimizes the full data risk for a given choice of loss function can be generalized

to the setting of right-censored outcome data: replace the desired average full data loss with

the corresponding average observed data IPCW-weighted loss. The current L2-loss-based es-

timation capabilities of partDSA may now be easily extended to right-censored outcomes in

the following two ways: (i) partDSAIPCW, that is, partDSA implemented using the average

loss function (2) with L(Xi,ψ) = (Zi− ψ(Wi))2and αi = ∆i/¯Gn(Ti|Wi), i = 1...n; and,

(ii) partDSABrier, that is, partDSA implemented using the weighted L2 loss function (4) for

the binary outcome Zi(t). Notably, the first extension generalizes the Koul-Susarla-van Ryzin

estimator (Koul et al., 1981) for the semiparametric accelerated failure time model, allowing

covariate-dependent censoring and a tree-based regression function.

?

j∈I

?n

i=1αiI(Wi∈ Rj)(Zi− βj)2, is easily solved in closed form:

i=1αiI(Wi∈ Rj)Zi)/(?n

?βI,j= (?n

i=1αiI(Wi∈ Rj)),j ∈ I. The software for this algorithm,

3Simulation Studies

To evaluate the proposed adaptations of partDSA, we ran both univariate and multivariate

simulation studies. In addition to the adaptations for partDSA, we considered the performance

of CART using the methods of LeBlanc and Crowley (1993, L&CNLL) and the IPCW-based

extension of CART considered in Molinaro et al. (2004, CARTIPCW). Kaplan Meier estimates

were used to create all IPCW and Brier Score weights. A brief discussion of tuning parameters

and specifications for each method considered is provided in Web Appendix A. The results of

the multivariate simulation study are reported below; the results for the univariate study are

provided in Web Appendix B. Figures and tables referenced in Section 3.2 below but not found

in the main document are provided in Web Appendix C.

The overall structure of each study is the same. First, a training set of 250 observations

was generated with a specified overall censoring level (0%, 30% or 50%). For this training set,

a model size (i.e., number of terminal partitions) is determined using the first minimum of the

5-fold cross-validated risk. Then, a model of indicated size is built using the entire training

set. Finally, the performance of the fitted model is assessed with an independent test set of

5000 observations generated from the full distribution (i.e. with no censoring) using the criteria

described in Section 3.1. This process was repeated for 1000 independent (training set, test

set) combinations.

In total, and for each simulation study, there are two true regression trees, each estimated

in two ways. Regarding the former, the true partDSA regression tree structure has two terminal

partitions and the true CART regression tree structure has three terminal nodes. Importantly,

both trees correctly describe the regression structure. However, two of terminal nodes in the

CART tree represent different subpopulations of subjects having the same survival distribution;

9

Page 10

in contrast, these two subpopulations populations are combined into a single partition in the

partDSA tree, a more parsimonious representation. Regarding estimation, the partDSA tree is

estimated using the IPCW and Brier score loss functions, whereas the CART tree is estimated

using the methods of LeBlanc and Crowley (1993) and Molinaro et al. (2004). It is useful to

point out here that the CARTIPCWand partDSAIPCWuse the exact same loss function; hence,

a direct comparison of these results illustrates the difference between the partDSA and CART

algorithms used for building a prediction rule.

3.1Test Set Evaluation Criteria

We evaluated performance using three different types of criteria: prediction concordance, pre-

diction error, and proper risk stratification. For brevity of presentation, we use the terminology

“terminal node” for both partDSA and CART in this section, with the understanding that the

meaning in the case of partDSA is really a terminal partition.

Prediction Concordance: The concordance index, or c-index, is a measure of association

between the predicted and observed outcomes (Harrell et al., 1982). The c-index computes the

fraction of pairs in which the observation having the shorter event time also has the shorter

model-predicted event time. A c-index of 0.50 indicates that the model performs no better than

chance while a c-index of 1 indicates a model with perfect predictive ability.

Our specific intent is to evaluate the predictive quality of a model built using censored data

on an independent, uncensored dataset. To ensure comparability across the different methods of

regression tree construction, the predicted outcome used for all methods is the node-specific es-

timated average survival time derived from the training set data. Since outcomes in the training

set are subject to right censoring, each node-specific mean survival time estimate is computed

using an IPCW estimate. Each test set member is then classified into a terminal node of the

resulting tree based only on their covariates and assigned the corresponding estimated average

survival time as their predicted outcome. The c-index compares these observed and predicted

outcomes. In addition to the estimated mean survival time, we also used the corresponding

IPCW-estimated proportion surviving as of the median survival time as the predicted outcome;

since the results were almost identical, they are not reported here.

Because each simulated test set only includes fully observed outcomes, no further adjustment

for censoring is required when computing the c-index (e.g., Koziol and Jia, 2009; Liu and Jin,

2009). However, because all tree-based methods use piecewise constant estimators, it is possible

(indeed, expected) that subjects with different outcomes have tied predicted values. We chose

to exclude the ties in order to focus on the concordance of predicted and observed outcomes for

observations assigned to different terminal nodes. In the results tabulated below, the c-indices

are reported as the ratio of each method’s average c-index to a baseline method’s average c-index

for each level of censoring. The baseline method is indicated for each simulation in the text

and the table captions. A ratio above one indicates improved performance over the baseline; a

ratio below one indicates decreased performance.

Prediction Error: A second method of evaluation compares, for each of the four regression

tree estimation methods, two predicted outcomes computed for each test set member. The first

predicted outcome is determined using the true tree structure. Specifically, all test set subjects

are run down the partDSA tree and also down the corresponding CART tree. The predicted

outcome for subject i, say ψTT

i

, is taken to be the average survival time of all members assigned

10

Page 11

to the same terminal node. The second predicted outcome is determined in exactly the same

fashion, but instead runs each test set member down the tree built using the training set; call the

resulting predicted outcome ψTS

i

. We measure the distance between the two predicted outcomes

using the empirical loss L∗= 1/5000?5000

be continuously distributed, L∗increases away from zero as the degree of heterogeneity in risk

group assignment increases. Hence, the “prediction error,” as implemented here, really intends

to measure whether the estimated tree consistently stratifies subjects into the same risk groups

as does the true (unknown) tree. Similarly to the c-indices, prediction errors are reported as

ratios. Here a ratio above one indicates greater error in comparison to the baseline method and

a ratio below one indicates decreased error.

Risk Stratification: The last criterion focuses on the ability of each method to separate

patients into groups of differing risk. In particular, for each of the 1000 independent test sets,

node-specific Kaplan-Meier (KM) curves are computed; notably, since test set subjects are all

uncensored, each KM estimate is equal to the corresponding node-specific empirical survivor

function. Then, for each estimation method, and for the subset of the 1000 estimated trees

that consist of either two or three terminal nodes, we computed the corresponding average

node-specific KM estimate and 0.025 and 0.975 percentiles at each time point. For partDSA,

we expect a high proportion of simulated datasets with only two survival curves and small

standard errors. For CART, we instead expect to observe a high proportion of cases with three

risk groups; however, as explained earlier, two of the three corresponding survival curves should

in theory be indistinguishable.

i=1(ψTT

i

− ψTS

i

)2. Observe that L∗= 0 if both trees

classify the test set subjects into exactly the same groupings. Since outcomes are assumed to

3.2 Multivariate Simulations

3.2.1Simulation 1

The first multivariate simulation is an extension of our univariate simulation study, the latter

being modeled after that in LeBlanc and Crowley (1993) and described in Web Appendix B. In

particular, there are five covariates, each given by a discrete uniform variable on the integers

1-100. Only the first two of these variables influence survival times, which are generated from

an exponential distribution with a covariate-dependent mean parameter σ. In particular, σ is

set equal to 5 if W1> 50 or W2> 75 and 0.5 otherwise. Censoring times are independently

generated from a uniform distribution, the parameters being chosen to maintain the levels of

censoring indicated earlier.

The true models for CART (with three terminal nodes) and partDSA (with two terminal

partitions) are shown in Web Figure 6. As seen in Table 1, on average all methods build slightly

larger models. They also select the signal variables consistently. Overall, partDSABrier(the

baseline method) has the highest c-indices. The relative improvement in performance ranges

up to 11% as seen in the highest censoring level. The differences are smallest between the two

partDSA methods. partDSABrier(the baseline method) also has the lowest prediction error. As

seen in Table 1, the relative decrease in prediction error compared to the CART methods ranges

from 21 to 83%. Figure 1 includes the KM curves for the four methods at the 30% censoring level

(additional results can be found in Web Figures 7 and 8). We see that partDSA methods build

two distinct groups of patients (80-86% of the time; Web Table 3) with significantly different

survival experiences. In contrast, CART usually selects the correct number of terminal nodes

11

Page 12

Table 1: True and fitted model sizes, number of predictors in the fitted model, number of correct

predictors selected, c-index ratio (baseline is partDSABrier; larger values indicate increased

performance), and prediction error ratio (smaller values indicate less error) for the four methods

over three censoring levels for the simulation study described in Section 3.2.1.

Censoring CriteriapartDSABrier

2.00

2.24

2.21

2.00

1

1

2.18

2.13

2.00

1

1

2.16

2.12

2.00

1

1

partDSAIPCW

2.00

2.15

2.10

2.00

1.011

.991

2.26

2.17

2.00

0.989

1.252

2.49

2.27

1.96

0.955

1.355

CARTIPCW

3.00

3.10

2.04

1.97

0.939

1.217

2.98

1.90

1.81

0.910

1.834

3.17

1.91

1.63

0.891

1.723

L&CNLL

3.00

3.24

2.10

2.00

0.932

1.208

3.21

2.07

1.98

0.902

1.411

3.11

1.98

1.89

0.891

1.311

True Model Size

Fitted Model Size

Number of Predictors

Number of W1,W2

C-index Ratio

Prediction Error Ratio

Fitted Model Size

Number of Predictors

Number of W1,W2

C-index Ratio

Prediction Error Ratio

Fitted Model Size

Number of Predictors

Number of W1,W2

C-index Ratio

Prediction Error Ratio

0%

30%

50%

(77-84% of the time; Web Table 3); however, the clinical utility of these three risk groups is not

apparent, for two of these groups are estimated to have essentially identical survival experiences

(as expected). This behavior mimics that observed in the univariate simulation study.

3.2.2Simulation 2

The second multivariate simulation is based on Model B in LeBlanc and Crowley (1992). There

are again five independent uniformly distributed predictor variables. Survival times are now

generated from a shifted exponential distribution with mean parameter equal to e−θ, where

θ = 4 × I(W1≤ 0.5 ∪ W2> 0.5). In comparison with LeBlanc and Crowley (1992): we have

replaced their “and” statement in the definition of θ by an “or” statement and multiplied the

indicator expression by four to increase the signal. As before, censoring times are independently

generated from a uniform distribution.

As seen in Table 2, the methods again build slightly larger models than necessary. Inter-

estingly, partDSAIPCWis the only method which consistently selects the two signal variables.

Overall, partDSAIPCW(the baseline method) has the highest c-indices, with a relative improve-

ment in performance of 16-20% over the CART methods and 5-10% over partDSABrier, and

also the lowest prediction error. The relative decrease in prediction error for censored data

compared to the CART methods is observed to range from 9 to 62%. It is apparent in the

top-right panel of Figure 2 that partDSABrierhas somewhat greater difficulty in identifying

12

Page 13

05 10 1520

0.0

0.2

0.4

0.6

0.8

1.0

partDSAIPCW

Time

Survival Probability

051015 20

0.0

0.2

0.4

0.6

0.8

1.0

partDSABrier

Time

Survival Probability

051015 20

0.0

0.2

0.4

0.6

0.8

1.0

L&CNLL

Time

Survival Probability

05 101520

0.0

0.2

0.4

0.6

0.8

1.0

CARTIPCW

Time

Survival Probability

Multivariate Simulation: 30% Censoring

Figure 1: Kaplan-Meier plots for the four methods illustrating the survival experience of the

corresponding risk groups for the simulation study described in Section 3.2.1. There are 2

survival curves and 2 sets of 95% pointwise confidence limits in each top panel; there are 3

survival curves and 3 sets of 95% pointwise confidence limits in each bottom panel. Two of the

survival curves essentially overlap in the bottom panel; see Section 3.1 for discussion.

13

Page 14

Table 2: True and fitted model sizes, number of predictors in the fitted model, number of correct

predictors selected, c-index ratio (baseline is partDSAIPCW; larger values indicate increased

performance), and prediction error ratio (smaller values indicate less error) for the four methods

over three censoring levels for the simulation study described in Section 3.2.2.

CensoringCriteriapartDSABrier

2.00

2.30

2.33

2.00

0.945

2.212

2.40

2.44

1.96

0.919

2.228

2.38

2.40

1.87

0.921

1.725

partDSAIPCW

2.00

2.14

2.09

2.00

1

1

2.26

2.15

2.00

1

1

2.38

2.21

2.00

1

1

CARTIPCW

3.00

3.10

2.03

1.94

0.813

1.311

3.00

1.92

1.82

0.820

1.615

3.12

2.02

1.89

0.840

1.137

L&CNLL

3.00

3.08

2.03

2.00

0.816

0.755

3.08

2.03

1.98

0.822

1.093

2.95

1.87

1.79

0.830

1.471

True Model Size

Fitted Model Size

Number of Predictors

Number of W1,W2

C-index Ratio

Prediction Error Ratio

Fitted Model Size

Number of Predictors

Number of W1,W2

C-index Ratio

Prediction Error Ratio

Fitted Model Size

Number of Predictors

Number of W1,W2

C-index Ratio

Prediction Error Ratio

0%

30%

50%

the low risk group (i.e., due to the comparatively wide confidence intervals); this may be a

consequence of the choice of loss function, which uses just a single time point in defining the

loss function. As seen in the previous KM plots, the two risk groups identified by the partDSA

methods continue to have distinct survival experiences, whereas this is not the case for the

three risk groups identified under the CART model.

4 Data Analysis

The prognostic model building ability of the four methods is assessed with the use of a German

Breast Cancer Study Group (GBCSG) data set, fully detailed in Schumacher et al. (1994)

and publicly available in the R package ipred (Peters et al., 2002). This data resulted from

a prospective clinical trial on the treatment of node-positive breast cancer patients. There

are 686 patients, each with 8 prognostic variables measured including: menopausal status,

age, hormonal therapy (HT), tumor size (TS), tumor grade (TG), positive lymph nodes (PN),

progesterone receptor (PR), and estrogen receptor (ER). Recurrence-free survival is the outcome

of interest and 56% of the patients are censored.

For all methods, model selection (i.e., number of terminal nodes or partitions) was deter-

mined via the first minimum of the 5-fold cross-validated risk. Running the two IPCW methods

(partDSAIPCWand CARTIPCW), where splits are based on the L2loss function (hence, mean

14

Page 15

01234

0.0

0.2

0.4

0.6

0.8

1.0

partDSAIPCW

Time

Survival Probability

01234

0.0

0.2

0.4

0.6

0.8

1.0

partDSABrier

Time

Survival Probability

01234

0.0

0.2

0.4

0.6

0.8

1.0

L&CNLL

Time

Survival Probability

01234

0.0

0.2

0.4

0.6

0.8

1.0

CARTIPCW

Time

Survival Probability

Multivariate Simulation 2: 30% Censoring

Figure 2: Kaplan-Meier plots for the four methods illustrating the survival experience of the

corresponding risk groups for the simulation study described in Section 3.2.2. There are 2

survival curves and 2 sets of 95% pointwise confidence limits in each top panel; there are 3

survival curves and 3 sets of 95% pointwise confidence limits in each bottom panel. Two of the

survival curves essentially overlap in the bottom panel; see Section 3.1 for discussion.

15