Page 1

Comparing Treatments in the Presence of Crossing Survival

Curves: An Application to Bone Marrow Transplantation

Brent R. Logan*, John P. Klein, and Mei-Jie Zhang

Division of Biostatistics, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee,

Wisconsin 53226, U.S.A.

Summary

In some clinical studies comparing treatments in terms of their survival curves, researchers may

anticipate that the survival curves will cross at some point, leading to interest in a long-term survival

comparison. However, simple comparison of the survival curves at a fixed point may be inefficient,

and use of a weighted log-rank test may be overly sensitive to early differences in survival. We

formulate the problem as one of testing for differences in survival curves after a prespecified time

point, and propose a variety of techniques for testing this hypothesis. We study these methods using

simulation and illustrate them on a study comparing survival for autologous and allogeneic bone

marrow transplants.

Keywords

Censored data; Crossing hazard functions; Generalized linear models; Log-rank test; Pseudo-value

approach; Weibull distribution; Weighted Kaplan–Meier statistic

1. Introduction

When comparing treatments in terms of their time-to-event distribution, there may be reason

to believe that the survival curves will cross, and standard comparison techniques in such cases

could lead to misleading results. Often researchers in such cases will focus on which treatment

has a better long-term survival probability. In particular, this research is motivated by a

common scenario in hematopoietic stem cell transplantation, illustrated using a study

comparing autologous and allogeneic bone marrow transplants for follicular lymphoma (Van

Besien et al., 2003). The sample contained 175 patients with an HLA-identical sibling

allogeneic transplant and 596 patients with an unpurged autologous transplant. We are

interested in comparing the disease-free survival (DFS) curves (i.e., the probability a patient

is alive and disease free) between the two treatment arms. However, this comparison is

complicated by the likely possibility that the hazard functions from these two treatments will

cross at some point. Allogeneic transplants tend to have a higher mortality early due to the

toxicity of the higher doses of chemotherapy used to ablate the immune system as well as graft-

versus-host disease from the donor cells. However, the donor cells may provide a graft-versus-

lymphoma effect resulting in less relapse of the primary disease in long-term survivors. In

contrast, autologous transplants have lower early toxicity because patients do not experience

© 2008, The International Biometric Society

*email: blogan@mcw.edu.

6. Supplementary Materials

Web Appendix A, referenced in Section 2.3, is available under the Paper Information link at the Biometrics website

http://www.biometrics.tibs.org.

NIH Public Access

Author Manuscript

Biometrics. Author manuscript; available in PMC 2009 September 29.

Published in final edited form as:

Biometrics. 2008 September ; 64(3): 733–740. doi:10.1111/j.1541-0420.2007.00975.x.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

graft-versus-host disease. However, these patients do not benefit from the protection against

relapse from the graft-versus-lymphoma effect, so they tend to experience more relapses. These

contrasting profiles are illustrated by the Kaplan–Meier curves for this dataset in Figure 1. The

DFS of the allogeneic transplant arm drops quickly early, but then levels off, whereas the DFS

of the autologous transplant arm decreases more slowly but does not plateau. The two curves

appear to cross at about 1 year. The unweighted log-rank test will have poor power to detect

such a difference in survival curves.

Generally, the question of interest is, which if any of the treatments yields better long-term

survival? Several strategies for addressing this question are possible. One may pick a single

long-term time point, and compare the survival estimates between the treatment groups at this

single time point, as is discussed in Klein et al. (2007). However, there are some potential

problems with this. First the results may be sensitive to the time point chosen. Second, this

strategy ignores events occurring after the selected time point. For example, in a clinical trial

comparing two treatments one might select 3-year survival as the primary endpoint. However,

if everyone is followed up for 3 years and accrual occurs over a period of time such as 2 years,

then there is substantial information on later events (between 3 and 5 years) for patients enrolled

early in the trial. Therefore, selecting a single time point may be inefficient.

Another alternative would be to estimate simultaneous confidence bounds for the difference

in survival curves (Parzen, Wei, and Ying, 1997; Zhang and Klein, 2001), which identify time

regions where the two treatments are different. However, because of the large number of time

points being considered and adjusted for, these tend to be quite wide and may be inefficient in

determining late differences between treatments.

Another option would be the weighted log-rank test, with more weight placed on later time

points to reflect interest in late events. For example, Fleming and Harrington (1981) proposed

a class of weighted log-rank tests with a weight function equal to Ŝ(t)ρ(1 − Ŝ(t))γ. Here setting

ρ = 0 and γ = 1 would place more weight on late events and hence late differences in the hazard

rates and/or the survival curves. However, even though the weight is placed appropriately, this

test is still designed to test the null hypothesis that the entire survival curves are equal. If we

are focused instead on late differences rather than the entire survival curve, even the weighted

log-rank test may be overly sensitive to early differences in the survival curves. We will

illustrate this point in simulations presented later in the article.

We propose a specific formulation of the hypothesis to focus on late differences in the survival

curve. We assume that atime point t0 can be prespecified, so that survival curves are presumed

likely to cross prior to that time point if at all. The null hypothesis is H0 : S1(t) = S0(t), for all

t ≥ t0, where S1(t) and S0(t) denote the survival curves at time t for the treatment and control

groups, respectively, versus the alternative, H1 : S1(t) ≠ S0(t), for some t ≥ t0. This formulation

allows us to specify exactly over what time range the comparison of treatments is of interest,

for example, after t0.

Note that this null hypothesis is equivalent to H0 : {S1(t0) = S0(t0)} ⋂ {λ1(t) = λ0(t), t > t0},

where λk(t) represents the hazard function at time t for group k, k = 0, 1. This formulation allows

us to separate the hypotheses into two sub-hypotheses: the hypothesis of equality of survival

at t0 and the hypothesis of no difference in the hazard function after t0. The composite

hypothesis can then be tested using combinations of test statistics for each of the sub-

hypotheses.

In the next section, we describe possible methods for testing this null hypothesis.

Logan et al. Page 2

Biometrics. Author manuscript; available in PMC 2009 September 29.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

2. Methods

2.1 Notation

The data consist of n1 + n0 = n subjects with event times tj. Let the distinct event times be

ordered such that t1 < ⋯ < tm. At time tj let dkj denote the number of events and Y kj denote the

number at risk in the kth group, k = 0, 1.

The Kaplan–Meier estimate of survival in group k is given by

The variance of the Kaplan–Meier estimate is estimated by Greenwood's formula given by

where

The Nelson–Aalen estimate of the cumulative hazard function is

with variance estimated by

2.2 Comparisons Based on a Single Time Point

The simplest method for testing the null hypothesis that the survival curves after time t0 are

equal would be to compare the survival curves at a selected point t′ > t0, using the difference

in Kaplan–Meier estimates of survival at t′. One can also construct a test statistic based on

transformations of the survival probabilities at a fixed point in time, as described in Klein et

al. (2007). Their recommendations were that the complementary log–log transformation of the

survival probability works the best overall, resulting in the test statistic

(1)

Alternatively, one could compare the cumulative hazard functions at a selected time point t′ >

t0, using the Nelson–Aalen estimates at t′, Λ̂k(t′). Tests based on the cumulative hazard function

should behave similarly to those using a log transformation of the survival function.

Logan et al. Page 3

Biometrics. Author manuscript; available in PMC 2009 September 29.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

2.3 Weighted Kaplan–Meier Test

One way to compare the entire survival curve after t0 is to consider a modification of the

weighted Kaplan–Meier statistic (Pepe and Fleming, 1989, 1991), where the integral is taken

over the restricted range after t0. The statistic is given by

where ŵ(t) = {n1Ĝ1(t) + n0Ĝ0(t)}−1nĜ1(t)Ĝ0(t) and Ĝk(t) is the Kaplan–Meier estimate of the

censoring distribution. Let ℓ denote the index of the event time such that tℓ−1 ≤ t0 < tℓ. The

(unpooled) variance of this statistic can be estimated by

where

variance expression is given in Web Appendix A. Then the standardized weighted K–M

statistic follows a standard normal distribution under the null hypothesis and is given by

and . A sketch of the derivation of this

(2)

2.4 Tests Based on Pseudo-Value Observations

Another test is based on a pseudo-value regression technique proposed by Andersen, Klein,

and Rosthoj (2003) and Klein and Andersen (2005). Originally applied in the context of

regression models for multistate models and competing risks data, it can also be used in the

simple survival comparison context. For a given time point τj, compute the pooled sample

Kaplan–Meier estimator, Ŝp(τj), based on all n1 + n0 observations and the Kaplan–Meier

estimator based on the sample of size n1 + n0 − 1 with the ith observation removed,

for i = 1, …, n. Define the ith pseudo-value at time τj by

,

, for i = 1, …, n.

To perform inference on survival curves after a fixed time t0, we use the pseudo-values defined

for event times t > t0. Let τ1 correspond to the earliest event time occurring after t0, τ2 correspond

to the next earliest event time after t0, and so forth, so that there are a total of m′ such observed

event times in the dataset. We consider a generalized linear model for the pseudo-values, given

by g(θij) = αj + βZi, for i = 1, …, n; j = 1, …, m′, where Zi is an indicator with value 1 if the

patient is in the treatment group and 0 if they are in the control group. Then given that we are

only considering pseudo-values for times t > t0, the null hypothesis H0 of equal survival curves

after t0 is equivalent to testing .

Inference on β may be performed using generalized estimating equations (GEE; Liang and

Zeger, 1986). Let μ(·) = g−1(·) be the mean function. Define dμi (β, α) to be the vector of partial

derivatives of μ(·) with respect to (β, α), where α is an m′-dimensional vector of intercepts at

time τj, j = 1, …, m′. Let Vi be a working covariance matrix. Express the pseudo-values and

their expectations in vector notation as θ̂i = (θ̂11, …, θ̂1m′) and θi = (θ11, … θ1m′). Then the

estimating equations to be solved are of the form

Logan et al. Page 4

Biometrics. Author manuscript; available in PMC 2009 September 29.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

Let (β̂, α̂) be the solution to this equation. Using results of Liang and Zeger (1986), under

standard regulatory conditions, it follows that

normal with mean 0. The covariance matrix of (β̂, α̂) can be estimated by the sandwich estimator

Σ̂(β̂, α̂) where

is asymptotically multivariate

and

is the model-based equivalent of the information matrix (Andersen et al., 2003).

When the number of time points or pseudo-values being included for each patient is large, this

can present numerical difficulties in several aspects. Estimation of the parameters can be slow

because there are a large number of parameters, and numerical algorithms must be used.

Furthermore, calculation of the matrix Σ̂ requires the difficult inversion of a high-dimensional

matrix I(β̂, α̂). One option is to consider alimited number of points (say 5 or 10) spread out

equally on an event scale over the time period after t0. An alternative is to use the generalized

score statistic for β (Rotnitzky and Jewell, 1990; Boos, 1992), as considered in Lu (2006) for

the pseudo-value regression context. The generalized score statistic for β when there is asingle

dichotomous predictor can be shown to have a closed form, assuming an independent working

correlation matrix and using the complementary log-log link function. Let α̃j = log{−log(θ̄·j)}

be the solution for αj in the estimating equation under the null hypothesis, where

, and qj = θ̄·j log θ̄·j. The generalized score statistic for testing H0 :

β = 0 simplifies to

(3)

where the matrix element (·)11 or vector element (·)1 refers to the β component. This statistic

asymptotically follows a

distribution under the null hypothesis that β = 0. Note, however,

that the method can be biased when the censoring distribution depends on covariates.

The pseudo-value regression technique offers several potential improvements over the other

methods studied in this article. First, it allows for straightforward inclusion of additional

covariates in the generalized linear model. Although other methods discussed here also can be

extended to include additional covariates, the generalized linear model framework makes this

extension very straightforward. Another advantage is that the pseudo-value regression

approach allows one to model the effect of treatment as a time-dependent predictor. Even

Logan et al.Page 5

Biometrics. Author manuscript; available in PMC 2009 September 29.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript