# The additive risk model for estimation of effect of haplotype match in BMT studies.

**ABSTRACT** In this paper we consider a problem from bone marrow transplant (BMT) studies where there is interest on assessing the effect of haplotype match for donor and patient on the overall survival. The BMT study we consider is based on donors and patients that are genotype matched, and this therefore leads to a missing data problem. We show how Aalen's additive risk model can be applied in this setting with the benefit that the time-varying haplo-match effect can be easily studied. This problem has not been considered before, and the standard approach where one would use the EM-algorithm cannot be applied for this model because the likelihood is hard to evaluate without additional assumptions. We suggest an approach based on multivariate estimating equations that are solved using a recursive structure. This approach leads to an estimator where the large sample properties can be developed using product-integration theory. Small sample properties are investigated using simulations in a setting that mimics the motivating haplo-match problem.

**0**Bookmarks

**·**

**132**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**In this paper we consider a problem from hematopoietic cell transplant (HCT) studies where there is interest on assessing the effect of haplotype match for donor and patient on the cumulative incidence function for a right censored competing risks data. For the HCT study, donor's and patient's genotype are fully observed and matched but their haplotypes are missing. In this paper we describe how to deal with missing covariates of each individual for competing risks data. We suggest a procedure for estimating the cumulative incidence functions for a flexible class of regression models when there are missing data, and establish the large sample properties. Small sample properties are investigated using simulations in a setting that mimics the motivating haplotype matching problem. The proposed approach is then applied to the HCT study.Lifetime Data Analysis 09/2012; · 0.54 Impact Factor

Page 1

The additive risk model for estimation of effect of haplotype

match in BMT studies

Thomas H. Scheike

Department of Biostatistics, University of Copenhagen

Torben Martinussen

Department of Biostatistics, University of Southern Denmark

Mei-Jie Zhang

Division of Biostatistics, Medical College of Wisconsin

Summary

In this paper we consider a problem from bone marrow transplant (BMT) studies where there is

interest on assessing the effect of haplotype match for donor and patient on the overall survival.

The BMT study we consider is based on donors and patients that are genotype matched, and this

therefore leads to a missing data problem. We show how Aalen's additive risk model can be

applied in this setting with the benefit that the time-varying haplo-match effect can be easily

studied. This problem has not been considered before, and the standard approach where one would

use the EM-algorithm cannot be applied for this model because the likelihood is hard to evaluate

without additional assumptions. We suggest an approach based on multivariate estimating

equations that are solved using a recursive structure. This approach leads to an estimator where the

large sample properties can be developed using product-integration theory. Small sample

properties are investigated using simulations in a setting that mimics the motivating haplo-match

problem.

Keywords

Aalen model; additive risk model; estimating equation; flexible modeling; haplo-match effects;

haplo-type effects; missing data; semiparametric modeling; survival data

1 Introduction

Hematopoietic cell transplantation (HCT) is a life saving procedure for many cancer

patients, but it also has side effects, especially for transplants from unrelated donors, for

example when the donor is not from the patient's family. Graft-versus-host disease (GVHD)

complications is one of the major causes for treatment related death. To reduce GVHD,

human leukocyte antigen (HLA) fully matched graft needs to be selected. When a patient

does not have an HLA-identical sibling donor, then an HLA fully matched unrelated

transplant often will be considered. For an HLA fully matched unrelated transplant, the

donor is selected such that the patient and donor have identical HLA genotypes, but they

may not necessarily have matching haplotype pairs. For example, suppose the donor and

patient have identical HLA-A/B/DRB1 alleles of {(A1, A3), (B8, B7), (DR3, DR2)}. Now,

there are four potential haplotypes that are consistent with this genotype, namely {(A1, B8,

Thomas H. Scheike, Department of Biostatistics, University of Copenhagen, Øster Farimagsgade 5B, P.O.B. 2099, DK-1014

Copenhagen K, Denmark ts@biostat.ku.dk.

NIH Public Access

Author Manuscript

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

Published in final edited form as:

Scand Stat Theory Appl. 2011 September ; 38(3): 409–423. doi:10.1111/j.1467-9469.2010.00720.x.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

DR3), (A3, B7, DR2)}, {(A3, B8, DR3), (A1, B7, DR2)}, {(A1, B8, DR2), (A3, B7, DR3)},

and {(A3, B8, DR2), (A1, B7, DR3)}. Here, a donor with {(A1, B8, DR3), (A3, B7, DR2)}

and a patient with {(A1, B8, DR3), (A3, B7, DR2)} are haplotype matched, but a donor with

{(A3, B8, DR3), (A1, B7, DR2)} and a patient with {(A1, B8, DR3), (A3, B7, DR2)} are

haplotype miss-matched, see Petersdorf et al. (2007) for details. Among HLA-genotype

matched transplants with unrelated donors it has not been fully investigated whether there is

an effect of haplotype matching. This is an important question since if a positive beneficial

haplotype matching effect can be identified, then patient and physician need to search for a

haplotype matched donor among those available HLA-genotype fully matched potential

donors. Among a patient and donor with identical HLA genotype, the typical probability of

a HLA haplotype match is around 70% (Petersdorf et al., 2007).

Recently, Petersdorf et al. (2007) studied the effect of major histocompatibility complex

(MHC) haplotype match of donor and patient after HLA-identical unrelated HCT for

leukemia. Their analysis was based on observing the haplotype pairs directly and therefore

was carried out as a Cox regression analysis. Identifying the haplotype pairs is a very costly,

time-consuming and not entirely unproblematic procedure. In this study, we show how

modeling of the missing haplotypes for donor and patient can also be used to address the

issue using only the observed genotype and failure time data. For illustration purpose, we

consider a similar cohort of patients with HLA-identical unrelated HCT for acute leukemia.

The data was selected from the Center for International Blood and Marrow Transplant

Research (CIBMTR). We found a strong time-varying effect for the haplotype matching,

and therefore the use of Cox's proportional hazards model is not appropriate to assess the

effect of haplo-type match for this transplant data. We instead consider Aalen's additive

intensity model (Aalen, 1989). This model allows the haplotype matching effect to be

varying over time and the time-dependent excess risk can be estimated directly. This excess

risk translates directly into a simple relative survival percentage. One particular advantage of

this approach is that the absolute size of the effect is directly accessible and that the model

does not rely on any particular structure for the effect.

Missing data for survival analysis has been considered in detail in many different settings

for Cox's regression model. For example with missing partial covariates for some patients

and in frailty models for unobserved cluster effect. Estimating haplotype effects using Cox's

regression model has been considered in a large number of papers (Schaid et al., 2002;

Zaykin et al., 2002; Schaid, 2004; Zhao et al., 2003; Tregouet & Tiret, 2004; Lin & Zeng,

2006; Epstein & Satten, 2003; Chen et al., 2004; Scheike et al., 2010). One of the attractions

for fitting a Cox's regression model is that the EM-algorithm can be applied in this setting.

This makes it easy to deal with the missing data. Alternative estimating equations have been

considered in Zucker (2005); Gorfine et al. (2005); Scheike (2006); Scheike et al. (2010).

There has been no related statistical work on how to fit an additive intensity model in this

setting where no simple likelihood is available. In this paper, we suggest a new approach

that fits the additive risk model without smoothing or additional assumptions. The proposed

method is simple to implement and use. The basic idea is to solve the estimating equations

recursively to deal with the structure of the model appropriately. It turns out that the large

sample properties of the proposed estimators can be derived based on product integration

theory (Gill & Johansen, 1990).

The model and estimating procedures are presented in section 2. Section 3 gives large

sample properties and also a test for investigating time-invariance. Small sample performs

are assessed by simulation studies that are presented in section 4. In section 5 we consider a

worked example that assesses the effect of haplotype matching for HCT patients. Section 6

concludes with some final remarks. Proofs are collected in the Appendix.

Scheike et al.Page 2

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

2 The additive risks model for effect of haplo-match

Let be the failure time of interest and let C be a potential censoring time such that we

observe

covariates X and observed haplotype pairs for the donor Hd = (Hd1, Hd2) and the patient Hp

= (Hp1, Hp2), the counting process N(t) has an additive intensity

and = . Let N(t) = I(T ≤ t, δ = 1). We assume that, given

(1)

where R(t) = I(T ≥ t) indicates individual is at risk at time t, and X(Hd, Hp) is a function of

the observed covariates and haplotypes of donors and patients. In this paper, we consider the

design where X(Hd, Hp)T = {XT, I(Hd = Hp)}. This simple design gives the effect of

haplotype matching between donor and patient.

If Hd and Hp are known, a simple standard survival analysis based on the additive model can

be used. However, the problem here is that we only observe the underlying HLA genotypes

related to Hd and Hp. For HLA-identical unrelated transplant data, the patient and donor

have identical genotypes, but may not necessarily have matched haplotype pairs (see section

1).

To estimate the effect of haplo-match we need to model how the haplotype pairs relate to the

observed genotypes. We assume that the probability of a specific haplotype pair for both

donor and patient H = (hi, hj) can be computed under Hardy-Weinberg equilibrium as

(2)

where πi, i = 1, .., K gives the distribution of the considered haplotypes. One may consider

any relevant parametric model for the frequencies of the haplotype pairs. Explicitly, we

parameterize the haplotype frequencies as

with ηj ∈ ] − ∞, ∞[ and with the identifiability restriction ηK = 0 (Σ πi = 1). We denote the

haplotype parameters as = ν = (η1,…, ηK−1). For the HCT data, these parameters are

unknown and must be estimated from the data, but one may also search known data-bases to

get estimates of these parameters. We here treat the haplotype parameters as unknown. In

the HCT data, we have a very rich haplotype space. Thus, a very large number of parameters

need to be estimated and since we only have limited observed data, we will make further

simplifying assumptions about the haplotype frequencies (see section 5).

Based on this model we can infer the conditional distribution of the underlying haplotype

pairs given the observed genotype, P (H = h|G = g). The observed intensity given genotype

can be written as

Scheike et al. Page 3

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

where

the observed genotype. For unrelated donors and patients, we can assume that P(Hd = hd, Hp

= hp|G) = P(Hd = hd|G) × P(Hp = hp|G). For related donors and patients this independence

assumption does not hold. Here, the observed hazard function has an additive structure and

it also depends on the cumulated regression function evaluated at time t−, i.e. just before

time t.

and Σhd,hp∈G is the sum over all haplotype pairs that are consistent with

The likelihood for a subject is

and the score equation for ν can be partitioned into two parts

where UG is the part of the score that relates to Pν(G) and UN relates to the information

contained in the survival data.

For the nonparametric additive structure of the hazard function, the likelihood is hard to

work with even when haplotype pairs, H, are fully observed. Therefore, we consider an

alternative approach for estimating the nonparametric regression functions. We assume that

n i.i.d. observations are available in the time-interval [0, τ], for some τ < ∞. Let N(t) =

{N1, ..,Nn(t)}T be the n-dimensional counting process of all subjects with corresponding

intensity λ(t) = {λ1(t), …, λn(t)}T.

We first note that estimation can be done recursively. Assuming for the moment that ν is

known and for all t smaller than the first jump-time. The intensity of Ni(t) can be

written as , where

square estimator

. This leads to a least

where the ith row of

of , which is given by

and is the generalized weighted inverse

Scheike et al.Page 4

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

Here the diagonal elements of the weight function, W are N (t, ν,Gi, Xi)/T(t, ν, Gi, Xi). The

efficient weight is equivalent to the efficient weight in the one-dimensional case, which

involves inverse of X(hd, hp)T α(t), and it can be implemented using a two-stage approach as

for the additive risk model (Huffer & McKeague, 1991).

Next, we look at an estimating equation for ν based on the survival data. We first need some

definitions. Let the ith row of

where

is the derivative of the intensity with respect to ν multiplied by the weight described above.

Here Dνh(x) denotes the derivative with respect to ν evaluated at x.

This leads to an estimating equation for the unknown ν based on the survival data

We now replace A(t) with Â(t) and add the score contribution coming from P (G) to obtain

the following estimating equation for ν:

(3)

Then, ν is estimated by solving U(ν, Â) = 0 and denoted by . Finally, we then estimate A(t)

by least square estimator with .

One may ignore the information about ν contained in the survival data and base estimation

solely on the genotype data. This we refer to as the two-stage approach.

3 Large sample properties

In this Section we outline the large sample properties and derive estimators for the variance

of the suggested estimators. When the estimating function, U, is evaluated in the true

parameter values ν0 and A it becomes a martingale, but the asymptotic analysis is

complicated by the fact that A is estimated using a recursively constructed estimator.

With the proper root of (3) and with additional regularity conditions it follows that it is

consistent and asymptotically normal, we refer to the Appendix for further details. But the

key arguments are based on the one dimensional case as in Bagdonavicius & Nikulin

(1999);Chen et al. (2002);Dabrowska (2005);Gorfine et al. (2005);Zucker (2005). The key

to the extension to the multivariate additive risk model is based on the powerful and elegant

product integration theory given by Gill & Johansen (1990). This also is utilized by

Dabrowska (2005).

It can be established that

Scheike et al.Page 5

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 6

where Ui(t, ν0) is defined in the second equation,

martingale and qi(s, ν) is a predictable function given in (9) in the Appendix, and

score for subject i corresponding to the genotype log-likelihood. Based on this i.i.d.

decomposition a simple robust estimator of the variance of n−1/2U(t, ν0) is given as

is the basic

the

(4)

where

by using the estimates and empirical versions of all expressions.

and and

are estimators of qi and Mi obtained

Based on this it can be derived that

with covariance matrix that can be consistently estimated by

converges to a zero-mean normal distribution

where denotes minus of the derivative of U with respect to ν.

It also follows that

establish an i.i.d. decomposition. It turns out, based on a Taylor series expansion, that (up to

an op(1) term)

converges to a Gaussian process. The key is again to

where the Li's are defined in (10) in the Appendix and they are i.i.d. processes.

Let denote the estimator of Li by plugging in the estimates of the unknown quantities, see

Appendix for details. The variance of n1/2{Â(t) – A(t)} may therefore be estimated by

(5)

The asymptotic distribution can be approximated by a resampling scheme

(6)

where G1, …,Gn are independent standard normals and are independent of the counting

processes and their covariates.

By a resampling technique we can construct (1 – α)% confidence bands for each cumulative

regression coefficient function of Aj(t), for t ∈ [0, τ], and derive a supremum test for testing

the hypothesis of H0 : αj(t) = 0, for t ∈ [0, τ]. We omit the details here.

Scheike et al.Page 6

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 7

3.1 Testing of constant effects and significant effects

To derive a formal test to see if a given effect varies over time or if it leads to constant

excess risk, we consider a supremum test that we evaluate based on the resampling

technique. Considering for example the jth component of the model, αj(t), we now wish to

test H0 : αj(t) ≡ γj for t ∈ [0, τ]. As in Martinussen & Scheike (2006) a test can be based on a

supremum test statistic

Let where is the kth realization of the resampling process of

. Based on derived large sample property and under H0,

asymptotically approximated by

can be

for κ = 1, …, K. A formal test can be derived by comparing observed and simulated

4 Simulations

We simulated data similar to the BMT example data considered in section 5. The underlying

additive hazard model had a baseline at 0.3 with the effect of haplotype match being −0.2 in

the first 1.5 time units and 0 afterwards. The survival times were right censored at 3 time

units. In the 2054 patients we consider below we found that there where 20, 31 and 13

different allele types for the loci A, B, and DRB1, and that this lead to a total of 2,325

possible haplotypes. These haplotypes where asssumed to have frequencies similar to those

we found in a structured MLE based on the genotype data where we found that 1901 of the

haplotypes had frequency 9.95 * 10−5, 214 had frequency 0.00072, 192 had frequency

0.0019, 14 had frequency 0.0087, 2 had frequency 0.027, 1 had frequency 0.042, and 1 had

frequency 0.083, respectively. We tried various other haplotype distributions that were not

based directly on the underlying structured model, and we will study this issue further in

future work. We then considered sample sizes of 2400, 1200, 600 and 300 that lead to

approximately 1100, 650, 320 and 160 number of deaths on average, respectively. 2000

random samples were generated for each simulation study, and the resampling based tests

were based on 1000 resamples.

We present detailed results for sample size of 600. Figure 1(a) and 1(b) show the mean of

the estimated cumulative intensity (broken line) and the true cumulative coefficient (solid

line) for the baseline and the effect of the haplotype match, respectively. Figure 1(c) and

1(d) give the mean of the estimated standard errors (broken line) and the sample standard

deviation of the realizations (solid line) for baseline and effect of the haplotype match,

respectively. Figure 1 shows that the estimators are essentially unbiased and the standard

errors are well estimated by the derived estimator. There is some slight underestimation of

the variance, that diminishes when the sample sizes increase (figures are not shown here).

Scheike et al. Page 7

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 8

The standard errors of the cumulative coefficients are approximately halved if the haplotype

match had been observed. We also considered the effect of estimating the haplotype

parameters and this only had minor impact on the standard errors, estimates and coverage

probabilities.

Figure 2 shows the pointwise coverage probabilities for the baseline intensity and for the

effect of haplotype match, respectively. The slight bias in the variances result in coverage

probabilities that are slightly lower for most of the time-axis. The initial problems with the

coverage probabilities are well known and due restrictions close to 0, these problems may be

remedied by choosing an appropriate transformation.

We also considered the test for constant effects developed earlier as well as the coverage of

the confidence bands. The observed coverage is considered in the time-period from 0.3 to 3

to disregard the initial problematic time period. This resulted in the observed levels reported

in Table 1.

The test for constant effect achieves the nominal 5% power for the constant effects for the

baseline and the power for the time-varying effect of the haplotype match increases with

increasing sample size. The coverage probabilities of confidence bands are in an acceptable

level. Slightly to low for the baseline, but close to the true 95% for the haplo-match effect.

The finite sample properties are well described by the estimators. Both the pointwise and

uniform description of the uncertainty works well. The slight problems are similar to those

found for the additive intensity model with known covariates.

5 Haplotype Match for BMT patients

Recently, Petersdorf et al. (2007) studied the effect of MHC haplotype match based on 246

leukemia patients. All these 246 patients received a HLA-identical unrelated HCT between

1986 and 2003. Their analysis was based on observing the haplotypes directly. This is a very

costly and time-consuming procedure. They identified 191 (78%) transplants that were

haplotype-matched. They demonstrated that the MHC haplotype match of donor and patient

had significant impact on the transplantation outcome of developing acute graft-versus-host

disease (GVHD) and recurrent malignancy, but had no impact on the treatment related

mortality (TRM) which is defined as death in complete remission and overall mortality. For

the illustration purpose, we consider a similar cohort of patients with HLA-identical

unrelated HCT for leukemia and transplanted between 1995 and 2007, and donor-patient

matched on four loci HLA-A/B/C/DRB1 based on low-resolution (2 digits) classification.

The data presented here were obtained from the Statistical Center of the Center for

International Blood and Marrow Transplant Research (CIBMTR) located at the Medical

College of Wisconsin. The analysis has not been reviewed or approved by the CIBMTR.

The CIBMTR is comprised of clinical and basic scientists who share data on their transplant

patients with CIBMTR Statistical Center.

The selected data consists of 5,145 leukemia patients (2,054 for acute myeloid leukemia

(AML), 1,171 for acute lymphoblastic leukemia (ALL), 1,037 for chronic myelogenous

leukemia (CML) and 883 for myelodysplastic syndrome (MDS)). The analysis showed that

a haplotype match only had a significant effect on AML patients. For illustration purposes,

we here focus on the overall survival of the 2,054 AML patients.

The genotype data used in estimation is based on the three-loci HLA-A, HLA-B and HLA-

DRB1. This gives a very rich haplotype space. To reduce the total number of possible

haplotypes, low-resolution (2 digits) classification of alleles at loci {A1, A2}, {B1, B2} and

{DRB11, DRB12} were used to estimated haplotype frequencies. There were 20, 31 and 13

Scheike et al. Page 8

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 9

different allele types for the loci A, B, and DRB1, respectively. This leads to a total of 2,325

possible haplotypes. Some additional structure is needed to be able to estimate the

parameters in this setting, and we grouped rare haplotypes into groups with a common

haplotype frequency. We ended up with a model with 10 parameters where the 5 largest

frequencies have their own parameters and other frequencies where allowed to be in 5

different groups. We tried modifying the models in many different ways but the results were

quite robust to these choices. In the regression analysis below we adjust for disease stage at

pre-transplant (high risk patient (36%) versus early or intermediate risk patient (64%)), age

of patient (> 45 (50%) versus 19 – 45 (38%) versus ≤ 18 (12%) years), GVHD prophylaxis

(Cyclosporine-A (CsA) ± other (44%) versus others (56%)), graft source (bone marrow

(BM) (45%) vs peripheral blood (PB) (55%)) and year of transplant (1995–2001: early,

(25%) versus 2002–2007 (75%)). We considered the survival for the first 90 months (τ =

90). There were 1,206 patients died within this study period.

We fitted an additive model to evaluate the effect of haplo-match by XT(HP, HD) = {XT,

I(HP = HD)}, where X consists of a baseline, disease stage, age of patient, GVHD

prophylaxis, graft source and transplant year. The estimated cumulative coefficients and

95% pointwise confidence intervals (solid lines) and with 95% confidence bands (broken

lines) for covariates and haplo-match are given in Figure 3. Note that it appears that all

covariates had time-varying effects and that the haplo-match effect was also timevarying.

The formal tests for a constant excess risk lead to p-values for disease stage, age group 19 –

45, age group > 45, GVHD prophylaxis, graft source and year of transplant that are <0.001,

0.051, <0.001, 0.002, 0.014 and 0.002, respectively. The haplo-match effect lead to a p-

value of p = 0.033. Figure 3 and these tests results indicate that a similar Cox regression

analyses would not be appropriate here.

Figure 3 indicates the overall effect of matching haplotypes is significant even though the

significance test only leads to a p-value of 0.055. Note, also that the test for constant effect

also provides an alternative test for significance.

Also, the negative cumulative hazards for the haplo-matching indicates that patient with a

haplo-matched HCT has a higher survival probability. Figure 4 gives the predicted survival

probabilities of haplo-mismatch (panel a) and haplo-matched (panel b) for a young patient

(≤ 18) with early or intermediate disease at pre-transplant, received other than CsA ± others

for GVHD prophylaxis and received a PB transplant between 2002–2007. The survival

predictions are provided with 95% pointwise confidence intervals (dotted lines) and with

95% confidence bands (broken lines). The predicted 5 year survival probabilities for haplo-

mismatched and haplo-matched HCT are 55% (95% CI: 38%–72%) and 83% (64%–100%),

respectively.

This BMT example showed a positive beneficial effect in overall survival for haplotype

matched transplant. As we discussed in section 1, this positive effect could lead to a new

decision in choosing a best matched donor.

6 Discussion

The popularity of the additive hazards model risk model has grown as it is well suited to

study time-dynamics, that is whether or not effects of covariates change with time. In this

paper we have demonstrated how the additive risk model can be applied also when there is

missing data, and in particular how one can estimate effects of haplo-type match in BMT

studies.

Scheike et al.Page 9

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 10

The large sample properties of the suggested estimators were derived using product

integration theory, and these seem to give an accurate description of the variability even for

very small samples. Furthermore, the formal test of whether or not covariate effects vary

with time worked quite well in the simulations.

In the context of the the BMT example, we studied the effect of haplotype match, and here

specific modeling of the haplotype frequencies were needed. It would be interesting to

further consider how to make effects robust to misspecification of the haplotype modeling,

and one may pursue ideas like Allen et al. (2005). Further, one may model the haplotypes

with additional complexity to allow for lacking Hardy-Weinberg Equilibrium and

recombination that is know to present on the HLA gene for the three loci we consider.

The approach is implemented in the R-package HaploSurvival, that is being developed by

the first author.

We appreciate helpful and constructive suggestions by the Editors and a referee.

This research was supported by NIH-grant 2 R01 CA54706-10, and a grant from the Danish

Research Council on “point process modeling and statistical inference”.

7 Appendix

We start by making the observation that (up to op(n−1))

where M(s) is the (vector) counting process martingale, and where the time arguments were

omitted for notational simplicity. Let

this quantity be denoted as s0(t). Note that the first term, Vn, by the martingale CLT, can be

and let the limit of

shown to converge to a Gaussian martingale process V(t) with variance

Note also that Vn(t) can be written as

.

Also, a Taylor expansion using matrix derivatives yields (up to op(n−1/2))

where the ith row of is DAjfi(ν, A(t−)), and therefore with

Scheike et al.Page 10

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 11

This implies that

W that satisfies the integral equation

converges in distribution to a p-dimensional process

Denote the elements of ej(t) as ej,km(t). With the p × p matrices F(t) with elements

the solution can be written as

Using the product integration formulae (Gill & Johansen, 1990) leads to the solution

(7)

and where F(s, t) is the product integral of F. In the one-dimensional (commutative) case

this equals exp

the product of (the atoms)

. Note that the matrix F(s, t) is estimated consistently by

(8)

where

in the one-dimensional case yields an estimator with the same asymptotic properties.

. Using the product integral rather than the exponential

We now consider the estimating function for ν based on the counting process data,

By the martingale decomposition

.

By a Taylor expansion, as before, it may be shown that the difference between the last two

terms in the previous equation (up to op(n−1/2)) equals

With the matrices in each of the terms converging to .

The last two term therefore equals (op(n−1/2))

Scheike et al. Page 11

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 12

that can be written as

with G(t) is an q × p matrix with elements

of integration and using the structure of W (t) one gets

. Now changing the order

where

Define

(9)

Then one can write the estimating function as follows (up to an op(1) term)

This is a sum of i.i.d. terms (or a martingale) and therefore converges to a normal

distribution with variance that is estimated by the robust estimator given in (4). The total

score for ν is UN(ν0) + UG(ν0)

Based on a Taylor series expansion (up to an op(1) term) we find that

where

(10)

and with P(t, ν0) the limit of

and with the product integral F(s, t) defined in (7).

defined in (11), with I(τ) the limit of n−1I(τ, ν0) (12),

The derivative ofÃ(t) with respect to ν is

Scheike et al.Page 12

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 13

(11)

The derivative with respect to ν of the estimating function, U, is

(12)

Based on the derivation of the estimating function it straightforward to do a Newton-

Raphson iteration to find a solution to the estimating equations. Note that the first term is a

martingale when evaluated in the true parameter, and therefore will converge towards 0 in

probability.

Note also that an estimator of F(s, t) is given in (8).

References

Aalen OO. A linear regression model for the analysis of life times. Statist. Med. 1989; 8:907–925.

Allen AS, Satten GA, Tsiatis AA. Locally-efficient robust estimation of haplotype-disease association

in family-based studies. Biometrika. 2005; 92:559–571.

Bagdonavicius V, Nikulin M. Generalised proportional hazards model based on modified parital

likelihood. Lifetime Data Anal. 1999; 5:329–350. [PubMed: 10650741]

Chen J, Peters U, Foster C, Chatterjee N. A haplotype-based test of association using data from cohort

and nested case-control epidemiologic studies. Human Heredity. 2004; 58:18–29. [PubMed:

15604561]

Chen K, Jin Z, Ying Z. Semiparametric analysis of transformation models with censored data.

Biometrika. 2002; 89:659–668.

Dabrowska DM. Quantile regession in transfomation models. arXiv:math.ST. 2005; 05115082v1:1–

34.

Epstein MP, Satten GA. Inference on haplotype effects in case-control studies using unphased

genotype data. American Journal of Human Genetics. 2003; 73:1316–1329. [PubMed: 14631556]

Gill RD, Johansen S. A survey of product-integration with a view towards application in survival

analysis. Ann. Statist. 1990; 18:1501–1555.

Gorfine M, Zucker D, Hsu L. Prospective survival analysis with a general semiparametric shared

frailty model: A pseudo full likelihood approach. Biometrika. 2005; 93:735–741.

Huffer FW, McKeague IW. Weighted least squares estimation for Aalen's additive risk model. J.

Amer. Statist. Assoc. 1991; 86:114–129.

Lin D, Zeng D. Likelihood based inference on haplotype effects in genetic association studies. J.

Amer. Statist. Assoc. 2006; 101:89–110.

Martinussen, T.; Scheike, T. Dynamic Regression Models for Survival Data. Springer; Verlag New

York: 2006.

Petersdorf E, Malkki M, Gooley T, Martin P, Guo Z. MHC haplotype matcing for unrelated

hematopoietic cell transplantation. PLOS medicine. 2007; 4:59–68.

Schaid DJ. Evaluating associations of haplotypes with traits. Genetic Epidemiology. 2004; 27:348–

364. [PubMed: 15543638]

Scheike et al.Page 13

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 14

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. Score tests for association between

traits and haplotypes when the linkage phase is ambiguous. American Journal of Human Genetics.

2002; 70:425–434. [PubMed: 11791212]

Scheike T, Martinussen T, Silver J. Estimating haplotype effects for survival data. Biometrics. 2010

DOI: 10.1111/j.1541-0420.2009.01329.x.

Scheike TH. A flexible semiparametric transformation model for survival data. Lifetime Data Anal.

2006; 12:461–80. [PubMed: 17031497]

Tregouet D, Tiret L. Cox proportional hazards survival regression in haplotype-based association

analysis using the stochastic-em algorithm. European Journal of Human Genetics. 2004; 12:971–

974. [PubMed: 15241485]

Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG. Testing association of

statistically inferred haplotypes with discrete and continuous traits in samples of unrelated

individuals. Human Heredity. 2002; 53:79–91. [PubMed: 12037407]

Zhao LP, Li SS, Khalid N. A method for the assessment of disease associations with single-nucleotide

polymorphism haplotypes and environmental variables in case-control studies. American Journal

of Human Genetics. 2003; 72:1231–1250. [PubMed: 12704570]

Zucker DM. A pseudo-partial likelihood method for semiparametric survival regression with covariate

errors. J. Amer. Statist. Assoc. 2005; 100:1264–1277.

Scheike et al.Page 14

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 15

Figure 1.

Mean of estimated cumulative hazards (broken line) versus true cumulative hazards (solid

line) for the baseline and haplomatch effect (Panels a and b), and observed pointwise

standard deviation of estimates (solid line) versus mean of estimated standard errors (broken

line) (Panels c and d) based on sample size 600

Scheike et al.Page 15

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript