Page 1

Journal of Statistical Planning and Inference 139 (2009) 3--15

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference

journal homepage: www.elsevier.com/locate/jspi

Robust designs for misspecified logistic models

Adeniyi J. Adewalea, Douglas P. Wiensb,∗

aMerck Research Laboratories, North Wales, Pennsylvania 19454, United States

bDepartment of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Alberta, Canada T6G 2G1

A R T I C L EI N F OA B S T R A C T

Available online 24 May 2008

MSC:

primary 62K05;62F35

secondary 62J05

Keywords:

Fisher information

Logistic regression

Linear predictor

Monte Carlo sample

Polynomial

Random walk

Simulated annealing

We develop criteria that generate robust designs and use such criteria for the construction of

designs that insure against possible misspecifications in logistic regression models. The design

criteria we propose are different from the classical in that we do not focus on sampling error

alone. Instead we use design criteria that account as well for error due to bias engendered

by the model misspecification. Our robust designs optimize the average of a function of the

sampling error and bias error over a specified misspecification neighbourhood. Examples of

robust designs for logistic models are presented, including a case study implementing the

methodologies using beetle mortality data.

© 2008 Elsevier B.V. All rights reserved.

1. Introduction

Experimental designs have been treated extensively in the statistical literature, starting with designs for linear models and

extending to nonlinear models. A large volume of literature is devoted to designs assuming the exact correctness of the relation-

ship between the response variable and the design (explanatory) variables. Box and Draper (1959) added another dimension to

the theory by investigating the impact of model misspecification. Following the work of Box and Draper the literature has since

beenrepletewithregressiondesignswhicharerobustagainstviolationsofvariousmodelassumptions—linearityoftheresponse,

independence and homoscedasticity of the errors, etc. Authors who have considered designs with an eye on the approximate

nature of the assumed linear models include Marcus and Sacks (1976), Li and Notz (1982), Wiens (1992) and Wiens and Zhou

(1999), to mention but a few.

For nonlinear designs, Fedorov (1972), Ford and Silvey (1980), Chaloner and Larntz (1989) and Chaudhuri and Mykland (1993)

have explored the construction of optimal designs while assuming that the nonlinear model of interest is correctly specified.

Still others have investigated designs for generalized linear models, a class of possibly nonlinear models in which the response

follows a distribution from the exponential family such as the normal, binomial, Poisson or gamma (McCullagh and Nelder,

1989). The expository article (Ford et al., 1989) hinted that in the context of nonlinear models, as in the case of linear models, the

misspecification of the model itself is of serious concern. They asserted that “indeed, if the model is seriously in doubt, the forms

of design that we have considered may be completely inappropriate.” Sinha and Wiens (2002) have explored some designs for

nonlinear models with due consideration for the approximate nature of the assumed model. In this work we consider designs for

misspecified logistic regression models.

For the logistic model, the mean response E(Y) = ? depends on the parameters, b, and the vector of explanatory variables, x,

through the nonlinear function ? = e?/(1 + e?), where ? = zT(x)b. The function ? is termed the linear predictor, with regressors

z1(x), ...,zp(x) depending on the q-dimensional independent variable x. The variance of the response, written var(Y|x), is a

∗Corresponding author. Tel.: +17804924406; fax: +17804926826.

E-mail addresses: adeniyiadewale@hotmail.com (A.J. Adewale), doug.wiens@ualberta.ca (D.P. Wiens).

0378-3758/$-see front matter © 2008 Elsevier B.V. All rights reserved.

doi:10.1016/j.jspi.2008.05.022

Page 2

4

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

nonlinear function of the linear predictor. Abdelbasit and Plackett (1983), Minkin (1987), Ford et al. (1992), Chaudhuri and

Mykland (1993), Burridge and Sebastiani (1994), Atkinson and Haines (1996) and King and Wong (2000) have investigated

designs for binary data, and in particular for logistic regression. As illustrated in these papers, the general approach to optimal

design is to seek a design that optimizes certain functions of the information matrix of the model parameters. The information

matrix for b from a design consisting of the points x1, ...,xnis given by

n

?

i=1

w(xi,b)z(xi)zT(xi) = ZTWZ,(1)

where Z = (z(x1),z(x2), ...,z(xn))Tand

W = diag(w(x1,b),w(x2,b), ...,w(xn,b))

for weights w(xi,b) = (d?/d?i)2/var(Y|xi). Thus, as with nonlinear experiments the information matrix depends on the unknown

parametersb. Designing an experiment for the estimation of these parameters would then seem to require that these parameters

be known. The following are some of the approaches that have been explored in the literature for handling the dependency of

the information matrix on b.

(1) Locally optimal designs: A traditional approach in designing a nonlinear experiment is to aim for maximum efficiency at a

best guess (initial estimate) of the parameter values (Chernoff, 1953). Designs that are optimal for given parameter values

are dubbed locally optimal designs. These designs may be stable over a range of parameter values. However, if unstable, a

design which is optimal for a best guess may not be efficient for parameter values in even a small neighbourhood of this

guess.

(2) Bayesian optimal designs: A natural generalization of locally optimal designs is to use a prior distribution on the unknown

parameters rather than a single guess. The approach which assumes such a prior and incorporates this distribution into the

appropriate design criteria is termed Bayesian optimal design—see Chaloner and Larntz (1989) and Dette and Wong (1996).

(3) Minimax optimal designs: Rather than assume a prior distribution, this approach assumes a range of plausible values for the

parameters. The minimax optimal design is the design with the least loss when the parameters take the worst possible value

within their respective ranges. These least favourable parameter values are those that maximize the loss (King and Wong,

2000; Dette et al., 2003).

(4) Sequential designs: In sequential designs, the experiment is done in stages. Parameter estimates from a previous stage are

used as initial estimates in the current stage. The process continues until optimal designs are obtained (Abdelbasit and

Plackett, 1983; Sinha and Wiens, 2002).

Suppose an experimenter is faced with a set S = {xi}N

not necessarily distinct, points at which to observe a binary response Y. The experimenter makes ni?0 observations at xisuch

that?N

logisticdesignisthesalientassumptionthattheassumedmodelformisexactlycorrect.Inthiswork,weproposetheconstruction

of robust designs for logistic models with due consideration for possible misspecification in the assumed form of the systematic

component—the linear predictor. The linear predictor could be said to be misspecified when it does not reflect the influence of

the covariates correctly, possibly due to omitted covariates or to omission of some transformation of existing covariates in the

model. In this section we formalize our notion of model misspecification.

We suppose that the experimenter fits a logistic model with the mean response

i=1of possible design points from which he is interested in choosing n,

i=1ni= n. The design problem is to choose n1, ...,nNin an optimal manner. The objective then is to choose a probability

distribution {pi}N

i=1, with pi= ni/n, on the design space S. The commonalities in the work of the authors who have considered

?i=?(?i),

i = 1, ...,n, (2)

for ?i= zT(xi)b0when in fact the true mean response is represented by

?T,i=?(?i+ f(xi)).

The target parameter b0is defined by

(3)

b0= arg min

?

1

N

N

?

i=1

{E(Y|xi) −?(zT(xi)b)}2.

Thus the target parameter is that which guarantees the least sum of squares of discrepancies, over all points in the design space,

between the assumed mean response and the true mean response. The contamination function f(x) may or may not be known. It

would be known, for example, when an experimenter decides to fit the more parsimonious model (2) despite the knowledge of a

more appropriate model (3) with a specified f(x). For instance, the simplified model might be required if the number of support

Page 3

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

5

points is not sufficient to handle a more complicated but more appropriate model. Knowing that the parsimonious model might

result in an inferior analysis, the experimenter may seek a design that remedies the anticipated model inadequacy.

The contamination function would be unknown in a situation where the experimenter is aware of the possible uncertainties

in the assumed model form and might have clues about the properties of the possible misspecification, without knowing its exact

structure. When f(x) is unknown, some knowledge about its properties or conditions it satisfies would be required to construct

any appropriate design. This is so because no single design which takes a finite number of observations can protect against all

possible forms of bias. Thus, we must impose some conditions on the contamination function when its precise form is unknown.

To bound the bias of an estimatorˆ?, we assume that

N

?

with ?2=O(n−1). This latter requirement is analogous to the notion of contiguity in the asymptotic theory of hypothesis testing,

and is justified in the same manner. The choice of ? is discussed following Theorem 3 in the next section. In order to ensure

identifiability of the model parameters b and the contamination function f(x) we require that the vector of regressors and the

contamination be orthogonal. That is,

1

N

i=1

f2(xi)??2,(4)

1

N

N

?

i=1

z(xi)f(xi) = 0. (5)

Let F denote the class of contamination functions f(x) satisfying (4) and (5).

2. Loss functions: estimated and averaged mean squared errors of prediction

The basis for the construction of classical designs for logistic regression models has typically been the minimization of

(a function of) the inverse of Fisher's information matrix (1)—see Atkinson and Haines (1996). However, in the face of model

misspecification the asymptotic covariance, cov(ˆb), of the maximum likelihood estimator of the model parameters no longer

equals the inverse of Fisher information—see White (1982) and also Fahrmeir (1990), who discusses the asymptotic properties

of MLEs under a misspecified likelihood.

Suppose that data {xi,yi} are given, where the xiare the design points chosen from S with niobservations at xisuch that

?N

MLEˆb are used in the derivation of the loss function in Corollary 2.

i=1ni= n, and yiis the proportion of successes at location xi. The asymptotic bias and covariance of the MLEˆb are given in

Theorem 1; see the Appendix for details of this and other proofs. The expressions for the asymptotic bias and covariance of the

Theorem 1. Define

wi=d?i

d?i

=?i(1 −?i) =1

4sech2

?

zT(xi)b0

2

?

, (6)

andletZbetheN×pmatrixwithrowszT(xi).Recall(2)and(3);letcandcTbetheN×1vectorswithelements?iand?T,i,respectively.

Let P, W and WTbe the N × N diagonal matrices with diagonal elements ni/n, wiand wT,i=?T,i(1 −?T,i), respectively. Finally, define

b = ZTP(cT− c),Hn= ZTPWZ,˜Hn= ZTPWTZ. The asymptotic bias and asymptotic covariance matrix of the maximum likelihood

estimatorˆb of the model parameter vector b from the misspecified model are

bias(ˆb) = E(ˆb −b0) = H−1

cov(√n(ˆb −b0)) = H−1

respectively.

nb + o(n−1/2),

˜HnH−1

nn

+ o(1),

Since the typical focus of logistic designs is prediction, we take as loss function the normalized average mean squared error

(AMSE) I of the response prediction ?(ˆ ?i), with ˆ ?i= zT(xi)ˆb. This is given by

N

?

I?n

N

i=1

E[{?(ˆ ?i) −?(?i+ f(xi))}2].

Corollary 2. The AMSE has the asymptotic approximation I =LI(P,f) + o(1), where

LI(P,f) =1

for f = (f(x1), ...,f(xN))T.

N{tr[WZH−1

n

˜HnH−1

nZTW] + n?W(ZH−1

nb − f)?2}

(7)

Page 4

6

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

By using the expressions for asymptotic bias and covariance given in Theorem 1, Corollary 2 expresses the AMSE as an

explicit function of the design matrix Z and contamination vector f. The first term in the loss function LIcorresponds to the

average variance of the predictions and it depends on the contamination function f(x) through the matrix˜Hn. The second term

in the expression for LIis the average squared bias of the predictions, which depends on the contamination f(x) through

the contamination vector f and implicitly through the vector b. Thus a design cannot minimize (7) directly without certain

assumptions about the contamination f(x).

Fang and Wiens (2000) constructed integer-valued designs for linear models, in the case of an unknown f, using a minimax

approach. Their minimax criterion minimizes the maximum value of the loss function over f. They solve the design problem by

minimizing the loss when the misspecification is the worst possible in the neighbourhood of interest.

Here, we take one of the two approaches depending on whether or not there are initial data. If we have initial data we

represent the discrepancy between the true response and the assumed response, at a sampled location x, by

d(x) =?(zT(x)b0+ f(x)) −?(zT(x)b0),

and estimate this by the residualˆd(x) = y(x) − ?(zT(x)ˆb). A first order approximation is d(x) ≈ (d?/d?)f(x), leading toˆf(x) =

ˆd(x)/(d?/d?|b=ˆb). We smooth this estimated contamination over the entire design space—see Example 3 of Section 5 for an

illustration. The resulting estimateˆf, together withˆb, is then substituted into the terms in (7), and we compute a design

minimizing LI(P,ˆf) using the techniques outlined in Section 3.

If there are no initial data we propose to instead average LIover F defined by (4) and (5). Our optimal design minimizes

this average value. This criterion is in the spirit of L¨ auter (1974, 1976). L¨ auter's criterion optimizes the weighted average of the

loss of a finite set of plausible models. Here we are instead faced with an infinite set of models indexed by f ∈ F.

To carry out the averaging we begin as in Fang and Wiens (2000), with the singular value decomposition

Z = UN×pKp×pVT

p×p, (8)

with UTU = VTV = Ipand K diagonal and invertible. We augment U by˜UN×(N−p)such that [U

and (5), we have that there is an (N − p) × 1 vector t, with ?t??1, satisfying

f(=ft) =?

The average loss is taken to be the expected value of (7), as a function of t, with respect to the uniform measure on the unit

sphere and its interior in RN−p. This measure has density p(t) = (1/?N,p)I(?t??1), where ?N,p=?(N−p)/2/?((N − p)/2 + 1) is the

volume of the unit sphere. Theorem 3 handles the averaging of LI. The importance of this theorem is in its elimination of the

dependency of our design criterion on the unknown contamination function.

...˜U]N×Nis orthogonal. Then by (4)

√

N˜Ut. (9)

Theorem 3. The average loss over the misspecification neighbourhood F is, apart from terms which are o(1), given by

?

=1

LI,ave(P,?)?

LI(P,ft)p(t)dt

Ntr[(UTPWU)−1(UTW2U)] +

?

N − p + 2tr[W(R − IN)(RT− IN)W],(10)

where ? = n?2and RN×N= U(UTPWU)−1UTPW. For numerical work it is more efficient to compute the second trace in (10) as

N

?

where ˜ rT

tr[W(R − IN)(RT− IN)W] =

i=1

w2

i?˜ ri?2,

iis the ith row of R − IN.

The dependency of the design criterion on the unknown contamination has now been represented by a design parameter ?,

which can be chosen by the experimenter. This parameter can be interpreted as a measure of departure of the true model from

the fitted model. In other words, it is a measure of the experimenter's lack of confidence in the validity of the model that he fits.

If he believes that this assumed model is exactly correct, he chooses ?=0 corresponding to the classical I-optimal design. On the

other hand, if the experimenter believes that the assumed model is highly uncertain, he chooses a large value of ? for his design.

Designs corresponding to a large value of ? are dominated by the bias component of the loss.

Our design criterion (10) remains dependent on the model parameter vector b0, as is the case in the general nonlinear design

problems, through the weights, as at (6). In the examples of the next section we handle this dependency by either taking a guess

(locallyoptimaldesigns)orassumingapriordistribution,say?(b0),onb0(Bayesiandesigns).ThelossfunctionLI,aveismodified

as?LI,ave(b)?(b)db in the case of a Bayesian construction.

Page 5

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

7

3. Designs—algorithm and examples

3.1. Simulated annealing

Weconsiderproblemswithpolynomialpredictors,viz.?=zT(x)bwithz(x)=(1,x,x2, ...,xp−1)T.Wetakeequallyspaceddesign

points {xi}N

is a nonlinear integer optimization problem for which there is no analytic solution, and for which we employ simulated annealing

to search for the optimal design.

The simulated annealing algorithm is a direct search random walk optimization algorithm which has been quite successful

at finding global extrema of non-smooth functions and/or functions with many local extrema. The algorithm consists of three

steps, each of which must be well adapted to the problem of interest for the algorithm to be successful. The first step is a

specification of the initial state of the process. In this step an initial design has to be specified, say P0. The second is a specification

of a scheme by which a new design P1is chosen from the optimization space. The last step is a prescription of the basis

of acceptance or rejection: an acceptance with probability 1 if LI,ave(P1)<LI,ave(P0), otherwise acceptance with probability

exp{−(LI,ave(P1) − LI,ave(P0))/T}, where T is a tuning parameter. The tuning parameter is usually decreased as the iterations

proceed. After a large number of iterations between the second and third steps the loss function is expected to converge to its

(near) minimum value. Simulated annealing has been used for design problems by, among others, Meyer and Nachtsheim (1988),

Fang and Wiens (2000) and Adewale and Wiens (2006).

A very simple and general approach that we considered for choosing the initial design is to randomly select p points from

{xi}N

(2000) used a different approach which assumes that one of (n,N) is a multiple of the other. For any (n,N) combination they chose

the initial design to be as uniform as possible. We applied this approach as well but found that the two approaches are equally

efficient. For generating a new design we adopted the perturbation scheme of Fang and Wiens (2000). The turning parameter in

the third step was chosen initially such that the acceptance rate is in the range 70% and 95%. We decrease T by a factor of .95 after

each 20 iterations. In the examples below we run the algorithm several times with varying turning parameter specification and

reduction rate in order to satisfy ourselves that the resulting design has the least loss possible under the relevant circumstances

of each example. In Fig. 1 we present the simulated annealing trajectory for one of the cases presented in Example 1. It took

83s for the algorithm to complete the preset maximum number of iterations (12000, for this case) and the minimum loss was

attained just before the 9000th iteration.

i=1in the intervalS. Our design minimizes the relevant loss function through the matrix P=diag(n1/n, ...,nN/n). This

i=1and randomly allocate the observations to these points such that the total number of observations is n. Fang and Wiens

3.2. Examples

Example 1 (No contamination). As a benchmark we first consider the logistic regression model with a single predictor: p = 2,

z(x) = (1,x)T, x ∈ S = [−1,1],b = (1,3)T, and no contamination: ? = 0. We initially took n = 20, N = 40 and considered designs

minimizing LI. The annealing algorithm converged to the design placing 10 of the 20 observations at each of the points −.744

and .128. This design is therefore the classical I-optimal design minimizing the integrated variance of the predictions over S.

There is evidently no previous theory that applies to this case. However, using a model that is a reparameterization of ours, and

a continuous design space [−1,1], King and Wong (2000) showed the locally D-optimal design to be the design that is equally

supported at −.848 and .181. For the sake of comparison, we sought an equivalent design using our finite design space and the

algorithm described above. The resulting design places 10 of 20 observations at each of −.846 and .180. Thus, our algorithm

attains the closest approximation to King and Wong's solution in that the points −.846 and .180 are nearest, in our design space,

to −.848 and .181. Unlike designs for linear models, the optimal designs in this case do not necessarily place observations at the

extreme points of the design space. This phenomenon is due to the curvature introduced by the link function and the resulting

nonlinear relationship between the mean response and x.

0 2000 40006000

Iteration number

800010000 12000 14000

0.26

0.28

0.3

0.32

Loss

Fig. 1. Simulated annealing trajectory for logistic design with ? = 1 + 3x, x ∈ S = [−1,1], ? = 0 and (N = 40,n = 200).

Page 6

8

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

-101

0

2

4

6

8

10

Number of observations

loss = 0.2496

-0.789

-0.684

0.0526

0.158

-101

0

20

40

60

80

100

loss = 0.2491

-0.789

0.0526

0.158

-101

0

2

4

6

8

10

12

loss = 0.2527

0.128

-0.744

-101

0

20

40

60

80

loss = 0.2524

-0.795

-0.744

0.07690.128

Fig. 2. Locally optimal designs minimizingLI,avewhen?=0 (no contamination) with (a) N=20,n=20; (b) N=20,n=200; (c) N=40,n=20; (d) N=40,n=200.

Table 1

Comparing restricted designs with unrestricted design; ? = 0

(N,n) Restricted design (two-point) Unrestricted design

Design points Loss Design pointsa

Loss

(20,20)

−.789(9),.053(11) .250

−.789(7),−.684(3),

.0526(2),.158(8)

−.789(95),.0526(63),.158(42)

−.744(10),.128(10)

−.795(49),−.744(47),

.0769(39),.128(65)

.250

(20,200)

(40,20)

(40,200)

−.789(94),.053(106)

−.744(10),.128(10)

−.744(97),.128(103)

.250

.2527

.2525

.249

.2527

.2524

aNumber of observations in parentheses.

Our numerical results further revealed that the designs depend on the number of points in the design space and the number

of observations the experimenter is willing to take. For this “no-contamination” case, we investigated designs for various

combinations of N and n. Some of these designs are presented in Fig. 2. The number of distinct design points varies from 2

to 4. We found this somewhat surprising, in light of the fact that all D-optimum designs for the two parameter logistic model

in the literature are two-point designs. Presumably this is explained through our use of a finite design space, and/or our use of

average loss rather than that based on the determinant of the information matrix.

To check that this phenomenon was not merely an artefact due to a lack of convergence, we modified our algorithm to obtain

“restricted” designs—restricted to two support points only. The results for the same values of N and n as in Fig. 2 are presented

in Table 1. The loss for the unrestricted design is less than or equal to that for the corresponding restricted design in all cases

considered.

In the examples that follow we limit discussion to the case N = 40, n = 200.

Example 2 (Example 1 continued). In this example, which we include largely for illustrative purposes, the form of the con-

tamination is known. Suppose that the experimenter anticipates fitting a simple logistic model, while wishing protection

against a range of logistic models with quadratic predictor: ?(x) = zT(x)b + f(x), where zT(x) and b are as in Example 1, and

f(x) = ?2(x2− ?2)/

and scaled to ensure the orthogonality condition (5); (4) becomes |?2|??. We obtained optimal designs for various values of the

quadratic coefficient ?2. The resulting designs and the corresponding values of the loss function are presented in Table 2. In the

range of values of ?2considered, we found that the number of distinct points varied from 3 to 6. The spread of the design over

the design space tended to increase as the magnitude of the omitted quadratic term increases. We computed the premium paid

for robustness and the gain due to robustness for each design presented as

?

?4−?2

2, for ?k= N−1?xk

i(=0 if k is odd). The contaminant f(x) is an omitted quadratic term, translated

Premium =

?

LI(Popt,f = 0)

LI(Pclassical,f = 0)− 1

?

× 100%(11)

and

Gain =

?

1 −

LI(Popt,f)

LI(Pclassical,f)

?

× 100%. (12)

Page 7

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

9

Table 2

Designs for simple logistic model when the true model has a quadratic term

?2

Design points (number of observations)

LI(P,f)PremiumGain

−10

−1(42),−.180(42),−.128(96),

−.077(12),−.026(2),.077(6)

−1(48),−.282(26),−.231(64),.282(62)

−.949(42),−.590(34),−.539(29),.128(95)

−.795(49),−.744(47),.077(39),.128(65)

−.641(100),.128(22),.180(78)

−.692(57),−.641(29),−.077(39),

−.026(44),.795(31)

−1(11),−.590(51),−.539(27),

−.231(30),−.180(40),.949(41)

3.50035.0%34.8%

−3

−1

0

1

3

.5020

.2756

.2524

.3080

.6073

10.9%

2.1%

0

1.9%

11.9%

17.0%

.5%

0

4.0%

19.9%

10 3.67939.0%42.9%

Table 3

Experimental design and response values

Design point

No. of observations

No. of successes

.−1

−7

20

6

9

−5

20

7

9

−3

20

13

9

−1

20

17

9

1

9

3

9

5

9

7

9

1

20

20

20 20

18

20

18

20

19

20

208

The gain measure is the percentage reduction in loss due to the use of a robust design as opposed to a (non-robust) classical

design which assumes the fitted model to be exactly correct. The premium measure is the percentage increase in loss as a result

of not using the classical design if in actual fact the assumed model is correct. The application of the premium and/or gain

measure depends on the amount of confidence the experimenter has in his knowledge of the true model. In this example, since

the assumption is that the experimenter knows that the model with a linear predictor involving the quadratic term is a more

appropriate model, the relevant measure would be the gain. Nevertheless, both measures are reported in Table 2. The value of

a design from our robust procedure increases with increasing magnitude of the quadratic parameter. On the other hand, the

experimenter has to be aware of the increasing premium when his knowledge of the true model is not accurate. The premium

paid for robustness also increases with the magnitude of the quadratic parameter.

Of course this example is artificial, assuming as it does that the true form of the predictor is known to be quadratic, with

parameter ?2= 3, say. If one did indeed possess this knowledge then the classically optimal—i.e., variance minimizing—design

would be −1(49), −.282(2), −.231(91), .436(9), .487(49). The premium figures in Table 2 would rise appreciably—to typical values

of 100% or more—since the robust design would be protecting against bias, known not to be present.

Example 3 (Designing when there are initial data to estimate contamination). Table 3 shows simulated data (“# of successes”) from

a logistic regression model with the predictor ?(x) = 1 + 3x + f(x), the model of the previous example; the quadratic parameter

was ?2=3. The data were simulated using a uniform design over equally spaced points in [−1,1]. Having simulated the data, we

suppose the contamination function f(x) to be unknown. We proceed using the procedure described in Section 2 for estimation

and eventual smoothing of the contamination. A plot of the estimated contamination with its loess smoothˆf(x) over the design

space is presented in Fig. 3.

We plugged the smoothed contamination values into the loss function (7), and used simulated annealing to obtain the design.

Our design places 34, 82, and 84 of the 200 observations at −.641,−.590, and .180, respectively. For this design the premium for

robustness is 5.0% and the gain is 60.0%. This example indicates that when there are initial data, it is expedient to incorporate the

information from the data into the design procedure. The resulting design can lead to substantial gain at a reduced premium.

Example 4 (Unknown contamination). Consider the logistic model with predictor

?(x) =?0+?1x + f(x).

In this example—as in Example 3—we assume that f is an unknown member of the class F defined by (4) and (5). In Fig. 4 we

exhibit designs minimizing the averaged loss (10) for various values of?,?0and?1. We observed a progression of the dispersion

of the design points over the design space with increasing?. The pattern of the dispersion is, however, modified by the curvature

indexed by ?0and ?1through the nonlinear mean response. For small ? our robust designs can be described as taking clusters

of observations at neighbouring locations rather than replications at only a few distinct sites; this was noticed for linear models

by Fang and Wiens (2000). However, here there is always a pattern to the clusters of observation to be taken depending on the

values of the model parameters. Large values of ? denote large departures from the assumed model and an extremely large ?

value corresponds to the all-bias design. Even though the all-bias design is spread over the entire design space the frequencies

of observations are different and these frequencies are prescribed by the curvature of the mean response as determined by the

(13)

Page 8

10

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

-1.0

x [Design Space]

0.0

0.5

1.0

1.5

2.0

2.5

Contamination

Estimated Contamination with its Loess

Smooth Superimposed

-0.5 0.00.5

1.0

Fig. 3. Estimated contamination plot for Example 3. True (but unknown) form of contamination is quadratic.

-101

0

50

100

#Observations

ρ = 0, loss = 0.29

-101

0

10

20

ρ = 10, loss = 0.68

-101

0

5

10

ρ = 100, loss = 3.98

-101

0

5

10

ρ = 10000, loss = 363.19

-101

0

50

100

#Observations

ρ = 0, loss = 0.25

-101

0

5

10

ρ = 10, loss = 0.51

-101

0

5

10

ρ = 100, loss = 2.72

-101

0

5

10

ρ = 10000, loss = 244.75

-101

0

50

100

#Observations

ρ = 0, loss = 0.08

-101

0

20

40

ρ = 10, loss = 0.11

-101

0

10

20

ρ = 100, loss = 0.39

-101

0

5

10

ρ = 10000, loss = 30.15

-101

0

50

100

#Observations

ρ = 0, loss = 0.14

-101

0

20

40

ρ = 10, loss = 0.28

-101

0

10

20

ρ = 100, loss = 1.42

-101

0

10

20

ρ = 10000, loss = 125.88

Fig. 4. Locally optimal designs in Example 4: (a)–(d) (?0,?1) = (1,1); (e)–(h) (?0,?1) = (1,3); (i)–(l) (?0,?1) = (3,1); (m)–(p) (?0,?1) = (3,3).

Page 9

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

11

Table 4

Design for unknown contamination with ?0= 1 and ?1= 3

?

LI,ave(P,?)PremiumGain

0

1

.2524

.2809

.5090

2.7204

24.7294

244.7545

00

2.01%

14.33%

25.87%

28.16%

28.43%

.62%

3.06%

7.44%

11.29%

11.97%

10

100

1000

10 000

-101

0

10

20

30

Number of observations

loss = 0.2630

-101

0

10

20

30

loss = 0.2836

-101

0

10

20

loss = 0.2748

-101

0

10

20

loss = 0.2840

Fig. 5. Robust Bayesian optimal design in Example 5 with ? = .25 and parameters ?0and ?1having independent uniform priors over (a) [.5,1.5] × [2.5,3.5],

(b) [.5,1.5] × [1,5], (c) [−1,3] × [2.5,3.5], (d) [−1,3] × [1,5].

model parameters. In Table 4 we present the values of the premium paid and the gain in robustness for designs corresponding

to different values of ? for the particular case of (?0,?1)=(1,3). The gain in robustness, measured by (12), exceeds the premium

paid, as measured by (11), for each design. Increasing robustness, however, comes with increasing premium; the experimenter

would thus have to choose his level of comfort.

Thus far, the examples we have presented have been locally optimal, hence have assumed good parameter guesses for

unknownmodelparameters.Intheabsenceofareliablebestguessformodelparameters,Sitter(1992)andKingandWong(2000)

considered minimax D-optimality, a procedure which assumes the knowledge of a prior range for each of the parameters. We

consider a Bayesian paradigm to be in the same spirit as averaging the contamination function over the specified misspecification

neighbourhood, and take independent uniform prior distributions over the range of each model parameter. Our design criteria

then becomes the expected loss, E(LI,ave(P,?)), with the expectation taken with respect to these priors. The dependency of our

design criteria on the model parameters is through the weights wi, and we do not have analytic expressions for the resulting

integrals. In the examples that follow we employ number-theoretic methods for numerical evaluation of multiple integrals as

discussed in Fang and Wang (1994). This approach is based on generating quasi-random points in the domain of definition of the

integrand, and averaging the values of the loss over the sample of points.

Example 5 (RobustBayesiandesign). Inthisexampleweconsiderthefollowingrangesofparametervalues:(a)[.5,1.5]×[2.5,3.5],

(b) [.5,1.5] × [1,5], (c) [−1,3] × [2.5,3.5], (d) [−1,3] × [1,5], all with centre point (1,3) but with coverage areas 1, 4, 4, and 16,

respectively. As described above, the robust design for each range of parameter values is the design that minimizes the expected

average loss with respect to uniform distributions on the specified ranges of parameter values. For each of the designs—see

Fig. 5—we take?=.25. We observed an increasing spread over the design space with increasing uncertainty in model parameters,

as measured by the coverage area of the priors. This is consistent with previous work in optimal Bayesian design—see, for

example, Chaloner and Larntz (1989)—which suggests increasing number of distinct design points with increasing uncertainty

in the specified prior distributions. Comparing the design plots in panels (b) and (c) of Fig. 5, we see that there is more sensitivity

to uncertainty in the intercept parameter than the slope parameter.

4. Case study: beetle mortality data

Bliss (1935) reported the numbers of beetles dead after 5h exposure to gaseous carbon disulphide at various concentrations.

The doses are given in Table 5; to facilitate our discussion we have linearly transformed these to the range [0,1]. Note that the

original design is then very nearly uniform on the eight equally spaced points 0(1

7)1.

Page 10

12

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

Table 5

Beetle mortality data

Dose, xi(log10CS2mgl−1)

Number of beetles, ni

Number killed, niyi

1.69

59

6

1.72

60

13

1.75

62

18

1.78

56

28

1.81

63

52

1.84

59

53

1.86

62

61

1.88

60

60

0 0.51

0

50

100

Number of observations

0

0.5

1

0

10

20

30

Fig. 6. (a) Prediction design when contamination is estimated from initial data. (b) Robust Bayesian prediction design with multivariate normal prior and ? = 5.

We first fitted the logistic model with the linear predictor ?(1)= ?(1)

ˆ?(1)

1

?

and deviance =11.232 (df =6). The corresponding estimates for the logistic model with the linear predictor?(2)=?(2)

?(2)

012

⎛

.489

−3.690

with deviance =3.195 (df = 5). The deviances and a plot (not presented here) of proportions of beetles killed against dose levels

with the estimated proportions from each model superimposed suggest that the model with the quadratic term is a significantly

better fit for these data. Suppose the experimenter is inclined to use the simple logistic fit for future data for ease of interpretation

and model simplicity or that the adequacy of the model with the quadratic term is itself in doubt. We proceed by estimating

the contamination and then smoothing over the design space as discussed in Section 2. The resulting design, obtained using the

parameter estimatesˆ?(1)

01

N =40 points in [0,1] is presented in panel (a) of Fig. 6. This would be the design of choice if the experimenter were interested in

predictionbutcontemplatedthesuperiorityofthemodelwithquadraticterm.However,theexperimentercanensurerobustness

against a broader set of alternatives by taking the contamination to belong to the class F while assuming an initial multivariate

normal prior on the parameter, with mean vector (ˆ?(1)

paradigm as in Example 5. The loss function becomes the expected value of (10), with expectation taken with respect to the

multivariate normal prior. The numerical implementation of expectation is done using a quasi-Monte Carlo sampling approach.

The design plot is given in Fig. 6(b).

0

+ ?(1)

1x, and obtained the estimatesˆ?(1)

0

= −2.777 and

= 6.621 with the estimated variance–covariance matrix

.082

−.144

−.144

?(1)=

.317

?

0

+?(2)

1x+

2x2areˆ?(2)

= −2.00,ˆ?(2)

.124

−.522

= 1.60,ˆ?(2)

−.522

3.252

= 5.84 and

⎞

?(2)=

⎝

.489

−3.690

4.665

⎠

andˆ?(1)

as initial guesses, with total number of observations n = 481 over an equally spaced grid of

0,ˆ?(1)

1)Tand variance–covariance matrix ?(1), and then using the Bayesian

5. Conclusions

We have investigated integer-valued designs for logistic regression models, using polynomial predictors as specific examples.

Our designs are robust against misspecification in the predictor. We have addressed both known and unknown contamination.

Previous robustness work done for logistic models has concentrated on the uncertainty of model parameters; in this contribution

we have gone further to investigate specific violations in the form of the assumed linear (in the parameters) predictor.

Designs for a specific alternative, for example quadratic versus linear in the independent variable, are quite different from

those for broad classes of alternatives. The number of distinct design points is usually not as large in the former case as in the

latter. In fact, when the magnitude of the misspecification is minimal the resulting robust design could have about the same

numberofdistinctobservationpointsasitsclassicalcounterpart.Nevertheless,thegaininrobustnessoftenexceedsthepremium

paid for robustness—see Table 1.

Page 11

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

13

Designs for a very specific alternative may, however, suffer the same fate as designs assuming the correctness of the fitted

model when the alternative itself is not valid. Both take replicates of observations at only a few distinct points, especially when

the magnitude of the departure is small. However, when there is a higher degree of certainty in the alternative, these designs

could result in substantial gain in robustness. An example of this would be when the experimenter is aware of a more appropriate

model but seeks a design that allows for the fitting of a more parsimonious model. Also, designs when there are data to estimate

model contamination are quite similar to designs when the exact form of the contamination is known (single alternative). When

the information in the initial data is incorporated into the design procedure, as seen in Example 3 above, the robustness of the

resulting design could come at a very reduced premium.

In general, we have found there to be increasing numbers of distinct observation sites with increasing model uncertainty. The

overall message is consistent with that reported in the model robust design literature for linear models—robust designs can be

approximated by placing clusters of observations about the support points for classical designs. However, the nonlinearity of the

mean response in logistic design adds a slight twist to the overall message, in that the clusters of observation come with patterns

that are determined by the curvature prescribed by the model parameters. More striking is the fact that the all-bias design is

non-uniform in logistic regression models—even though the recommended design points are spread over the entire design space,

the frequencies of observations vary due to the curvature.

Overall, the design that protects against uncertainty in model parameters (via a Bayesian paradigm) and that which protects

against uncertainty in assumed model form could be described as taking observations in clusters. These clusters often come in

interesting patterns of curvature prescribed by the nonlinearity of the model—see examples in the previous section. Further work

wouldbe required toobtain analytic descriptionsof theeffect of curvature inthis robustapproach, oreven forthe all-bias designs

forlogisticmodels.Whilethefocusofthemodelmisspecificationreportedhereisexclusivelyonlinearpredictormisspecification,

we are currently investigating other forms of misspecification in designing for the broader class of generalized linear models, of

which the logistic model is but a special case.

Acknowledgements

The research of both authors is supported by the Natural Sciences and Engineering Research Council of Canada. We appreciate

helpful comments from an anonymous referee.

Appendix A. Derivations

Proof of Theorem 1. Under conditions as in Fahrmeir (1990) the maximum likelihood estimateˆb exists and is consistent, and

?l(ˆb)/?b is op(n−1/2). The log-likelihood l, the score function and −1 times the second derivative according to the assumed model

are

?

l(b) =

N

?

i=1

ni

?

yilog

?

?i

1 −?i

?

+ log(1 −?i)

?

+ log

?

ni

niyi

??

,

?l(b)

?b

=

N

?

i=1

ni(yi−?i)z(xi),

−?2l(b)

?b?bT=

N

?

i=1

niwiz(xi)zT(xi).

An expansion of ?l(b)/??jaround b0gives

?l(b)

??j

=?l(b0)

??j

+

?

k

(?k−?0,k)?2l(b0)

??j??k

+1

2

?

k

?

l

(?k−?0,k)(?l−?0,l)

?3l(b∗)

??j??k??l

,

where ?jand ?0,jare the jth terms of the vectors b and b0, respectively, and b∗is a point on the line segment connecting b and

b0. If we replace b byˆb in this expansion, we obtain

⎡

n

??j??k

l

√n

?

k

(ˆ?k−?0,k)

⎣1

?2l(b0)

+

1

2n

?

(ˆ?l−?0,l)

?3l(b∗)

??j??k??l

⎤

⎦= −

1

√n

?l(b0)

??j

.

For the logistic likelihood the ?3l(b∗)/??j??k??lare bounded, and so, using the consistency ofˆb, we have that

?

??j??k

??j??k??l

where Hjkis the (j,k)th element of the matrix Hn= −(1/n)?2l(b0)/?b?bT= ZTPWZ. Thus the limit distribution of√n(ˆb − b0) is

that of the solution of the equations?Hjk

1

n

?2l(b0)

+

1

2n

?

(ˆ?l−?0,l)

?3l(b∗)

?

p

− →−Hjk,

√n(ˆ?k−?0,k) = (1/√n)?l(b0)/??j, i.e., is the limit distribution of H−1

n(1/√n)?l(b0)/?b.

Page 12

14

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

Using the central limit theorem for independent not identically distributed random variables we have that (1/√n)?l(b0)/?b has a

multivariate normal limit distribution with asymptotic mean (1/√n)?N

i=1niE[yi−?i(b0)]z(xi)=√nb and asymptotic covariance

nb,H−1

n

matrix˜Hn= ZTPWTZ. From this it follows that√n(ˆb −b0) is AN(√nH−1

˜HnH−1

n), as required. ?

Proof of Corollary 2. First write

I =1

N

N

?

i=1

var[√n?(ˆ ?i)] +1

N

N

?

i=1

{E[√n?(ˆ ?i)] −√n?(?i+ f(xi))}2.

By the ?-method, the first sum is, up to terms which are o(1),

1

N

N

?

i=1

?d?i

d?i

?2

var[√nˆ ?i] =1

N

N

?

i=1

?d?

d?i

?2

˜HnH−1

zT(xi)H−1

n

˜HnH−1

nz(xi)

=1

Ntr[WZH−1

nnZTW].

Also, on expanding ?(ˆ ?i) and ?(?i+ f(xi)) around ?i, we have

E[√n?(ˆ ?i)] =√n?(?i) + E

?√nd?

d?i

(ˆ ?i−?i) + o(√n(ˆ ?i−?i))

?

,

and

√n?(?i+ f(xi)) =√n?(?i) +√nd?

d?i

f(xi) + o(√nf(xi)).

Using an argument similar to that in the proof of Theorem 1, we have

E[√n?(ˆ ?i)] =√n?(?i) +√nd?

d?i

E(ˆ ?i−?i) + o(1).

Thus, the second sum in the expression of I is, up to terms which are o(1),

1

N

N

?

=1

i=1

{E[√n?(ˆ ?i)] −√n?(?i+ f(xi))}2

N

N

?

i=1

?d?

d?i

?2

{n · biasT(ˆb)z(xi)zT(xi)bias(ˆb) + nf2(xi)}

=1

N{n · bTH−1

nZTW2ZH−1

nb − 2nfTW2ZH−1

nb + n · fTW2f},

reducing to (n/N)?W(ZH−1

Proof of Theorem 3. Here and elsewhere, in the averaging we will use the identity?tTtp(t)dt = (N − p)/(N − p + 2), which

?

First use (8) and (9) to write (7) explicitly in terms of t:

?tr[(UTPWU)−1(UTPWT(t)U)(UTPWU)−1UTW2U]

Using (9) again we have WT(t) = W +˙W?√N˜Ut + O(?2), where˙W = diag(w?(?1), ...,w?(?N)). Since ?2= O(n−1) we obtain

?WT(t)p(t)dt = W + O(n−1), and so

tr[(UTPWU)−1(UTPWT(t)U)(UTPWU)−1UTW2U]p(t)dt = tr[(UTPWU)−1(UTW2U)].

Similarly, we have cT(t) −c =?√NW˜Ut + O(?2), and so

n?W(U(UTPWU)−1UTP(cT(t) −c) −?

nb − f)?2. ?

implies that

ttTp(t)dt =

1

N − p + 2IN−p.

LI(P,f) =1

N

+n?W(U(UTPWU)−1UTP(cT(t) −c) −?√N˜Ut)?2

?

. (A.1)

?

√

N˜Ut)?2= n?2N?W(R − I)˜Ut?2+ O(n−1/2),

Page 13

A.J. Adewale, D.P. Wiens / Journal of Statistical Planning and Inference 139 (2009) 3--15

15

with

?

n?W(U(UTPWU)−1UTP(cT(t) −c) −?

√

N˜Ut)?2p(t)dt =n?2N · tr[W(R − I)˜U˜UT(R − I)TW]

N − p + 2

.

The result follows upon noting that˜U˜UT= I − UUTand (R − I)U = 0, and then substituting these integrals into (A.1) and

simplifying. ?

References

Abdelbasit, K.M., Plackett, R.L., 1983. Experimental design for binary data. J. Amer. Statist. Assoc. 8, 90–98.

Adewale, A., Wiens, D.P., 2006. New criteria for robust integer-valued designs in linear models. Comput. Statist. Data Anal. 51, 723–736.

Atkinson, A.C., Haines, L.M., 1996. Designs for nonlinear and generalized linear models. In: Ghosh, S., Rao, C.R. (Eds.), Handbook of Statistics, vol. 13. pp. 437–475.

Bliss, C.I., 1935. The calculation of the dose-mortality curve. Ann. Appl. Biol. 22, 134–167.

Box, G.E.P., Draper, N.R., 1959. A basis for the selection of a response surface design. J. Amer. Statist. Assoc. 54, 622–654.

Burridge, J., Sebastiani, P., 1994. D-optimal designs for generalized linear models with variance proportional to the square of the mean. Biometrika 81, 295–304.

Chaloner, K., Larntz, K., 1989. Optimal Bayesian design applied to logistic regression experiments. J. Statist. Plann. Inference 21, 191–208.

Chaudhuri, P., Mykland, P., 1993. Nonlinear experiments: optimal design and inference based on likelihood. J. Amer. Statist. Assoc. 88, 538–546.

Chernoff, H., 1953. Locally optimal designs for estimating parameters. Ann. Math. Statist. 24, 586–602.

Dette, H., Wong, W.K., 1996. Optimal Bayesian designs for models with partially specified heteroscedastic structure. Ann. Statist. 24, 2108–2127.

Dette, H., Haines, L., Imhof, L., 2003. Bayesian and maximin optimal designs for heteroscedastic regression models. Canad. J. Statist. 33, 221–241.

Fahrmeir, L., 1990. Maximum likelihood estimation in misspecified generalized linear models. Statistics 21, 487–502.

Fang, K.-T., Wang, Y., 1994. Number-Theoretic Methods in Statistics. Chapman & Hall, London.

Fang, Z., Wiens, D.P., 2000. Integer-valued, minimax robust designs for estimation and extrapolation in heteroscedastic, approximately linear models. J. Amer.

Statist. Assoc. 95, 807–818.

Fedorov, V.V., 1972. Theory of Optimal Experiments. Academic Press, New York.

Ford, I., Silvey, S.D., 1980. A sequentially constructed design for estimating a nonlinear parametric function. Biometrika 67, 381–388.

Ford, I., Titterington, D.M., Kitsos, C.P., 1989. Recent advances in nonlinear experimental design. Technometrics 31, 49–60.

Ford, I., Torsney, B., Wu, C.F.J., 1992. The use of a canonical form in the construction of locally optimal designs for nonlinear problems. J. Roy. Statist. Soc. B 54,

569–583.

King, J., Wong, W.-K., 2000. Minimax D-Optimal designs for the logistic model. Biometrics 56, 1263–1267.

L¨ auter, E., 1974. Experimental design in a class of models. Math. Operationsforschung Statist. 5, 379–396.

L¨ auter, E., 1976. Optimal multipurpose designs for regression models. Math. Operationsforschung Statist. 7, 51–68.

Li, K.C., Notz, W., 1982. Robust designs for nearly linear regression. J. Statist. Plann. Inference 6, 135–151.

Marcus, M.B., Sacks, J., 1976. Robust designs for regression problems. In: Gupta, S.S., Moore, D.S. (Eds.), Statistical Theory and Related Topics II. Academic Press,

New York, pp. 245–268.

McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models. Chapman & Hall, CRC, London, Boca Raton, FL.

Meyer, R.K., Nachtsheim, C.J., 1988. Constructing exact D-optimal experimental designs by simulated annealing. Amer. J. Math. Management Sci. 3 & 4, 329–359.

Minkin, S., 1987. Optimal design for binary data. J. Amer. Statist. Assoc. 82, 1098–1103.

Sinha, S., Wiens, D.P., 2002. Robust sequential designs for nonlinear regression. Canad. J. Statist. 30, 601–618.

Sitter, R.R., 1992. Robust designs for binary data. Biometrics 48, 1145–1155.

White, H., 1982. Maximum likelihood estimation of misspecified models. Econometrica 50, 1–25.

Wiens, D.P., 1992. Minimax designs for approximately linear regression. J. Statist. Plann. Inference 31, 353–371.

Wiens, D.P., Zhou, J., 1999. Minimax designs for approximately linear models with AR(1) errors. Canad. J. Statist. 27, 781–794.