Page 1

Robust penalized logistic regression with truncated loss

functions

Seo Young Park1 and Yufeng Liu2,*

1Department of Health Studies, Chicago, IL 60615, USA

2Department of Statistics and Operations Research, Carolina Center for Genome Sciences,

Chapel Hill, NC 27599, USA

Abstract

The penalized logistic regression (PLR) is a powerful statistical tool for classification. It has been

commonly used in many practical problems. Despite its success, since the loss function of the PLR

is unbounded, resulting classifiers can be sensitive to outliers. To build more robust classifiers, we

propose the robust PLR (RPLR) which uses truncated logistic loss functions, and suggest three

schemes to estimate conditional class probabilities. Connections of the RPLR with some other

existing work on robust logistic regression have been discussed. Our theoretical results indicate

that the RPLR is Fisher consistent and more robust to outliers. Moreover, we develop estimated

generalized approximate cross validation (EGACV) for the tuning parameter selection. Through

numerical examples, we demonstrate that truncating the loss function indeed yields better

performance in terms of classification accuracy and class probability estimation.

Key words and phrases

Classification; logistic regression; probability estimation; robustness; truncation

1. INTRODUCTION

The penalized logistic regression (PLR) is a commonly used classification method in

practice. It is a generalization of the standard logistic regression with a penalty term on the

coefficients. It is now known that the PLR can be fit in the regularization framework with

loss + penalty (Wahba, 1999; Lin et al., 2000). The loss function controls goodness of fit of

the model, and the penalization term helps avoid overfitting so that good generalization can

be obtained.

The PLR uses the unbounded logistic loss. As a result, the resulting classifier can be

sensitive to outliers. In this article, we propose the robust penalized logistic regression

(RPLR), which uses truncated logistic loss function. Because truncation reduces the impact

of misclassified outliers, the RPLR is more robust and accurate than the standard PLR.

Connections of the proposed RPLR with some other existing robust logistic regression

methods are also discussed.

One important aspect of classification is class probability estimation. Good class probability

estimation can reflect the strength of classification. Thus, it is desirable in many

applications. In the PLR, one can use the estimated classification function, that is, the

© 2011 Statistical Society of Canada

*Author to whom correspondence may be addressed. yfliu@email.unc.edu.

NIH Public Access

Author Manuscript

Can J Stat. Author manuscript; available in PMC 2011 December 7.

Published in final edited form as:

Can J Stat. 2011 June 1; 39(2): 300–323. doi:10.1002/cjs.10105.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

estimated logit function, to derive the corresponding probability estimation. When we

replace the logistic loss by its truncated version, properties of the corresponding

classification function may not preserve all class probability information any more. To solve

this problem, we propose three different schemes for class probability estimation. Properties

and performance of these three schemes are explored as well.

Although the original logistic loss function is convex, its truncated version becomes non-

convex. Consequently, the corresponding minimization problem involves difficult non-

convex optimization. To implement the RPLR, we decompose the non-convex truncated

logistic loss function into the difference of two convex functions. Then, using this

decomposition, we apply the difference convex (d.c.) algorithm to obtain the solution of the

RPLR through iterative convex minimization.

The tuning parameter plays an important role in the RPLR implementation. To select an

efficient tuning parameter, we develop the estimated generalized approximate cross

validation (EGACV) procedure and compare its performance with the cross validation

method.

In the following sections, we describe the new proposed method in more details with

theoretical justification and numerical examples. Section 2 reviews the PLR and gives a

maximum likelihood interpretation. In Section 3 we review some related robust logistic

regression methods in the literature. In Section 4 we describe the RPLR and explore its

theoretical properties. The methods for class probability estimation are also introduced.

Section 5 develops the d.c. algorithm to solve the non-convex minimization problem for the

RPLR. In Section 6 we discuss the issue of the tuning parameter selection. Numerical results

are presented in Section 7 and Section 8 provides some discussion. The proofs of theorems

and the detailed derivation of the tuning procedure are included in the Appendix Section.

2. PENALIZED LOGISTIC REGRESSION

In binary classification, we want to build a classifier based on a training sample {(xi, yi)|i =

1, 2, …, n}, where xi ∈ Rd is a vector of predictors, and yi ∈ {+1, −1} is its class label.

Typically it is assumed that the training data are distributed according to an unknown

probability distribution P(x, y). The goal is to find a classifier which minimizes the

misclassification rate. Moreover, besides good classification accuracy, it is also desirable to

estimate the class conditional probability.

For discussion, we first briefly review the PLR and its likelihood interpretation. In the

standard logistic regression model for binary classification, one assumes that the logit can be

modeled as a linear function in covariates. Specifically, the model can be written as follows:

(1)

where X and Y denote the vector of explanatory variables and the class label, respectively.

The coefficients of logistic regression (w, b) can be estimated by the method of maximum

likelihood (McCullagh & Nelder, 1989). As one way of smoothing, le Cessie & van

Houwelingen (1992) proposed PLR, which maximizes the log-likelihood subject to a

constraint on the L2 norm of the coefficients. Wahba (1999) showed the linear PLR is

equivalent to finding b and w which solves

Park and Liu Page 2

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

(2)

where ℱ = {f : f(x) = wTx + b}, l(u) = log(1 + e−u),

parameter. Once the classification function f is obtained, one can use sign(f(x)) to estimate

the label of x, that is, ŷ = +1 if f(x) ≥ 0, and ŷ = −1 otherwise.

, and λ > 0 is a tuning

For a nonlinear problem, theory of reproducing kernel Hilbert spaces can be applied and

then the kernel PLR has ℱ = {f : f(x) = r(x) + b, r(x) ∈ ℋK} and J(f) = ‖ r ‖ℋK where

and K is the kernel function (Wahba, 1999). Properties of the

reproducing kernel and the representer theorem imply that

υn)T and K is an n × n positive definite matrix with its i1i2 th element K(xi1, xi2) (Kimeldorf

& Wahba, 1971).

where υ = (υ1, …,

Notice that the loss function l(u) in (2) is a decreasing function as shown in the left panel of

Figure 1 and in particular, its value grows rapidly as u goes to negative infinity. This causes

high impact of outliers with very small (negative) value of yi f(xi). As a result, the

coefficient estimates of the PLR can be affected by outliers far from their own classes. To

further illustrate the effect of outliers on the PLR, we randomly generate two-dimensional

separable data and apply the PLR to obtain a classification boundary. As shown in the left

panel of Figure 2, the PLR works very well without outliers. However, if we randomly

select one of the observations and move it away from its own class, then the classification

boundary of the PLR is pulled towards to that outlier, as shown in the right panel of Figure

2. As a result, the corresponding misclassification rate will become higher. In contrast, our

new proposed method is much more robust to the outlier so that its classification boundary is

more accurate.

The effect of outliers on the PLR can also be interpreted using maximum likelihood. The

likelihood function of unpenalized logistic regression can be written as

(3)

where P(x) = Pr(y = +1|x). Then, we can plug in the logit function (1) into (3), and the

corresponding maximizer of L(b, w) is the solution of the logistic regression. Note that the

ith term of the product in the likelihood is P(xi) when yi = +1, and 1 − P(xi) otherwise.

Therefore, to maximize the likelihood, one needs to find (w, b) to make P(xi) big when yi =

1 and small when yi = −1. However, this could be sensitive to outliers. To illustrate this

further, assume there is one data point xi with yi = +1 which locates far from the other data

points of class +1 but closer to data of class −1 as illustrated in the right panel of Figure 2.

Using the solution (w, b) without the outlier, the corresponding P(xi) for the outlier will be

very small because xi is closer to the data of class −1. Consequently, the ML method would

select (w, b) which will make P(xi) bigger to obtain larger likelihood at the expense of other

entries’ classification accuracy. This results in the boundary moving towards to the outlier.

In the next section, we discuss some literature on robust logistic regression.

Park and Liu Page 3

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

3. LITERATURE ON ROBUST LOGISTIC REGRESSION

There is a large literature on the robustness issue of the logistic regression. Most of the

existing methods attempt to achieve robustness by downweighting observations which are

far from the majority of the data, that is, outliers (Krasker & Welsch, 1982; Pregibon, 1982;

Stefanski, Carroll & Ruppert, 1986; Copas, 1988; Künsch, Stefansk & Carroll, 1989;

Morgenthaler, 1992; Carroll and Pederson, 1993; Bianco and Yohai, 1996; Bondell, 2005).

Stefanski, Carroll & Ruppert (1986) and Künsch, Stefansk & Carroll (1989) modified

original score function of the logistic regression to obtain bounded sensitivity, which is a

concept introduced by Krasker & Welsch (1982). Morgenthaler (1992) used L1-norm instead

of L2-norm in the likelihood, resulting in a weighted score function of the original score

function. Cantoni & Ronchetti (2001b) focused on robustness of inference rather than the

model.

Pregibon (1982) suggested resistant fitting methods which taper the standard likelihood to

reduce the influence of extreme observations. In particular, he proposed to estimate (w, b) by

solving

(4)

where ρ(u) is a tapering function, h(x) is a factor which controls leverage of each

observation, and di is negative log-likelihood, that is, di = −[((1 + Yi)/2) log P(xi)+((1 + Yi)/

2) log(1 − P(xi))]. Note that this reduces to standard maximum likelihood estimation of the

logistic regression when h(x) ≡ 1 and ρ(u) = u. The particular tapering function Pregibon

(1982) proposed to use is the Huber’s loss function

(5)

where H is a prespecified constant. In order to compare with our new method, we provide a

new view of the method by Pregibon (1982) in the loss function framework. In particular,

with ρ in (5) and h(x) ≡ 1, we can reduce (4) to

(6)

where

(7)

The estimate in (6) was shown to have approximately 95% asymptotic relative efficiency

when H = 1.3452. The loss function in (7) with H = 1.3452 is plotted in the right panel of

Figure 1 for comparison. As shown in the plot, lPregibon(u) grows as u goes to negative

infinity, but less rapidly than the loss function of the original logistic regression l(u).

Consequently, the resulting coefficient estimates become less sensitive to extreme

observations. However, the value of lPregibon(u) remains to be unbounded, hence the impact

of outliers can still be large.

Park and Liu Page 4

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

Bianco & Yohai (1996) proposed a consistent and more robust version of Pregibon’s

estimator, by adding a bias correction term. More specifically, they suggested to solve

(8)

with the di previously defined and the bias correction term Ci, where Ci = G(P(xi)) + G(1 −

P(xi)) − G(1), , and

(9)

where c is a constant. Croux & Haesbroeck (2003) pointed out that the minimizer of (8) with

ρ(t) in (9) does not exist quite often, in particular, the minimizer tends to be infinity. To

overcome this problem, they suggested to use

(10)

and

(11)

where d is a constant and Φ is the normal cumulative distribution function. To view the

method by Croux & Haesbroeck (2003) in the loss function framework, we show that the

problem (8) with ρ(t) in (10) is equivalent to solving

(12)

where

Park and Liu Page 5

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 6

(13)

The loss function (13) is plotted in Figure 1.

Another attempt to achieve robustness was made by Copas (1988), who modeled

contamination of class labels in the training data. Specifically, it is assumed that the class

label y ∈ {1, −1} was transposed with a small probability γ. As a result, the response y can

be 1 with probability P*(x), where

(14)

Using (1) and (14), the log-likelihood with P*(x) becomes

(15)

To view this in the loss framework, we get the equivalent problem of log-likelihood

maximization in (15) as follows

(16)

where lCopas(u) = log(1 + e−u)/(1 + γ(e−u − 1)), which is plotted with γ = 0.02 in the right

panel of Figure 1. With any γ smaller than 0.5, lCopas(u) is decreasing in u, and bounded by

Park and Liu Page 6

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 7

−log γ. Though it reduces the impact of outliers, it heavily depends on the misclassification

rate γ, which is often unknown and needs to be tuned.

Overall, despite progress on several variants of PLR to achieve robustness, there is still

room for improvement as discussed earlier. In the next section, we propose a new classifier

which effectively reduces the influence of outliers by truncating the logistic loss function.

4. ROBUST PENALIZED LOGISTIC REGRESSION

4.1. Truncated Loss for Robustness

Although most of the previous methods of robust logistic regression use the likelihood point

of view, they can be transformed into the loss function framework as shown in the right

panel of Figure 1. In this article, we propose a different approach to achieve robustness for

the logistic regression. In particular, we develop a new classifier by truncating the loss

function directly rather than modifying the log-likelihood function.

Our focus here is on outliers that are far from their own classes. Due to the unboundedness

of the logistic loss function, it assigns large loss values for those outliers. Consequently, the

resulting classifiers will be affected by them (Shen et al., 2003; Liu & Shen, 2006). To

reduce the effect of outliers, we propose a novel robust version of the PLR(RPLR), which

truncates the loss function of the PLR. Specifically, we propose to use the truncated logistic

loss function gs(u) = min(l(u), l(s)) instead of l(u). Here s ≤ 0 represents the location of

truncation. As illustrated on the left panel of Figure 1, gs(yf(x)) increases as yf(x) decreases,

but once yf(x) is less than s, gs(yf(x)) becomes a constant. This implies that gs becomes

bigger as an observation gets further away from the classification boundary up to an

upperbound. For outliers located further away from the boundary satisfying yf(x) ≤ s, the

loss stays at a constant l(s) so that the outliers cannot further influence the classification

boundary. This is in contrast to the untruncated version whose impact grows to infinity.

Furthermore, it differs from other methods discussed in the previous section in the sense that

the effect of extreme observations stays the same once yf(x) becomes less than s, while that

of others keeps increasing. Note that s determines the level of truncation. When s = −∞, no

truncation occurs, thus the loss is the same as the original logistic loss. As s gets closer to 0,

we have more truncation on the loss which may further reduce the effect of outliers.

Therefore, gs(u) contains a group of loss functions indexed by s.

From the likelihood point of view, minimizing is equivalent to maximizing

(17)

where Q+(x) = max(P(x), 1/(1 + e−s)) and Q−(x) = min(P(x), 1/(1 + es)). Interestingly, (17)

has a similar form as that of the logistic regression in (3). The difference is that the ith factor

is Q+(xi) or 1 − Q−(xi), instead of P(xi) or 1 − P(xi), depending on yi. Hence, maximizing

(17) is equivalent to finding (w, b) which gives big Q+(x) when y = +1 and small Q−(x)

when y = −1. By definition, Q+(x) cannot get extremely small because it is lower bounded

by (1 + e−s)−1. Similarly, Q−(x) cannot get extremely big. Therefore, outliers may not

influence (17) as much compared to (3). As a result, the maximizer of (17) can be less

sensitive to outliers. For the toy example illustrated in Figure 2, the classification boundary

of the original PLR deteriorates dramatically when there exists an extreme outlier in the

dataset. In contrast, the RPLR boundary is very stable whether there is an outlier or not.

Park and LiuPage 7

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 8

4.2. Fisher Consistency

In this section, we study Fisher consistency of robust logistic regression and its weighted

version. Fisher consistency, also known as classification-calibration (Bartlett, Jordan &

McAuliffe, 2006), requires that the population minimizer of a binary loss function has the

same sign as P(x) − 1/2 (Lin, 2004). Wu & Liu (2007) established the conditions of a

truncated loss for Fisher consistency. In particular, the binary truncated logistic loss function

gs(u) = min(l(u), l(s)) is Fisher consistent for any s ≤ 0. For the multicategory case with k ≥

3 classes, gs(u) is Fisher consistent for s ∈ [−log (2k/(k−1) − 1), 0]. In the binary case, the

interval reduces to s ∈ [−log 3, 0]. In this article, we consider three different truncation

locations s = 0, −log 3, and −log 7 for the RPLR. The corresponding values of the logistic

loss are l(0), 2l(0), and 3l(0), respectively. Our numerical results suggest that s = −log 3 with

l(s) = 2l(0) = 2 log 2 gives the best performance. This matches the Fisher consistency result

for multicategory classification.

So far, we have focused on the standard case, that is, treating different types of

misclassification equally. Sometimes, it can be natural to impose different costs for different

types of misclassification. For example, it can be more severe to misclassify an observation

of class +1 to class −1 than that of class −1 to +1. Then it is sensible to put a bigger cost for

the first kind of misclassification than the second type. Lin, Lee & Wahba (2002) discussed

the weighted SVM to deal with non-standard situations such as different misclassification

costs for different classes. Recently, Wang, Shen & Liu (2007) applied weighted learning to

large margin classifiers for probability estimation. In addition to Fisher consistency of non-

weighted robust logistic regression, we investigate similar properties of the weighted robust

logistic regression.

Let (1 − π, π) with 0 < π < 1 be the weights for class +1 and class −1, respectively, then the

weighted version of the RPLR becomes

(18)

where λ > 0 balances the goodness of fit, measured by the loss function, and the smoothness

of f. If λ = 0, the objective function in (18) reduces to the unpenalized robust logistic

regression. Note that the expectation of the weighted loss part in (18) is E[hπ(Y)gs(Yf(X))],

where hπ(1) = 1 − π and hπ(−1) = π.

To understand the RPLR further, we need to explore properties of weighted robust logistic

regression. The following theorem discusses the theoretical minimizer of the truncated

logistic loss.

Theorem 1. The minimizer of E[hπ(Y)gs(Yf(X))] has the same sign as P(x) − π.

Theorem 1 indicates that the sign of is the same as sign(P(x) − π). Thus,

a natural estimate of sign(P(x) − π). In particular, if

This offers a natural procedure for class probability estimation. In particular, one can

estimate for many different π′s ∈ (0, 1) to obtain further information about P(·). Thus, it

can be used for class probability estimation, as discussed further in Section 4.3.

provides

, then P(x) > π, otherwise P(x) ≤ π.

4.3. Probability Estimation

Lin (2002) showed that under certain conditions the solution f̂π of (18) approaches

. Therefore, we can use the property of to design estimators

Park and LiuPage 8

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 9

of class probabilities P̂(x). In the simplest scenario where π = 1/2 and s = −∞, we use the

regular logistic loss and (18) reduces to the ordinary PLR. In that case, it is well known that

the minimizer of E[l(Yf(X))] is f = log[p(X)/(1 − p(X))]. Then a natural estimator of P(x) is

ef̂ /(1 + ef̂).

When we use the truncated loss function, the minimizer of E[hπ(Y)gs(Yf(X))] does not

always maintain enough information to obtain class probability estimation. The following

theorem establishes the minimizer of E[hπ(Y)gs(Yf(X))].

Theorem 2. Define H1(π, P(x)) = log[1 + 1/τ(P(x), π)] + [1/τ(P(x), π)] log[1 + τ(P(x), π)],

H2(π, P(x)) = τ(P(x), π) log[1 + 1/τ(P(x), π)] + log[1 + τ(P(x), π)], and τ(P(x), π) = ((1 −

π)P(x))/(π(1 − P(x))). Then, for t = gs(s),

Theorem 2 implies that we can use to express class probability only when

. Otherwise we cannot reconstruct P(x)

using . To further illustrate the relationship between and P(x), we consider H1 and H2

in the case that π = 1/2 in Figure 3. When P(x) ∈ [p1, p2] with t = H1(π, p1) and t = H2(π,

p2), then . However, when P(x) ∉ [p1, p2], is either ∞

or −∞, which does not have enough information to recover P(x). For this reason, we need to

explore other schemes to estimate P(x).

To estimate the class probability, we propose the following three schemes.

Scheme 1: Since the RPLR works only for estimation of P(x) ∈ [p1, p2], we can consider

utilizing it for those p, and using the ordinary PLR for P(x) ∉ [p1, p2]. Notice that this

scheme is valid only for t > 2 log 2, because if t ≤ 2 log 2, p1 = p2 and t is smaller than H1

and H2 for any P(x) as shown in Figure 3. Thus, by Theorem 2, the RPLR does not work for

estimation of any P(x) when t ≤ 2 log 2.

This scheme is a valid approach in the sense that estimation of P(x) ∈ [p1, p2] is more

critical than that of P(x) ∉ [p1, p2]. Usually the data points with very small P(x) or very big

P(x) are easier to classify and we are more certain about the class membership of those

points. However, class membership prediction for data points with P(x) near 1/2 is not only

difficult, but also highly affected by outliers. Thus, estimation of class probability becomes

more important for those points. Therefore, we use the RPLR for estimation of P(x) ∈ [p1,

p2], and use the ordinary PLR for P(x) ∉ [p1, p2].

Scheme 2: The second scheme is motivated by the idea that we can shift p1 and p2 by

changing π. Because H1 and H2 in Theorem 2 depend on π, different π’s bring different

estimable regions [p1, p2]. Hence, we can cover most of the P(x) ∈ [0, 1] using many

different π’s. Note that this method is applicable only when t > 2 log 2, and here we

illustrate the case with t = 3 log 2. More specifically, we use seven different π’s such as π1 =

1/2, π2 = 1/5, π3 = 4/5, π4 = 1/20, π5 = 19/20, π6 = 1/91, and π7 = 90/91, which give different

estimable regions for P(x), [0.310, 0.690], [0.105, 0.358], [0.642, 0.899], [0.024, 0.101],

[0.895, 0.976], [0.005, 0.024], and [0.976, 0.995]. Using f̂j which denotes the solution from

the RPLR with πj, we can construct the estimator P̂j(x) = ef̂j /(1 + ef̂j); j = 1, …, 7, to

estimate P(x) in the corresponding region.

Park and Liu Page 9

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 10

There are some drawbacks of the second scheme. First, there are overlaps between the

estimable regions. Moreover, the RPLR with different π’s can give contradictory inference

about P(x). To solve this, for given P̂j(x), we consider P̂1(x) first. If P̂1(x) ∈ [0.310, 0.690],

then take P̂1(x) as P̂(x). Otherwise, we consider P̂2(x) or P̂3(x) depending on whether P̂1(x)

is <0.310 or >0.690. Then take P̂2(x) or P̂3(x) as P̂(x) if it falls in the estimable region,

otherwise, take P̂4(x) or P̂5(x) in the same manner as P̂(x) or use P̂6(x) or P̂7(x) likewise. If

the RPLR with P̂j(x) gives contradictory inference about P(x) or none of them gives the

estimate of P(x) in the estimable region, then we use the PLR to estimate P(x).

Scheme 3: Wang, Shen & Liu (2007) suggested to estimate the class probability for large

margin classifiers via bracketing the probability using multiple weighted classifiers. We

consider to apply the similar idea to the RPLR. First, we make equally spaced partitions of

[0,1], that is, 0 = π0 < π1 < … < πm < πm+1 = 1 such that πj+1 − πj is constant for any i = 0,

…, m. Then we can obtain f̂j from the RPLR with πj, j = 1, …, m. By Theorem 1, f̂j estimates

whether the class probability is greater than π or not. Therefore, if we make the partition fine

enough, then we can achieve probability estimation with the desired level of accuracy. To be

more specific, we define π* = arg maxπj{f̂j > 0} and π* = arg maxπj{f̂j < 0}, then p̂ is

obtained by 1/2(π* + π*).

This method is not restricted by the truncation location, that is, we can use this method for

any t > log 2, corresponding to s ≤ 0. The larger m we use, the finer estimate we can get.

However, larger m’s require higher computational costs. As discussed in Wang, Shen & Liu

(2007), this scheme provides consistent estimators for the class probability. Our numerical

examples demonstrate that the third scheme works the best among the three schemes.

5. COMPUTATIONAL ALGORITHMS

Since the loss function gs is not convex, the RPLR requires non-convex minimization. Note

that gs can be written as the difference of two convex functions as gs(u) = l(u) − ls(u) as

shown in the left panel of Figure 1.With this decomposition, we can solve the non-convex

minimization via the d.c. algorithm (An & Tao, 1997; Liu, Shen & Doss, 2005). For each

iteration, ls is replaced by its linear approximation using the current solution. Then the

problem becomes convex minimization. We iterate this until the objective function

converges.

In the literature, Fan & Li (2001) introduced local quadratic approximation (LQA) to solve

penalized likelihood optimization problems. Hunter & Li (2005) studied convergence of

LQA as an instance of minorize–maximize or majorize–minimize (MM) algorithm.

Considering a linear approximation of ls as the affine minorization, the d.c. algorithm for

RPLR is also a special case of the MM algorithm. Since the objective function in (18) is

positive, our d.c. algorithm converges to an ε-local minimizer in finite iterations (An & Tao,

1997; Liu, Shen & Doss, 2005). In this section, we discuss the d.c. algorithm for the RPLR.

In linear learning with f(x) = wTx + b, (18) can be reduced to

(19)

Using the fact that gs(u) = l(u) − ls(u) with l(u) = log(1 + e−u) and ls(u) = [log(1 + e−u)

−log(1 + e−s)]+, (19) can be written as

Park and LiuPage 10

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 11

(20)

where Θ = (b, w),

. Then, at the

(m + 1)th iteration, the d.c. algorithm minimizes

(21)

where

otherwise. Problem (21) can then be solved using nonlinear convex minimization

techniques.

and βi = 1 if yi = 1 and f(xi) < s, −1 if yi = −1 and f(xi) > −s, and 0

The algorithm can be extended to nonlinear learning directly. Specifically, for kernel

learning, (18) becomes

(22)

where and υ = (υ1, …, υn). Notice that

. Using Θ = (b, υ) in (20) leads to a similar

algorithm for the nonlinear kernel learning case.

6. TUNING PARAMETER SELECTION

The tuning parameter λ in (19) and (22) plays an important rule for the RPLR. In this

section, we explore various ways to tune λ. We use the penalty term which measures

smoothness of the model to avoid overfitting the data, and the tuning parameter λ decides

how smooth our model will be. Thus, the choice of λ has a big impact on the resulting

model.

There are numerous ways proposed to tune λ in the penalized likelihood literature and we

employ some of those here for the RPLR. Some well known ones include the cross

validation, AIC, and BIC. Among them, cross validation is probably one of the most

commonly used method. Cantoni & Ronchetti (2001a) pointed out that choice of λ could be

influenced by outliers. They proposed robust versions of cross validation and Mallows’ Cp,

which are essentially equivalent to modifying the loss function by imposing weights. In

contrast, our RPLR automatically chooses robust λ without employing weights, because the

loss function itself is already designed to reduce the effect of outliers. Since cross validation

requires intensive computation, generalized approximate cross validation (GACV) can be a

Park and LiuPage 11

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 12

good approximation. In this section, we explore how to generalize GACV to the RPLR

problem.

Xiang & Wahba (1996) proposed GACV for the PLR, which estimates comparative

Kullback–Leibler distance between the true linear predictor f(x) and the estimated one for a

particular λ. It starts with a leaving-out-one version, then uses Taylor expansion to get an

estimate. This idea can be generalized here to get GACV of the RPLR. The details are as

follows.

Let fλ(x) be the solution of the RPLR for a particular value of λ. The Kullback–Leibler

distance KL(f, fλ) is

(23)

where L̃(yi, f(xi)) = P(xi)(1+yi)/2(1 − P(xi))(1−yi)/2 for the PLR and L̃(yi, f(xi)) =

Q+(xi)(1+yi)/2(1 − Q−(xi))(1−yi)/2 for the RPLR. Since the true f(x) is unknown and does not

depend on λ, we define the comparative KL loss,

(24)

to compare models with different λ. It can be shown that

for the PLR, and

for the RPLR, with zi = 1/2(1 + yi).

Then the remaining issue is how to estimate the CKL. After some derivation (the details are

included in the Appendix Section), we define GACV for the RPLR as follows

(25)

where H = {W(fλ) + nλΣ}−1 with Σ such that fTΣf, hii is the ith diagonal entry of H, Pλ(x) =

1/(1 + e−fλ(x)), and

(26)

with ai = −zifλ(xi) + log(1 + efλ(xi)) and bi = zi(fλ(xi) − fλ(−i)(xi)), where fλ(−i)(·) is the

solution of the RPLR with the ith data point omitted. Using the fact that 0 < di < 1, we can

bound GACV(λ). We use the average of the upper and lower bound of GACV. In particular,

we define the EGACV

(27)

Park and LiuPage 12

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 13

We use simulated data to illustrate the performance of EGACV(λ). The training set consists

of 50 data points sampled from the uniform distribution over a unit disk

and labeled as y = 1 if x1 ≥ x2, y = −1 otherwise. The testing set has 105 data points which

are sampled and labeled in the same manner as the training set. Using these datasets, we

build a model using the RPLR with t = 2 log 2 based on the training set and calculate

CKL(λ) of the testing set for each λ such that log10 λ ∈ {−3.0, −2.9, …, 2.0}. Then we

calculate EGACV(λ) using the training set only and plot it with CKL(λ) to see how close

they are. We repeat this 100 times with a different training set each time and take average of

EGACV(λ) and CKL(λ) and plot them. The left panel of Figure 4 illustrates typical curves

of EGACV(λ) and CKL(λ) from one example, and the average curves of the 100 repetitions

are plotted in the right panel. The solid line shows CKL(λ), the dashed line shows

EGACV(λ), and the dotted lines show the upper and lower bounds of GACV(λ). As shown

in Figure 4, EGACV(λ) reflects the variation of CKL(λ) quite well, thus EGACV(λ) can be a

useful tool to tune λ.

7. NUMERICAL EXAMPLES

In this section, we examine performance of the RPLR. Using two simulated examples and

two real data examples, we compute the PLR and RPLR to compare their classification

errors as well as accuracy of class probability estimation.

7.1. Simulation

In the simulated examples, data are generated with the sample sizes of training, tuning and

testing sets 100, 100, and 106, respectively. The training data sets are used to build

classifiers, and λ is chosen by two different ways: by a grid search based on the tuning sets,

and by a grid search based on the EGACV calculated from the training set. The testing

errors and probability estimation errors are evaluated using the testing sets.

Example 1: The data are generated as follows. First, (x1, x2) is sampled from the uniform

distribution over a unit disk

To demonstrate robustness of the RPLR, we randomly select v percent of the observations

and change their class labels to the other classes, where v = 0%, 5%, 10%, and 20%. For

each value of v, we repeat the classification procedure 100 times to capture variation of the

results. Since the true boundary is linear, we focus on linear learning in this example. For the

RPLR, we use s = 0, −log 3, and −log 7 which correspond to t = 2 log 2, 2 log 2, and 3 log

2, respectively. We also report misclassification rate of the RPLR when we tune s along with

λ, as well as results of another version of logistic regression proposed by Croux &

Haesbroeck (2003) for comparison. For class probability estimation, we apply Scheme 3 to

each t, but Scheme 1 and Scheme 2 are used only for t = 3 log 2 because they are valid only

if t > 2 log 2. To evaluate accuracy of probability estimation, we use

. Then, set y = 1 if x1 ≥ x2, y = −1 otherwise.

to measure the probability estimation error, where n′ is the size of

the testing set.

Results are summarized in Tables 1 and 2. With no contamination, the RPLR and the PLR

perform very similarly. As we increase the percent of contamination, the RPLR performs

better than the PLR because the truncated loss is more robust against outliers.

The truncation location is an important issue. If the loss function is not truncated, it can be

sensitive to outliers. If the loss function is truncated too much, we may under use the

information of those data points close to the decision boundary. The performance of the

RPLR with t = log 2 corresponding to the most truncation is indeed suboptimal as shown in

Park and Liu Page 13

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 14

Tables 1 and 2. The RPLR with t = 3 log 2 works the best for the cases v = 0 and 5, but as

the proportion of contamination grows, performance of the RPLR with t = 2 log 2 becomes

the best. This is reasonable because more truncation helps for data with more outliers. In

general, we recommend to use t = 2 log 2 for the truncation location of binary problems.

This choice also has good theoretical justification as mentioned in Section 4.2 in terms of

Fisher consistency. The results with t = 2 log 2 are comparable to that using the tuned t, but

a fixed t can be more efficient to compute.

Regarding to the choice of λ, the one chosen based on the tuning set performs better than the

one by the EGACV. This may not be surprising because the first approach uses information

from both the training set and the tuning set to choose λ, while the EGACV approach uses

the training set only. Hence a direct comparison may not be fair considering the difference

in the amount of information used between the two approaches. Nevertheless we can see that

the EGACV approach works reasonably well in this example.

Note that the overall performance of the robust estimator of the logistic regression by Croux

& Haesbroeck (2003) is not as good as that of the RPLR, especially when the data are highly

contaminated.

As to the issue of class probability estimation, the RPLR with t = 3 log 2 works the best for

non-contaminated data, but t = 2 log 2 becomes better as the rate of contamination increases.

This agrees with the results of classification errors. In general, better classification

performance can be translated into better class probability estimation. Thus, the RPLR

yields more accurate class probability estimation than that of the PLR. Among three

different schemes, Scheme 3 seems to perform the best overall.

To visualize the classification boundaries, we select a typical dataset and plot the

corresponding boundaries yielded by the PLR and the RPLR on the left panel of Figure 5.

Clearly, the RPLR is much less sensitive to outliers and deliver more accurate classification

boundary than that of the PLR.

Example 2: We generate (x1, x2) uniformly from the unit disk

being 1 if (x1 − x2)(x1 + x2) < 0, and −1 otherwise. Then we flip the class labels using the

same strategy as in Example 1. Linear learning does not work here due to its generation. We

use nonlinear learning with Gaussian kernel K(x1, x2) = exp(− ‖ x1 − x2 ‖2 /(2σ2)). We tune

σ among the first quartile, the median, and the third quartile of the between-class pairwise

Euclidean distances of training inputs (Wu & Liu, 2007). We use the same truncation

location, class probability estimation schemes, and measure of probability estimation errors

as in Example 1. Results are similar to Example 1 and not included to save space. The RPLR

with t = 2 log 2 works the best overall. When outliers exist in the data, truncation indeed

improves both classification accuracy as well as class probability estimation. We also plot

the results of one typical example on the right panel of Figure 5. Again, the RPLR is more

robust and consequently its classification boundary is closer to the Bayes decision boundary.

with y

Overall, based on these examples, we can conclude that the RPLR works better than the

original PLR and is also competitive compared with the method of Croux & Haesbroeck

(2003). We also explored the case when the logistic model is the true underlying model. In

that case, the PLR works slightly better than that of the RPLR. When we contaminate the

data with outliers, the RPLR works better than the PLR as expected.

7.2. Real Data

7.2.1. Leukaemia data—In this section, we apply the PLR and the RPLR to the

leukaemia data set described in Golub et al. (1999). This data set is publicly available at:

Park and LiuPage 14

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 15

www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. It contains 72 samples with 7,129 gene

expression values. The goal is to classify the patients into two types of leukaemia: acute

myeloid leukaemia (AML) and acute lymphoblastic leukaemia (ALL). Since the number of

genes is much higher than the sample size, we performed prescreening to choose a subset of

genes. In particular, we used the ratios of between-groups to within-groups sum of squares

of the genes to sort them and chose the top 40 genes. Similar procedure was done in Dudoit,

Fridly & Speed (2002).

This data set includes a training set with 38 instances and a testing set with 34 instances.

Heatmaps in Figure 6 are drawn for good visualization of the data sets. From the heatmap of

the testing set, we can identify some observations that are difficult to classify. Indeed, there

are two subjects that the PLR and the RPLR fail to classify to the correct classes. The

training set is used for model building, then performance of the model is evaluated on the

testing set. More specifically, the tuning parameter λ is chosen by fivefold cross validation

on the training set. We also used EGACV and it gives very similar results. Using the RPLR

coefficients estimated from the training set with the selected λ, class probability of each

instance in the testing set is estimated. Both linear and nonlinear learning with Gaussian

kernel have been performed. The results show that linear learning works better for this

problem.

Figure 7 shows the results of the PLR and the RPLR with t = 2 log 2. The results when t =

log 2 and t = 3 log 2 are not reported because they are barely different from the case when t

= 2 log 2. The horizontal axis stands for the estimated value of linear predictor f(x) = wTx +

b, and the vertical axis stands for the estimated probability. The observations of the classes

ALL and AML are plotted as circles and squares, respectively, with a color scheme of blue

for the training set (larger symbols) and red for the testing set (smaller symbols) for the

online version of the plot. The solid and dashed lines are the estimated density curves of the

values of linear predictors for the ALL and AML classes, respectively. Here, the class

probabilities for the PLR were estimated by P̂(x) = ef̂/(1 + ef̂). For the RPLR, we use

Scheme 3 to estimate the class probabilities. In both procedures of probability estimation, f̂

(x) > 0 implies P̂(x) > 0.5, hence the sign(f̂(x)) gives class prediction. As shown in Figure 7,

there are two common misclassified observations by the PLR and RPLR. This is not

surprising considering the nature of the data revealed by the heatmaps. Besides the two

misclassified observations, the PLR and the RPLR show different patterns in class

probability estimation. The estimated class probabilities by the RPLR are either very close

to 1 or 0, while estimated probabilities by the PLR have more variability. This is because

that these two classifiers have different sensitivity to outliers: since the PLR is sensitive to

those two misclassified observations, the estimated probabilities of other observations are

affected so that we lose some certainty about the class memberships for some of the other

observations despite the clear pattern of the data. On the other hand, those two misclassified

observations do not influence the RPLR as much, hence all the other class probabilities

remain close to 0 or 1, which reflect the nature of the data better.

7.2.2. Lung cancer data—In this section, we apply the RPLR to the lung cancer data set

described in Liu et al. (2008). The data set we use here has 12,625 genes of 188 lung cancer

patients with 5 categories. There are five different categories: Adeno, Carcinoid, Colon,

SmallCell, and Squamous with 128, 20, 13, 6, 21 patients, respectively. First, we calculate

the ratio of the standard deviation and the sample mean of each gene, and choose 316 genes

with the highest ratios. Then we standardize the genes so that each gene has sample mean 0

and sample standard deviation 1. Figure 8 is the biplot of the data after filtering and

standardization on principal component analysis (PCA). Out of all five types of cancer, the

Adeno group has the most broad spectrum and overlaps much with other types. This

matches the biological knowledge that Adeno is a very heterogeneous lung cancer subtype

Park and Liu Page 15

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 16

(Bhattacharjee et al., 2001). For that reason, we perform the RPLR to classify Adeno

patients versus all other cancer patients.

Since there are 188 cancer patients in total, we randomly divide patients into training,

tuning, and testing sets with sample sizes 63, 63, 62, respectively. Then we build a model for

each value of λ and choose the λ that gives the smallest misclassification rate on the tuning

set. Using the model with the selected λ, the misclassification rate on the testing set is

calculated. This whole procedure is repeated for 10 times.

The results are reported in Table 3. We can see that although the difference is not very big,

truncation indeed improves performance, and the truncation location that we suggest, t = 2

log 2, gives the best result.

Overall, we can see that the RPLR yields competitive performance when the data are noisy

with potential outliers. In practice, one may not know whether it is advantage to use robust

methods for a particular application. Based on our experience, even when there are no

outliers, the RPLR gives similar performance to that of the PLR. Thus, one may try both

methods and compare the results.

8. DISCUSSION

In this article, we have proposed the RPLR, using the truncated logistic loss function to

produce more robust classifiers to outliers than the standard PLR. Moreover, we have

proposed three schemes of class probability estimation for the RPLR. Our theoretical

investigation shows that the proposed RPLR is Fisher consistent and more robust than the

original PLR. Numerical results demonstrate that truncation of the loss function indeed

reduces the effect of outliers so that more accurate classification and class probability

estimation can be obtained.

Our current study focuses on the loss function framework. It will be interesting to perform

theoretical comparison of the proposed method with other existing robust logistic regression

using the likelihood point of view. Future work includes the study of robustness versus

efficiency as well as some comparison using the influence function as well as sensitivity

curves.

We have used the L2 penalty for the regularization term J(f). It is now well known that one

can use some other penalty functions to achieve variable selection. Examples of such

penalty functions include the L1 penalty (Tibshirani, 1996), the SCAD penalty (Fan & Li,

2001), the COSSO penalty (Lin & Zhang, 2006), etc. A natural extension of the RPLR is to

use different penalty functions to achieve simultaneous variable selection and robust

classification. Moreover, although we have focused on the binary case in this article, the

truncated logistic loss is applicable for multicategory classification problems as well. The

work of Zhu & Hastie (2005) can be useful here.

Acknowledgments

The authors are indebted to the editor, the associate editor, and two referees, whose helpful comments and

suggestions led to a much improved presentation. This research was supported in part by National Science

Foundation grant (DMS-0747575) and National Institutes of Health grant (NIH/NCI R01-CA149569).

APPENDIX

Proof of Theorem 1. Since E[hπ(Y)gs(Yf(X))] = E[E[hπ(Y)gs(Yf(X))|X = x]], we can minimize

E[hπ(Y)gs(Yf(X))] by minimizing E[hπ(Y)gs(Yf(X))|X = x] for every x. Note that

Park and LiuPage 16

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 17

E[hπ(Y)gs(Yf(X))|X = x] = P(x)(1 − π)gs(f(x)) + (1 − P(x))πgs(−f(x)). Because gs is a non-

increasing function, the minimizer should satisfy that

otherwise. Note that P(x)(1 − π) > (1 − P(x))π is equivalent to P(x) > π. Hence, it is

sufficient to show that f = 0 is not a minimizer. We can assume P(x) > π without loss of

generality. For s = 0, E[hπ(Y)gs(0)|X = x] = P(x)(1 − π)gs(0) + (1 − P(x))πgs(0), and

E[hπ(Y)gs(1)|X = x] = P(x)(1 − π)gs(1) + (1 − P(x))πgs(−1). Hence E[hπ(Y)gs(0)|X = x] >

E[hπ(Y)gs(1)|X = x] because gs(0) > gs(1) and gs(0) = gs(−1). Thus, f = 0 is not a minimizer

in this case. For s < 0,

if P(x)(1 − π) > (1 − P(x))π,

because . Thus, f = 0 is not a minimizer. Hence, has the same sign as P(x) − π.

Proof of Theorem 2. Define A(f) = E[hπ(Y)gs(Yf(X))|X = x]. Observe that A(f) = P(x)(1 − π)

min(t, log(1 + e−f(x))) + (1 − P(x))π min(t, log(1 + ef(x))), where t = log(1 + e−s). We

consider three cases, s ≤ f ≤ −s, f < s, and f > −s.

First, when s ≤ f ≤ −s,

and A″(f) = (P(x)(1 − π) + (1 − P(x))π)ef/(1 + ef)2. Note that A″(f) > 0 for any f ∈ [s, −s], and

A′(f̃) = 0 when f̃ = log((1 − π)P(x))/(π(1 − P(x))) = log τ(P(x), π). Hence, f̃ is the minimizer

of A(f) for f ∈ [s, −s]. Note that A(f̃) = (P(x)(1 − π) + (1 − P(x))π) log(P(x)(1 − π) + (1 −

P(x))π) − P(x)(1 − π) log(P(x)(1 − π)) − (1 − P(x))π log((1 − P(x))π).

Second, when f < s, note that A(f) = P(x)(1 − π)t + (1 − P(x))π log(1 + ef(x)) and it is an

increasing function in f. Thus, the minimum of A(f) in this case is limf→−∞ A(f) = P(x)(1 −

π)t.

Similarly, when f > −s, A(f) = P(x)(1 − π) log(1 + e−f(x)) + (1 − P(x))πt and it is a decreasing

function in f. Likewise, the minimum of A(f) in this case is limf→∞ A(f) = (1 − P(x))πt.

Hence, f̃ is the minimizer of A(f) if A(f̃) < limf→−∞ A(f) = P(x)(1 − π)t and A(f̃) < limf→∞

A(f) = (1 − P(x))πt. If A(f̃) > limf→−∞ A(f) = P(x)(1 − π)t and limf→∞ A(f) = (1 − P(x))πt >

limf→−∞ A(f) = P(x)(1 − π)t, f = −∞ is the minimizer of A(f). Similarly, f = ∞ is the

minimizer of A(f) if A(f̃) > limf→∞ A(f) = P(x)(1 − π)t and limf→∞ A(f) = (1 − P(x))πt <

Park and Liu Page 17

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 18

limf→−∞ A(f) = P(x)(1 − π)t. Finally, if A(f̃) > limf→∞ A(f) = P(x)(1 − π)t = limf→−∞ A(f) =

P(x)(1 − π)t, then f = −∞, ∞ is the minimizer of A(f). The desired results can follow with

that H1(π, P(x)) = tA(f̃)/limf→−∞ A(f) and H2(π, P(x)) = tA(f̃)/limf→∞ A(f).

Derivation of the GACV for the RPLR: First, let fλ(−i)(·) is the solution of the RPLR with the

ith data point omitted. Adopting the leaving-out-cone cross validation function

for data from general exponential family in

Xiang & Wahba (1996), we define CV(λ) for the RPLR

(28)

Since it is computationally expensive to calculate fλ(−i)(xi), we approximate CV(λ) using

formulae introduced in Xiang & Wahba (1996), and Liu (1995). Specifically, from (28), we

have

(29)

where ai = −zifλ(xi) + log(1 + efλ(xi)) and bi = zi(fλ(xi) − fλ(−i)(xi)). Define

(30)

Note that 0 < di < 1. Now (29) becomes

(31)

where Pλ(xi) = 1/(1 + e−fλ(xi)) and Pλ(−i)(xi) = 1/(1 + e−fλ(−i)(xi)). Let b(fλ(xi)) = log(1 +

efλ(xi)). Since b′(fλ(xi)) = Pλ(xi) and b″(fλ(xi)) = Pλ(xi)(1 − Pλ(xi)),

(32)

Park and Liu Page 18

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 19

and (31) becomes

(33)

Now what is left is the calculation of (zi − Pλ(−i)(xi))/(fλ(xi) − fλ(−i)(xi)). We modify the

leaving-out-one lemma of Xiang & Wahba (1996), which is a generalized version of the

leaving-out-one lemma of Craven & Wahba (1979).

Lemma 1 (leaving-out-one lemma). Let −l̃(zi, f(xi)) = min{t, −zi f(xi) + log(1 + ef(xi))} and

. Suppose f*(i, z*, ·) is the minimizer in ℱ of Iλ(f, z*),

where z* = (z1, …, zi−1, z*, zi+1, …, zn). Then,

where fλ(−i)(·) is the minimizer of −∑j≠i l̃(zj, f(xj)) + nλJ(f), and Pλ(−i)(x) = 1/(1 + e−fλ(−i)(x)).

Proof of Lemma 1. Let z(−i) = (z1, …, zi−1, Pλ(−i)(xi), zi+1, …, zn)T, and −l̃*(z, τ) = −zτ +

log(1 + eτ). Since−(∂l̃*(z, τ))/∂τ = −z + 1/(1 + e−τ) and−(∂2l̃*(z, τ))/∂τ2 = eτ/(1 + eτ)20, for

any fixed z, the minimizer of −l̃*(z, τ) is τ which satisfies z = 1/(1 + e−τ). Therefore, using

Pλ(−i)(xi) = 1/(1 + e−fλ(−i)(xi)), we have −l̃*(Pλ(−i)(xi), fλ(−i)(xi)) − l̃*(Pλ(−i)(xi), fλ(xi)). This

implies

(34)

since −l̃(zi, f(xi)) = min{t, −l̃*(zi, f(xi))}. Hence, for any f, we have

Park and LiuPage 19

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 20

using (34) and the definition of fλ(−i). Therefore, we have f*(i, Pλ(−i)(xi), ·) = fλ(−i)(·).

Now let fλ = (fλ(x1), …, fλ(xn))T, fλ(−i) = (fλ(−i)(x1), …, fλ(−i)(xn))T, z = (z1, …, zn)T, and z(−i)

= (z1, …, zi−1, Pλ(−i)(xi), zi+1, …, zn)T. By the definition of fλ, (fλ, z) is a local minimizer of

Iλ(f, z*). Also, (fλ(−i), z(−i)) is a local minimizer of Iλ(f, z*) by Lemma 1. Therefore, (∂Iλ(f,

z*))/∂f(fλ, z) = 0 and (∂Iλ(f, z*))/∂f(fλ(−i), z(−i)) = 0. Writing J(f) = fTΣf gives Iλ = min{t, −zi

f(xi) + log(1 + ef(xi))} + nλfTΣf (see Section 3.1 of Xiang & Wahba (1996) for computation

of Σ). Since Iλ0 is not differentiable, we approximate it with a differentiable function

(36)

with

(37)

where g** is a quadratic function of f which makes g* differentiable in f. Note that

as ε → 0. Let σij be the ijth element of Σ. Then,

(38)

and

Park and Liu Page 20

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 21

(39)

Therefore, defining

we have

Using Taylor expansion,

(40)

where (fλ**, z**) is a point somewhere between (fλ, z) and (fλ(−i), z(−i)). Approximating

W(fλ**) by W(fλ) and letting ε → 0 gives fλ − fλ(−i) = {W(fλ**) + nλΣ}−1(z − z(−i)), that is,

(41)

Let H = {W(fλ) + nλΣ}−1 and hii be the ith diagonal entry of H. Then (41) implies

Park and LiuPage 21

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 22

(42)

Using (42), (33) becomes

(43)

Replacing hii by tr(H)/n and replacing hiiPλ(xi)(1 − Pλ(xi)) by tr(W*1/2HW*1/2)/n with

we define

(44)

BIBLIOGRAPHY

An LTH, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by d.c.

algorithms. Journal of Global Optimization. 1997; 11:253–285.

Bartlett P, Jordan M, McAuliffe J. Convexity, classification, and risk bounds. Journal of the American

Statistical Association. 2006; 101:138–156.

Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R,

Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker

DJ, Meyerson M. Classification of human lung carcinomas by mrna expression profiling reveals

distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences of the

United States of America. 2001; 98:13790–13795. [PubMed: 11707567]

Bianco, AM.; Yohai, VJ. Robust estimation in the logistic regression model. In: Rieder, H., editor.

Robust Statistics, Data Analysis, and Computer Intensive Methods, Volume 109 of Lecture Notes in

Statistics. New York: Springer-Verlag; 1996.

Bondell H. Minimum distance estimation for the logistic regression model. Biometrika. 2005; 92:724–

731.

Cantoni E, Ronchetti E. Resistant selection of the smoothing parameter for smoothing splines.

Statistics and Computing. 2001a; 11:141–146.

Cantoni E, Ronchetti E. Robust inference for generalized linear models. Journal of the American

Statistical Association. 2001b; 96:1022–1030.

Carroll RJ, Pederson S. On robustness in the logistic regression model. Journal of the Royal Statistical

Society, Series B (Methodological). 1993; 55:693–706.

Copas JB. Binary regression models for contaminated data (with discussion). Journal of the Royal

Statistical Society, Series B (Methodological). 1988; 50:225–265.

Craven P, Wahba G. Smoothing noisy data with spline functions: Estimating the correct degree of

smoothing by the method of generalized cross-validation. Numerische Mathematik. 1979; 31:377–

403.

Croux C, Haesbroeck G. Implementing the bianco and yohai estimator for logistic regression.

Computational Statistics and Data Analysis. 2003; 44:273–295.

Park and Liu Page 22

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 23

Dudoit S, Fridly J, Speed TP. Comparison of discrimination methods for the classification of tumors

using gene expression data. Journal of the American Statistical Association. 2002; 97:77–87.

Fan J, Li R. Variable selection via nonconcave penalized likelihood and it oracle properties. Journal of

American Statistical Association. 2001; 96:1348–1360.

Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J,

Caligiuri M. Molecular classification of cancer: Class discovery and class prediction by gene

expression monitoring. Science. 1999; 286:531–537. [PubMed: 10521349]

Hunter D, Li R. Variable selection using mm algorithms. The Annals of Statistics. 2005; 33:1617–

1642.

Kimeldorf G, Wahba G. Some results on Tchebycheffian spline functions. Journal of Mathematical

Analysis and Applications. 1971; 33:82–95.

Krasker WS, Welsch RE. Efficient bounded-influence regression estimation. Journal of the American

Statistical Association. 1982; 77:595–604.

Künsch HR, Stefanski LA, Carroll RJ. Conditionally unbiased bounded-influence estimation in general

regression models, with applications to generalized linear models. Journal of the American

Statistical Association. 1989; 84:460–466.

le Cessie S, van Houwelingen JC. Ridge estimators in logistic regression. Applied Statistics. 1992;

41:191–201.

Lin X, Wahba G, Xiang D, Gao F, Klein R, Klein B. Smoothing spline ANOVA models for large data

sets with Bernoulli observations and the randomized GACV. The Annals of Statistics. 2000;

28:1570–1600.

Lin Y. Support vector machines and the bayes rule in classification. Data Mining and Knowledge

Discovery. 2002; 6:259–275.

Lin Y. A note on margin-based loss functions in classification. Statistics and Probability Letters. 2004;

68:73–82.

Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine

Learning. 2002; 46:191–202.

Lin Y, Zhang HH. Component selection and smoothing in smoothing spline analysis of variance

models—Cosso. Annals of Statistics. 2006; 34:2272–2297.

Liu Y. Unbiased estimate of generalization error and model selection in neural network. Neural

Networks. 1995; 8(2):215–219.

Liu Y, Hayes DN, Nobel A, Marron JS. Statistical significance of clustering for high dimension low

sample size data. Journal of the American Statistical Association. 2008; 103:1281–1293.

Liu Y, Shen X. Multicategory ψ-learning. Journal of the American Statistical Association. 2006;

101:500–509.

Liu Y, Shen X, Doss H. Multicategory ψ-learning and support vector machine: Computational tools.

Journal of Computation and Graphical Statistics. 2005; 14:219–236.

McCullagh, P.; Nelder, J. Generalized Linear Models. London: Chapman & Hall/CRC; 1989.

Morgenthaler S. Least-absolute-deviations fits for generalized linear models. Biometrika. 1992;

79:747–754.

Pregibon D. Resistant fits for some commonly used logistic models with medical applications.

Biometrics. 1982; 38:485–498. [PubMed: 7115876]

Shen X, Tseng G, Zhang X, Wong W. On ψ-learning. Journal of the American Statistical Association.

2003; 98:724–734.

Stefanski LA, Carroll RJ, Ruppert D. Optimally hounded score functions for generalized linear models

with applications to logistic regression. Biometrika. 1986; 73:413–424.

Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,

Series B. 1996; 58:267–288.

Wahba, G. Support vector machines, reproducing kernel hilbert spaces and the randomized GACV. In:

Bernhard, S.; Burges, CJS.; Smola, AJ., editors. Advances in Kernel Methods Support Vector

Learning. Cambridge, MA: MIT Press; 1999. p. 69-88.

Wang J, Shen X, Liu Y. Probability estimation for large margin classifiers. Biometrika. 2007; 95:149–

167.

Park and Liu Page 23

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 24

Wu Y, Liu Y. Robust truncated hinge loss support vector machines. Journal of the American Statistical

Association. 2007; 102:974–983.

Xiang D, Wahba G. A generalized approximate cross validation for smoothing splines with non-

gaussian data. Statistica Sinica. 1996; 6:675–692.

Zhu J, Hastie T. Kernel logistic regression and the import vector machine. Journal of Computational

and Graphical Statistics. 2005; 14:185–205.

Park and LiuPage 24

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 25

Figure 1.

Left: Plot of the functions l(u), ls(u), and gs(u) with ls(u) = [l(u) − l(s)]+ and gs(u) = l(u) −

ls(u). Right: Plot of the loss functions of the original logistic regression, Pregibon’s resistant

fitting model, Copas’ misclassification model, Bianco and Yohai’s robust logistic

regression, Croux and Haesbroeck’s robust logistic regression and the proposed RPLR.

Park and Liu Page 25

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 26

Figure 2.

Illustration plot of the effect of outliers with an outlier far away from its own class. The

RPLR boundary is more robust than that of the original PLR. Note that on the left panel, the

decision boundaries of the PLR and RPLR are identical.

Park and LiuPage 26

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 27

Figure 3.

Plot of H1 and H2 for Theorem 2 in Section 4.3. The conditions t > H1(π, p) and t > H2(π, p)

only hold when p∈[p1, p2].

Park and LiuPage 27

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 28

Figure 4.

Left: An illustration plot of CKL(λ) and EGACV(λ) from the example in Section 6. Right:

Average curves of CKL(λ) and EGACV(λ) based on 100 replications.

Park and LiuPage 28

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 29

Figure 5.

Plot of typical training sets for Example 1 (the left panel) and Example 2 (the right panel) as

well as the corresponding decision boundaries.

Park and Liu Page 29

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 30

Figure 6.

Heat maps of the leukaemia data in Section 7.2.1. The left panel is for the training set and

the right panel is for the testing set. The red and green colors of the online version represent

high and low expression values, respectively. [Color figure can be viewed in the online

issue, which is available at wileyonlinelibrary.com]

Park and Liu Page 30

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 31

Figure 7.

Plot of the estimated class probabilities against the estimated values of the linear predictor

f(x) = wTx + b for the PLR and the RPLR with t = 2 log 2. The solid and the dashed lines are

the estimated density curves of the values of linear predictor for ALL and AML class,

respectively. [Color figure can be viewed in the online issue, which is available at

wileyonlinelibrary.com]

Park and Liu Page 31

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 32

Figure 8.

Biplot on PCA of the lung cancer data in Section 7.2.2. [Color figure can be viewed in the

online issue, which is available at wileyonlinelibrary.com]

Park and LiuPage 32

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 33

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Park and LiuPage 33

Table 1

Testing errors of the simulated linear example (Example 1) in Section 7.1.

Method

v = 0

v = 5

v = 10

v = 20

PLR0.0090 (0.0006) 0.0726 (0.0014) 0.1348 (0.0021) 0.2371 (0.0022)

RPLR (I)

t = 3 log 20.0061 (0.0005) 0.0606 (0.0009)0.1172 (0.0015)0.2271 (0.0022)

t = 2 log 20.0090 (0.0006) 0.0613 (0.0008) 0.1161 (0.0012) 0.2198 (0.0017)

t = log 2 0.0120 (0.0008) 0.0663 (0.0011)0.1215 (0.0015)0.2248 (0.0018)

Tuned0.0097 (0.0007)0.0612 (0.0008)0.1150 (0.0011)0.2205 (0.0016)

RPLR (II)

t = 3 log 20.0187 (0.0011) 0.0714 (0.0012)0.1280 (0.0018) 0.2378 (0.0033)

t = 2 log 20.0188 (0.0012) 0.0688 (0.0013)0.1222 (0.0015)0.2288 (0.0033)

t = log 2 0.0306 (0.0019) 0.0782 (0.0046)0.1301 (0.0042) 0.2447 (0.0067)

Croux and Haesbroeck 0.0104 (0.0009)0.0658 (0.0010)0.1286 (0.0019)0.2335 (0.0021)

Bayes error0.000.050.100.20

Here RPLR (I) and RPLR (II) refer to the RPLR results using the tuning set and EGACV for tuning parameter selection, respectively.

Can J Stat. Author manuscript; available in PMC 2011 December 7.

Page 34

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Park and LiuPage 34

Table 2

Class probability estimation errors of the simulated linear example (Example 1) in Section 7.1.

Method

Scheme

v = 0

v = 5

v = 10

v = 20

PLR

0.0464 (0.0060)

0.1342 (0.0062)

0.1487 (0.0046)

0.1350 (0.0035)

RPLR (I)

t = 3 log 2

1

0.0207 (0.0049)

0.1101 (0.0029)

0.1350 (0.0029)

0.1289 (0.0030)

2

0.0173 (0.0039)

0.0994 (0.0027)

0.1236 (0.0033)

0.1270 (0.0032)

3

0.0438 (0.0037)

0.0686 (0.0034)

0.1022 (0.0041)

0.1184 (0.0041)

t = 2 log 2

3

0.0614 (0.0050)

0.0676 (0.0032)

0.0934 (0.0035)

0.1053 (0.0041)

t = log 2

3

0.0758 (0.0073)

0.0887 (0.0079)

0.1057 (0.0059)

0.1185 (0.0040)

RPLR (II)

t = 3 log 2

1

0.1152 (0.0008)

0.1248 (0.0015)

0.1323 (0.0015)

0.1279 (0.0026)

2

0.0861 (0.0007)

0.1034 (0.0017)

0.1208 (0.0021)

0.1254 (0.0027)

3

0.1053 (0.0010)

0.0975 (0.0019)

0.1084 (0.0028)

0.1230 (0.0040)

t = 2 log 2

3

0.1193 (0.0011)

0.0982 (0.0028)

0.1054 (0.0026)

0.1053 (0.0034)

t = log 2

3

0.1707 (0.0028)

0.1127 (0.0065)

0.1096 (0.0046)

0.1251 (0.0056)

Croux and Haesbroeck

0.0104 (0.0009)

0.0865 (0.0015)

0.1208 (0.0012)

0.1238 (0.0015)

Can J Stat. Author manuscript; available in PMC 2011 December 7.

Page 35

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Park and Liu Page 35

Table 3

Testing errors of the lung cancer data example in Section 7.2.2.

MethodTesting Error

PLR 0.1274 (0.0052)

RPLR

t = 3 log 2 0.1242 (0.0051)

t = 2 log 2 0.1210 (0.0046)

t = log 2 0.1226 (0.0054)

Can J Stat. Author manuscript; available in PMC 2011 December 7.