Page 1

Robust penalized logistic regression with truncated loss

functions

Seo Young Park1 and Yufeng Liu2,*

1Department of Health Studies, Chicago, IL 60615, USA

2Department of Statistics and Operations Research, Carolina Center for Genome Sciences,

Chapel Hill, NC 27599, USA

Abstract

The penalized logistic regression (PLR) is a powerful statistical tool for classification. It has been

commonly used in many practical problems. Despite its success, since the loss function of the PLR

is unbounded, resulting classifiers can be sensitive to outliers. To build more robust classifiers, we

propose the robust PLR (RPLR) which uses truncated logistic loss functions, and suggest three

schemes to estimate conditional class probabilities. Connections of the RPLR with some other

existing work on robust logistic regression have been discussed. Our theoretical results indicate

that the RPLR is Fisher consistent and more robust to outliers. Moreover, we develop estimated

generalized approximate cross validation (EGACV) for the tuning parameter selection. Through

numerical examples, we demonstrate that truncating the loss function indeed yields better

performance in terms of classification accuracy and class probability estimation.

Key words and phrases

Classification; logistic regression; probability estimation; robustness; truncation

1. INTRODUCTION

The penalized logistic regression (PLR) is a commonly used classification method in

practice. It is a generalization of the standard logistic regression with a penalty term on the

coefficients. It is now known that the PLR can be fit in the regularization framework with

loss + penalty (Wahba, 1999; Lin et al., 2000). The loss function controls goodness of fit of

the model, and the penalization term helps avoid overfitting so that good generalization can

be obtained.

The PLR uses the unbounded logistic loss. As a result, the resulting classifier can be

sensitive to outliers. In this article, we propose the robust penalized logistic regression

(RPLR), which uses truncated logistic loss function. Because truncation reduces the impact

of misclassified outliers, the RPLR is more robust and accurate than the standard PLR.

Connections of the proposed RPLR with some other existing robust logistic regression

methods are also discussed.

One important aspect of classification is class probability estimation. Good class probability

estimation can reflect the strength of classification. Thus, it is desirable in many

applications. In the PLR, one can use the estimated classification function, that is, the

© 2011 Statistical Society of Canada

*Author to whom correspondence may be addressed. yfliu@email.unc.edu.

NIH Public Access

Author Manuscript

Can J Stat. Author manuscript; available in PMC 2011 December 7.

Published in final edited form as:

Can J Stat. 2011 June 1; 39(2): 300–323. doi:10.1002/cjs.10105.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

estimated logit function, to derive the corresponding probability estimation. When we

replace the logistic loss by its truncated version, properties of the corresponding

classification function may not preserve all class probability information any more. To solve

this problem, we propose three different schemes for class probability estimation. Properties

and performance of these three schemes are explored as well.

Although the original logistic loss function is convex, its truncated version becomes non-

convex. Consequently, the corresponding minimization problem involves difficult non-

convex optimization. To implement the RPLR, we decompose the non-convex truncated

logistic loss function into the difference of two convex functions. Then, using this

decomposition, we apply the difference convex (d.c.) algorithm to obtain the solution of the

RPLR through iterative convex minimization.

The tuning parameter plays an important role in the RPLR implementation. To select an

efficient tuning parameter, we develop the estimated generalized approximate cross

validation (EGACV) procedure and compare its performance with the cross validation

method.

In the following sections, we describe the new proposed method in more details with

theoretical justification and numerical examples. Section 2 reviews the PLR and gives a

maximum likelihood interpretation. In Section 3 we review some related robust logistic

regression methods in the literature. In Section 4 we describe the RPLR and explore its

theoretical properties. The methods for class probability estimation are also introduced.

Section 5 develops the d.c. algorithm to solve the non-convex minimization problem for the

RPLR. In Section 6 we discuss the issue of the tuning parameter selection. Numerical results

are presented in Section 7 and Section 8 provides some discussion. The proofs of theorems

and the detailed derivation of the tuning procedure are included in the Appendix Section.

2. PENALIZED LOGISTIC REGRESSION

In binary classification, we want to build a classifier based on a training sample {(xi, yi)|i =

1, 2, …, n}, where xi ∈ Rd is a vector of predictors, and yi ∈ {+1, −1} is its class label.

Typically it is assumed that the training data are distributed according to an unknown

probability distribution P(x, y). The goal is to find a classifier which minimizes the

misclassification rate. Moreover, besides good classification accuracy, it is also desirable to

estimate the class conditional probability.

For discussion, we first briefly review the PLR and its likelihood interpretation. In the

standard logistic regression model for binary classification, one assumes that the logit can be

modeled as a linear function in covariates. Specifically, the model can be written as follows:

(1)

where X and Y denote the vector of explanatory variables and the class label, respectively.

The coefficients of logistic regression (w, b) can be estimated by the method of maximum

likelihood (McCullagh & Nelder, 1989). As one way of smoothing, le Cessie & van

Houwelingen (1992) proposed PLR, which maximizes the log-likelihood subject to a

constraint on the L2 norm of the coefficients. Wahba (1999) showed the linear PLR is

equivalent to finding b and w which solves

Park and Liu Page 2

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

(2)

where ℱ = {f : f(x) = wTx + b}, l(u) = log(1 + e−u),

parameter. Once the classification function f is obtained, one can use sign(f(x)) to estimate

the label of x, that is, ŷ = +1 if f(x) ≥ 0, and ŷ = −1 otherwise.

, and λ > 0 is a tuning

For a nonlinear problem, theory of reproducing kernel Hilbert spaces can be applied and

then the kernel PLR has ℱ = {f : f(x) = r(x) + b, r(x) ∈ ℋK} and J(f) = ‖ r ‖ℋK where

and K is the kernel function (Wahba, 1999). Properties of the

reproducing kernel and the representer theorem imply that

υn)T and K is an n × n positive definite matrix with its i1i2 th element K(xi1, xi2) (Kimeldorf

& Wahba, 1971).

where υ = (υ1, …,

Notice that the loss function l(u) in (2) is a decreasing function as shown in the left panel of

Figure 1 and in particular, its value grows rapidly as u goes to negative infinity. This causes

high impact of outliers with very small (negative) value of yi f(xi). As a result, the

coefficient estimates of the PLR can be affected by outliers far from their own classes. To

further illustrate the effect of outliers on the PLR, we randomly generate two-dimensional

separable data and apply the PLR to obtain a classification boundary. As shown in the left

panel of Figure 2, the PLR works very well without outliers. However, if we randomly

select one of the observations and move it away from its own class, then the classification

boundary of the PLR is pulled towards to that outlier, as shown in the right panel of Figure

2. As a result, the corresponding misclassification rate will become higher. In contrast, our

new proposed method is much more robust to the outlier so that its classification boundary is

more accurate.

The effect of outliers on the PLR can also be interpreted using maximum likelihood. The

likelihood function of unpenalized logistic regression can be written as

(3)

where P(x) = Pr(y = +1|x). Then, we can plug in the logit function (1) into (3), and the

corresponding maximizer of L(b, w) is the solution of the logistic regression. Note that the

ith term of the product in the likelihood is P(xi) when yi = +1, and 1 − P(xi) otherwise.

Therefore, to maximize the likelihood, one needs to find (w, b) to make P(xi) big when yi =

1 and small when yi = −1. However, this could be sensitive to outliers. To illustrate this

further, assume there is one data point xi with yi = +1 which locates far from the other data

points of class +1 but closer to data of class −1 as illustrated in the right panel of Figure 2.

Using the solution (w, b) without the outlier, the corresponding P(xi) for the outlier will be

very small because xi is closer to the data of class −1. Consequently, the ML method would

select (w, b) which will make P(xi) bigger to obtain larger likelihood at the expense of other

entries’ classification accuracy. This results in the boundary moving towards to the outlier.

In the next section, we discuss some literature on robust logistic regression.

Park and Liu Page 3

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

3. LITERATURE ON ROBUST LOGISTIC REGRESSION

There is a large literature on the robustness issue of the logistic regression. Most of the

existing methods attempt to achieve robustness by downweighting observations which are

far from the majority of the data, that is, outliers (Krasker & Welsch, 1982; Pregibon, 1982;

Stefanski, Carroll & Ruppert, 1986; Copas, 1988; Künsch, Stefansk & Carroll, 1989;

Morgenthaler, 1992; Carroll and Pederson, 1993; Bianco and Yohai, 1996; Bondell, 2005).

Stefanski, Carroll & Ruppert (1986) and Künsch, Stefansk & Carroll (1989) modified

original score function of the logistic regression to obtain bounded sensitivity, which is a

concept introduced by Krasker & Welsch (1982). Morgenthaler (1992) used L1-norm instead

of L2-norm in the likelihood, resulting in a weighted score function of the original score

function. Cantoni & Ronchetti (2001b) focused on robustness of inference rather than the

model.

Pregibon (1982) suggested resistant fitting methods which taper the standard likelihood to

reduce the influence of extreme observations. In particular, he proposed to estimate (w, b) by

solving

(4)

where ρ(u) is a tapering function, h(x) is a factor which controls leverage of each

observation, and di is negative log-likelihood, that is, di = −[((1 + Yi)/2) log P(xi)+((1 + Yi)/

2) log(1 − P(xi))]. Note that this reduces to standard maximum likelihood estimation of the

logistic regression when h(x) ≡ 1 and ρ(u) = u. The particular tapering function Pregibon

(1982) proposed to use is the Huber’s loss function

(5)

where H is a prespecified constant. In order to compare with our new method, we provide a

new view of the method by Pregibon (1982) in the loss function framework. In particular,

with ρ in (5) and h(x) ≡ 1, we can reduce (4) to

(6)

where

(7)

The estimate in (6) was shown to have approximately 95% asymptotic relative efficiency

when H = 1.3452. The loss function in (7) with H = 1.3452 is plotted in the right panel of

Figure 1 for comparison. As shown in the plot, lPregibon(u) grows as u goes to negative

infinity, but less rapidly than the loss function of the original logistic regression l(u).

Consequently, the resulting coefficient estimates become less sensitive to extreme

observations. However, the value of lPregibon(u) remains to be unbounded, hence the impact

of outliers can still be large.

Park and Liu Page 4

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

Bianco & Yohai (1996) proposed a consistent and more robust version of Pregibon’s

estimator, by adding a bias correction term. More specifically, they suggested to solve

(8)

with the di previously defined and the bias correction term Ci, where Ci = G(P(xi)) + G(1 −

P(xi)) − G(1), , and

(9)

where c is a constant. Croux & Haesbroeck (2003) pointed out that the minimizer of (8) with

ρ(t) in (9) does not exist quite often, in particular, the minimizer tends to be infinity. To

overcome this problem, they suggested to use

(10)

and

(11)

where d is a constant and Φ is the normal cumulative distribution function. To view the

method by Croux & Haesbroeck (2003) in the loss function framework, we show that the

problem (8) with ρ(t) in (10) is equivalent to solving

(12)

where

Park and Liu Page 5

Can J Stat. Author manuscript; available in PMC 2011 December 7.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript