Page 1

Robust Sparse Coding for Face Recognition

Meng Yang

Hong Kong Polytechnic Univ.

Lei Zhang∗

Jian Yang

Nanjing Univ. of Sci. & Tech.

David Zhang

Hong Kong Polytechnic Univ.

Abstract

Recently the sparse representation (or coding) based

classification (SRC) has been successfully used in face

recognition. In SRC, the testing image is represented as

a sparse linear combination of the training samples, and

the representation fidelity is measured by the 푙2-norm or

푙1-norm of coding residual. Such a sparse coding model

actually assumes that the coding residual follows Gaus-

sian or Laplacian distribution, which may not be accurate

enough to describe the coding errors in practice. In this

paper, we propose a new scheme, namely the robust sparse

coding (RSC), by modeling the sparse coding as a sparsity-

constrained robust regression problem. The RSC seeks for

the MLE (maximum likelihood estimation) solution of the

sparse coding problem, and it is much more robust to out-

liers (e.g., occlusions, corruptions, etc.)

efficient iteratively reweighted sparse coding algorithm is

proposed to solve the RSC model. Extensive experiments

on representative face databases demonstrate that the RSC

scheme is much more effective than state-of-the-art meth-

ods in dealing with face occlusion, corruption, lighting and

expression changes, etc.

than SRC. An

1. Introduction

As a powerful tool for statistical signal modeling, sparse

representation(orsparsecoding) hasbeensuccessfullyused

in image processing applications [16], and recently has led

to promising results in face recognition [24, 25, 27] and

texture classification [15]. Based on the findings that nat-

ural images can be generally coded by structural primitives

(e.g., edges and line segments) that are qualitatively similar

in form to simple cell receptive fields [18], sparse coding

techniques represent a natural image using a small number

of atoms parsimoniously chosen out of an over-complete

dictionary. Intuitively, the sparsity of the coding coefficient

vector can be measured by the 푙0-norm of it (푙0-norm counts

the number of nonzero entries in a vector). Since the com-

binatorial 푙0-norm minimization is an NP-hard problem, the

∗Corresponding author. This research is supported by the Hong Kong

General Research Fund (PolyU 5351/08E).

푙1-norm minimization, as the closest convex function to 푙0-

norm minimization, is widely employed in sparse coding,

and it was shown that 푙0-norm and 푙1-norm minimizations

are equivalent if the solution is sufficiently sparse [3]. In

general, the sparse coding problem can be formulated as

min

휶

∥휶∥1

s.t.

∥풚 − 퐷휶∥2

2≤ 휀,

(1)

where 풚 is a given signal, 퐷 is the dictionary of coding

atoms, 휶 is the coding vector of 풚 over 퐷, and 휀 > 0 is a

constant.

Face recognition (FR) is among the most visible and

challenging research topics in computer vision and pattern

recognition [29], and many methods, such as Eigenfaces

[21], Fisherfaces [2] and SVM [7], have been proposed in

the past two decades. Recently, Wright et al. [25] applied

sparse coding to FR and proposed the sparse representation

based classification (SRC) scheme, which achieves impres-

sive FR performance. By coding a query image 풚 as a

sparse linear combination of the training samples via the

푙1-norm minimization in Eq. (1), SRC classifies the query

image풚 byevaluatingwhichclassoftrainingsamplescould

result in the minimal reconstruction error of it with the as-

sociated coding coefficients. In addition, by introducing an

identity matrix 퐼 as a dictionary to code the outlier pixels

(e.g., corrupted or occluded pixels):

min

휶,휷∥[휶;휷]∥1

s.t.풚 = [퐷,퐼] ⋅ [휶;휷],

(2)

the SRC method shows high robustness to face occlusion

and corruption. In [9], Huang et al. proposed a sparse rep-

resentation recovery method which is invariant to image-

plane transformation to deal with the misalignment and

pose variation in FR, while in [22] Wagner et al. proposed

a sparse representation based method that could deal with

face misalignment and illumination variation. Instead of di-

rectly using original facial features, Yang and Zhang [27]

used Gabor features in SRC to reduce greatly the size of

occlusion dictionary and improve a lot the FR accuracy.

The sparse coding model in Eq. (1) is widely used in

literature. There are mainly two issues in this model. The

first one is that whether the 푙1-norm constraint ∥휶∥1is good

enoughtocharacterizethesignalsparsity. Thesecondoneis

625

Page 2

that whether the 푙2-norm term ∥풚 − 퐷휶∥2

enough to characterize the signal fidelity, especially when

theobservation풚 isnoisyorhasmanyoutliers. Manyworks

have been done for the first issue by modifying the sparsity

constraint. For example, Liu et al. [14] added a nonneg-

ative constraint to the sparse coefficient 휶; Gao et al. [4]

introduced a Laplacian term of coefficient in sparse coding;

Wang et al. [23] used the weighted 푙2-norm for the spar-

sity constraint. In addition, Ramirez et al. [19] proposed a

framework of universal sparse modeling to design sparsity

regularization terms. The Bayesian methods were also used

for designing the sparsity regularization terms [11].

2≤ 휀 is effective

The above developments of sparsity regularization term

in Eq. (1) improve the sparse representation in different as-

pects; however, to the best of our knowledge, little work has

been done on improving the fidelity term ∥풚 − 퐷휶∥2

cept that in [24, 25] the 푙1-norm was used to define the cod-

ing fidelity (i.e., ∥풚 − 퐷휶∥1). In fact, the fidelity term has

a high impact on the final coding results because it ensures

that the given signal 풚 can be faithfully represented by the

dictionary 퐷. From the viewpoint of maximum likelihood

estimation (MLE), defining the fidelity term with 푙2- or 푙1-

norm actually assumes that the coding residual 풆 = 풚−퐷휶

follows Gaussian or Laplacian distribution. But in prac-

tice this assumption may not hold well, especially when

occlusions, corruptions and expression variations occur in

the query face images. So the conventional 푙2- or 푙1-norm

based fidelity term in sparse coding model Eq. (1) may not

be robust enough in these cases. Meanwhile, these prob-

lems cannot be well solved by modifying the sparsity regu-

larization term.

2ex-

To improve the robustness and effectiveness of sparse

representation, we propose a so-called robust sparse cod-

ing (RSC) model in this paper. Inspired by the robust re-

gression theory [1, 10], we design the signal fidelity term

as an MLE-like estimator, which minimizes some function

(associated with the distribution of the coding residuals) of

the coding residuals. The proposed RSC scheme utilizes

the MLE principle to robustly regress the given signal with

sparse regression coefficients, and we transform the mini-

mization problem into an iteratively reweighted sparse cod-

ing problem. A reasonable weight function is designed for

applying RSC to FR. Our extensive experiments in bench-

mark face databases show that RSC achieves much better

performance than existing sparse coding based FR methods,

especially when there are complicated variations of face im-

ages, such as occlusions, corruptions and expressions, etc.

The rest of this paper is organized as follows. Section 2

presents the proposed RSC model. Section 3 presents the

algorithm of RSC and some analyses, such as convergence

and complexity. Section 4 conducts the experiments, and

Section 5 concludes the paper.

2. Robust Sparse Coding (RSC)

2.1. The RSC model

The traditional sparse coding model in Eq. (1) is equiva-

lent to the so-called LASSO problem [20]:

min

휶

∥풚 − 퐷휶∥2

2

s.t.

∥휶∥1≤ 휎,

(3)

where 휎 > 0 is a constant, 풚 = [푦1;푦2;⋅⋅⋅ ;푦푛] ∈ ℝ푛is the

signal to be coded, 퐷 = [풅1,풅2,⋅⋅⋅ ,풅푚] ∈ ℝ푛×푚is the

dictionary with column vector 풅푗being the 푗thatom, and 휶

is the coding coefficient vector. In our problem of FR, the

atom 풅푗is the training face sample (or its dimensionality

reduced feature) and hence the dictionary 퐷 is the training

dataset.

We can see that the sparse coding problem in Eq. (3)

is essentially a sparsity-constrained least square estima-

tion problem.It is known that only when the residual

풆 = 풚 − 퐷휶 follows the Gaussian distribution, the least

square solution is the MLE solution. If 풆 follows the Lapla-

cian distribution, the MLE solution will be

min

휶

∥풚 − 퐷휶∥1

s.t.

∥휶∥1≤ 휎,

(4)

Actually Eq. (4) is essentially another expression of Eq. (2)

because both of them can have the following Lagrangian

formulation: min

휶

In practice, however, the distribution of residual 풆 may

be far from Gaussian or Laplacian distribution, especially

when there are occlusions, corruptions and/or other varia-

tions. Hence, the conventional sparse coding models in Eq.

(3) (or Eq. (1)) and Eq. (4) (or Eq. (2)) may not be robust

and effective enough for face image representation.

In order to construct a more robust model for sparse cod-

ing of face images, in this paper we propose to find an MLE

solution of the coding coefficients. We rewrite the dictio-

nary 퐷 as 퐷 = [풓1;풓2;⋅⋅⋅ ;풓푛], where row vector 풓푖is the

푖throw of 퐷. Denote by 풆 = 풚 − 퐷휶 = [푒1;푒2;⋅⋅⋅ ;푒푛]

the coding residual.Then each element of 풆 is 푒푖 =

푦푖− 푟푖휶,푖 = 1,2,⋅⋅⋅ ,푛.

are independently and identically distributed according to

some probability density function (PDF) 푓휽(푒푖), where 휽

denotes the parameter set that characterizes the distribution.

Without considering the sparsity constraint of 휶, the likeli-

hood of the estimator is퐿휽(푒1,푒2,⋅⋅⋅ ,푒푛) =∏푛

equivalently, minimize the objective function: −ln퐿휽 =

∑푛

MLE of 휶, namely the robust sparse coding (RSC), can be

formulated as the following minimization

{∥풚 − 퐷휶∥1+ 휆∥휶∥1}[26].

Assume that 푒1,푒2,⋅⋅⋅ ,푒푛

푖=1푓휽(푒푖),

and MLE aims to maximize this likelihood function or,

푖=1휌휽(푒푖), where 휌휽(푒푖) = −ln푓휽(푒푖).

With consideration of the sparsity constraint of 휶, the

min

휶

∑푛

푖=1휌휽(푦푖− 풓푖휶)s.t.

∥휶∥1≤ 휎,

(5)

626

Page 3

In general, we assume that the unknown PDF 푓휽(푒푖) is sym-

metric, and푓휽(푒푖) < 푓휽(푒푗)if∣푒푖∣ > ∣푒푗∣. So휌휽(푒푖)hasthe

following properties: 휌휽(0) is the global minimal of 휌휽(푒푖);

휌휽(푒푖) = 휌휽(−푒푖); 휌휽(푒푖) < 휌휽(푒푗) if ∣푒푖∣ > ∣푒푗∣. Without

loss of generality, we let 휌휽(0) = 0.

Form Eq. (5), we can see that the proposed RSC model

is essentially a sparsity-constrained MLE problem. In other

words, it is a more general sparse coding model, while the

conventional sparse coding models in Eq. (3) and Eq. (4)

are special cases of it when the coding residual follows

Gaussian and Laplacian distributions, respectively.

By solving Eq. (5), we can get the MLE solution to 휶

with sparsity constraint. Clearly, one key problem is how

to determine the distribution 휌휽(or 푓휽). Explicitly taking

푓휽as Gaussian or Laplacian distribution is simple but not

effective enough. In this paper, we do not determine 휌휽di-

rectly to solve Eq. (5). Instead, with the above mentioned

general assumptions of 휌휽, we transform the minimization

probleminEq. (5)intoaniterativelyreweightedsparsecod-

ing problem, and the resulted weights have clear physical

meaning, i.e., outliers will have low weight values. By it-

eratively computing the weights, the MLE solution of RSC

could be solved efficiently.

2.2. The distribution induced weights

Let 퐹휽(풆) =∑푛

풆0:˜퐹휽(풆) = 퐹휽(풆0) + (풆 − 풆0)푇퐹′

푹1(풆) is the high order residual term, and 퐹′

derivative of 퐹휽(풆). Denote by 휌′

then 퐹′

푒0,푖is the 푖thelement of 풆0.

In sparse coding, it is usually expected that the fidelity

term is strictly convex. So we approximate the residual term

as 푅1(풆) = 0.5(풆−풆0)푇푊(풆−풆0), where 푊 is a diagonal

matrix for that the elements in 풆 are independent and there

is no cross term between 푒푖and 푒푗, 푖 ∕= 푗, in 퐹휽(풆). Since

퐹휽(풆) reaches its minimal value (i.e., 0) at 풆 = 0, we also

require that˜퐹휽(풆) has its minimal value at 풆 = 0. Letting

˜퐹′

푖=1휌휽(푒푖). We can approximate 퐹휽(풆)

by its first order Taylor expansion in the neighborhood of

휽(풆0) + 푹1(풆), where

휽(풆) is the

휽the derivative of 휌휽, and

휽(푒0,2);⋅⋅⋅ ;휌′

휽(풆0) = [휌′

휽(푒0,1);휌′

휽(푒0,푛)], where

휽(0) = 0, we have the diagonal element of 푊 as

푊푖,푖= 휔휽(푒0,푖) = 휌′

휽(푒0,푖)/푒0,푖.

(6)

According to the properties of 휌휽(푒푖), 휌′

same sign as 푒푖. So each 푊푖,푖is a non-negative scalar. Then

˜퐹휽(풆) can be written as˜퐹휽(풆) =1

푏 is a scalar value determined by 풆0. Since 풆 = 풚 − 퐷휶,

the RSC model in Eq. (5) can be approximated by

휽(푒푖) will have the

2

??푊1/2풆??2

2+ 푏, where

min

휶

???푊1/2(풚 − 퐷휶)

???

2

2

s.t.

∥휶∥1≤ 휎,

(7)

which is clearly a weighted LASSO problem. Because the

weight matrix 푊 needs to be estimated using Eq. (6), Eq.

(7) is a local approximation of the RSC in Eq. (5) at 풆0, and

the minimization procedure of RSC can be transformed into

an iteratively reweighted sparse coding problem with 푊 be-

ing updated using the residuals in previous iteration via Eq.

(6). Each 푊푖,푖is a non-negative scalar, so the weighted

LASSO in each iteration is a convex problem, which could

be solved easily by methods such as 푙1-ls [12].

Since 푊 is a diagonal matrix, its element 푊푖,푖 (i.e.,

휔휽(푒푖)) is the weight assigned to each pixel of the query

image 풚. Intuitively, in FR the outlier pixels (e.g. occluded

or corrupted pixels) should have low weight values. Thus,

with Eq. (7) the determination of distribution 휌휽is trans-

formed into the determination of weight 푊. Considering

the logistic function has properties similar to the hinge loss

function in SVM [28], we choose it as the weight function

휔휽(푒푖) = exp(휇훿 − 휇푒2

where 휇 and 훿 are positive scalars. Parameter 휇 controls the

decreasing rate from 1 to 0, and 훿 controls the location of

demarcation point. With Eq. (8), Eq. (6) and 휌휽(0) = 0,

we could get

(ln(1 + exp(휇훿 − 휇푒2

TheoriginalsparsecodingmodelsinEqs. (3)and(4)can

be interpreted by Eq. (7). The model in Eq. (3) is the case

by letting 휔휽(푒푖) = 2. The model in Eq. (4) is the case by

letting 휔휽(푒푖) = 1/∣푒푖∣. Compared with the models in Eqs.

(3) and (4), the proposed weighted LASSO in Eq. (7) has

the following advantage: outliers (usually the pixels with

big residuals) will be adaptively assigned with low weights

to reduce their affects on the regression estimation so that

the sensitiveness to outliers can be greatly reduced. The

weight function of Eq. (8) is bounded in [0,1]. Although

the model in Eq. (4) also assigns low weight to outliers, its

weight function is not bounded. The weights of pixels with

very small residuals will have nearly infinite values. This

reduces the stability of the coding process.

The convexity of the RSC model (Eq. (5)) depends on

the form of 휌휽(푒푖) or the weight function 휔휽(푒푖). If we

simply let 휔휽(푒푖) = 2, the RSC degenerates to the origi-

nal sparse coding problem (Eq. (3)), which is convex but

not effective. The RSC model is not convex with the weight

function defined in Eq. (8). However, for FR, a good initial-

ization can always be got, and our RSC algorithm described

in next section could always find a local optimal solution,

which has very good FR performance as validated in the

experiments in Section 4.

푖

)/(1 + exp(휇훿 − 휇푒2

푖

))

(8)

휌휽(푒푖) =−1

2휇

푖

))− ln(1 + exp휇훿))

(9)

3. Algorithm of RSC

As discussed in Section 2.2, the implementation of RSC

can be an iterative process, and in each iteration it is a con-

vex 푙1-minimization problem. In this section we propose

627

Page 4

such an iteratively reweighted sparse coding (IRSC) algo-

rithm to solve the RSC minimization.

3.1. Iteratively reweighted sparse coding (IRSC)

Although in general the RSC model can only have a

locally optimal solution, fortunately in FR we are able to

have a very reasonable initialization to achieve good per-

formance. When a testing face image 풚 comes, in order to

initialize the weight, we should firstly estimate the coding

residual 풆 of 풚. We can initialize 풆 as 풆 = 풚 − 풚푖푛푖, where

풚푖푛푖is some initial estimation of the true face from obser-

vation 풚. Because we do not know which class the testing

face image 풚 belongs to, a reasonable 풚푖푛푖can be set as the

mean image of all training images. In the paper, we simply

compute 풚푖푛푖as

풚푖푛푖= 풎퐷,

(10)

where 풎퐷is the mean image of all training samples.

With the initialized 풚푖푛푖, our algorithm to solve the

RSC model, namely Iteratively Reweighted Sparse Coding

(IRSC), is summarized in Algorithm 1.

When RSC converges, we use the same classification

strategy as in SRC [25] to classify the face image 풚.

3.2. The convergence of IRSC

The weighted sparse coding in Eq. (7) is a local ap-

proximation of RSC in Eq. (5), and in each iteration the

objective function value of Eq. (5) decreases by the IRSC

algorithm. Since the original cost function of Eq. (5) is

lower bounded (≥0), the iterative minimization procedure

in IRSC will converge.

The convergence is achieved when the difference of the

weight between adjacent iterations is small enough. Specif-

ically, we stop the iteration if the following holds:

???푊(푡)− 푊(푡−1)???

where 훾 is a small positive scalar.

2

/???푊(푡−1)???

2< 훾,

(12)

3.3. Complexity analysis

The complexity of both SRC and the proposed IRSC

mainly lies in the sparse coding process, i.e., Eq. (3) and

Eq. (7). Suppose that the dimensionality 푛 of face feature

is fixed, the complexity of sparse coding model Eq. (3) ba-

sically depends on the number of dictionary atoms, i.e. 푚.

The empirical complexity of commonly used 푙1-regularized

sparse coding methods (such as 푙1-ls [12]) to solve Eq. (3)

or Eq. (7) is 푂(푚휀) with 휀 ≈ 1.5 [12]. For FR without

occlusion, SRC [25] performs sparse coding once and then

uses the residuals associated with each class to classify the

face image, while RSC needs several iterations (usually 2

iterations) to finish the coding. Thus in this case, RSC’s

complexity is higher than SRC.

Algorithm 1 Iteratively Reweighted Sparse Coding

Input: Normalized test sample 풚 with unit 푙2-norm, dic-

tionary 퐷 (each column of 퐷 has unit 푙2-norm) and 풚(1)

initialized as 풚푖푛푖.

Output: 휶

Start from 푡 = 1:

1: Compute residual 풆(푡)= 풚 − 풚(푡)

2: Estimate weights as

(

1 + exp휇(푡)훿(푡)− 휇(푡)(푒(푡)

푟푒푐

푟푒푐.

휔휽

(

푒(푡)

푖

)

=

exp휇(푡)훿(푡)− 휇(푡)(푒(푡)

(

푖)2)

푖)2), (11)

where 휇(푡)and 훿(푡)are parameters estimated in the 푡th

iteration (please refer to Section 4.1 for the setting of

them).

3: Sparse coding:

휶∗= min

휶

where 푊(푡)is the estimated diagonal weight matrix

with 푊(푡)

4: Update the sparse coding coefficients:

If 푡 = 1, 휶(푡)= 휶∗;

If 푡 > 1, 휶(푡)= 휶(푡−1)+ 휂(푡)(휶∗− 휶(푡−1));

should make∑푛

process [8]. (Since both 휶(푡−1)and 휶∗belong to the

convex set 푄 = {∥휶∥1≤ 휎}, 휶(푡)will also belong to

푄.

5: Compute the reconstructed test sample:

풚(푡)

and let 푡 = 푡 + 1.

6: Go back to step 1 until the condition of convergence

(described in Section 3.2) is met, or the maximal num-

ber of iterations is reached.

??(푊(푡))1/2(풚 − 퐷휶)??2

푖,푖= 휔휽(푒(푡)

2

s.t.∥휶∥1≤ 휎,

푖).

where 0 < 휂(푡)< 1 is the step size, and a suitable 휂(푡)

푖=1휌휽(푒(푡)) <∑푛

푖=1휌휽(푒(푡−1)). 휂(푡)

can be searched from 1 to 0 by the standard line-search

푟푒푐= 퐷휶(푡),

For FR with occlusion or corruption, SRC needs to use

an identity matrix to code the occluded or corrupted pix-

els, as shown in Eq. (2). In this case SRC’s complexity is

푂((푚 + 푛)휀). Considering the fact that 푛 is often much

greater than 푚 in sparse coding based FR (e.g. 푛 = 8086,

푚 = 717intheexperimentswithpixelcorruptionandblock

occlusion in [25]), the complexity of SRC becomes very

high when dealing with occlusion and corruption.

The computational complexity of our proposed RSC is

푂(푘(푚)휀), where 푘 is the number of iteration. Note that

푘 depends on the percentage of outliers in the face image.

By our experience, when there is a small percentage of out-

liers, RSC will converge in only two iterations. If there is a

big percentage of outliers (e.g. occlusion, corruption, etc.),

RSC could converge in 10 iterations. So for FR with occlu-

628

Page 5

sion, the complexity of RSC is generally much lower than

SRC. In addition, in the iteration of IRSC we can delete the

element 푦푖that has very small weight because this implies

that 푦푖is an outlier. Thus the complexity of RSC can be

further reduced (i.e., in FR with real disguise on the AR

database, about 30% pixels could be deleted in each itera-

tion in average).

4. Experimental Results

In this section, we perform experiments on benchmark

face databases to demonstrate the performance of RSC

(source codes accompanying this work are available at

http://www.comp.polyu.edu.hk/˜cslzhang/

code.htm). We first discuss the parameter selection of

RSC in Section 4.1; in Section 4.2, we test RSC for FR

without occlusion on three face databases (Extended Yale

B [5, 13], AR [17], and Multi-PIE [6]). In Section 4.3,

we demonstrate the robustness of RSC to random pixel

corruption, random block occlusion and real disguise.

All the face images are cropped and aligned by using the

locations of eyes, which are provided by the face databases

(except for Multi-PIE, for which we manually locate the

positions of eyes). For all methods, the training samples

are used as the dictionary 퐷 in sparse coding.

4.1. Parameter selection

In the weight function Eq. (8), there are two parameters,

훿 and 휇, which need to be calculated in Step 2 of IRSC. 훿

is the parameter of demarcation point. When the square of

residual is larger than 훿, the weight value is less than 0.5. In

order to make the model robust to outliers, we compute the

value of 훿 as follows.

Denote by 흍 =[(푒1)2,(푒2)2,⋅⋅⋅ ,(푒푛)2]. By sorting 흍

푘 = ⌊휏푛⌋, where scalar 휏 ∈ (0,1], and ⌊휏푛⌋ outputs the

largest integer smaller than 휏푛. We set 훿 as

in an ascending order, we get the re-ordered array 흍푎. Let

훿 = 흍푎(푘)

(13)

Parameter 휇 controls the decreasing rate of weight value

from 1 to 0. Here we simply let 휇 = 푐/훿, where 푐 is a

constant. In the experiments, if no specific instructions,

푐 is set as 8; 휏 is set as 0.8 for FR without occlusion,

and 0.5 for FR with occlusion. In addition, in our exper-

iments, we solve the (weighted) sparse coding (in Eq. (2),

Eq. (3) or Eq.(7)) by its unconstrained Lagrangian formu-

lation. Take Eq. (3) as an example, its Lagrangian form

is min

휶

the multiplier, 휆, is 0.001.

{

∥풚 − 퐷휶∥2

2+ 휆∥휶∥1

}

), and the default value for

4.2. Face recognition without occlusion

We first validate the performance of RSC in FR with

variations such as illumination and expression changes but

Dim

NN

NS

SVM

SRC [25]

RSC

3084150

90.0%

95.1%

96.4%

96.8%

98.4%

300

91.6%

96.0%

97.0%

98.3%

99.4%

66.3%

63.6%

92.4%

90.9%

91.3%

85.8%

94.5%

94.9%

95.5%

98.1%

Table 1. Face recognition rates on the Extended Yale B database

without occlusion.

methods such as nearest neighbor (NN), nearest subspace

(NS), linear support vector machine (SVM), and the re-

cently developed SRC [25].

In the experiments, PCA (i.e., Eigenfaces [21]) is used

to reduce the dimensionality of original face features, and

the Eigenface features are used for all the competing meth-

ods. Denote by 푃 the subspace projection matrix com-

puted by applying PCA to the training data. Then in RSC,

the sparse coding in step 3 of IRSC becomes: 휶∗=

min

휶

1) Extended Yale B Database: The Extended Yale B

[5, 13] database contains about 2,414 frontal face images

of 38 individuals. We used the cropped and normalized

54×48 face images, which were taken under varying illumi-

nation conditions. We randomly split the database into two

halves. One half (about 32 images per person) was used as

the dictionary, and the other half for testing. Table 1 shows

the recognition rates versus feature dimension by NN, NS,

SVM, SRC and RSC. It can be seen that RSC achieves bet-

ter results than the other methods in all dimensions except

that RSC is slightly worse than SVM when the dimension

is 30. When the dimension is 84, RSC achieves about 3%

improvement of recognition rate over SRC. The best recog-

nition rate of RSC is 99.4%, compared to 91.6% for NN,

96.0% for NS, 97.0% for SVM, and 98.3% for SRC.

2) AR database: As in [25], a subset (with only illumina-

tion and expression changes) that contains 50 males and 50

females was chosen from the AR dataset [17]. For each sub-

ject, thesevenimagesfromSession1wereusedfortraining,

with other seven images from Session 2 for testing. The size

of image is cropped to 60×43. The comparison of RSC and

its competing methods is given in Table 2. Again, we can

see that RSC performs much better than all the other four

methods in all dimensions except that RSC is slightly worse

thanSRCwhenthedimensionis30. Nevertheless, whenthe

dimension is too low, all the methods cannot achieve very

high recognition rate. On other dimensions, RSC outper-

forms SRC by about 3%. SVM does not give good results in

this experiment because there are not enough training sam-

ples (7 samples per class here) and there are high variations

between training set and testing set. The maximal recog-

nition rates of RSC, SRC, SVM, NS and NN are 96.0%,

We compare RSC with the popular

??푃(푊(푡))1/2(풚 − 퐷휶)??2

2

s.t.∥휶∥1≤ 휎.

629

Page 6

Dim

NN

NS

SVM

SRC [25]

RSC

30 54120

70.1%

75.4%

74.5%

90.1%

94.0%

300

71.3%

76.0%

75.4%

93.3%

96.0%

62.5%

66.1%

66.1%

73.5%

71.4%

68.0%

70.1%

69.4%

83.3%

86.8%

Table 2. Face recognition rates on the AR database

Dim

NN

NS

SVM

SRC [25]

RSC

Sim-S1

88.7%

89.6%

88.9%

93.7%

97.8%

Sim-S3

47.3%

48.8%

46.3%

60.3%

75.0%

Sur-S2

40.1%

39.6%

25.6%

51.4%

68.8%

Sqi-S2

49.6%

51.2%

47.7%

58.1%

64.6%

Table 3. Face recognition rates on Multi-PIE database. (’Sim-

S1’(’Sim-S3’): set with smile in Session 1 (3);’Sur-S2’(’Sqi-S2’):

set with surprise (squint) in Session 2).

93.3%, 75.4%, 76.0% and 71.3%, respectively.

3) Multi PIE database: The CMU Multi-PIE database

[6] contains images of 337 subjects captured in four ses-

sions with simultaneous variations in pose, expression, and

illumination. Among these 337 subjects, all the 249 sub-

jects in Session 1 were used as training set. To make the FR

more challenging, four subsets with both illumination and

expression variations in Sessions 1, 2 and 3, were used for

testing. For the training set, as in [22] we used the 7 frontal

images with extreme illuminations {0,1,7,13,14,16,18}

and neutral expression (refer to Fig. 1(a) for examples). For

the testing set, 4 typical frontal images with illuminations

{0,2,7,13} and different expressions (smile in Sessions 1

and 3, squint and surprise in Session 2) are used (refer to

Fig. 1(b) for examples with surprise in Session 2, Fig. 1(c)

for examples with smile in Session 1, and Fig. 1(d) for ex-

amples with smile in Session 3). Here we used the Eigen-

face with dimensionality 300 as the face feature for sparse

coding. Table 3 lists the recognition rates in four testing sets

by the competing methods.

From Table 3, we can see that RSC achieves the best

performance in all tests, and SRC performs the second best.

In addition, all the methods achieve their best results when

Smi-S1 is used for testing because the training set is also

from Session 1. The highest recognition rate of RSC on

Smi-S1 is 97.8%, more than 4% improvement over SRC.

From testing set Smi-S1 to set Smi-S3, the variations in-

crease because of the longer data acquisition time interval

(refer to Fig. 1(c) and Fig. 1(d)). The recognition rate of

RSC drops by 22.8%, while those of NN, NS, SVM and

SRC drop by 41.4%, 40.8%, 42.6% and 33.4%. This vali-

dates that RSC is much more robust to face variations than

the other methods. For the testing sets Sur-S2 and Sqi-S2,

(a)(b)

(c) (d)

Figure 1. A subject in Multi-PIE database. (a) Training samples

with only illumination variations. (b) Testing samples with sur-

prise expression and illumination variations. (c) and (d) show the

testing samples with smile expression and illumination variations

in Session 1 and Session 3, respectively.

0 20 40 6080

0

0.2

0.4

0.6

0.8

1

Percent corrupted(%)

Recognition rate

SRC

RSC

Figure 2. The recognition rate curves of RSC and SRC versus dif-

ferent percentage of corruption.

RSC’s recognition rates are 17.4% and 6.5% higher than

those of SRC, respectively. Meanwhile, we could also see

that FR with surprise expression change is much more dif-

ficult than FR with the other two expression changes.

4.3. Face recognition with occlusion

One of the most interesting features of sparse coding

based FR in [25] is its robustness to face occlusion by

adding an occlusion dictionary (an identity matrix). Thus,

in this subsection we test the robustness of RSC to different

kinds of occlusions, such as random pixel corruption, ran-

dom block occlusion and real disguise. In the experiments

of random corruption, we compare our proposed RSC with

SRC [25]. In the experiments of block occlusion and real

disguise, we compare RSC with SRC and the recently de-

veloped Gabor-SRC (GSRC) [27].

1) FR with pixel corruption: To be identical to the ex-

perimental settings in [25], we used Subsets 1 and 2 (717

images, normal-to-moderate lighting conditions) of the Ex-

tended Yale B database for training, and used Subset 3 (453

images, more extreme lighting conditions) for testing. The

images were resized to 96×84 pixels. For each testing im-

age, we replaced a certain percentage of its pixels by uni-

formly distributed random values within [0,255]. The cor-

rupted pixels were randomly chosen for each test image and

the locations are unknown to the algorithm.

Fig. 2 shows the results of RSC and SRC under the per-

centage of corrupted pixels from 0% to 90%. It can be seen

630

Page 7

Occlusion

SRC [25]

GSRC [27]

RSC

0%

1

1

1

10%

1

1

1

20%

0.998

1

1

30%

0.985

0.998

0.998

40%

0.903

0.965

0.969

50%

0.653

0.874

0.839

Table 4. The recognition rates of RSC, SRC and GSRC under dif-

ferent levels of block occlusion.

(a)(b) (c) (d)

(e) (f)

Figure 3. An example of face recognition with disguise using

RSC. (a) A test image with sunglasses. (b) The initialized weight

map (binarized). (c) The weight map (binarized) when IRSC con-

verges. (d) Atemplate image of the identified subject. (e) The con-

vergence curve of IRSC. (f) The residuals of each class by RSC.

that when the percentage of corrupted pixels is between

10% and 60%, both RSC and SRC correctly classify all the

testing images. However, when the percentage of corrupted

pixels is more than 60%, the advantage of RSC over SRC

is clear. Especially, RSC can still have a recognition rate of

98.9% when 80% pixels are corrupted, while SRC only has

a recognition rate of 37.5%.

2) FR with block occlusion: In this part, we test the ro-

bustness of RSC to block occlusion. We also used the same

experimental settings as in [25], i.e. Subsets 1 and 2 of Ex-

tended Yale B for training and Subset 3 for testing. The

images were resized to 96×84. Here we set 휏 = 0.7. Table

4 lists the results of RSC, SRC and GSRC. We see that RSC

achieves much higher recognition rates than SRC when the

occlusion percentage is larger than 30% (more than 18 %

(6%) improvement at 50% (40%) occlusion). Compared to

GSRC, RSC still gets competing results without using Ga-

bor features.

3) FR with real face disguise: A subset from the AR

database is used in this experiment. This subset consists

of 2,599 images from 100 subjects (about 26 samples per

class), 50 males and 50 females. We do two tests: one fol-

lows the experimental setting in [25], while the other one is

more challenging. The images were resized to 42×30. (For

simplicity, we let 훿 = 120, 휇 = 0.1, 휆 = 100, and did not

normalize the face images to have unit 푙2-norm).

In the first test, 799 images (about 8 samples per subject)

of non-occluded frontal views with various facial expres-

sions in Sessions 1 and 2 were used for training, while two

separate subsets (with sunglasses and scarf) of 200 images

(1 sample per subject per Session, with neutral expression)

for testing. Fig. 3 illustrates the classification process of

RSC by using an example. Fig. 3(a) shows a test image

with sunglasses; Figs. 3(b) and 3(c) show the initialized

and converged weight maps (which are binarized for better

illustration), respectively; Fig. 3(d) shows a template image

of the identified subject. The convergence of the IRSC pro-

cess is shown in Fig. 3(e) and Fig. 3(f) plots the residuals of

each class. The detailed FR results of RSC, SRC and GSRC

are listed in Table 5. We see that RSC achieves a recogni-

tion rate of 99% in FR with sunglasses, 6% and 12% higher

than that of GSRC and SRC, while in FR with scarf, much

more improvement is obtained (18% and 37.5% higher than

GSRC and SRC).

In the second test, we conduct FR with more complex

disguise (disguise with variations of illumination and longer

data acquisition interval). 400 images (4 neutral images

with different illuminations per subject) of non-occluded

frontal views in Session 1 were used for training, while the

disguised images (3 images with various illuminations and

sunglasses or scarf per subject per Session) in Sessions 1

and 2 for testing. Table 6 shows the results of RSC, GSRC

and SRC. Clearly, RSC achieves the best results in all the

cases. Compared to SRC, RSC advances much on the test-

ing set with scarf, about 60% improvement in each session.

Compared to GSRC, over 6% improvement is achieved by

RSCforscarfdisguise. Forthetestingsetwithsunglassesin

Session 2, the recognition rate of RSC is about 35% higher

than that of GSRC. Surprisingly, GSRC has lower recog-

nition rates than SRC in the testing sets with sunglasses.

ThisispossiblybecausethatGaboranalysisneedsrelatively

high resolution images and favors regions that are rich in

local features, i.e. the eyes. In addition, the average drop

of RSC’s recognition rate from Session 1 to Session 2 is

about 16%, compared to 25% for SRC and 30% for GSRC.

We also compute the running times of SRC and RSC (both

sparse coding by 푙1-ls [12] in Matlab with machine of 3.16

GHzand3.25GRAM),whichare60.08s(SRC)and21.30s

(RSC) in average, validating RSC has lower computational

cost than SRC in that case.

5. Conclusion

This paper presented a novel robust sparse coding (RSC)

model and an effective iteratively reweighted sparse cod-

ing (IRSC) algorithm for RSC. One important advantage of

RSC is its robustness to various types of outliers (i.e., oc-

clusion, corruption, expression, etc.) because RSC seeks

631

Page 8

Algorithms

Sunglasses

Scarf

SRC [25]

87.0%

59.5%

GSRC [27]

93%

79%

RSC

99%

97%

Table 5. Recognition rates of RSC, GSRC and SRC on the AR

database with disguise occlusion.

Algorithms

SRC [25]

GSRC [27]

RSC

sg-1

89.3%

87.3%

94.7%

sc-1

32.3%

85%

91.0%

sg-2

57.3%

45%

80.3%

sc-2

12.7%

66%

72.7%

Table 6. Recognition rates of RSC, GSRC and SRC on the AR

database with sunglasses (sg-X) or scarf (sc-X) in Session X.

for an MLE (maximum likelihood estimation) solution of

the sparse coding problem. Its associated IRSC algorithm

is essentially a sparsity-constrained robust regression pro-

cess. We evaluated the proposed method on different condi-

tions, including variations of illumination, expression, oc-

clusion and corruption as the combination of them. The ex-

tensive experimental results clearly demonstrated that RSC

outperforms significantly previous methods, such as SRC

and GSRC, while its computational complexity is compara-

ble or less than SRC.

References

[1] R. Andersen. Modern methods for robust regression, series:

Quantitative applications in the social sciences. SAGE Pub-

lications, 2008. 626

[2] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriengman.

Eigenfaces vs. fisherfaces: recognition using class specific

linear projection. IEEE PAMI, 19(7):711–720, 1997. 625

[3] D. Donoho. For most large underdetermined systems of lin-

ear equations the minimal 푙1-norm solution is also the spars-

est solution. Comm. Pure and Applied Math., 59(6):797–

829, 2006. 625

[4] S. H. Gao, I. W. H. Tsang, L. T. Chia, and P. L. Zhao. Lo-

cal features are not lonely-laplacian sparse coding for image

classification. In CVPR, 2010. 626

[5] A. Georghiades, P. Belhumeur, and D. Kriegman. From few

to many: Illumination cone models for face recognition un-

der variable lighting and pose. IEEE PAMI, 23(6):643–660,

2001. 629

[6] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker.

Multi-PIE. ImageandVisionComputing, 28:807–813, 2010.

629, 630

[7] B. Heisele, P. Ho, and T. Poggio. Face recognition with

support vector machine: Global versus component-based ap-

proach. In ICCV, 2001. 625

[8] J. Hiriart-Urruty and C. Lemarechal. Convex analysis and

minimization algorithms. Springer-Verlag, 1996. 628

[9] J. Z. Huang, X. L. Huang, and D. Metaxas. Simultaneous

image transformation and sparse representation recovery. In

CVPR, 2008. 625

[10] P. J. Huber. Robust regression: Asymptotics, conjectures and

monte carlo. Ann. Stat., 1(5):799–821, 1973. 626

[11] S.H.Ji, Y.Xue, andL.Carin. Bayesiancompressivesensing.

IEEE SP, 56(6):2346–2356, 2008. 626

[12] S. J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky.

A interior-point method for large-scale 푙1-regularized least

squares. IEEE Journal on Selected Topics in Signal Process-

ing, 1(4):606–617, 2007. 627, 628, 631

[13] K. Lee, J. Ho, and D. Kriegman. Acquiring linear subspaces

for face recognition under variable lighting. IEEE PAMI,

27(5):684–698, 2005. 629

[14] Y. N. Liu, F. Wu, Z. H. Zhang, Y. T. Zhuang, and S. C. Yan.

Sparse representation using nonnegative curds and whey. In

CVPR, 2010. 626

[15] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman.

Supervised dictionary learning. In NIPS, 2009. 625

[16] J. Mairal, M. Elad, and G. Sapiro. Sparse representation for

color image restoration. IEEE IP, 17(1):53–69, 2008. 625

[17] A. Martinez and R. benavente. The AR face database. Tech-

nical Report 24, CVC, 1998. 629

[18] B. A. Olshausen and D. J. Field. Sparse coding with an over-

complete basis set: a strategy employed by v1? Vision Re-

search, 37(23):3311–3325, 1997. 625

[19] I. Ramirez and G. Sapiro. Universal sparse modeling. Tech-

nical report, arXiv:1003.2941v1[cs.IT], University of Min-

nesota, 2010. 626

[20] R. Tibshirani. Regression shrinkage and selection via the

lasso. Journal of the Royal Statistical Society B, 58(1):267–

288, 1996. 626

[21] M. Turk and A. Pentland. Eigenfaces for recognition. J.

Cognitive Neuroscience, 3(1):71–86, 1991. 625, 629

[22] A. Wagner, J. Wright, A. Ganesh, Z. H. Zhou, and Y. Ma.

Towards a practical face recognition system: Robust regis-

tration and illumination by sparse representation. In CVPR,

2009. 625, 630

[23] J. J. Wang, J. C. Yang, K. Yu, F. J. Lv, T. Huang, and Y. H.

Gong. Locality-constrained linear coding for image classifi-

cation. In CVPR, 2010. 626

[24] J. Wright and Y. Ma. Dense error correction via 푙1minimiza-

tion. IEEETransactionsonInformationTheory, 56(7):3540–

3560, 2010. 625, 626

[25] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma.

Robust face recognition via sparse representation.

PAMI, 31(2):210–227, 2009. 625, 626, 628, 629, 630, 631,

632

[26] J. Yang and J. Zhang. Alternating direction algorithms for

푙1-problems in compressive sensing. Technical report, Rice

University, 2009. 626

[27] M. Yang and L. Zhang. Gabor feature based sparse represen-

tation for face recognition with gabor occlusion dictionary.

In ECCV, 2010. 625, 630, 631, 632

[28] J. Zhang, R. Jin, Y. M. Yang, and A. G. Hauptmann. Mod-

ified logistic regression: An approximation to SVM and

its applications in large-scale text categorization. In ICML,

2003. 627

[29] W. Zhao, R. Chellppa, P. J. Phillips, and A. Rosenfeld. Face

recognition: A literature survey. ACM Computing Survey,

35(4):399–458, 2003. 625

IEEE

632