Page 1
A Pure L1-norm Principal Component Analysis
JP Brooks1, JH Dul´ a2, and EL Boone1
1Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, VA 23284,
jpbrooks@vcu.edu
2Department of Management, Virginia Commonwealth University, Richmond, VA 23284
July 2, 2010
1Abstract
The L1norm has been applied in numerous variations of principal component analysis (PCA). L1-
norm PCA is an attractive alternative to traditional L2-based PCA because it can impart robustness
in the presence of outliers and is indicated for models where standard Gaussian assumptions about
the noise may not apply. Of all the previously-proposed PCA schemes that recast PCA as an
optimization problem involving the L1norm, none provide globally optimal solutions in polynomial
time. This paper proposes an L1-norm PCA procedure based on the efficient calculation of the
optimal solution of the L1-norm best-fit hyperplane problem. We present a procedure called L1-
PCA∗based on the application of this idea that fits data to hyperplanes of successively smaller
dimension, the orthogonal vectors of which are successive directions of minimum variation. The
procedure is implemented and tested on a diverse problem suite. Our tests establish conditions for
which L1-PCA∗is suitable.
2Keywords
principal component analysis, linear programming, L1regression
3 Introduction
Principal component analysis (PCA) is a data analysis technique with various uses including di-
mensionality reduction, quality control, extraction of interpretable derived variables, and outlier
detection [15]. PCA can be viewed in several different but equivalent ways: as the decomposition
of a covariance matrix based on its eigenvectors and eigenvalues, as a method for finding successive
directions of maximum variation in data, and as a method for linear subspace estimation [15]. The
last view is the one taken in this work.
Traditional PCA, hereafter referred to as L2-PCA, is based on the L2norm. Procedures based on
this norm are sensitive to outlier observations. This sensitivity is the principal reason for exploring
alternative norms. Procedures for PCA that involve the L1norm have been developed to increase
robustness [12, 4, 16, 1, 7, 10, 13, 18]. Galpin and Hawkins [12] develop a robust L1covariance
1
Page 2
estimation procedure. Others have considered robust measures of dispersion for finding directions
of maximum variation [8, 7, 18]. The approaches in [8] and [7] are based on the projection pursuit
method introduced in [19].
Several previous works involve the L1norm in subspace estimation with PCA. Baccini et al. [4] and
Ke and Kanade [16] consider the problem of finding a subspace such that the sum of L1distances of
points to the subspace is minimized, and propose heuristic schemes that approximate the subspace.
In light of the camera resectioning problem from computer vision, Agarwal et al. [1] formulate
the problem of L1projection as a fractional program and give a branch-and-bound algorithm for
finding solutions. Gao [13] proposes a probabilistic Bayesian approach to estimating the best L1
subspace under the assumption that the noise follows a Laplacian distribution.
Measuring error for PCA using the L1 norm can impart desirable properties besides providing
robustness to outlier observations. For example, the L1norm is the indicated measure in a noise
model where the Laplace distribution defines the error [1, 13]. In special applications such as cellular
automata models, where translations can only occur along unit directions, the fit of a subspace can
be measured using the L1norm [5, 17].
In this paper, we propose a new approach for robust PCA, L1-PCA∗, based on a“pure”application
of the L1norm in the sense that it uses globally optimal hyperplanes that minimize the sum of L1
distances of points to their projections. L1-PCA∗generates a sequence of hyperplanes whose normal
vectors are successive directions of minimum variation in the data. The procedure makes use of a
polynomial-time algorithm for projecting m-dimensional data into an L1best-fit m−1-dimensional
subspace. The algorithm is based on results concerning properties of L1 projection and best-fit
hyperplanes [25, 21, 20, 5]. The provable optimality of the projected subspace ensures that several
interesting properties are inherited. The polynomiality of the algorithm makes it practical. We
establish experimentally conditions under which the proposed procedure is preferred over L2-PCA
and another previously-investigated scheme for robust PCA.
4Geometry of the L1norm and best-fit hyperplanes
The projection of a point x ∈ Rkon a set S is the set of points P ⊆ S such that the distance
between x and points in S is minimized. Distance is usually measured using the L2norm. The L2
norm of a vector x is
?
i=1
?x?2=
?
?
?
k
?
x2
i.
When S is a subspace, the L2projection of x is a unique point and the direction from P to x is
orthogonal to S.
The L1norm of a vector x is
?x?1=
k
?
i=1
|xi|.
Using the L1norm for measuring distance results in different projections. Figure 1 illustrates this
difference for the case when x ∈ R2and S is a line. The figure represents the level sets using the
L1and L2norms. The L2norm level sets in two dimensions are circles; the L1norm level sets are
2
Page 3
S
y
Figure 1: Level sets for the L1and L2norms.
diamonds. Notice that the L1projection occurs along a horizontal direction. The property that
projections occur along a single unit direction generalizes to multiple dimensions [20, 5]. Further,
the direction of a projection depends only on the orientation of the subspace and not on the location
of the point [20]. These two properties lead to the following result about L1best-fit hyperplanes;
that is, hyperplanes for which the sum of L1distances of points to their projections is minimized.
Proposition 1. Given a set of points xi ∈ Rk, i = 1,...,n, the projections into an L1 best-fit
k − 1-dimensional subspace occur along the same unit direction for all of the points.
Proof. See [20].
Proposition 1 implies that an L1best-fit hyperplane is found by computing the k hyperplanes that
minimize the errors along each of the k dimensions and selecting the one with the smallest error.
Identifying the hyperplane that minimizes the errors along a given dimension is the well-known L1
linear regression problem [23] where the dependent variable corresponds to the dimension along
which measurements are made.
Therefore, a best-fit hyperplane is found by computing k L1linear regressions, where each variable
takes a turn as the dependent variable, and selecting the regression hyperplane with the smallest
total error. L1linear regression is efficiently solved using linear programming (LP) [6, 26]. An LP
formulation for L1linear regression is used in the following theorem.
Theorem 1. Given a data matrix X ∈ Rn×kan L1-norm best fit hyperplane is given by {x ∈
Rk|(β∗)Tx = 0}, where β∗solves
n
?
subject to
min
β;yi,i=1,...,n
i=1
?xi− yi?1
(1)
βyi= 0, i = 1,...,n.
3
Page 4
A solution to the program in (1) can be found by idenifying j∗where
j∗= argmin
j=1,...,k
Rj(X) =min
β,e+,e−
n
?
i=1
e+
i+ e−
i
(2)
subject to
βTxi+ e+
i− e−
i
=
=
≥
0,
−1,
0,
i = 1,...,n,
βj
e+
i,e−
i
i = 1,...,n.
The coefficients of the best-fit hyperplane, β∗, are in the vector β that generates Rj∗(X).
Proof. See [5].
Expression (1) in Theorem 1 is the mathematical program formulated to identify the L1best-fit
hyperplane to the points in a data matrix X. The problem has a piece-wise linear objective function
and nonconvex constraints. Each of the k regressions in Expression (2) generates a hyperplane and
the hyperplane for which the total error is smallest provides a globally optimal solution.
The fitted hyperplane is an L1regression hyperplane and therefore inherits the following well-known
properties [3, 1]:
1. At least k − 1 of the points lie in the fitted hyperplane;
2. The projection is maximum likelihood for errors following a joint distribution of k indepen-
dent, identically distributed Laplace random variables.
In the development below, the normal vector of a best-fit hyperplane for points in a k-dimensional
space is given by βk. We will employ the notation (Ij∗)kto denote the k × k identity matrix
modified that row j∗has entries
(Ij∗
(Ij∗
j∗?)k
j∗j∗)k
=
βk
?/?βk?2
0.
? ?= j∗,
=
Theorem 1 provides the elements for a new PCA procedure, L1-PCA∗, based on globally optimal
L1projections.
5 The L1-PCA∗Algorithm
Theorem 1 motivates Algorithm L1-PCA∗where points are iteratively projected down from the
initial space Rmof the data, to an m − 1-dimensional subspace, then to an m − 2-dimensional
subspace, and so on. This approach corresponds to finding the successive directions of least variation
based on the L1norm. These directions are given by the vectors orthogonal to each subspace. The
pseudocode for the L1-PCA∗algorithm is given next.
4
Page 5
Algorithm L1-PCA∗
Given a data matrix X ∈ Rn×mwith full column rank.
1: Set Xm= X; set Vm+1= I; set (Ij∗)m+1= I.
2: for (k = m; k > 1; k = k − 1) do
3:
Set j∗= argmin
j
Set Zk= (Xk)((Ij∗)k)T.
5:
Calculate the singular value decomposition of Zk, Zk= UΛVT, and set Vkto be equal to
the k − 1 columns of V corresponding to the largest values in the diagonal matrix Λ.
6:
Calculate the kthprincipal component αk=
Set Xk−1= ZkVk.
8: end for
9: Set α1=?2
At each iteration k, the rows of Xk∈ Rn×krepresent the projected observations in the k-
dimensional subspace given by Sk= {x ∈ Rk|x = Vk+1y for some y ∈ Rk+1}, where Vk+1∈
Rk+1×k. The projection into the best k − 1-dimensional subspace is determined by finding the
best of k L1regressions by solving (2) (Step 3). A coordinate system for the k − 1-dimensional
system must be defined (Step 5) so that Theorem 1 can be applied again to find the projection
into a k − 2-dimensional space. The kthdirection of least variation is stored as αk(Step 6). The
projected points are then expressed in terms of the new coordinate system (Step 7). The first
principal component corresponds to the single column of V2and is orthogonal to α2,...,αk(Step
8).
Notes on L1-PCA∗
Rj(Xk) and βk= β∗according to Theorem 1.
4:
??k+1
?=m+1V??
βk/?βk?2.
7:
?=m+1V?.
1. The algorithm generates a sequence of matrices Xm,...,X1. Each of these matrices contains
n rows; each row corresponding to a point. All of the points in Xkare in a k-dimensional
space. The successive subspaces where the points in Xkreside are mutually orthogonal.
2. The sequence of αkvectors in Step 6 represent the principal components. The vector αkis
orthogonal to the subspace {x ∈ Rm|x = Vky for some y ∈ Rk}.
3. In Step 5 we are defining a new coordinate system for the projected points by singular value
decomposition. In fact, any k−1 vectors spanning the projected points can form the columns
of Vk. Different choices for this set will lead to different projections at successive iterations
because the L1norm is not rotationally invariant [10].
4. The solution of linear programs is the most computationally-intensive step in each iteration.
A total of?m
?m
k=1
k=1k =
m(m+1)
2
linear programs are solved. Each linear program has 2m + k
variables and n constraints. The algorithm has a worst-case running time of
?
O
?
kP(2n + k,n)
,
where P(r,s) is the complexity of solving a linear program with r variables and s constraints.
Since the complexity of linear programming is polynomial, the complexity of L1-PCA∗is
polynomial.
5
Page 6
5. The algorithm produces an L1best-fit hyperplane at each iteration. Accordingly, this pro-
cedure performs well for outlier-contaminated data so long as the accrued effect of the L1
distances of the outliers does not force the optimal solution to fit them directly. As testing
shows, L1-PCA∗peforms well in the presence of outliers if they are reasonably well-contained.
The following three-dimensional example will help to illustrate L1-PCA∗. Consider the data matrix
?−1.17
−0.30
These points are displayed in Figure 2(a). The L1best-fit plane obtained in Step 3 by applying
Theorem 1 and after solving three LPs is defined by β3x = 0 where β3= (−0.80,−1.00,−0.39).
The plane can be see in Figure 2(b). This plane minimizes the total sum of L1distances of the
points to their projections at 9.75. Notice that all projections occur along j∗= 2 (y-axis). The
score matrix Z3from Steps 4 and 5 is
?−1.17
−0.30
Notice that only the second component of each point changes. The singular value decomposition
of Z3is UΛVT, where
?−0.212
0.858
−0.267
?7.46
0.000.000.00
0.05
−0.95
The first two columns of V span the plane and comprise the 3 by 2 matrix V3. The third principal
component is α3= (−0.59,−0.75,−0.29) using the formula in Step 6. Notice that α3is orthogonal
to the best-fit hyperplane, and that for this first iteration, α3points in the same direction as β3.
The next iteration will use the data set X2= Z3V3which is calculated in Step 7 and corresponds
to a re-orientation of the axes so that the plane becomes the two-dimensional space where the data
resides:
?
Figure 2(c) is a plot of the data points in the matrix X2, embedded in the two-dimensional best-fit
hyperplane. The figure also shows the resulting best-fit hyperplane found in the next iteration in
terms of the new axes. The line in the v1− v2plane perpendicular to the line labeled β2is the L1
best-fit hyperplane for the two-dimensional problem solved in the next iteration and represents the
projection of the first principal component in this plane. Notice that the presence of the outlier in
the second quadrant does not significantly affect the fit. The line labeled β2is the projection of
the second L1principal component in this plane.
Figure 2(d) depicts the two best-fit hyperplanes and the three principal component axes in terms
of the original coordinates of the data. The two hyperplanes are the red plane and the line labeled
α1. The three principal component axes are the lines labeled α1, α2, and α3.
(X3)T=
0.53
0.24
−1.00
−1.02
0.40
1.11
1.12
1.36
−1.69
2.08
−1.82
−0.76
−1.61
0.53
0.99
1.17
−1.52
0.71
2.00
−1.03
−1.44
3.00
−2.00
−1.00
3.00
3.00
3.00
1.20
?
.
(Z3)T=
0.53
−0.03
−1.00
−1.02
0.38
1.11
1.12
−0.22
−1.69
2.08
−1.36
−0.76
−1.61
0.90
0.99
1.17
−1.21
0.71
2.00
−1.03
−1.44
3.00
−2.00
−1.00
3.00
−3.58
3.00
1.05
?
.
UT=
0.051
0.232
−0.130
−0.264
−0.021
0.124
0.397
0.123
0.325
0.200
−0.092
0.22
0.20
−0.237
−0.245
0.200
0.228
−0.144
−0.084
0.285
0.352
0.198
0.474
0.266
0.225
0.633
−0.633
0.175
0.052
?
,
Λ =
0.00
4.59
0.00
0.000.00
?
, V =
?
0.77
−0.63
0.59
0.75
0.29
?
.
(X2)T=
−1.58
0.24
0.38
1.07
−0.97
−1.21
0.92
1.82
2.43
0.92
−1.77
−1.13
1.70
−0.66
2.13
1.61
3.54
1.22
4.73
−2.91
?
.
6
Page 7
−5
0
5
−5
0
5
−5
0
5
y
x
z
(a) (b)
−505
−5
−4
−3
−2
−1
0
1
2
3
4
5
v1
v2
β2
(c) (d)
(e)
Figure 2: L1-PCA∗implementation for a 3-dimensional example. (a) The point set in 3-D. (b) The
L1best-fit plane. (c) The projection in the best-fit plane with best-fit line. (d) L1-PCA∗results.
(e) Comparison of L1-PCA∗versus L2-PCA.
7
Page 8
Table 1: Correspondences between L1-PCA∗and L2-PCA
for estimating the k-dimensional best-fit subspace
Concept
kthprincipal component αk, k = 2,...,m
Formula
1
(Set α1orthogonal to α2,...,αm)...................
?
k+1
?
?=m+1
V?
?
βk
‚‚‚βk‚‚‚2
2 Score of observation i
(from Step 4 of Algorithm L1-PCA∗) ................................ xk
i
3 Projection of point xifor observation i
(in terms of original coordinates) .......................
?
k+1
?
?=m+1
V?
?
xk
i
4Score of a new point xn+1 ...................
?
m+1
?
?=k+1
(V?)T(Ij∗)?
?
xn+1
5Projection of a new point xn+1
(in terms of original coordinates)
?
k+1
?
r=m+1
Vr
??
m+1
?
?=k+1
(V?)T(Ij∗)?
?
xn+1
6 Cumulative proportion of dispersion
in the k-dimensional best-fit subspace .................
Pk
j=1
Pn
i=1|xk
Pn
ij−˜ xk
j|
Pm
j=1i=1|xij−˜ xj|
Figure 2(e) compares the L1-norm best-fit hyperplane with the L2-norm best-fit hyperplane. The
view is oriented so that the L1hyperplane is orthogonal to the plane of the page. The L1hyperplane
is a good fit for most of the data points in spite of the presence of the outlier x10= (3.0,3.0,3.0).
The outlier has clearly affected the location of the L2 hyperplane and does not fit the rest of
the data well. This comparison between L1- and L2-based methods for PCA is formalized in the
experiments below.
6 Correspondences between L1-PCA∗and L2-PCA
When Algorithm L1-PCA∗is applied to a data set, values analogous to those used in an application
of L2-PCA are obtained. These values permit an analysis with all the functionality of L2-PCA.
Table 1 collects these results along with their explicit formulas in L1-PCA∗. Below, we explain how
to obtain robust principal components and scores, project new points into a fitted subspace, and
estimate the proportion of dispersion explained by fitted subspaces.
Extracting Principal Components and Scores. As Algorithm L1-PCA∗iteratively projects
points into lower-dimensional spaces, we can collect the successive directions of least variation as
principal components. The vector that is orthogonal to projected points is unique at each iteration
k. This vector is precisely βk. Also, when the singular value decomposition of the projected points
is calculated in Step 5, Zk= UΛVT, the normalized direction of least variation αk= βk/?βk?
8
Page 9
Table 2: Calculated Values for Numerical Example using Formulas from Table 1
(xn+1= (−2.0,3.0,1.0))
Formula 1
k = 3
α3= (−0.59,−0.75,−0.29)
Principal
k = 2
α2= (0.04,−0.40,0.92)
Components
k = 1
α1= (0.80,−0.53,−0.27)
Formula 2
Scores
k = 3
4
k = 2
0.241.07
−1.21
(X3)T=
2
−1.17
1.20
−0.30
0.53
0.24
−1.00
−1.02
0.40
1.11
1.12
1.36
−1.69
2.08
−1.82
−0.76
−1.61
0.53
0.99
1.17
−1.52
0.71
2.00
−1.03
−1.44
3.00
−2.00
−1.00
3.00
3.00
3.00
3
5
(X2)T=
»
−1.580.38
−0.970.92
1.82
2.43
0.92
−1.77
−1.13
1.70
−0.66
2.13
1.61
3.54
1.22
4.73
−2.91
–
k = 1
(X1)T=ˆ
−1.670.40
−1.030.982.57
−1.871.802.253.745.00
˜
Formula 3
Projections
k = 3
2
4
2
−1.17
1.20
−0.30
0.53
0.24
−1.00
−1.02
0.40
1.11
1.12
1.36
−1.69
2.08
−1.82
−0.76
−1.61
0.53
0.99
1.17
−1.52
0.71
2.00
−1.03
−1.44
3.00
−2.00
−1.00
3.00
3.00
3.00
3
5
k = 2
4
2
−1.17
1.05
−0.30
0.53
−0.03
−1.00
−1.02
0.38
1.11
1.12
−0.22
−1.69
2.08
−1.36
−0.76
−1.61
0.90
0.99
1.17
−1.21
0.71
2.00
−1.03
−1.44
3.00
−2.00
−1.00
3.00
−3.58
3.00
3
5
3
k = 1
4
−1.34
0.89
0.45
0.32
−0.22
−0.11
−0.83
0.55
0.28
0.78
−0.52
−0.26
2.06
−1.37
−0.69
−1.50
1.00
0.50
1.45
−0.96
−0.48
1.81
−1.20
−0.60
3.00
−2.00
−1.00
4.01
−2.67
−1.34
5
Formula 4
Scores
New Point
Formula 5
Projections
New Point
Formula 6
Dispersion
Explained
k = 3
k = 2
k = 1
k = 3
k = 2
k = 1
k = 3
k = 2
k = 1
(−2.00,3.00,1.00)
(−2.26,−1.16)
−2.39
(−2.00,3.00,1.00)
(−2.00,1.20,1.00)
(−1.92,1.28,0.64)
1.00
0.80
0.50
is the column of V corresponding to the smallest value in the diagonal matrix Λ. This direction
defined by βkis a k-dimensional vector in the current subspace. Formula 1 in Table 1 presents αk,
the kthL1principal component in terms of the original m-dimensional space.
The rows of the matrix Xk, xk
projected points in the projected coordinate system. For an observation i, the projection into the
k-dimensional subspace in terms of the original coordinates is calculated using Formula 3.
Projecting New Points. The principal components obtained with Formula 1 of Table 1 define
the subspace(s) into which observations are projected. In L2-PCA, the matrix the columns of which
are the first k principal components is the rotation matrix and is used to project points into the
k-dimensional fitted subspace. The projection of a point using L1-PCA∗depends on the sequence
of intermediate subspaces so that the matrix of principal components should not be used as a
projection matrix. L1-PCA∗projects optimally one dimension at a time in a unit direction that
may not coincide with the direction of least variation βk.
For a new point xn+1, the projection into the best-fit m − 1-dimensional subspace is given by
iin Table 1, are the principal component scores. These are the
9
Page 10
(Ij∗)mxn+1. The projected point in terms of the original coordinates is given by Vm(Ij∗)mxn+1.
To project the point into a k-dimensional subspace, use Formulas 4 and 5.
Measuring Dispersion Explained. The dispersion in a data set explained by the fitted subspaces
of L1-PCA∗is measured using the L1norm directly. Formula 6 applies this notion by comparing the
sum of L1distances to the medians in the projected point sets with the original data. For variable
j, the median among points i = 1,...,n is ˜ xj. In the projected spaces, we measure dispersion along
the axes of the defined coordinate system; accordingly, the median score for coordinate j is denoted
by ˜ xk
Table 2 applies the formulas from Table 1 to the numerical example from the previous section for
three subspaces: 1) for k = 3, the space R3where the original data reside, 2) for k = 2, the fitted
plane in Figure 2(b), and 3) for k = 1, the line labeled α1in Figure 2(d). Applying Formula 6, we
see that the fitted plane explains 80% and the fitted line 50% of the dispersion in the original data.
j.
7Computational Results
L1-PCA∗is implemented and its performance on simulated and real-world data is compared to
L2-PCA and pcaPP, a publicly-available implementation of the L1-norm-based PCA procedure
developed by Croux and Ruiz-Gazen [8]. The approach implemented in pcaPP maximizes an L1
measure of dispersion in the data to find successive locally-optimal directions of maximum variation.
L1-PCA∗is implemented in a C program that uses ILOG CPLEX 11.1 Callable Library [14] for
the solution of the linear programs required for Step 3. The singular value decomposition in Step 5
is calculated using the function dgesvd in LAPACK [2]. The L2-PCA and pcaPP implementations
are publicly available in the stats [24] and pcaPP [11] packages for the R language for statistical
computing [24]. The function used for L2-PCA is prcomp. All experiments are performed on
machines with 2.6GHz Opteron processors and 4GB RAM.
Tests with simulated data. The implementations are tested on simulated data. Simulated data
provide a controlled setting for comparison of data analysis algorithms and reveal trends that help
make generalizable conclusions. The objectives for the data are to explore the impact of outliers on
the procedures by varying the dimensionality and magnitude of outlier contamination. The data
are generated such that a predetermined subspace contains most of the dispersion. The dimension
of this “true” subspace is varied to assess the dependence on this data characteristic.
Each data set consists of n = 1000 observations with m = 10 dimensions. The first q dimensions
define the predetermined subspace and the remaining dimensions contain noise. The first q dimen-
sions are sampled from a U(−10,10) distribution; the remaining dimensions are sampled from a
Laplace(0,0.1) distribution. Outliers are introduced in the data set by generating additional points
where the first q dimensions are sampled from a U(−10,10) distribution, the next p dimensions are
sampled from a Laplace(µ,0.01) distribution so that the outliers are on the same side of the true
subspace, and the remaining 10−p−q dimensions are sampled from a Laplace(0,0.1) distribution.
The problem suite also includes control data sets (p = 0, µ = 0) without outlier observations. Out-
lier observations comprise ten percent of each data set. We refer to the parameter p as the number of
outlier-contaminated dimensions and the parameter µ as the outlier magnitude. Data sets are also
generated by replacing distribution Laplace(0,0.1) with N(0,0.1) and replacing Laplace(µ,0.01)
with N(µ,0.01).
Tests are conducted for the configurations that result when q = 2,5; p = 1,2,3; and µ = 1,5,10,25;
in addition to the control data sets; for a total of 26 configurations. We define the error for an
10
Page 11
Laplacian noise
G
G
G
G
0
1000
2000
3000
4000
Outlier Magnitude
L1 Distance
µ µ = = 1
µ µ = = 5
µ µ = = 10
µ µ = = 25
G
Legend
L1− −PCA*
L2− −PCA
pcaPP
G
G
G
G
0
2000
4000
6000
Outler Magnitude
L1 Distance
µ µ = = 1
µ µ = = 5
µ µ = = 10
µ µ = = 25
G
G
G
G
0
2000
6000
Outlier Magnitude
L1 Distance
µ µ = = 1
µ µ = = 5
µ µ = = 10
µ µ = = 25
(a) p = 1(b) p = 2(c) p = 3
Figure 3: The error, the sum of L1distances of projected points in a 5-dimensional subspace to the
“true” 5-dimensional subspace of the data, versus outlier magnitude with Laplacian noise for (a)
p = 1, (b) p = 2, and (c) p = 3. The average error over 100 iterations is plotted. The parameter p
is the number of outlier-contaminated dimensions.
observation as the distance from its projected point in the best-fitting q-dimensional subspace de-
termined by the three methods to the predetermined q-dimensional subspace. Errors are measured
using the L1distance.
The problem suite is processed by the L1-PCA∗, L2-PCA, and pcaPP implementations. The results
for Laplacian noise are reported in Table 3 and the results for Gaussian noise are reported in Table
4. These tables contain the mean and standard deviation of errors for 100 replications of each
configuration.
Our experiments include four controls when µ = 0 and p = 0 for each distribution of noise. In these
data sets, there are no outliers. L2-PCA outperforms L1-PCA∗in 3 of the 4 control experiments. As
expected, L2-PCA outperforms L1-PCA∗in the control experiments when the noise is Gaussian. In
the presence of Laplacian noise, L1-PCA∗performs better than L2-PCA when the dimension of the
underlying subspace has a higher dimension (q = 5). The explanation is that the fitted hyperplanes
of successively smaller dimension derived by L1-PCA∗are optimal with respect to the projected
data at each iteration and do not necessarily coincide with the L1best-fit subspace with respect to
the original data points. Therefore, there is a degradation in the performance of L1-PCA∗as the
dimension of the true subspace decreases. For the control experiments, pcaPP is not competitive.
In the presence of outliers, L1-PCA∗has a clear advantage over the other two methods. Figure 3
illustrates the results for q = 5 and Laplacian noise. The figure compares the performance of the
three implementations with respect to outlier magnitude µ and number of outlier-contaminated
dimensions p. As the outlier magnitude increases, L1-PCA∗outperforms L2-PCA and pcaPP for
each value of p. When q = 2, the advantage of using L1-PCA∗is not as dramatic. L1-PCA∗is
also better when noise is Gaussian. This confirms that L1-PCA∗performs better in the presence
of outliers.
All of the methods have slightly better performance when the noise is Gaussian than when the noise
is Laplacian. This effect is simply because there is less dispersion in Normal(µ, σ) noise compared
to Laplace (µ, σ) noise.
For each method, as µ and p are increased, the breakdown point is reached where the methods
11
Page 12
G
G
G
G
G
G
G
7
0
100
200
300
400
Dimension of Projected Subspace
Sum of Residuals
123456
G
Legend
L1− −PCA*
L2− −PCA
pcaPP
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
0
5000
15000
Dimension of Projected Subspace
Sum of Residuals
13579 111315
(a) Milk data(b) McDonald and Schwing data
Figure 4: Sum of residuals of non-outlier observations versus the dimension of the fitted subspace
for (a) the Milk data set and (b) the McDonald and Schwing data set.
begin to fit the outlier observations better than the non-contaminated data. For p = 1,2, L1-PCA∗
does not breakdown, even as µ is increased to 25, while significant increases in error are seen for
L2-PCA and pcaPP. The approach of the breakdown point for L1-PCA∗and L2-PCA as p and
µ increase is signaled in Table 3 by an increase in the standard deviations of the errors. For the
configurations with large standard deviations; such as q = 2, p = 3, and µ = 10; the methods fit the
non-contaminated data for some samples and fit the outlier observations for other samples which
result in drastically different errors. The standard deviations of the errors for pcaPP are larger
than those for L1-PCA∗and L2-PCA for almost every configuration, an indication that pcaPP is
sensitive to small changes in the data. For q = 5, we can see in Figure 3 that L1-PCA∗is less
susceptible to breakdown when p = 1,2 than L2-PCA and pcaPP and breaks down at a higher
outlier magnitude when p = 3.
Our experiments show that L1-PCA∗is the indicated procedure in the presence of unbalanced
outlier contamination. These experiments validate the intuition that L1-PCA∗is robust to outliers
because of the underlying reliance on optimally-fitted L1hyperplanes.
Tests with Real Data. The three implementations; L1-PCA∗, L2-PCA, and pcaPP; are applied
to real-world data that are known to contain outliers. The“Milk”data set is introduced by Daudin
[9] and used by Choulakian [7] for tests with an L1 projection-pursuit algorithm for PCA. The
“McDonald and Schwing” data set is introduced by McDonald and Schwing [22] and is used by
Croux and Ruiz-Gazen [8] for tests with an L1 projection-pursuit algorithm for PCA. For each
data set, the data are centered by subtracting the attribute medians.
Figure 4 contains plots of the sum of residuals of non-outlier observations against the dimension
of the fitted subspace for the two data sets. The residual for an observation is measured as the
L1distance from the original point to the projected point in the fitted subspace. If a method is
properly ignoring the outlier observations and fitting the non-contaminated data, then the plotted
values should be small.
Figure 4(a) contains the results for the Milk data set. Observations 17, 47, and 70 are identified
as outliers in previous analyses (see [7]). The sum of residuals for the non-outlier observations for
12
Page 13
L1-PCA∗and pcaPP are almost identical and are less than that for L2-PCA for two and three
dimensions.
The residuals for the non-outliers in the McDonald and Schwing data set are depicted in Figure
4(b). Observations 29 and 48 are identified as outliers in previous analyses (see [8]). The sum of
residuals for non-outlier observations for pcaPP is larger for smaller-dimensional fitted subspaces
when compared to L1-PCA∗and L2-PCA. Procedure pcaPP appears to be pushing the fitted
subspace away from the outlier observations at the expense of the fit of non-outlier observations.
The expectation that L1-PCA∗performs well in the presence of outliers is validated for these
data sets. For the two real data sets, Figure 4 shows that L1-PCA∗generates subspaces that are
competitive in terms of having low residuals for non-outliers. The diminished advantage of using
L1-PCA∗for these real data sets when compared to the simulated data can be attributed to the
fact that the outliers in the real data are more balanced in that they are not located on one side
of the best-fitting subspace for the non-contaminated data. Real data with outlier contamination
are likely to have some imbalance in the location of the outliers. L1-PCA∗is well-suited in these
situations.
8Conclusions
This paper proposes a new procedure for PCA based on finding the L1-norm best-fit hyperplane.
The result is a pure L1PCA procedure in the sense that it uses the globally optimal solution of
an optimization problem where the sum of L1 distances is minimized. We describe a complete
PCA procedure, L1-PCA∗based on this result, and include formulas for correspondences with
familiar L2-PCA outputs. The procedure is tested on simulated and real data. The results for
simulated data indicates that the procedure can be more robust than L2-PCA and a competing
L1-based procedure, pcaPP, for unbalanced outlier contamination and for a wide range of outlier
magnitudes. Experiments with real data confirm that L1-PCA∗is competitive in data sets with
outliers. L1-PCA∗represents an alternative tool for the numerous applications of PCA.
9Supplementary Information
At the URL http://www.people.vcu.edu/∼jpbrooks/l1pcastar are a script for L1-PCA∗for R
[24], data for the numerical example of Sections 5 and 6, and simulated data for the experiments
in Section 7.
10 Acknolwedgments
The first author was supported in part by NIH award 1UH2AI083263-01 and NASA award NNX09AR44A.
The authors would also like to acknowledge The Center for High Performance Computing at VCU
for providing computational infrastructure and support.
References
[1] S. Agarwal, M.K. Chandraker, F. Kahl, D. Kriegman, and S. Belongie. Practical global opti-
mization for multiview geometry. Lecture Notes in Computer Science, 3951:592–605, 2006.
13
Page 14
[2] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Green-
baum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. Society for
Industrial and Applied Mathematics, Philadelpha, PA, third edition, 1999.
[3] G. Appa and C. Smith. On L1and Chebyshev estimation. Mathematical Programming, 5:73–
87, 1973.
[4] A. Baccini, P. Besse, and A. de Faguerolles. A L1-norm PCA and heuristic approach. In
Proceedings of the International Conference on Ordinal and Symbolic Data Analysis, pages
359–368, 1996.
[5] J.P. Brooks and J.H. Dul´ a. The L1-norm best-fit hyperplane problem. submitted; available at
http://www.optimization-online.org/DB HTML/2009/05/2291.html, 2009.
[6] A. Charnes, W.W. Cooper, and R.O. Ferguson. Optimal estimation of executive compensation
by linear programming. Management Science, 1:138–150, 1955.
[7] V. Choulakian. L1-norm projection pursuit principal component analysis. Computational
Statistics and Data Analysis, 50:1441–1451, 2006.
[8] C. Croux and A. Ruiz-Gazen. High breakdown estimators for prinicpal components: the
projection-pursuit approach revisited. Journal of Multivariate Analysis, 95:206–226, 2005.
[9] J.J. Daudin, C. Duby, and P. Trecourt. Stability of principal component analysis studied by
the bootstrap method. Statistics, 19:241–258, 1988.
[10] C. Ding, D. Zhou, X. He, and H. Zha. R1-pca: Rotational invariant L1-norm principal com-
ponent analysis for robust subspace factorization. In Proceedings of the 23rdInternational
Conference on Machine Learning, pages 281–288, 2006.
[11] P. Filzmozer, H. Fritz, and K. Kalcher. pcaPP: Robust PCA by projection pursuit, 2009.
[12] J.S. Galpin and D.M. Hawkins. Methods of L1estimation of a covariance matrix. Computa-
tional Statistics and Data Analysis, 5:305–319, 1987.
[13] J. Gao. Robust L1 principal component analysis and its Bayesian variational inference. Neural
Computation, 20:555–572, 2008.
[14] ILOG. ILOG CPLEX Division. 889 Alder Avenue, Incline Village, Nevada, 2009.
[15] I.T. Jolliffe. Principal Component Analysis. Springer, 2nd edition, 2002.
[16] Q. Ke and T. Kanade. Robust subspace computation using L1 norm. Technical Report CMU-
CS-03-172, Carnegie Mellon University, Pittsburgh, PA, 2003.
[17] L.B. Kier, P.G. Seybold, and C.-K. Cheng. Cellular automata modeling of chemical systems:
a textbook and laboratory manual. Springer, 2005.
[18] N. Kwak. Principal component analysis based on L1-norm maximization. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 30:1672–1680, 2008.
[19] G. Li and Z. Chen. Projection-pursuit approach to robust distpersion matrices and principal
components: Primary theory and Monte Carlo. Journal of the American Statistical Associa-
tion, 80:759–766, 1985.
14
Page 15
[20] O.L. Mangasarian. Arbitrary-norm separating plane. Operations Research Letters, 24:15–23,
1999.
[21] H. Martini and A. Sch¨ obel. Median hyperplanes in normed spaces - a survey. Discrete Applied
Mathematics, 89:181–195, 1998.
[22] G.C. McDonald and R.C. Schwing. Instabilities of regression estimates relating air pollution
to mortality. Technometrics, 15:463–481, 1973.
[23] S.C. Narula and J.F. Wellington. The minimum sum of absolute errors regression: A state of
the art survey. International Statistical Review, 50:317–326, 1982.
[24] R Development Core Team. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria, 2008. ISBN 3-900051-07-0.
[25] H. Sp¨ ath and G.A. Watson. On orthogonal linear l1approximation. Numerische Mathematik,
51:531–543, 1987.
[26] W.H. Wagner. Linear programming techniques for regression analysis. Journal of the American
Statistical Association, 54:206–212, 1959.
15
Download full-text