Page 1
The L1-norm best-fit hyperplane problem
JP Brooks1and JH Dul´ a2
1Corresponding Author, Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, VA 23284,
jpbrooks@vcu.edu
2Department of Management, Virginia Commonwealth University, Richmond, VA 23284
August 12, 2009
1 Abstract
We present a simple and efficient algorithm for solving the L1-norm best-fit hyperplane problem
derived using first principles and intuitive geometric insights about L1projections. The problem
is easy to solve because the procedure relies on the solution of a small number of linear programs.
We provide a simple proof that global optimality is achieved. The procedure is implemented for
validation and testing. The result can be the basis for an L1principal component analysis method.
2 Introduction
Given points xi∈ Rm, i = 1,...,n, consider the Lp-norm best-fit hyperplane problem for the case
when the hyperplane is an m − 1-dimensional subspace.
n
?
where ||·||pis the Lp-norm of the argument, V ∈ Rm×m−1, αi∈ Rm−1, and p ≥ 1. A solution to this
nonconvex mathematical program, (V∗,A∗= [α∗
Rm|x = V∗α for some α ∈ Rm−1}, that minimizes the sum of the p-norm distances of the points
to the subspace. Our results extend directly to the general hyperplane case. This representation
of an affine set in terms of linear combinations of vectors in V has several specialized applications
such as in providing information about the directions of dispersion in a point set with regard to
the Lp-norm.
The case when p = 2 is a well-studied problem dating back to Pearson [21]. The optimal solution
V∗consists of the m−1 eigenvectors of XTX, where XT= [x1,x2,...,xn], corresponding to the
m − 1 largest eigenvalues [13]. The solution minimizes the sum of Euclidean distances of points
to their orthogonal projections in the fitted hyperplane. The problem when p = 2 is a basis for
traditional principal component analysis (PCA). The columns of V∗define the first m−1 principal
components; the last principal component is the normal vector to the optimal hyperplane [13].
This paper deals with the case when p = 1. The solution to this problem minimizes the sum
of L1 distances of points to their L1 projections in the fitted hyperplane. The problem under
min
αi,i=1,...,n;V
i=1
||xi− V αi||p,
(1)
1,α∗
2,...,α∗
n]), defines a subspace in Rm, {x ∈
1
Page 2
consideration is not the orthogonal linear L1approximation problem (see, e.g. [23]). As we will
see, the optimal solution is based on the residuals in only a single unit direction rather than
the distances between points and their orthogonal projections in the fitted subspace. Our result
consolidates and synthesizes more general results about projections [17] and the problem of fitting
hyperplanes to data [18].
The L1-norm best-fit hyperplane problem has several applications. Ke and Kanade [14, 15] apply
a more general form of (1) with p = 1 where the subspace defined by V can have fewer than m−1
dimensions. They use the formulation in the context of subspace estimation for image analysis
using the affine camera model [15]. Agarwal et al. [1] solve the more general perspective camera
model. Kwak [16] treats a problem closely related to the L1-norm best-fit hyperplane problem
by finding successive directions of maximum variation by maximizing the L1lengths of projected
points along a fitted vector. The approach is the basis of an L1PCA method that he applies to
face recognition data. In each of these three works, there is no guarantee of global optimality in
polynomial time to the respective L1best-fit optimization problem formulations. Ke and Kanade
[14] provide an exact solution when m = 2, and a heuristic approach for generating locally optimal
solutions when m > 2. Agarwal et al. [1] reformulate the problem as a fractional program and
give a branch-and-bound algorithm for finding globally optimal solutions; the algorithm has an
exponential worst-case running time. Kwak [16] provides an algorithm with only local optimality
of solutions guaranteed.
Various schemes for PCA have been proposed that involve the L1-norm to impart robustness. Previ-
ous approaches include using the L1-norm for robust covariance matrix estimation [5, 8], specifying
a fixed-effects model based on a multivariate Laplace distribution and applying heuristics for pa-
rameter estimation [3, 9], and employing heuristics for finding successive directions of maximum
variation based on the L1-norm [7, 16].
Access to the L1-norm best-fit hyperplane obtained by solving (1) when p = 1 is the first step
towards a future pure L1-based PCA procedure. The procedure will project points down one
dimension at each iteration until the projected points lie in a one-dimensional subspace. Each
iteration will provide the direction of least dispersion in the respective subspace. This procedure
will benefit from well-known L1properties such as robustness to outliers.
3 Notation, Assumptions, Definitions
The span of the points has dimension at least m−1, so that there exists a matrix V∗of full column
rank that is optimal for (1). Boldface uppercase letters represent matrices, and boldface lowercase
letters represent vectors. A unit direction is a direction along one of the 2m unit vectors ±uj,
j = 1,...,m. The external representation [22] of a subspace S ⊆ Rmis the set {x ∈ Rm|Ax = 0}
for an appropriately-defined matrix A ∈ Rq×m.
subspace is {x ∈ Rm|x = VTα for some α} for a matrix V ∈ Rq×mwhere the rows of V span the
subspace. The projection of a point x onto a set S is the set of points P such that the distance
between x and points in P is minimum among all points in S; we will call elements of P projections.
The internal representation [22] of the same
4L1Projection
Suppose we are given a point ˆ x ∈ Rmand a matrix V ∈ Rm×m−1of full column rank. The
projection of ˆ x onto S = {x ∈ Rm|x = V α for some α} can be found by solving the following
2
Page 3
optimization problem.
min
α||ˆ x − V α||1= min
α
m
?
j=1
???ˆ xj− (V α)j
1,...,λ+
???
(2)
For non-negative variables λ+= [λ+
The mathematical program in (2) can be reformulated as a linear program (LP) that leads to
important geometric insights.
m]Tand λ−= [λ−
1,...,λ−
m]T, let λ+− λ−= ˆ x − V α.
min
α,λ+,λ−
?m
j=1|λ+
j− λ−
j| =
s.t.
min
α,λ+,λ−
?m
V α + λ+− λ−
j=1λ+
j+ λ−
j
=
≥
ˆ x
0
λ+,λ−
≡ LP(V , ˆ x) (3)
An optimal solution to LP(V , ˆ x) provides the magnitudes λ+
an L1projection of ˆ x onto S. Optimal values for α are scaling factors that locate the projection
in terms of a linear combination of the columns of V . The following result states that there exists
a projection of ˆ x ∈ Rmonto an m − 1-dimensional subspace that is located along a single unit
direction from ˆ x.
jand λ−
jfor the unit directions for
Result 1. Given a subspace S = {x ∈ Rm|x = V α for some α} of dimension m − 1 and a point
ˆ x ∈ Rm, ˆ x / ∈ S, there exists a solution to (3) with exactly one component from (λ+,λ−) positive.
Proof. Because the variables in α are unbounded, they will never leave the basis in a simplex pivot
(see [19], p. 170). Therefore, there exists an optimal basic feasible solution with all of the variables
in α and one component of (λ+,λ−) basic.
Result 1 can also be proved using Corollary 2.2 in [17], after applying a correction (?n
Figure 1 illustrates Result 1 in two dimensions. In Figure 1(a), the unique projection of ˆ x onto
S is along the negative vertical unit direction. The subspace S is defined by the solitary vector
in V ; therefore, the value for α∗in LP(V , ˆ x) is positive and less than 1. Figure 1(b) illustrates
the situation when the projection is along the horizontal direction. Because of the orientation of
the vector in V , the value for α∗will be negative. Figure 1(c) illustrates the special case when S
is a line that makes a 45 degree angle with the coordinate axes. In this case, the projection is a
segment of S. There exist optimal solutions to LP(V , ˆ x) corresponding to projections along both
of the horizontal and vertical unit directions. As we will see, the projection direction depends on
the orientation of S and not on the location of ˆ x.
Next consider the projection of a set of points xi∈ Rm, i = 1,...n onto an m − 1-dimensional
subspace. The following result establishes that each point projects onto the subspace along the
same unit direction.
i=1νi= 1 is
replaced with?
{i:|wi|=||w||∞}νi= 1, see [17]).
Result 2. Given a set of points xi ∈ Rm, i = 1,...n and a subspace S = {x ∈ Rm|x =
V α for some α} of dimension m − 1, there exists an optimal solution to LP(V ,xi) with either
λ+
j? ≥ 0 or λ−
j? ≥ 0 for some j?, and λ+
j= λ−
j= 0 for j ?= j?.
3
Page 4
S
P
V
θ
x ˆ
S
P
V
θ
x ˆ
S
45
o
P
V
x ˆ
(a) (b) (c)
Figure 1: The L1projection, P, of a point ˆ x onto a subspace S depends on the orientation of the
subspace. In 2D, when the angle, θ, is different from 45othe projection is unique but directly along
either (a) the y-axis or (b) the x-axis. When (c) θ = 45o, the projection P is a segment and it
includes the points along both unit directions.
Proof. When S has an external representation, the result follows from Theorem 2.1 in [17] and
Result 1.
Results 1 and 2 apply to general hyperplanes in Rm. These properties of L1 projection are the
basis for a new procedure for finding the m − 1-dimensional subspace of best fit. We can find an
L1-norm best-fit hyperplane by considering the residuals along each of the m unit directions.
L1regression is a well-understood procedure for analyzing the dependence of one variable on other
variables in a point set [20]. The L1regression problem is to find a hyperplane that minimizes the
sum of L1-norm distances from the points in a point set along the unit direction corresponding to
the “independent” variable. The designation of the independent variable in a general point set is
effectively arbitrary. Since an L1-norm best-fit hyperplane for a point set will have the property
that all projections occur along the same unit direction, the problem reduces to finding the best
L1regression in each of the m unit directions. Charnes et al. [6] and Wagner [24] show that L1
linear regression can be solved by finding an optimal solution to a linear program. This realization
about the relationship between L1projection and L1regression is the basis for a new procedure for
solving the L1-norm best-fit hyperplane problem. Theorem 1 formalizes this result.
Theorem 1. Given a set of points xi∈ Rm, i = 1,...,n, an optimal solution to the L1-norm best-
fit hyperplane problem is the hyperplane given by {x ∈ Rm|β∗
and
0+ β∗Tx = 0} where (β∗
0,β∗) = Rj∗
j∗= argmin
j=1,...,mRj(x1,...,xn) = min
β0,β,e+,e−
n
?
i=1
e+
i+ e−
i
(4)
subject to
β0+ βTxi+ e+
i− e−
i
=
=
≥
0
−1
0
i = 1,...,n
βj
e+
i,e−
i
i = 1,...,n
4
Page 5
Proof. Suppose that for a point set a different hyperplane attains a better L1fit. By Result 2, we
know that all points will project onto this hyperplane along a single unit direction corresponding
to j?. The contradiction if j?= j∗is immediate by the optimality of (β∗
leads to a contradiction because Rj∗ would not have been minimal.
0,β∗). Similarly, j??= j∗
Theorem 1 is an instance of a more general result about hyperplane fitting using general norms in
Minkowski spaces [18]. The idea for the L1case is suggested by Zemel [25], but no formal proof is
provided. Neither of these works implement and test a procedure based on this result.
The proof of Theorem 1 implies that there exists a projection into S that has all of the properties
of an optimal L1regression hyperplane. Some of these properties are summarized in the following
corollary.
Corollary 1. Given a set of points xi∈ Rm, i = 1,...,n, there exists a projection into an m − 1-
dimensional subspace S such that
1. the sum of L1distances of points to S is minimized among all m − 1-dimensional subspaces,
2. at least m − 1 of the points lie in S [2],
3. the difference between number of points on each side of S is at most m [2], and
4. the projection of points into S is maximum likelihood for errors following a joint distribution
of m independent, identically distributed Laplace random variables [1].
Problem (1) is stated in terms of an internal representation of an affine set. Theorem 1 provides an
externally-defined best-fit hyperplane. In order to satisfy the original requirements of the problem,
we must calculate m−1 linearly independent vectors that span the optimal hyperplane. We can find
an optimal matrix V by applying an orthogonalization procedure to (β∗
linearly independent vectors in Rm.
0,β∗) and m−1 additional
5 Optimal L1projection procedure
Theorem 1 motivates Algorithm 1 for calculating the L1subspace of best fit.
The input to Algorithm 1 are n observations of m dimensions. The main loop in Steps 2-7 solves
m LPs each of which has 2n + m + 1 variables. Step 8 involves n(m − 1) multiplications. In Step
9, the matrix V can be found by performing a singular value decomposition on the matrix whose
rows are comprised of zi, i = 1,...,n, which has complexity O(m2n + m3) [10]. Therefore, since
LPs can be solved in polynomial time, the overall complexity of Algorithm 1 is polynomial.
6 Numerical Validation
Algorithm 1 is validated by comparing the results obtained for four instances to the results ob-
tained with an industry-standard nonlinear programming solver. The formulation in (1) for p = 1
is recast as a mathematical program with a linear objective and 2mn nonlinear constraints and
is submitted to KNITRO [4] via an algebraic modeling language (AMPL). The four point sets have
dimensions m = 3,5,10,20 with n = 10,25,50, and 25000 observations, respectively. The point
sets are available online at http://www.people.vcu.edu/∼jpbrooks/ProjEl/index.html.
5
Page 6
Algorithm 1 Calculating the L1-norm best-fit hyperplane
Given points xi∈ Rm, i = 1,...,n.
1: Let R∗= ∞
2: for j in 1,...,m do
3:
Solve the LP in (4) to find the L1regression hyperplane with the jthcolumn representing
the dependent variable and the remaining columns representing the independent variables.
The optimal hyperplane has coefficients (β0,β) and error Rj.
4:
if Rj< R∗then
5:
R∗= Rj, j∗= j, β∗
6:
end if
7: end for
8: For each xi, the optimal projection onto S is given by zi, i = 1,...,n, where zij = xij for
j ?= j∗and zij∗ = β∗
the jthelement.
9: S is defined by {x|V α = x for some α}, where the columns of V are vectors that span the
zi’s.
0= β0, β∗= β
0+ β∗T
(j∗)xi(j∗), where for a vector y, y(j)is the vector created by removing
Algorithm 1 is implemented using the ILOG CPLEX 11.1 Callable Library [11] for the solution
of LPs. The instances are solved on a machine with 3.2 GHz Intel Pentium D processors and 4
GB RAM. The first two instances are solved using the student version of KNITRO on the same
architecture. The remaining instances for KNITRO must be solved using the NEOS Server [12]
because of the limitations of the student version.
Table 1 summarizes the results of applying Algorithm 1 and KNITRO to the point sets. The objective
of this exercise is to verify that the procedure in Algorithm 1 produces a solution with the same
objective function value as an optimal solution to the original nonlinear best-fit problem formulated
directly from expression (1). The procedures obtain solutions with identical objective function
values of the first three point sets. KNITRO was unable to solve the problem for the fourth point set
due to insufficient memory available at the host site.
Table 1: Performance of Algorithm 1 and KNITRO on Synthetic Point Sets
Algorithm 1
Solution
Time (s)
<1
<1
<1
2939.5
*No solution due to insufficient memory
KNITRO
Optimal
Objective
Solution
Time (s)
Best
m
3
5
10
20
n
Objective
10
25
50
79.0
27.1
51.4
<1
3.0
57.0
79.0
27.1
51.4
250001987.3**
Figure 2 depicts different plots of planes for the first point set of Table 1. Panels 2(a) and 2(b)
display the regression planes when the independent variables are x and y, respectively. The sums
of the L1deviations for the planes in 2(a) and 2(b) are 517.8 and 332.1, respectively. Panel 2(c)
is the regression plane when the third variable is the independent variable. The sum of the L1
deviations along this axis is 79.0 for this plane; therefore, by Theorem 1, it is an L1best-fit plane
for the points. Panel 2(d) shows the plane obtained when a nonlinear formulation based on (1) is
6
Page 7
solved using KNITRO. The planes in 2(c) and 2(d) are essentially the same; the differences can be
attributed to numerical error. For Panels 2(a), 2(b), 2(c), Properties 2 and 3 of Corollary 1 are
verified.
7Conclusion
In spite of all that is known about general projection theory and the identification of best-fit
hyperplanes, the L1-norm best-fit hyperplane problem is still being treated as a difficult nonlinear
optimization problem in application areas such as computer vision and statistics. With insights
about L1projection and L1regression, the problem is surprisingly simple to solve. Two key insights
into the geometry of L1projections onto a hyperplane: (1) L1projection occurs along a single unit
direction, and (2) the direction of projection is independent of the location of the point, suggest
immediately an algorithm for solving this problem. The algorithm calculates the L1 regression
hyperplanes for each of the m dimensions in which the points reside, and selects the one that
minimizes the sum of the L1distances. The algorithm is implemented and numerically validated.
With this new algorithm, large-scale instances arising from multivariate statistics and computer
vision are now easily solvable and the elements are in place for a pure L1-based PCA procedure.
7
Page 8
L1Regression Plane
w.r.t. First Coordinate
L1Regression Plane
w.r.t. Second Coordinate
(a)(b)
L1Regression Plane
w.r.t. Third Coordinate: Optimal L1FitOptimal KNITRO Plane
(c) (d)
Figure 2: Point set, fitted planes, and projection directions for the m = 3, n = 10 point set in
Table 1.
8
Page 9
References
[1] S. Agarwal, M.K. Chandraker, F. Kahl, D. Kriegman, and S. Belongie. Practical global opti-
mization for multiview geometry. Lecture Notes in Computer Science, 3951:592–605, 2006.
[2] G. Appa and C. Smith. On L1and Chebyshev estimation. Mathematical Programming, 5:73–
87, 1973.
[3] A. Baccini, P. Besse, and A. de Faguerolles. A L1-norm PCA and heuristic approach. In
Proceedings of the International Conference on Ordinal and Symbolic Data Analysis, pages
359–368, 1996.
[4] R.H. Byrd, J. Nocedal, and R.A. Waltz. Large-Scale Nonlinear Optimization, Eds. G. di Pillo
and M. Roma, chapter KNITRO: An integrated package for nonlinear optimization, pages
35–59. Springer Verlag, 2006.
[5] N.A. Campbell. Robust procedures in multivariate analysis I: Robust covariance estimation.
Applied Statistics, 29:231–237, 1980.
[6] A. Charnes, W.W. Cooper, and R.O. Ferguson. Optimal estimation of executive compensation
by linear programming. Management Science, 1:138–150, 1955.
[7] V. Choulakian. L1-norm projection pursuit principal component analysis. Computational
Statistics and Data Analysis, 50:1441–1451, 2006.
[8] J.S. Galpin and D.M. Hawkins. Methods of L1estimation of a covarance matrix. Computational
Statistics and Data Analysis, 5:305–319, 1987.
[9] J. Gao. Robust L1 principal component analysis and its Bayesian variational inference. Neural
Computation, 20:555–572, 2008.
[10] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press,
Baltimore, MD, 1983.
[11] ILOG. ILOG CPLEX Division. 889 Alder Avenue, Incline Village, Nevada, 2009.
[12] M. Mesnier J. Czyzyk and J. Mor´ e. The NEOS server. IEEE Journal on Computational
Science and Engineering, 5:68–75, 1998.
[13] I.T. Jolliffe. Principal Component Analysis. Springer, 2nd edition, 2002.
[14] Q. Ke and T. Kanade. Robust subspace computation using L1 norm. Technical Report CMU-
CS-03-172, Carnegie Mellon University, Pittsburgh, PA, 2003.
[15] Q. Ke and T. Kanade. Robust l1norm factorization in the presence of outliers and missing data
by alternative convex programming. In IEEE Conference on Computer Vision and Pattern
Recognition, 2005.
[16] N. Kwak. Principal component analysis based on L1-norm maximization. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 30:1672–1680, 2008.
[17] O.L. Mangasarian. Arbitrary-norm separating plane. Operations Research Letters, 24:15–23,
1999.
9
Page 10
[18] H. Martini and A. Sch¨ obel. Median hyperplanes in normed spaces - a survey. Discrete Applied
Mathematics, 89:181–195, 1998.
[19] K.G. Murty. Linear Programming. Wiley, 1983.
[20] S.C. Narula and J.F. Wellington. The minimum sum of absolute errors regression: A state of
the art survey. International Statistical Review, 50:317–326, 1982.
[21] K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical
Magazine, 2:559–572, 1901.
[22] R.T. Rockafellar. Convex analysis. Princeton University Press, 1970.
[23] H. Sp¨ ath and G.A. Watson. On orthogonal linear l1approximation. Numerische Mathematik,
51:531–543, 1987.
[24] W.H. Wagner. Linear programming techniques for regression analysis. Journal of the American
Statistical Association, 54:206–212, 1959.
[25] E. Zemel. An o(n) algorithm for the linear multiple choice knapsack problem and related
problems. Information Processing Letters, 18:123–128, 1984.
10
Download full-text