-
[show abstract]
[hide abstract]
ABSTRACT: We provide fast algorithms for overconstrained $\ell_p$ regression and
related problems: for an $n\times d$ input matrix $A$ and vector $b\in\R^n$, in
$O(nd\log n)$ time we reduce the problem $\min_{x\in\R^d} \norm{Ax-b}_p$ to the
same problem with input matrix $\tilde A$ of dimension $s \times d$ and
corresponding $\tilde b$ of dimension $s\times 1$. Here, $\tilde A$ and $\tilde
b$ are a coreset for the problem, consisting of sampled and rescaled rows of
$A$ and $b$; and $s$ is independent of $n$ and polynomial in $d$. Our results
improve on the best previous algorithms when $n\gg d$, for all $p\in
[1,\infty)$ except $p=2$. We also provide a suite of improved results for
finding well-conditioned bases via ellipsoidal rounding, illustrating tradeoffs
between running time and conditioning quality, including a one-pass
conditioning algorithm for general $\ell_p$ problems.
We also provide an empirical evaluation of implementations of our algorithms
for $p=1$, comparing them with related algorithms. Our empirical results
clearly show that, in the asymptotic regime, the theory is a very good guide to
the practical performance of these algorithms. Our algorithms use our faster
constructions of well-conditioned bases for $\ell_p$ spaces and, for $p=1$, a
fast subspace embedding of independent interest that we call the Fast Cauchy
Transform: a distribution over matrices $\Pi: \R^n\mapsto \R^{O(d\log d)}$,
found obliviously to $A$, that approximately preserves the $\ell_1$ norms: that
is, with large probability, simultaneously for all $x$, $\norm{Ax}_1 \approx
\norm{\Pi Ax}_1$, with distortion $O(d^{2+\eta})$, for an arbitrarily small
constant $\eta>0$; and, moreover, $\Pi A$ can be computed in $O(nd\log d)$
time. The techniques underlying our Fast Cauchy Transform include fast
Johnson-Lindenstrauss transforms, low-coherence matrices, and rescaling by
Cauchy random variables.
07/2012;
-
[show abstract]
[hide abstract]
ABSTRACT: We study the topic of dimensionality reduction methods for k-means
clustering. Dimensionality reduction encompasses the union of two approaches;
feature selection and feature extraction. First, feature selection selects a
small subset of actual features from the data and then runs the clustering
algorithm only on the selected features. Second, feature extraction constructs
a small set of new artificial features and then runs the clustering algorithm
only on the constructed features. Despite the significance of the problem as
well as the wealth of heuristic methods addressing it there exist no provably
accurate feature selection methods. On the other hand, two provably accurate
feature extraction methods for k-means exist: the first one is randomized and
is based on Random Projections; the other, is deterministic and it is based on
the Singular Value Decomposition.
This paper addresses this shortcoming by presenting the first provably
accurate feature selection method for k-means clustering. We also present two
novel feature extraction methods: the first one is based on Random Projections
and improves the existing result in terms of speed and number of features
needed to be extracted; the other is based on fast approximate SVD
factorizations and improves the existing result in terms of speed. All three
methods of our work are randomized and, with constant probability, provide
constant-factor approximation guarantees with respect to the optimal k-means
objective value.
10/2011;
-
[show abstract]
[hide abstract]
ABSTRACT: The statistical leverage scores of a matrix $A$ are the squared row-norms of
the matrix containing its (top) left singular vectors and the coherence is the
largest leverage score. These quantities are of interest in recently-popular
problems such as matrix completion and Nystr\"{o}m-based low-rank matrix
approximation as well as in large-scale statistical data analysis applications
more generally; moreover, they are of interest since they define the key
structural nonuniformity that must be dealt with in developing fast randomized
matrix algorithms. Our main result is a randomized algorithm that takes as
input an arbitrary $n \times d$ matrix $A$, with $n \gg d$, and that returns as
output relative-error approximations to all $n$ of the statistical leverage
scores. The proposed algorithm runs (under assumptions on the precise values of
$n$ and $d$) in $O(n d \log n)$ time, as opposed to the $O(nd^2)$ time required
by the na\"{i}ve algorithm that involves computing an orthogonal basis for the
range of $A$. Our analysis may be viewed in terms of computing a relative-error
approximation to an underconstrained least-squares approximation problem, or,
relatedly, it may be viewed as an application of Johnson-Lindenstrauss type
ideas. Several practically-important extensions of our basic result are also
described, including the approximation of so-called cross-leverage scores, the
extension of these ideas to matrices with $n \approx d$, and the extension to
streaming environments.
09/2011;
-
[show abstract]
[hide abstract]
ABSTRACT: The linkage disequilibrium structure of the human genome allows identification of small sets of single nucleotide polymorphisms (SNPs) (tSNPs) that efficiently represent dense sets of markers. This structure can be translated into linear algebraic terms as evidenced by the well documented principal components analysis (PCA)-based methods. Here we apply, for the first time, PCA-based methodology for efficient genomewide tSNP selection; and explore the linear algebraic structure of the human genome. Our algorithm divides the genome into contiguous nonoverlapping windows of high linear structure. Coupling this novel window definition with a PCA-based tSNP selection method, we analyze 2.5 million SNPs from the HapMap phase 2 dataset. We show that 10-25% of these SNPs suffice to predict the remaining genotypes with over 95% accuracy. A comparison with other popular methods in the ENCODE regions indicates significant genotyping savings. We evaluate the portability of genome-wide tSNPs across a diverse set of populations (HapMap phase 3 dataset). Interestingly, African populations are good reference populations for the rest of the world. Finally, we demonstrate the applicability of our approach in a real genome-wide disease association study. The chosen tSNP panels can be used toward genotype imputation using either a simple regression-based algorithm or more sophisticated genotype imputation methods.
Annals of Human Genetics 09/2011; 75(6):707-22. · 2.57 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Recent work in theoretical computer science and scientific computing has focused on nearly-linear-time algorithms for solving systems of linear equations. While introducing several novel theoretical perspectives, this work has yet to lead to practical algorithms. In an effort to bridge this gap, we describe in this paper two related results. Our first and main result is a simple algorithm to approximate the solution to a set of linear equations defined by a Laplacian (for a graph $G$ with $n$ nodes and $m \le n^2$ edges) constraint matrix. The algorithm is a non-recursive algorithm; even though it runs in $O(n^2 \cdot \polylog(n))$ time rather than $O(m \cdot polylog(n))$ time (given an oracle for the so-called statistical leverage scores), it is extremely simple; and it can be used to compute an approximate solution with a direct solver. In light of this result, our second result is a straightforward connection between the concept of graph resistance (which has proven useful in recent algorithms for linear equation solvers) and the concept of statistical leverage (which has proven useful in numerically-implementable randomized algorithms for large matrix problems and which has a natural data-analytic interpretation). Comment: 16 pages
05/2010;
-
[show abstract]
[hide abstract]
ABSTRACT: Detecting clusters or communities in large real-world graphs such as large social or information networks is a problem of considerable interest. In practice, one typically chooses an objective function that captures the intuition of a network cluster as set of nodes with better internal connectivity than external connectivity, and then one applies approximation algorithms or heuristics to extract sets of nodes that are related to the objective function and that "look like" good communities for the application of interest. In this paper, we explore a range of network community detection methods in order to compare them and to understand their relative performance and the systematic biases in the clusters they identify. We evaluate several common objective functions that are used to formalize the notion of a network community, and we examine several different classes of approximation algorithms that aim to optimize such objective functions. In addition, rather than simply fixing an objective and asking for an approximation to the best cluster of any size, we consider a size-resolved version of the optimization problem. Considering community quality as a function of its size provides a much finer lens with which to examine community detection algorithms, since objective functions and approximation algorithms often have non-obvious size-dependent behavior.
04/2010;
-
[show abstract]
[hide abstract]
ABSTRACT: Principal components analysis and, more generally, the Singular Value Decomposition are fundamental data analysis tools that express a data matrix in terms of a sequence of orthogonal or uncorrelated vectors of decreasing importance. Unfortunately, being linear combinations of up to all the data points, these vectors are notoriously difficult to interpret in terms of the data and processes generating the data. In this article, we develop CUR matrix decompositions for improved data analysis. CUR decompositions are low-rank matrix decompositions that are explicitly expressed in terms of a small number of actual columns and/or actual rows of the data matrix. Because they are constructed from actual data elements, CUR decompositions are interpretable by practitioners of the field from which the data are drawn (to the extent that the original data are). We present an algorithm that preferentially chooses columns and rows that exhibit high "statistical leverage" and, thus, in a very precise statistical sense, exert a disproportionately large "influence" on the best low-rank fit of the data matrix. By selecting columns and rows in this manner, we obtain improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work. In addition, since the construction involves computing quantities with a natural and widely studied statistical interpretation, we can leverage ideas from diagnostic regression analysis to employ these matrix decompositions for exploratory data analysis.
Proceedings of the National Academy of Sciences 02/2009; 106(3):697-702. · 9.68 Impact Factor
-
SIAM J. Comput. 01/2009; 38:2060-2078.
-
Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada.; 01/2009
-
[show abstract]
[hide abstract]
ABSTRACT: We consider the problem of selecting the best subset of exactly $k$ columns from an $m \times n$ matrix $A$. We present and analyze a novel two-stage algorithm that runs in $O(\min\{mn^2,m^2n\})$ time and returns as output an $m \times k$ matrix $C$ consisting of exactly $k$ columns of $A$. In the first (randomized) stage, the algorithm randomly selects $\Theta(k \log k)$ columns according to a judiciously-chosen probability distribution that depends on information in the top-$k$ right singular subspace of $A$. In the second (deterministic) stage, the algorithm applies a deterministic column-selection procedure to select and return exactly $k$ columns from the set of columns selected in the first stage. Let $C$ be the $m \times k$ matrix containing those $k$ columns, let $P_C$ denote the projection matrix onto the span of those columns, and let $A_k$ denote the best rank-$k$ approximation to the matrix $A$. Then, we prove that, with probability at least 0.8, $$ \FNorm{A - P_CA} \leq \Theta(k \log^{1/2} k) \FNorm{A-A_k}. $$ This Frobenius norm bound is only a factor of $\sqrt{k \log k}$ worse than the best previously existing existential result and is roughly $O(\sqrt{k!})$ better than the best previous algorithmic result for the Frobenius norm version of this Column Subset Selection Problem (CSSP). We also prove that, with probability at least 0.8, $$ \TNorm{A - P_CA} \leq \Theta(k \log^{1/2} k)\TNorm{A-A_k} + \Theta(k^{3/4}\log^{1/4}k)\FNorm{A-A_k}. $$ This spectral norm bound is not directly comparable to the best previously existing bounds for the spectral norm version of this CSSP. Our bound depends on $\FNorm{A-A_k}$, whereas previous results depend on $\sqrt{n-k}\TNorm{A-A_k}$; if these two quantities are comparable, then our bound is asymptotically worse by a $(k \log k)^{1/4}$ factor. Comment: 17 pages; corrected a bug in the spectral norm bound of the previous version
12/2008;
-
[show abstract]
[hide abstract]
ABSTRACT: A large body of work has been devoted to defining and identifying clusters or communities in social and information networks. We explore from a novel perspective several questions related to identifying meaningful communities in large social and information networks, and we come to several striking conclusions. We employ approximation algorithms for the graph partitioning problem to characterize as a function of size the statistical and structural properties of partitions of graphs that could plausibly be interpreted as communities. In particular, we define the network community profile plot, which characterizes the "best" possible community--according to the conductance measure--over a wide range of size scales. We study over 100 large real-world social and information networks. Our results suggest a significantly more refined picture of community structure in large networks than has been appreciated previously. In particular, we observe tight communities that are barely connected to the rest of the network at very small size scales; and communities of larger size scales gradually "blend into" the expander-like core of the network and thus become less "community-like." This behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, it is exactly the opposite of what one would expect based on intuition from expander graphs, low-dimensional or manifold-like graphs, and from small social networks that have served as testbeds of community detection algorithms. We have found that a generative graph model, in which new edges are added via an iterative "forest fire" burning process, is able to produce graphs exhibiting a network community profile plot similar to what we observe in our network datasets.
11/2008;
-
Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008; 01/2008
-
Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, San Francisco, California, USA, January 20-22, 2008; 01/2008
-
Random Struct. Algorithms. 01/2008; 32:307-333.
-
SIAM J. Matrix Analysis Applications. 01/2008; 30:957-987.
-
[show abstract]
[hide abstract]
ABSTRACT: We consider, both theoretically and empirically, the problem of selecting the "best" subset of exactly k columns from an m × n data matrix A. From a theoretical perspective, we present and analyze a novel two-stage algorithm. In the first phase, the algorithm randomly selects O(k log k) columns according to a judiciously-chosen probability distribution that depends on information in the top-k right singular subspace of A. In the second stage the algorithm applies a deterministic column-selection procedure to select and return exactly k columns from the set of columns selected in the first phase. Let C be the m × k matrix containing those k columns, let P C denote the projection matrix onto the span of those columns, and let A k denote the "best" rank-k approximation to the matrix A as computed with the singular value decomposition. Then, we prove that A − P C A 2 ≤ O k 3 4 log 1 2 (k) (n − k) 1 4 A − A k 2 , with probability at least 1 − 10 −20 . For small to moderate values of k, this improves upon the best previously-existing result (of Gu and Eisenstat [20]) for this Column Subset Selection Problem. From an empirical perspective, we evaluate this algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis: finance, document-term data, and genetics. We pay particular attention to how this algorithm may be used to select representative or landmark features from an object-feature matrix in an unsupervised manner. In all three application domains, we are able to identify k landmark features, i.e., columns of the data matrix, that capture nearly the same amount of information as does the subspace that is spanned by the top k "eigenfeatures." Moreover, in cases where the original data matrix clusters well in the best k-dimensional space, e.g., as measured by k-means clustering, we also find that it clusters well in the space spanned by the features our algorithm chooses.
11/2007;
-
[show abstract]
[hide abstract]
ABSTRACT: Least squares approximation is a technique to find an approximate solution to a system of linear equations that has no exact solution. In a typical setting, one lets $n$ be the number of constraints and $d$ be the number of variables, with $n \gg d$. Then, existing exact methods find a solution vector in $O(nd^2)$ time. We present two randomized algorithms that provide very accurate relative-error approximations to the optimal value and the solution vector of a least squares approximation problem more rapidly than existing exact algorithms. Both of our algorithms preprocess the data with the Randomized Hadamard Transform. One then uniformly randomly samples constraints and solves the smaller problem on those constraints, and the other performs a sparse random projection and solves the smaller problem on those projected coordinates. In both cases, solving the smaller problem provides relative-error approximations, and, if $n$ is sufficiently larger than $d$, the approximate solution can be computed in $O(nd \log d)$ time. Comment: 25 pages; minor changes from previous version; this version will appear in Numerische Mathematik
10/2007;
-
[show abstract]
[hide abstract]
ABSTRACT: Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.
PLoS Genetics 10/2007; 3(9):1672-86. · 8.69 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Many data analysis applications deal with large matrices and involve approximating the matrix using a small number of ``components.'' Typically, these components are linear combinations of the rows and columns of the matrix, and are thus difficult to interpret in terms of the original features of the input data. In this paper, we propose and study matrix approximations that are explicitly expressed in terms of a small number of columns and/or rows of the data matrix, and thereby more amenable to interpretation in terms of the original data. Our main algorithmic results are two randomized algorithms which take as input an $m \times n$ matrix $A$ and a rank parameter $k$. In our first algorithm, $C$ is chosen, and we let $A'=CC^+A$, where $C^+$ is the Moore-Penrose generalized inverse of $C$. In our second algorithm $C$, $U$, $R$ are chosen, and we let $A'=CUR$. ($C$ and $R$ are matrices that consist of actual columns and rows, respectively, of $A$, and $U$ is a generalized inverse of their intersection.) For each algorithm, we show that with probability at least $1-\delta$: $$ ||A-A'||_F \leq (1+\epsilon) ||A-A_k||_F, $$ where $A_k$ is the ``best'' rank-$k$ approximation provided by truncating the singular value decomposition (SVD) of $A$. The number of columns of $C$ and rows of $R$ is a low-degree polynomial in $k$, $1/\epsilon$, and $\log(1/\delta)$. Our two algorithms are the first polynomial time algorithms for such low-rank matrix approximations that come with relative-error guarantees; previously, in some cases, it was not even known whether such matrix decompositions exist. Both of our algorithms are simple, they take time of the order needed to approximately compute the top $k$ singular vectors of $A$, and they use a novel, intuitive sampling method called ``subspace sampling.''
09/2007;
-
[show abstract]
[hide abstract]
ABSTRACT: The Lp regression problem takes as input a matrix $A \in \Real^{n \times d}$, a vector $b \in \Real^n$, and a number $p \in [1,\infty)$, and it returns as output a number ${\cal Z}$ and a vector $x_{opt} \in \Real^d$ such that ${\cal Z} = \min_{x \in \Real^d} ||Ax -b||_p = ||Ax_{opt}-b||_p$. In this paper, we construct coresets and obtain an efficient two-stage sampling-based approximation algorithm for the very overconstrained ($n \gg d$) version of this classical problem, for all $p \in [1, \infty)$. The first stage of our algorithm non-uniformly samples $\hat{r}_1 = O(36^p d^{\max\{p/2+1, p\}+1})$ rows of $A$ and the corresponding elements of $b$, and then it solves the Lp regression problem on the sample; we prove this is an 8-approximation. The second stage of our algorithm uses the output of the first stage to resample $\hat{r}_1/\epsilon^2$ constraints, and then it solves the Lp regression problem on the new sample; we prove this is a $(1+\epsilon)$-approximation. Our algorithm unifies, improves upon, and extends the existing algorithms for special cases of Lp regression, namely $p = 1,2$. In course of proving our result, we develop two concepts--well-conditioned bases and subspace-preserving sampling--that are of independent interest.
08/2007;