Michael W. Mahoney

Stanford University, Stanford, CA, USA

Are you Michael W. Mahoney?

Claim your profile

Publications (41)34.55 Total impact

  • Article: The Fast Cauchy Transform and Faster Robust Linear Regression
    [show abstract] [hide abstract]
    ABSTRACT: We provide fast algorithms for overconstrained $\ell_p$ regression and related problems: for an $n\times d$ input matrix $A$ and vector $b\in\R^n$, in $O(nd\log n)$ time we reduce the problem $\min_{x\in\R^d} \norm{Ax-b}_p$ to the same problem with input matrix $\tilde A$ of dimension $s \times d$ and corresponding $\tilde b$ of dimension $s\times 1$. Here, $\tilde A$ and $\tilde b$ are a coreset for the problem, consisting of sampled and rescaled rows of $A$ and $b$; and $s$ is independent of $n$ and polynomial in $d$. Our results improve on the best previous algorithms when $n\gg d$, for all $p\in [1,\infty)$ except $p=2$. We also provide a suite of improved results for finding well-conditioned bases via ellipsoidal rounding, illustrating tradeoffs between running time and conditioning quality, including a one-pass conditioning algorithm for general $\ell_p$ problems. We also provide an empirical evaluation of implementations of our algorithms for $p=1$, comparing them with related algorithms. Our empirical results clearly show that, in the asymptotic regime, the theory is a very good guide to the practical performance of these algorithms. Our algorithms use our faster constructions of well-conditioned bases for $\ell_p$ spaces and, for $p=1$, a fast subspace embedding of independent interest that we call the Fast Cauchy Transform: a distribution over matrices $\Pi: \R^n\mapsto \R^{O(d\log d)}$, found obliviously to $A$, that approximately preserves the $\ell_1$ norms: that is, with large probability, simultaneously for all $x$, $\norm{Ax}_1 \approx \norm{\Pi Ax}_1$, with distortion $O(d^{2+\eta})$, for an arbitrarily small constant $\eta>0$; and, moreover, $\Pi A$ can be computed in $O(nd\log d)$ time. The techniques underlying our Fast Cauchy Transform include fast Johnson-Lindenstrauss transforms, low-coherence matrices, and rescaling by Cauchy random variables.
    07/2012;
  • Source
    Article: Stochastic Dimensionality Reduction for K-means Clustering
    [show abstract] [hide abstract]
    ABSTRACT: We study the topic of dimensionality reduction methods for k-means clustering. Dimensionality reduction encompasses the union of two approaches; feature selection and feature extraction. First, feature selection selects a small subset of actual features from the data and then runs the clustering algorithm only on the selected features. Second, feature extraction constructs a small set of new artificial features and then runs the clustering algorithm only on the constructed features. Despite the significance of the problem as well as the wealth of heuristic methods addressing it there exist no provably accurate feature selection methods. On the other hand, two provably accurate feature extraction methods for k-means exist: the first one is randomized and is based on Random Projections; the other, is deterministic and it is based on the Singular Value Decomposition. This paper addresses this shortcoming by presenting the first provably accurate feature selection method for k-means clustering. We also present two novel feature extraction methods: the first one is based on Random Projections and improves the existing result in terms of speed and number of features needed to be extracted; the other is based on fast approximate SVD factorizations and improves the existing result in terms of speed. All three methods of our work are randomized and, with constant probability, provide constant-factor approximation guarantees with respect to the optimal k-means objective value.
    10/2011;
  • Source
    Article: Fast approximation of matrix coherence and statistical leverage
    [show abstract] [hide abstract]
    ABSTRACT: The statistical leverage scores of a matrix $A$ are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr\"{o}m-based low-rank matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary $n \times d$ matrix $A$, with $n \gg d$, and that returns as output relative-error approximations to all $n$ of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of $n$ and $d$) in $O(n d \log n)$ time, as opposed to the $O(nd^2)$ time required by the na\"{i}ve algorithm that involves computing an orthogonal basis for the range of $A$. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with $n \approx d$, and the extension to streaming environments.
    09/2011;
  • Source
    Article: Efficient genomewide selection of PCA-correlated tSNPs for genotype imputation.
    [show abstract] [hide abstract]
    ABSTRACT: The linkage disequilibrium structure of the human genome allows identification of small sets of single nucleotide polymorphisms (SNPs) (tSNPs) that efficiently represent dense sets of markers. This structure can be translated into linear algebraic terms as evidenced by the well documented principal components analysis (PCA)-based methods. Here we apply, for the first time, PCA-based methodology for efficient genomewide tSNP selection; and explore the linear algebraic structure of the human genome. Our algorithm divides the genome into contiguous nonoverlapping windows of high linear structure. Coupling this novel window definition with a PCA-based tSNP selection method, we analyze 2.5 million SNPs from the HapMap phase 2 dataset. We show that 10-25% of these SNPs suffice to predict the remaining genotypes with over 95% accuracy. A comparison with other popular methods in the ENCODE regions indicates significant genotyping savings. We evaluate the portability of genome-wide tSNPs across a diverse set of populations (HapMap phase 3 dataset). Interestingly, African populations are good reference populations for the rest of the world. Finally, we demonstrate the applicability of our approach in a real genome-wide disease association study. The chosen tSNP panels can be used toward genotype imputation using either a simple regression-based algorithm or more sophisticated genotype imputation methods.
    Annals of Human Genetics 09/2011; 75(6):707-22. · 2.57 Impact Factor
  • Source
    Article: Effective Resistances, Statistical Leverage, and Applications to Linear Equation Solving
    Petros Drineas, Michael W. Mahoney
    [show abstract] [hide abstract]
    ABSTRACT: Recent work in theoretical computer science and scientific computing has focused on nearly-linear-time algorithms for solving systems of linear equations. While introducing several novel theoretical perspectives, this work has yet to lead to practical algorithms. In an effort to bridge this gap, we describe in this paper two related results. Our first and main result is a simple algorithm to approximate the solution to a set of linear equations defined by a Laplacian (for a graph $G$ with $n$ nodes and $m \le n^2$ edges) constraint matrix. The algorithm is a non-recursive algorithm; even though it runs in $O(n^2 \cdot \polylog(n))$ time rather than $O(m \cdot polylog(n))$ time (given an oracle for the so-called statistical leverage scores), it is extremely simple; and it can be used to compute an approximate solution with a direct solver. In light of this result, our second result is a straightforward connection between the concept of graph resistance (which has proven useful in recent algorithms for linear equation solvers) and the concept of statistical leverage (which has proven useful in numerically-implementable randomized algorithms for large matrix problems and which has a natural data-analytic interpretation). Comment: 16 pages
    05/2010;
  • Source
    Article: Empirical Comparison of Algorithms for Network Community Detection
    Jure Leskovec, Kevin J. Lang, Michael W. Mahoney
    [show abstract] [hide abstract]
    ABSTRACT: Detecting clusters or communities in large real-world graphs such as large social or information networks is a problem of considerable interest. In practice, one typically chooses an objective function that captures the intuition of a network cluster as set of nodes with better internal connectivity than external connectivity, and then one applies approximation algorithms or heuristics to extract sets of nodes that are related to the objective function and that "look like" good communities for the application of interest. In this paper, we explore a range of network community detection methods in order to compare them and to understand their relative performance and the systematic biases in the clusters they identify. We evaluate several common objective functions that are used to formalize the notion of a network community, and we examine several different classes of approximation algorithms that aim to optimize such objective functions. In addition, rather than simply fixing an objective and asking for an approximation to the best cluster of any size, we consider a size-resolved version of the optimization problem. Considering community quality as a function of its size provides a much finer lens with which to examine community detection algorithms, since objective functions and approximation algorithms often have non-obvious size-dependent behavior.
    04/2010;
  • Source
    Article: CUR matrix decompositions for improved data analysis.
    Michael W Mahoney, Petros Drineas
    [show abstract] [hide abstract]
    ABSTRACT: Principal components analysis and, more generally, the Singular Value Decomposition are fundamental data analysis tools that express a data matrix in terms of a sequence of orthogonal or uncorrelated vectors of decreasing importance. Unfortunately, being linear combinations of up to all the data points, these vectors are notoriously difficult to interpret in terms of the data and processes generating the data. In this article, we develop CUR matrix decompositions for improved data analysis. CUR decompositions are low-rank matrix decompositions that are explicitly expressed in terms of a small number of actual columns and/or actual rows of the data matrix. Because they are constructed from actual data elements, CUR decompositions are interpretable by practitioners of the field from which the data are drawn (to the extent that the original data are). We present an algorithm that preferentially chooses columns and rows that exhibit high "statistical leverage" and, thus, in a very precise statistical sense, exert a disproportionately large "influence" on the best low-rank fit of the data matrix. By selecting columns and rows in this manner, we obtain improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work. In addition, since the construction involves computing quantities with a natural and widely studied statistical interpretation, we can leverage ideas from diagnostic regression analysis to employ these matrix decompositions for exploratory data analysis.
    Proceedings of the National Academy of Sciences 02/2009; 106(3):697-702. · 9.68 Impact Factor
  • Article: Sampling Algorithms and Coresets for $\ell
    SIAM J. Comput. 01/2009; 38:2060-2078.
  • Conference Proceeding: Unsupervised Feature Selection for the $k$-means Clustering Problem.
    Christos Boutsidis, Michael W. Mahoney, Petros Drineas
    Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada.; 01/2009
  • Source
    Article: An Improved Approximation Algorithm for the Column Subset Selection Problem
    Christos Boutsidis, Michael W. Mahoney, Petros Drineas
    [show abstract] [hide abstract]
    ABSTRACT: We consider the problem of selecting the best subset of exactly $k$ columns from an $m \times n$ matrix $A$. We present and analyze a novel two-stage algorithm that runs in $O(\min\{mn^2,m^2n\})$ time and returns as output an $m \times k$ matrix $C$ consisting of exactly $k$ columns of $A$. In the first (randomized) stage, the algorithm randomly selects $\Theta(k \log k)$ columns according to a judiciously-chosen probability distribution that depends on information in the top-$k$ right singular subspace of $A$. In the second (deterministic) stage, the algorithm applies a deterministic column-selection procedure to select and return exactly $k$ columns from the set of columns selected in the first stage. Let $C$ be the $m \times k$ matrix containing those $k$ columns, let $P_C$ denote the projection matrix onto the span of those columns, and let $A_k$ denote the best rank-$k$ approximation to the matrix $A$. Then, we prove that, with probability at least 0.8, $$ \FNorm{A - P_CA} \leq \Theta(k \log^{1/2} k) \FNorm{A-A_k}. $$ This Frobenius norm bound is only a factor of $\sqrt{k \log k}$ worse than the best previously existing existential result and is roughly $O(\sqrt{k!})$ better than the best previous algorithmic result for the Frobenius norm version of this Column Subset Selection Problem (CSSP). We also prove that, with probability at least 0.8, $$ \TNorm{A - P_CA} \leq \Theta(k \log^{1/2} k)\TNorm{A-A_k} + \Theta(k^{3/4}\log^{1/4}k)\FNorm{A-A_k}. $$ This spectral norm bound is not directly comparable to the best previously existing bounds for the spectral norm version of this CSSP. Our bound depends on $\FNorm{A-A_k}$, whereas previous results depend on $\sqrt{n-k}\TNorm{A-A_k}$; if these two quantities are comparable, then our bound is asymptotically worse by a $(k \log k)^{1/4}$ factor. Comment: 17 pages; corrected a bug in the spectral norm bound of the previous version
    12/2008;
  • Source
    Article: Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters
    [show abstract] [hide abstract]
    ABSTRACT: A large body of work has been devoted to defining and identifying clusters or communities in social and information networks. We explore from a novel perspective several questions related to identifying meaningful communities in large social and information networks, and we come to several striking conclusions. We employ approximation algorithms for the graph partitioning problem to characterize as a function of size the statistical and structural properties of partitions of graphs that could plausibly be interpreted as communities. In particular, we define the network community profile plot, which characterizes the "best" possible community--according to the conductance measure--over a wide range of size scales. We study over 100 large real-world social and information networks. Our results suggest a significantly more refined picture of community structure in large networks than has been appreciated previously. In particular, we observe tight communities that are barely connected to the rest of the network at very small size scales; and communities of larger size scales gradually "blend into" the expander-like core of the network and thus become less "community-like." This behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, it is exactly the opposite of what one would expect based on intuition from expander graphs, low-dimensional or manifold-like graphs, and from small social networks that have served as testbeds of community detection algorithms. We have found that a generative graph model, in which new edges are added via an iterative "forest fire" burning process, is able to produce graphs exhibiting a network community profile plot similar to what we observe in our network datasets.
    11/2008;
  • Source
    Conference Proceeding: Unsupervised feature selection for principal components analysis.
    Christos Boutsidis, Michael W. Mahoney, Petros Drineas
    Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008; 01/2008
  • Conference Proceeding: Sampling algorithms and coresets for ℓ
    Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, San Francisco, California, USA, January 20-22, 2008; 01/2008
  • Source
    Article: Sampling subproblems of heterogeneous Max-Cut problems and approximation algorithms.
    Petros Drineas, Ravi Kannan, Michael W. Mahoney
    Random Struct. Algorithms. 01/2008; 32:307-333.
  • Source
    Article: Tensor-CUR Decompositions for Tensor-Based Data.
    Michael W. Mahoney, Mauro Maggioni, Petros Drineas
    SIAM J. Matrix Analysis Applications. 01/2008; 30:957-987.
  • Source
    Article: Column Subset Selection for Unsupervised Feature Selection
    Christos Boutsidis, Michael W Mahoney, Petros Drineas
    [show abstract] [hide abstract]
    ABSTRACT: We consider, both theoretically and empirically, the problem of selecting the "best" subset of exactly k columns from an m × n data matrix A. From a theoretical perspective, we present and analyze a novel two-stage algorithm. In the first phase, the algorithm randomly selects O(k log k) columns according to a judiciously-chosen probability distribution that depends on information in the top-k right singular subspace of A. In the second stage the algorithm applies a deterministic column-selection procedure to select and return exactly k columns from the set of columns selected in the first phase. Let C be the m × k matrix containing those k columns, let P C denote the projection matrix onto the span of those columns, and let A k denote the "best" rank-k approximation to the matrix A as computed with the singular value decomposition. Then, we prove that A − P C A 2 ≤ O k 3 4 log 1 2 (k) (n − k) 1 4 A − A k 2 , with probability at least 1 − 10 −20 . For small to moderate values of k, this improves upon the best previously-existing result (of Gu and Eisenstat [20]) for this Column Subset Selection Problem. From an empirical perspective, we evaluate this algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis: finance, document-term data, and genetics. We pay particular attention to how this algorithm may be used to select representative or landmark features from an object-feature matrix in an unsupervised manner. In all three application domains, we are able to identify k landmark features, i.e., columns of the data matrix, that capture nearly the same amount of information as does the subspace that is spanned by the top k "eigenfeatures." Moreover, in cases where the original data matrix clusters well in the best k-dimensional space, e.g., as measured by k-means clustering, we also find that it clusters well in the space spanned by the features our algorithm chooses.
    11/2007;
  • Source
    Article: Faster Least Squares Approximation
    [show abstract] [hide abstract]
    ABSTRACT: Least squares approximation is a technique to find an approximate solution to a system of linear equations that has no exact solution. In a typical setting, one lets $n$ be the number of constraints and $d$ be the number of variables, with $n \gg d$. Then, existing exact methods find a solution vector in $O(nd^2)$ time. We present two randomized algorithms that provide very accurate relative-error approximations to the optimal value and the solution vector of a least squares approximation problem more rapidly than existing exact algorithms. Both of our algorithms preprocess the data with the Randomized Hadamard Transform. One then uniformly randomly samples constraints and solves the smaller problem on those constraints, and the other performs a sparse random projection and solves the smaller problem on those projected coordinates. In both cases, solving the smaller problem provides relative-error approximations, and, if $n$ is sufficiently larger than $d$, the approximate solution can be computed in $O(nd \log d)$ time. Comment: 25 pages; minor changes from previous version; this version will appear in Numerische Mathematik
    10/2007;
  • Source
    Article: PCA-correlated SNPs for structure identification in worldwide human populations.
    [show abstract] [hide abstract]
    ABSTRACT: Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.
    PLoS Genetics 10/2007; 3(9):1672-86. · 8.69 Impact Factor
  • Source
    Article: Relative-Error CUR Matrix Decompositions
    Petros Drineas, Michael W. Mahoney, S. Muthukrishnan
    [show abstract] [hide abstract]
    ABSTRACT: Many data analysis applications deal with large matrices and involve approximating the matrix using a small number of ``components.'' Typically, these components are linear combinations of the rows and columns of the matrix, and are thus difficult to interpret in terms of the original features of the input data. In this paper, we propose and study matrix approximations that are explicitly expressed in terms of a small number of columns and/or rows of the data matrix, and thereby more amenable to interpretation in terms of the original data. Our main algorithmic results are two randomized algorithms which take as input an $m \times n$ matrix $A$ and a rank parameter $k$. In our first algorithm, $C$ is chosen, and we let $A'=CC^+A$, where $C^+$ is the Moore-Penrose generalized inverse of $C$. In our second algorithm $C$, $U$, $R$ are chosen, and we let $A'=CUR$. ($C$ and $R$ are matrices that consist of actual columns and rows, respectively, of $A$, and $U$ is a generalized inverse of their intersection.) For each algorithm, we show that with probability at least $1-\delta$: $$ ||A-A'||_F \leq (1+\epsilon) ||A-A_k||_F, $$ where $A_k$ is the ``best'' rank-$k$ approximation provided by truncating the singular value decomposition (SVD) of $A$. The number of columns of $C$ and rows of $R$ is a low-degree polynomial in $k$, $1/\epsilon$, and $\log(1/\delta)$. Our two algorithms are the first polynomial time algorithms for such low-rank matrix approximations that come with relative-error guarantees; previously, in some cases, it was not even known whether such matrix decompositions exist. Both of our algorithms are simple, they take time of the order needed to approximately compute the top $k$ singular vectors of $A$, and they use a novel, intuitive sampling method called ``subspace sampling.''
    09/2007;
  • Source
    Article: Sampling Algorithms and Coresets for Lp Regression
    [show abstract] [hide abstract]
    ABSTRACT: The Lp regression problem takes as input a matrix $A \in \Real^{n \times d}$, a vector $b \in \Real^n$, and a number $p \in [1,\infty)$, and it returns as output a number ${\cal Z}$ and a vector $x_{opt} \in \Real^d$ such that ${\cal Z} = \min_{x \in \Real^d} ||Ax -b||_p = ||Ax_{opt}-b||_p$. In this paper, we construct coresets and obtain an efficient two-stage sampling-based approximation algorithm for the very overconstrained ($n \gg d$) version of this classical problem, for all $p \in [1, \infty)$. The first stage of our algorithm non-uniformly samples $\hat{r}_1 = O(36^p d^{\max\{p/2+1, p\}+1})$ rows of $A$ and the corresponding elements of $b$, and then it solves the Lp regression problem on the sample; we prove this is an 8-approximation. The second stage of our algorithm uses the output of the first stage to resample $\hat{r}_1/\epsilon^2$ constraints, and then it solves the Lp regression problem on the new sample; we prove this is a $(1+\epsilon)$-approximation. Our algorithm unifies, improves upon, and extends the existing algorithms for special cases of Lp regression, namely $p = 1,2$. In course of proving our result, we develop two concepts--well-conditioned bases and subspace-preserving sampling--that are of independent interest.
    08/2007;

Institutions

  • 2008–2009
    • Stanford University
      • Department of Mathematics
      Stanford, CA, USA
  • 2006–2008
    • Yahoo! Labs
      Sunnyvale, CA, USA
  • 2007
    • Democritus University of Thrace
      • Tμήμα Μοριακής Βιολογίας και Γενετικής
      Komotiní, Anatoliki Makedonia kai Thraki, Greece
  • 1970–2007
    • Yale University
      • • Department of Genetics
      • • Department of Mathematics
      New Haven, CT, USA