Alex J. Smola

Monmouth University, West Long Branch, New Jersey, United States

Are you Alex J. Smola?

Claim your profile

Publications (196)61.7 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a family of adaptive estimators on graphs, based on penalizing the $\ell_1$ norm of discrete graph differences. This generalizes the idea of trend filtering [Kim et al. (2009), Tibshirani (2014)], used for univariate nonparametric regression, to graphs. Analogous to the univariate case, graph trend filtering exhibits a level of local adaptivity unmatched by the usual $\ell_2$-based graph smoothers. It is also defined by a convex minimization problem that is readily solved (e.g., by fast ADMM or Newton algorithms). We demonstrate the merits of graph trend filtering through examples and theory.
    10/2014;
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study a novel spline-like basis, which we name the {\it falling factorial basis}, bearing many similarities to the classic truncated power basis. The advantage of the falling factorial basis is that it enables rapid, linear-time computations in basis matrix multiplication and basis matrix inversion. The falling factorial functions are not actually splines, but are close enough to splines that they provably retain some of the favorable properties of the latter functions. We examine their application in two problems: trend filtering over arbitrary input points, and a higher-order variant of the two-sample Kolmogorov-Smirnov test.
    05/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: We study a novel spline-like basis, which we name the {\it falling factorial basis}, bearing many similarities to the classic truncated power basis. The advantage of the falling factorial basis is that it enables rapid, linear-time computations in basis matrix multiplication and basis matrix inversion. The falling factorial functions are not actually splines, but are close enough to splines that they provably retain some of the favorable properties of the latter functions. We examine their application in two problems: trend filtering over arbitrary input points, and a higher-order variant of the two-sample Kolmogorov-Smirnov test.
    04/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Classical techniques such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are ubiquitous in statistics. However, these techniques only reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, they are computationally prohibitive in the large scale. In a separate strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving dramatic savings in computational requirements. In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas also extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on real-world data, on which we compare against the state-of-the-art. Code in R implementing our methods is provided in the Appendix.
    02/2014;
  • Amr Ahmed, Alex Smola
    [Show abstract] [Hide abstract]
    ABSTRACT: Large amounts of data arise in a multitude of situations, ranging from bioinformatics to astronomy, manufacturing, and medical applications. For concreteness our tutorial focuses on data obtained in the context of the internet, such as user generated content (microblogs, e-mails, messages), behavioral data (locations, interactions, clicks, queries), and graphs. Due to its magnitude, much of the challenges are to extract structure and interpretable models without the need for additional labels, i.e. to design effective unsupervised techniques. We present design patterns for hierarchical nonparametric Bayesian models, efficient inference algorithms, and modeling tools to describe salient aspects of the data.
    Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining; 08/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: As search pages are becoming increasingly complex, with images and nonlinear page layouts, understanding how users examine the page is important. We present a lab study on the effect of a rich informational panel to the right of the search result column, on eye and mouse behavior. Using eye and mouse data, we show that the flow of user attention on nonlinear page layouts is different from the widely believed top-down linear examination order of search results. We further demonstrate that the mouse, like the eye, is sensitive to two key attributes of page elements -- their position (layout), and their relevance to the user's task. We identify mouse measures that are strongly correlated with eye movements, and develop models to predict user attention (eye gaze) from mouse activity. These findings show that mouse tracking can be used to infer user attention and information flow patterns on search pages. Potential applications include ranking, search page optimization, and UI evaluation.
    Proceedings of the 22nd international conference on World Wide Web; 05/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A typical behavioral targeting system optimizing purchase activities, called conversions, faces two main challenges: the web-scale amounts of user histories to process on a daily ba- sis, and the relative sparsity of conversions. In this paper, we try to address these challenges through feature selec- tion. We formulate a multi-task (or group) feature-selection problem among a set of related tasks (sharing a common set of features), namely advertising campaigns. We apply a group-sparse penalty consisting of a combination of an `1 and `2 penalty and an associated fast optimization algo- rithm for distributed parameter estimation. Our algorithm relies on a variant of the well known Fast Iterative Thresh- olding Algorithm (FISTA), a closed-form solution for mixed norm programming and a distributed subgradient oracle. To e�ciently handle web-scale user histories, we present a dis- tributed inference algorithm for the problem that scales to billions of instances and millions of attributes. We show the superiority of our algorithm in terms of both sparsity and ROC performance over baseline feature selection methods (both single-task L1-regularization and multi-task mutual- information gain).
    The 21st ACM International Conference on Information and Knowledge Management (CIKM); 10/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we de ne conditional random elds in reproducing kernel Hilbert spaces and show connections to Gaussian Process classi cation. More speci cally, we prove decomposition results for undirected graphical models and we give constructions for kernels. Finally we present e cient means of solving the optimization problem using reduced rank decompositions and we show how stationarity can be exploited e ciently in the optimization process.
    07/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper analyzes the problem of Gaussian process (GP) bandits with deterministic observations. The analysis uses a branch and bound algorithm that is related to the UCB algorithm of (Srinivas et al, 2010). For GPs with Gaussian observation noise, with variance strictly greater than zero, Srinivas et al proved that the regret vanishes at the approximate rate of $O(1/\sqrt{t})$, where t is the number of observations. To complement their result, we attack the deterministic case and attain a much faster exponential convergence rate. Under some regularity assumptions, we show that the regret decreases asymptotically according to $O(e^{-\frac{\tau t}{(\ln t)^{d/4}}})$ with high probability. Here, d is the dimension of the search space and tau is a constant that depends on the behaviour of the objective function near its global maximum.
    06/2012;
  • Source
    Journal of Machine Learning Research 01/2012; 13:723-773. · 3.42 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe �ã Hokusai, a real time system which is able to capture frequency information for streams of arbitrary sequences of symbols. The algorithm uses the Count-Min sketch as its basis and exploits the fact that sketching is linear. It provides real time statistics of arbitrary events, e.g. streams of queries as a function of time. We use a factorizing approximation to provide point estimates at arbitrary (time, item) combinations. Queries can be answered in constant time.
    Proceedings of the 28th International Conference on Conference on Uncertainty in Artificial Intelligence. (UAI), 2012.; 01/2012
  • Source
    The 26th Conference on Neural Information Processing Systems (NIPS),; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Relevance, diversity and personalization are key issues when presenting content which is apt to pique a user's interest. This is particularly true when presenting an engaging set of news stories. In this paper we propose an efficient algorithm for selecting a small subset of relevant articles from a streaming news corpus. It offers three key pieces of improvement over past work: 1) It is based on a detailed model of a user's viewing behavior which does not require explicit feedback. 2) We use the notion of submodularity to estimate the propensity of interacting with content. This improves over the classical context independent relevance ranking algorithms. Unlike existing methods, we learn the submodular function from the data. 3) We present an efficient online algorithm which can be adapted for personalization, story adaptation, and factorization models. Experiments show that our system yields a significant improvement over a retrieval system deployed in production.
    Proceedings of the Fifth International Conference on Web Search and Web Data Mining, WSDM 2012, Seattle, WA, USA, February 8-12, 2012; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work we study parallelization of online learning, a core primitive in machine learning. In a parallel environment all known approaches for parallel online learning lead to delayed updates, where the model is updated using out-of-date information. In the worst case, or when examples are temporally correlated, delay can have a very adverse effect on the learning algorithm. Here, we analyze and present preliminary empirical results on a set of learning architectures based on a feature sharding approach that present various tradeoffs between delay, degree of parallelism, representation power and empirical performance.
    Computing Research Repository - CORR. 03/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work we study parallelization of online learning, a core primitive in machine learning. In a parallel environment all known approaches for parallel online learning lead to delayed updates, where the model is updated using out-of-date information. In the worst case, or when examples are temporally correlated, delay can have a very adverse effect on the learning algorithm. Here, we analyze and present preliminary empirical results on a set of learning architectures based on a feature sharding approach that present various tradeoffs between delay, degree of parallelism, representation power and empirical performance.
    Computing Research Repository - CORR. 03/2011;
  • Source
    Qinfeng Shi, Li Cheng, Li Wang, Alex J. Smola
    [Show abstract] [Hide abstract]
    ABSTRACT: A challenging problem in human action un- derstanding is to jointly segment and recognize human actions from an unseen video sequence, where one per- son performs a sequence of continuous actions. In this paper, we propose a discriminative semi- Markov model approach, and dene a set of features over boundary frames, segments as well as neighbor- ing segments. This enable us to conveniently capture a combination of local and global features that best represent a specic action type. To eciently solve the inference problem of simultaneously segmentation and recognition, we devise a Viterbi-like dynamic program- ming algorithm, which is able to process 20 frames per second in practice. Moreover, the model is discrimina- tively learned from large margin principle, and is for- mulated as an optimization problem with exponentially many constraints. To solve it eciently, we present two dierent optimization algorithms, namely cutting plane method and bundle method, and demonstrate that each can be alternatively deployed in a \plug and play" fash- ion. From its theoretical aspect, we also analyze the generalization error of the proposed approach and pro- vide a PAC-Bayes bound.
    International Journal of Computer Vision 01/2011; 93:22-32. · 3.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Sponsored search is a three-way interaction between advertisers, users, and the search engine. The basic ad selection in sponsored search, lets the advertiser choose the exact queries where the ad is to be shown. To increase advertising volume, many advertisers opt into advanced match, where the search engine can select additional queries that are deemed relevant for the advertiser's ad. In advanced match, the search engine is effectively bidding on the behalf of the advertisers. While advanced match has been extensively studied in the literature from the ad relevance perspective there is little work that discusses how to infer the appropriate bid value for a given advanced match. The bid value is crucial as it affects both the ad placement in revenue reordering and the amount advertisers are charged in case of a click. We propose a statistical approach to solve the bid generation problem and examine two information sources: the bidding behavior of advertisers, and the conversion data. Our key finding suggests that sophisticated advertisers' bids are driven by many factors beyond clicks and immediate measurable conversions, likely capturing the value chain of an ad display ranging from views, clicks, profit margins, etc., representing the total ROI from the advertising.
    Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong, China, February 9-12, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between the classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert-Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 11/2010; · 4.80 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 302-318, 2010
    Statistical Analysis and Data Mining 09/2010; 3(5):302 - 318.

Publication Stats

14k Citations
61.70 Total Impact Points

Institutions

  • 2014
    • Monmouth University
      West Long Branch, New Jersey, United States
  • 2013
    • Carnegie Mellon University
      • Computer Science Department
      Pittsburgh, Pennsylvania, United States
  • 2007–2012
    • National ICT Australia Ltd
      Sydney, New South Wales, Australia
  • 2010
    • Yahoo! Labs
      Sunnyvale, California, United States
  • 2000–2006
    • Australian National University
      • Research School of Earth Sciences
      Canberra, Australian Capital Territory, Australia
  • 2005
    • ICT University
      Baton Rouge, Louisiana, United States
  • 2000–2005
    • Ruhr-Universität Bochum
      Bochum, North Rhine-Westphalia, Germany