Alex J. Smola

Monmouth University, West Long Branch, New Jersey, United States

Are you Alex J. Smola?

Claim your profile

Publications (194)61.7 Total impact

  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study a novel spline-like basis, which we name the {\it falling factorial basis}, bearing many similarities to the classic truncated power basis. The advantage of the falling factorial basis is that it enables rapid, linear-time computations in basis matrix multiplication and basis matrix inversion. The falling factorial functions are not actually splines, but are close enough to splines that they provably retain some of the favorable properties of the latter functions. We examine their application in two problems: trend filtering over arbitrary input points, and a higher-order variant of the two-sample Kolmogorov-Smirnov test.
    05/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: We study a novel spline-like basis, which we name the {\it falling factorial basis}, bearing many similarities to the classic truncated power basis. The advantage of the falling factorial basis is that it enables rapid, linear-time computations in basis matrix multiplication and basis matrix inversion. The falling factorial functions are not actually splines, but are close enough to splines that they provably retain some of the favorable properties of the latter functions. We examine their application in two problems: trend filtering over arbitrary input points, and a higher-order variant of the two-sample Kolmogorov-Smirnov test.
    04/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Classical techniques such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are ubiquitous in statistics. However, these techniques only reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, they are computationally prohibitive in the large scale. In a separate strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving dramatic savings in computational requirements. In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas also extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on real-world data, on which we compare against the state-of-the-art. Code in R implementing our methods is provided in the Appendix.
    02/2014;
  • Amr Ahmed, Alex Smola
    [Show abstract] [Hide abstract]
    ABSTRACT: Large amounts of data arise in a multitude of situations, ranging from bioinformatics to astronomy, manufacturing, and medical applications. For concreteness our tutorial focuses on data obtained in the context of the internet, such as user generated content (microblogs, e-mails, messages), behavioral data (locations, interactions, clicks, queries), and graphs. Due to its magnitude, much of the challenges are to extract structure and interpretable models without the need for additional labels, i.e. to design effective unsupervised techniques. We present design patterns for hierarchical nonparametric Bayesian models, efficient inference algorithms, and modeling tools to describe salient aspects of the data.
    Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining; 08/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: As search pages are becoming increasingly complex, with images and nonlinear page layouts, understanding how users examine the page is important. We present a lab study on the effect of a rich informational panel to the right of the search result column, on eye and mouse behavior. Using eye and mouse data, we show that the flow of user attention on nonlinear page layouts is different from the widely believed top-down linear examination order of search results. We further demonstrate that the mouse, like the eye, is sensitive to two key attributes of page elements -- their position (layout), and their relevance to the user's task. We identify mouse measures that are strongly correlated with eye movements, and develop models to predict user attention (eye gaze) from mouse activity. These findings show that mouse tracking can be used to infer user attention and information flow patterns on search pages. Potential applications include ranking, search page optimization, and UI evaluation.
    Proceedings of the 22nd international conference on World Wide Web; 05/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we de ne conditional random elds in reproducing kernel Hilbert spaces and show connections to Gaussian Process classi cation. More speci cally, we prove decomposition results for undirected graphical models and we give constructions for kernels. Finally we present e cient means of solving the optimization problem using reduced rank decompositions and we show how stationarity can be exploited e ciently in the optimization process.
    07/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper analyzes the problem of Gaussian process (GP) bandits with deterministic observations. The analysis uses a branch and bound algorithm that is related to the UCB algorithm of (Srinivas et al, 2010). For GPs with Gaussian observation noise, with variance strictly greater than zero, Srinivas et al proved that the regret vanishes at the approximate rate of $O(1/\sqrt{t})$, where t is the number of observations. To complement their result, we attack the deterministic case and attain a much faster exponential convergence rate. Under some regularity assumptions, we show that the regret decreases asymptotically according to $O(e^{-\frac{\tau t}{(\ln t)^{d/4}}})$ with high probability. Here, d is the dimension of the search space and tau is a constant that depends on the behaviour of the objective function near its global maximum.
    06/2012;
  • Source
    Journal of Machine Learning Research 01/2012; 13:723-773. · 3.42 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe �ã Hokusai, a real time system which is able to capture frequency information for streams of arbitrary sequences of symbols. The algorithm uses the Count-Min sketch as its basis and exploits the fact that sketching is linear. It provides real time statistics of arbitrary events, e.g. streams of queries as a function of time. We use a factorizing approximation to provide point estimates at arbitrary (time, item) combinations. Queries can be answered in constant time.
    Proceedings of the 28th International Conference on Conference on Uncertainty in Artificial Intelligence. (UAI), 2012.; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A typical behavioral targeting system optimizing purchase activities, called conversions, faces two main challenges: the web-scale amounts of user histories to process on a daily ba- sis, and the relative sparsity of conversions. In this paper, we try to address these challenges through feature selec- tion. We formulate a multi-task (or group) feature-selection problem among a set of related tasks (sharing a common set of features), namely advertising campaigns. We apply a group-sparse penalty consisting of a combination of an `1 and `2 penalty and an associated fast optimization algo- rithm for distributed parameter estimation. Our algorithm relies on a variant of the well known Fast Iterative Thresh- olding Algorithm (FISTA), a closed-form solution for mixed norm programming and a distributed subgradient oracle. To e�ciently handle web-scale user histories, we present a dis- tributed inference algorithm for the problem that scales to billions of instances and millions of attributes. We show the superiority of our algorithm in terms of both sparsity and ROC performance over baseline feature selection methods (both single-task L1-regularization and multi-task mutual- information gain).
    The 21st ACM International Conference on Information and Knowledge Management (CIKM); 01/2012
  • Source
    The 26th Conference on Neural Information Processing Systems (NIPS),; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Relevance, diversity and personalization are key issues when presenting content which is apt to pique a user's interest. This is particularly true when presenting an engaging set of news stories. In this paper we propose an efficient algorithm for selecting a small subset of relevant articles from a streaming news corpus. It offers three key pieces of improvement over past work: 1) It is based on a detailed model of a user's viewing behavior which does not require explicit feedback. 2) We use the notion of submodularity to estimate the propensity of interacting with content. This improves over the classical context independent relevance ranking algorithms. Unlike existing methods, we learn the submodular function from the data. 3) We present an efficient online algorithm which can be adapted for personalization, story adaptation, and factorization models. Experiments show that our system yields a significant improvement over a retrieval system deployed in production.
    Proceedings of the Fifth International Conference on Web Search and Web Data Mining, WSDM 2012, Seattle, WA, USA, February 8-12, 2012; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work we study parallelization of online learning, a core primitive in machine learning. In a parallel environment all known approaches for parallel online learning lead to delayed updates, where the model is updated using out-of-date information. In the worst case, or when examples are temporally correlated, delay can have a very adverse effect on the learning algorithm. Here, we analyze and present preliminary empirical results on a set of learning architectures based on a feature sharding approach that present various tradeoffs between delay, degree of parallelism, representation power and empirical performance.
    Computing Research Repository - CORR. 03/2011;
  • Source
    Qinfeng Shi, Li Cheng, Li Wang, Alex J. Smola
    [Show abstract] [Hide abstract]
    ABSTRACT: A challenging problem in human action un- derstanding is to jointly segment and recognize human actions from an unseen video sequence, where one per- son performs a sequence of continuous actions. In this paper, we propose a discriminative semi- Markov model approach, and dene a set of features over boundary frames, segments as well as neighbor- ing segments. This enable us to conveniently capture a combination of local and global features that best represent a specic action type. To eciently solve the inference problem of simultaneously segmentation and recognition, we devise a Viterbi-like dynamic program- ming algorithm, which is able to process 20 frames per second in practice. Moreover, the model is discrimina- tively learned from large margin principle, and is for- mulated as an optimization problem with exponentially many constraints. To solve it eciently, we present two dierent optimization algorithms, namely cutting plane method and bundle method, and demonstrate that each can be alternatively deployed in a \plug and play" fash- ion. From its theoretical aspect, we also analyze the generalization error of the proposed approach and pro- vide a PAC-Bayes bound.
    International Journal of Computer Vision 01/2011; 93:22-32. · 3.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Sponsored search is a three-way interaction between advertisers, users, and the search engine. The basic ad selection in sponsored search, lets the advertiser choose the exact queries where the ad is to be shown. To increase advertising volume, many advertisers opt into advanced match, where the search engine can select additional queries that are deemed relevant for the advertiser's ad. In advanced match, the search engine is effectively bidding on the behalf of the advertisers. While advanced match has been extensively studied in the literature from the ad relevance perspective there is little work that discusses how to infer the appropriate bid value for a given advanced match. The bid value is crucial as it affects both the ad placement in revenue reordering and the amount advertisers are charged in case of a click. We propose a statistical approach to solve the bid generation problem and examine two information sources: the bidding behavior of advertisers, and the conversion data. Our key finding suggests that sophisticated advertisers' bids are driven by many factors beyond clicks and immediate measurable conversions, likely capturing the value chain of an ad display ranging from views, clicks, profit margins, etc., representing the total ROI from the advertising.
    Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong, China, February 9-12, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between the classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert-Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 11/2010; · 4.80 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 302-318, 2010
    Statistical Analysis and Data Mining 09/2010; 3(5):302 - 318.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Background / Purpose: There is a growing need for adapting data analysis and machine learning methods to graphs, as expressive graph representations are becoming increasingly popular in areas as diverse as chemo- and bioinformatics, natural language processing, or program flow analysis. Defining a graph kernel makes it possible to apply a whole spectrum of kernel machine learning algorithms (Smola 2002) to graphs. These kernels have to respect the structure and node/edge labels of graphs and, importantly, they have to be efficient to compute in order to be applicable to large graphs.While the lack of efficiency of graph kernels remained the computational bottleneck for several years, Shervashidze and Borgwardt recently proposed a highly scalable graph kernel, called the Weisfeiler-Lehman kernel. The key idea of this kernel is to count nodes whose sets of neighbors match exactly, and to iteratively apply this procedure to the neighbors of the neighbors of nodes and so on. The Weisfeiler-Lehman kernel was shown to be superior to state-of-the-art graph kernels in terms of runtime and competitive in terms of accuracy on several chemo- and bioinformatics classification benchmark datasets.However, the fact that the Weisfeiler-Lehman kernel counts only exactly-matching node neighborhoods may affect its performance in certain applications. In fact, in application domains where edges and nodes and their labels are noisy or false measurements occur, hardly any neighbourhoods might match exactly. Main conclusion: Based on the Weisfeiler-Lehman algorithm (Weisfeiler and Lehman 1968), the recent Weisfeiler-Lehman kernel (Shervashidze and Borgwardt 2009), and an efficient approximation of the Jaccard coefficient (Gibson 2005), we propose a scalable subtree kernel that allows approximate neighborhood matching.
    Intelligent Systems for Molecular Biology 2010 meeting; 08/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Ranking a set of retrieved documents according to their relevance to a given query has become a popular problem at the intersection of web search, machine learning, and information retrieval. Recent work on ranking focused on a number of different paradigms, namely, pointwise, pairwise, and list-wise approaches. Each of those paradigms focuses on a different aspect of the dataset while largely ignoring others. The current paper shows how a combination of them can lead to improved ranking performance and, moreover, how it can be implemented in log-linear time. The basic idea of the algorithm is to use isotonic regression with adaptive bandwidth selection per relevance grade. This results in an implicitly-defined loss function which can be minimized efficiently by a subgradient descent procedure. Experimental results show that the resulting algorithm is competitive on both commercial search engine data and publicly available LETOR data sets.
    Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, New York, NY, USA, February 4-6, 2010; 01/2010

Publication Stats

13k Citations
61.70 Total Impact Points

Institutions

  • 2014
    • Monmouth University
      West Long Branch, New Jersey, United States
  • 2013
    • Carnegie Mellon University
      • Computer Science Department
      Pittsburgh, Pennsylvania, United States
  • 2007–2012
    • National ICT Australia Ltd
      Sydney, New South Wales, Australia
  • 2010
    • Yahoo! Labs
      Sunnyvale, California, United States
  • 2000–2006
    • Australian National University
      • Research School of Earth Sciences
      Canberra, Australian Capital Territory, Australia
  • 2005
    • ICT University
      Baton Rouge, Louisiana, United States
  • 2000–2005
    • Ruhr-Universität Bochum
      Bochum, North Rhine-Westphalia, Germany