Conference Paper

Expected Similarity Estimation for Large Scale Anomaly Detection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We propose a new algorithm named EXPected Similarity Estimation (EXPoSE) to approach the problem of anomaly detection (also known as one-class learning or outlier detection) which is based on the similarity between data points and the distribution of non-anomalous data. We formulate the problem as an inner product in a reproducing kernel Hilbert space to which we present approximations that allow its application to very large-scale datasets. More precisely, given a dataset with n instances, our proposed method requires O(n) training time and O(1) to make a prediction while spending only O(1) memory to store the learned model. Despite its abstract derivation our algorithm is simple and parameter free. We show on seven real datasets that our approach can compete with state of the art algorithms for anomaly detection.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The EXPected Similarity Estimation (EXPOSE) algorithm was proposed to solve the problem of large-scale anomaly detection where the dataset sizes are too immense to be processed by standard techniques [26]. As explained later in detail, the EXPOSE anomaly detection classifier η(y) = φ(y), μ [P] calculates a score (the likelihood of y belonging to the class of normal data) using the inner product between a feature map φ and the kernel mean map μ[P] of the distribution P (Fig. 1). ...
... The following datasets, which all have purposely very different feature characteristics, are used to evaluate anomaly detection algorithms. We refer to [26] for a more detailed description of the datasets and their feature characteristic. ...
... • KDDCUP is an intrusion detection dataset which contains 4 898 431 connection records of network traffic. As in [26] we rescale the 34 continuous features to [0, 1] and apply a binary encoding for the 7 symbolic features. • The third dataset contains 600 000 instances of the Google Street View House Numbers (SVHN) [19] where we use the Histogram of Oriented Gradients (HOG) with a cell size of 3 to get a 2592dimensional feature vector. ...
Conference Paper
Full-text available
A new algorithm named EXPected Similarity Estimation (EXPOSE) was recently proposed to solve the problem of large-scale anomaly detection. It is a non-parametric and distribution free kernel method based on the Hilbert space embedding of probability measures. Given a dataset of n samples, EXPOSE takes [Oscr ](n) time to build a model and [Oscr ](1) time per prediction. In this work we describe and analyze a simple and effective stochastic optimization algorithm which allows us to drastically reduce the learning time of EXPOSE from previous linear to constant. It is crucial that this approach allows us to determine the number of iterations based on a desired accuracy, independent of the dataset size n. We will show that the proposed stochastic gradient descent algorithm works in general possible infinite-dimensional Hilbert spaces, is easy to implement and requires no additional step-size parameters.
... EXPected Similarity Estimation (EXPoSE) was recently proposed to solve the problem of large-scale anomaly detection, where the number of training samples n and the dimension of the data d are too high for most other algorithms [SEP15]. Here, "anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior. ...
... where k : X × X → R is a reproducing kernel. « Intuitively speaking the query point y is compared to all other points of the distribution P. It can be shown [SEP15] that this equation can be rewritten as an inner product between the feature map φ(y) and the kernel embedding µ[P] of P as ...
... For sufficiently large sample sizes n we can expect µ[P n ] to be a good proxy for µ[P] by the law of large numbers. Besides the convergence of the model w t → µ[P], we will examine and compare the anomaly detection scores η n (y) = φ(y), µ[P n ] and η t (y) = φ(y), w t calculated by the empirical distribution (which is the original EXPoSE predictor proposed in [SEP15]) and the stochastic optimization approximation, respectively. ...
Article
Full-text available
A new algorithm named EXPected Similarity Estimation (EXPoSE) was recently proposed to solve the problem of large-scale anomaly detection. It is a non-parametric and distribution free kernel method based on the Hilbert space embedding of probability measures. Given a dataset of n samples, EXPoSE needs only O(n)\mathcal{O}(n) (linear time) to build a model and O(1)\mathcal{O}(1) (constant time) to make a prediction. In this work we improve the linear computational complexity and show that an ϵ\epsilon-accurate model can be estimated in constant time, which has significant implications for large-scale learning problems. To achieve this goal, we cast the original EXPoSE formulation into a stochastic optimization problem. It is crucial that this approach allows us to determine the number of iteration based on a desired accuracy ϵ\epsilon, independent of the dataset size n. We will show that the proposed stochastic gradient descent algorithm works in general (possible infinite-dimensional) Hilbert spaces, is easy to implement and requires no additional step-size parameters.
... SSPs induce a sinc kernel function [20], and sinc kernels have been found to be efficient kernels for kernel density estimators [21], [22]. Vector representations that induce kernels can be used to make memory-and time-efficient kernel density estimators, as in the EXPoSE algorithm [23]. But where Schneider et al. [23] made a KDE, we are combining SSPs with Bayesian Linear Regression to approximate Gaussian Process regression. ...
... Vector representations that induce kernels can be used to make memory-and time-efficient kernel density estimators, as in the EXPoSE algorithm [23]. But where Schneider et al. [23] made a KDE, we are combining SSPs with Bayesian Linear Regression to approximate Gaussian Process regression. ...
Preprint
Full-text available
Mutual information (MI) is a standard objective function for driving exploration. The use of Gaussian processes to compute information gain is limited by time and memory complexity that grows with the number of observations collected.We present an efficient implementation of MI-driven exploration by combining vector symbolic architectures with Bayesian Linear Regression. We demonstrate equivalent regret performance to a GP-based approach with memory and time complexity that is constant in the number of samples collected, as opposed to t^2 and t^3, respectively, enabling long-term exploration.
... In this section, we construct a two-layer prediction framework which transforms X t into the classification of whether the node will fail within time interval (t, t + H], where H is the prespecified length of time, as is shown in Figure 1. e tow-layer architecture consists of one anomaly detection layer and one classification layer. e anomaly detection layer contains a real-time anomaly detector EXPoSE [32]. For each monitoring sequence, the detector generates anomaly score vector S it � (s i1 , s i2 , . . . ...
... (3) e anomaly score quantifying the anomalous level is the base of interpretability in the second layer. e EXPected Similarity Estimation (EXPoSE) computes the similarity between new data and the distribution of historical normal data [32]. Anomaly scores are represented by the opposite side of similarity. ...
Article
Full-text available
In recent years, the distributed architecture has been widely adopted by security companies with the rapid expansion of their business. A distributed system is comprised of many computing nodes of different components which are connected by high-speed communication networks. With the increasing functionality and complexity of the systems, failures of nodes are inevitable which may result in considerable loss. In order to identify anomalies of the possible failures and enable DevOps engineers to operate in advance, this paper proposes a two-layer prediction architecture based on the monitoring sequences of nodes status. Generally speaking, in the first layer, we make use of EXPoSE anomaly detection technique to derive anomaly scores in constant time which are then used as input data for ensemble learning in the second layer. Experiments are conducted on the data provided by one of the largest security companies, and the results demonstrate the predictability of the proposed approach.
... Because the dot product distributes over the summation, we can rewrite it as: der et al., 2016;Schneider, 2017), for example, uses RFFs for fast anomaly detection in large datastreams in finite memory. Fourier features have also been applied to Gaussian Process Regression (Rahimi et al., 2007;Mutnỳ and Krause, 2019). ...
Article
Full-text available
Distributed vector representations are a key bridging point between connectionist and symbolic representations in cognition. It is unclear how uncertainty should be modelled in systems using such representations. In this paper we discuss how bundles of symbols in certain Vector Symbolic Architectures (VSAs) can be understood as defining an object that has a relationship to a probability distribution, and how statements in VSAs can be understood as being analogous to probabilistic statements. The aim of this paper is to show how (spiking) neural implementations of VSAs can be used to implement probabilistic operations that are useful in building cognitive models. We show how similarity operators between continuous values represented as Spatial Semantic Pointers (SSPs), an example of a technique known as fractional binding, induces a quasi-kernel function that can be used in density estimation. Further, we sketch novel designs for networks that compute entropy and mutual information of VSA-represented distributions and demonstrate their performance when implemented as networks of spiking neurons. We also discuss the relationship between our technique and quantum probability, another technique proposed for modelling uncertainty in cognition. While we restrict ourselves to operators proposed for Holographic Reduced Representations, and for representing real-valued data. We suggest that the methods presented in this paper should translate to any VSA where the dot product between fractionally bound symbols induces a valid kernel.
... 7 we empirically compare EXPoSE with several state of the art anomaly detectors. This is an extended and revised version of a preliminary conference report that was presented in the International Joint Conference on Neural Networks 2015 (Schneider et al. 2015). This work reviews the EXPoSE anomaly detection algorithm and provides a derivation that makes fewer assumptions on the input space and kernel function. ...
Article
Full-text available
We present a novel algorithm for anomaly detection on very large datasets and data streams. The method, named EXPected Similarity Estimation (EXPoSE), is kernel-based and able to efficiently compute the similarity between new data points and the distribution of regular data. The estimator is formulated as an inner product with a reproducing kernel Hilbert space embedding and makes no assumption about the type or shape of the underlying data distribution. We show that offline (batch) learning with EXPoSE can be done in linear time and online (incremental) learning takes constant time per instance and model update. Furthermore, EXPoSE can make predictions in constant time, while it requires only constant memory. In addition we propose different methodologies for concept drift adaptation on evolving data streams. On several real datasets we demonstrate that our approach can compete with state of the art algorithms for anomaly detection while being significant faster than techniques with the same discriminant power.
Chapter
Maschinelles Lernen dominiert heute Forschung und Anwendungen in der KI. Wir stellen in diesem Einführungskapitel einige einfache aber wichtige Lernalgorithmen zusammen mit wichtigen Begriffen und Methoden vor.
Article
The development of new data analytical methods remains a crucial factor in the combat against insurance fraud. Methods rooted in the research field of anomaly detection are considered as promising candidates for this purpose. Commonly, a fraud data set contains both numeric and nominal attributes, where, due to the ease of expressiveness, the latter often encodes valuable expert knowledge. For this reason, an anomaly detection method should be able to handle a mixture of different data types, returning an anomaly score meaningful in the context of the business application. We propose the iForestCAD approach that computes conditional anomaly scores, useful for fraud detection. More specifically, anomaly detection is performed conditionally on well-defined data partitions that are created on the basis of selected numeric attributes and distinct combinations of values of selected nominal attributes. In this way, the resulting anomaly scores are computed with respect to a reference group of interest, thus representing a meaningful score for domain experts. Given that anomaly detection is performed conditionally, this approach allows detecting anomalies that would otherwise remain undiscovered in unconditional anomaly detection. Moreover, we present a case study in which we demonstrate the usefulness of our proposed approach on real-world workers' compensation claims received from a large European insurance organization. As a result, the iForestCAD approach is greatly accepted by domain experts for its effective detection of fraudulent claims.
Article
Full-text available
Advances in computation and the fast and cheap computational facilities now available to statisticians have had a significant impact upon statistical research, and especially the development of nonparametric data analysis procedures. In particular, theoretical and applied research on nonparametric density estimation has had a noticeable influence on related topics, such as nonparametric regression, nonparametric discrimination, and nonparametric pattern recognition. This article reviews recent developments in nonparametric density estimation and includes topics that have been omitted from review articles and books on the subject. The early density estimation methods, such as the histogram, kernel estimators, and orthogonal series estimators are still very popular, and recent research on them is described. Different types of restricted maximum likelihood density estimators, including order-restricted estimators, maximum penalized likelihood estimators, and sieve estimators, are discussed, where restrictions are imposed upon the class of densities or on the form of the likelihood function. Nonparametric density estimators that are data-adaptive and lead to locally smoothed estimators are also discussed; these include variable partition histograms, estimators based on statistically equivalent blocks, nearest-neighbor estimators, variable kernel estimators, and adaptive kernel estimators. For the multivariate case, extensions of methods of univariate density estimation are usually straightforward but can be computationally expensive. A method of multivariate density estimation that did not spring from a univariate generalization is described, namely, projection pursuit density estimation, in which both dimensionality reduction and density estimation can be pursued at the same time. Finally, some areas of related research are mentioned, such as nonparametric estimation of functionals of a density, robust parametric estimation, semiparametric models, and density estimation for censored and incomplete data, directional and spherical data, and density estimation for dependent sequences of observations.
Article
Full-text available
Data domain description concerns the characterization of a data set. A good description covers all target data but includes no superfluous space. The boundary of a dataset can be used to detect novel data or outliers. We will present the Support Vector Data Description (SVDD) which is inspired by the Support Vector Classifier. It obtains a spherically shaped boundary around a dataset and analogous to the Support Vector Classifier it can be made flexible by using other kernel functions. The method is made robust against outliers in the training set and is capable of tightening the description by using negative examples. We show characteristics of the Support Vector Data Descriptions using artificial and real data.
Conference Paper
Full-text available
For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
Conference Paper
Full-text available
If a simple and fast solution for one-class classification is required, the most common approach is to assume a Gaussian distribution for the patterns of the single class. Bayesian classification then leads to a simple template matching. In this paper we show for two very different applications that the classification performance can be improved significantly if a more uniform subgaussian instead of a Gaussian class distribution is assumed. One application is face detection, the other is the detection of transcription factor binding sites on a genome. As for the Gaussian, the distance from a template, i.e., the distribution center, determines a pattern’s class assignment. However, depending on the distribution assumed, maximum likelihood learning leads to different templates from the training data. These new templates lead to significant improvements of the classification performance.
Article
Full-text available
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review.
Article
Full-text available
to difierentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the efiectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the difierent existing techniques in that category are variants of the basic tech- nique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the difierent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Article
Full-text available
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specified value between 0 and 1. We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement. The functional form of f is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. The expansion coefficients are found by solving a quadratic programming problem, which we do by carrying out sequential optimization over pairs of input patterns. We also provide a theoretical analysis of the statistical performance of our algorithm. The algorithm is a natural extension of the support vector algorithm to the case of unlabeled data.
Article
Full-text available
Support vector machines (SVMs) construct decision functions that are linear combinations of kernel evaluations on the training set. The samples with non-vanishing coefficients are called support vectors. In this work we establish lower (asymptotical) bounds on the number of support vectors. On our way we prove several results which are of great importance for the understanding of SVMs. In particular, we describe to which "limit" SVM decision functions tend, discuss the corresponding notion of convergence and provide some results on the stability of SVMs using subdifferential calculus in the associated reproducing kernel Hilbert space.
Article
Full-text available
Various consistency proofs for the kernel density estimator have been developed over the last few decades. Important milestones are the pointwise consistency and almost sure uniform convergence with a fixed bandwidth on the one hand and the rate of convergence with a fixed or even a variable bandwidth on the other hand. While considering global properties of the empirical distribution functions is sufficient for strong consistency, proofs of exact convergence rates use deeper information about the underlying empirical processes. A unifying character, however, is that earlier and more recent proofs use bounds on the probability that a sum of random variables deviates from its mean.
Article
Full-text available
A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented. It is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a randomly chosen non-diseased subject. Moreover, this probability of a correct ranking is the same quantity that is estimated by the already well-studied nonparametric Wilcoxon statistic. These two relationships are exploited to (a) provide rapid closed-form expressions for the approximate magnitude of the sampling variability, i.e., standard error that one uses to accompany the area under a smoothed ROC curve, (b) guide in determining the size of the sample required to provide a sufficiently reliable estimate of this area, and (c) determine how large sample sizes should be to ensure that one can statistically detect differences in the accuracy of diagnostic techniques.
Article
Full-text available
The Minnesota Intrusion Detection System (MINDS), a data mining based system, can detect sophisticated cyberattacks on large-scale networks using signature-based systems. At MINDS' core is a behavioral-anomaly detection module based on a novel data-driven technique for calculating the distance between points in high-dimensional space, enabling meaningful calculation of the similarity between records containing a mixture of categorical and numerical attitudes. MINDS uses the shared nearest neighbor clustering algorithm, which works particularly well when data is high-dimensional and noisy. Its ability to summarize large amounts of network traffic can be highly valuable for network security analysts who must deal with large amounts of data.
Article
Full-text available
In this article we study the generalization abilities of several classifiers of support vector machine (SVM) type using a certain class of kernels that we call universal. It is shown that the soft margin algorithms with universal kernels are consistent for a large class of classification problems including some kind of noisy tasks provided that the regularization parameter is chosen well. In particular we derive a simple su#cient condition for this parameter in the case of Gaussian RBF kernels. On the one hand our considerations are based on an investigation of an approximation property---the so-called universality---of the used kernels that ensures that all continuous functions can be approximated by certain kernel expressions. This approximation property also gives a new insight into the role of kernels in these and other algorithms. On the other hand the results are achieved by a precise study of the underlying optimization problems of the classifiers. Furthermore, we show consistency for the maximal margin classifier as well as for the soft margin SVM's in the presence of large margins. In this case it turns out that also constant regularization parameters ensure consistency for the soft margin SVM's. Finally we prove that even for simple, noise free classification problems SVM's with polynomial kernels can behave arbitrarily badly.
Article
Full-text available
When highly-accurate and/or assumptionfree density estimation is needed, nonparametric methods are often called upon - most notably the popular kernel density estimation (KDE) method. However, the practitioner is instantly faced with the formidable computational cost of KDE for appreciable dataset sizes, which becomes even more prohibitive when many models with different kernel scales (bandwidths) must be evaluated -- this is necessary for finding the optimal model, among other reasons. In previous work we presented an algorithm for fast KDE which addresses large dataset sizes and large dimensionalities, but assumes only a single bandwidth. In this paper we present a generalization of that algorithm allowing multiple models with different bandwidths to be computed simultaneously, in substantially less time than either running the singlebandwidth algorithm for each model independently, or running the standard exhaustive method. We show examples of computing the likelihood curve for 100,000 data and 100 models ranging across 3 orders of magnitude in scale, in minutes or seconds.
Article
Full-text available
We develop a probability model over image spaces and demonstrate its broad utility in mammographic image analysis. The model employs a pyramid representation to factor images across scale and a tree-structured set of hidden variables to capture long-range spatial dependencies. This factoring makes the computation of the density functions local and tractable. The result is a hierarchical mixture of conditional probabilities, similar to a hidden Markov model on a tree. The model parameters are found with maximum likelihood estimation using the EM algorithm. The utility of the model is demonstrated for three applications; 1) detection of mammographic masses in computer-aided diagnosis 2) qualitative assessment of model structure through mammographic synthesis and 3) lossless compression of mammographic regions of interest. 1.
Article
For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Book
With the increasing advances in hardware technology for data collection, and advances in software technology (databases) for data organization, computer scientists have increasingly participated in the latest advancements of the outlier analysis field. Computer scientists, specifically, approach this field based on their practical experiences in managing large amounts of data, and with far fewer assumptions- the data can be of any type, structured or unstructured, and may be extremely large. Outlier Analysis is a comprehensive exposition, as understood by data mining experts, statisticians and computer scientists. The book has been organized carefully, and emphasis was placed on simplifying the content, so that students and practitioners can also benefit. Chapters will typically cover one of three areas: methods and techniques commonly used in outlier analysis, such as linear methods, proximity-based methods, subspace methods, and supervised methods; data domains, such as, text, categorical, mixed-attribute, time-series, streaming, discrete sequence, spatial and network data; and key applications of these methods as applied to diverse domains such as credit card fraud detection, intrusion detection, medical diagnosis, earth science, web log analytics, and social network analysis are covered. © 2013 Springer Science+Business Media New York. All rights are reserved.
Article
Detecting and reading text from natural images is a hard computer vision task that is central to a variety of emerging applications. Related problems like document character recognition have been widely studied by computer vision and machine learning researchers and are virtually solved for practical applications like reading handwritten digits. Reliably recognizing characters in more complex scenes like photographs, however, is far more difficult: the best existing methods lag well behind human performance on the same tasks. In this paper we attack the prob-lem of recognizing digits in a real application using unsupervised feature learning methods: reading house numbers from street level photos. To this end, we intro-duce a new benchmark dataset for research use containing over 600,000 labeled digits cropped from Street View images. We then demonstrate the difficulty of recognizing these digits when the problem is approached with hand-designed fea-tures. Finally, we employ variants of two recently proposed unsupervised feature learning methods and find that they are convincingly superior on our benchmarks.
Article
Despite their successes, what makes kernel methods difficult to use in many large scale problems is the fact that storing and computing the decision function is typically expensive, especially at prediction time. In this paper, we overcome this difficulty by proposing Fastfood, an approximation that accelerates such computation significantly. Key to Fastfood is the observation that Hadamard matrices, when combined with diagonal Gaussian matrices, exhibit properties similar to dense Gaussian random matrices. Yet unlike the latter, Hadamard and diagonal matrices are inexpensive to multiply and store. These two matrices can be used in lieu of Gaussian matrices in Random Kitchen Sinks proposed by Rahimi and Recht (2009) and thereby speeding up the computation for a large range of kernel functions. Specifically, Fastfood requires O(n log d) time and O(n) storage to compute n non-linear basis functions in d dimensions, a significant improvement from O(nd) computation and storage, without sacrificing accuracy. Our method applies to any translation invariant and any dot-product kernel, such as the popular RBF kernels and polynomial kernels. We prove that the approximation is unbiased and has low variance. Experiments show that we achieve similar accuracy to full kernel expansions and Random Kitchen Sinks while being 100x faster and using 1000x less memory. These improvements, especially in terms of memory usage, make kernel methods more practical for applications that have large training sets and/or require real-time prediction.
Conference Paper
We introduce algorithms to visualize feature spaces used by object detectors. Our method works by inverting a visual feature back to multiple natural images. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector's failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and supports that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of recognition systems.
Article
The comparison of two treatments generally falls into one of the following two categories: (a) we may have a number of replications for each of the two treatments, which are unpaired, or (b) we may have a number of paired comparisons leading to a series of differences, some of which may be positive and some negative. The appropriate methods for testing the significance of the differences of the means in these two cases are described in most of the textbooks on statistical methods.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of S is bounded or bounded above. The bounds for Pr {S – ES ≥ nt} depend only on the endpoints of the ranges of the summands and the mean, or the mean and the variance of S. These results are then used to obtain analogous inequalities for certain sums of dependent random variables such as U statistics and the sum of a random sample without replacement from a finite population.
Conference Paper
In this paper we unify divergence minimization and statistical inference by means of convex duality. In the process of doing so, we prove that the dual of approximate maximum entropy estimation is maximum a posteriori estimation as a special case. Moreover, our treatment leads to stability and convergence bounds for many statistical learning problems. Finally, we show how an algorithm by Zhang can be used to solve this class of optimization problems efficiently.
Article
We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD).We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. © 2012 Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf and Alexander Smola.
Article
Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of S is bounded or bounded above. The bounds for Pr{SESnt}\Pr \{ S - ES \geq nt \} depend only on the endpoints of the ranges of the summands and the mean, or the mean and the variance of S. These results are then used to obtain analogous inequalities for certain sums of dependent random variables such as U statistics and the sum of a random sample without replacement from a finite population.
Conference Paper
Support vector machines (SVMs) have been promising methods for classification and regression analysis because of their solid mathematical foundations which convery several salient properties that other methods hardly provide. However, despite the prominent properties of SVMs, they are not as favored for large-scale data mining as for pattern recognition or machine learning because the training complexity of SVMs is highly dependent on the size of a data set. Many real-world data mining applications involve millions or billions of data records where even multiple scans of the entire data are too expensive to perform. This paper presents a new method, Clustering-Based SVM (CB-SVM), which is specifically designed for handling very large data sets. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM. CB-SVM tries to generate the best SVM boundary for very large data sets given limited amount of resources. Our experiments on synthetic and real data sets show that CB-SVM is highly scalable for very large data sets while also generating high classification accuracy.
Conference Paper
To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shift- invariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning al- gorithms applied to these features outperform state-of-the-art large-scale kernel machines.
Conference Paper
Randomized neural networks are immortalized in this well-known AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. "What are you doing?" asked Minsky. "I am training a randomly wired neural net to play tic-tac-toe," Sussman replied. "Why is the net wired ran- domly?" asked Minsky. Sussman replied, "I do not want it to have any precon- ceptions of how to play." Minsky then shut his eyes. "Why do you close your eyes?" Sussman asked his teacher. "So that the room will be empty," replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlin- earities. We identify conditions under which these networks exhibit good classi- fication performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities.
Conference Paper
We describe a technique for comparing distributions without the need for density estimation as an intermediate step. Our approach relies on mapping the distributions into a reproducing kernel Hilbert space. Applications of this technique can be found in two-sample tests, which are used for determining whether two sets of observations arise from the same distribution, covariate shift correction, local learning, measures of independence, and density estimation. Kernel methods are widely used in supervised learning [1, 2, 3, 4], however they are much less established in the areas of testing, estimation, and analysis of probability distributions, where information theoretic approaches [5, 6] have long been dominant. Recent examples include [7] in the context of construction of graphical models, [8] in the context of feature extraction, and [9] in the context of independent component analysis. These methods have by and large a common issue: to compute quantities such as the mutual information, entropy, or Kullback-Leibler divergence, we require sophisticated space partitioning and/or
Article
In this paper, we propose a computational model of the recognition of real world scenes that bypasses the segmentation and the processing of individual objects or regions. The procedure is based on a very low dimensional representation of the scene, that we term the Spatial Envelope. We propose a set of perceptual dimensions (naturalness, openness, roughness, expansion, ruggedness) that represent the dominant spatial structure of a scene. Then, we show that these dimensions may be reliably estimated using spectral and coarsely localized information. The model generates a multidimensional space in which scenes sharing membership in semantic categories (e.g., streets, highways, coasts) are projected closed together. The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.
Article
A problem for many kernel-based methods is that the amount of computation required to find the solution scales as O(n3), where n is the number of training examples. We develop and analyze an algorithm to compute an easily-interpretable low-rank approximation to an n× n Gram matrix G such that computations of interest may be performed more rapidly. The approximation is of the form ˜ Gk = CW+ k C T , where C is a matrix consisting of a small number c of columns of G and Wk is the best rank-k approximation to W, the matrix formed by the intersection between those c columns of G and the corresponding c rows of G. An important aspect of the algorithm is the probability distribution used to randomly sample the columns; we will use a judiciously-chosen and data-dependent nonuniform probability distribution. Let k·k2 and k·kF denote the spectral norm and the Frobenius norm, respectively, of a matrix, and let Gk be the best rank-k approximation to G. We prove that by choosing O(k/ε4) columns
Article
Large scale nonlinear support vector machines (SVMs) can be approximated by linear ones using a suitable feature map. The linear SVMs are in general much faster to learn and evaluate (test) than the original nonlinear SVMs. This work introduces explicit feature maps for the additive class of kernels, such as the intersection, Hellinger's, and χ2 kernels, commonly used in computer vision, and enables their use in large scale problems. In particular, we: 1) provide explicit feature maps for all additive homogeneous kernels along with closed form expression for all common kernels; 2) derive corresponding approximate finite-dimensional feature maps based on a spectral analysis; and 3) quantify the error of the approximation, showing that the error is independent of the data dimension and decays exponentially fast with the approximation order for selected kernels such as χ2. We demonstrate that the approximations have indistinguishable performance from the full kernels yet greatly reduce the train/test times of SVMs. We also compare with two other approximation methods: Nystrom's approximation of Perronnin et al., which is data dependent, and the explicit map of Maji and Berg for the intersection kernel, which, as in the case of our approximations, is data independent. The approximations are evaluated on a number of standard data sets, including Caltech-101, Daimler-Chrysler pedestrians, and INRIA pedestrians.
Conference Paper
CARDWATCH, a database mining system used for credit card fraud detection, is presented. The system is based on a neural network learning module, provides an interface to a variety of commercial databases and has a comfortable graphical user interface. Test results obtained for synthetically generated credit card data and an autoassociative neural network model show very successful fraud detection rates