Christopher J. C. Burges

Christopher J. C. Burges
Microsoft

About

96
Publications
98,175
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
41,968
Citations

Publications

Publications (96)
Patent
Aspects of the subject matter described herein relate to predicting and using search engine switching behavior. In aspects, switching components receive a representation of user interactions with at least one browser. The switching components derive information from the representation that is useful in predicting whether a user will switch search e...
Patent
Full-text available
The subject disclosure is directed towards automated processes for generating sentence completion questions based at least in part on a language model. Using the language model, a sentence is located, and alternates for a focus word (or words) in the sentence are automatically provided. Also described is automated filtering candidate sentences to l...
Patent
Full-text available
Described is a technology for identifying sample data items (e.g., documents corresponding to query-URL pairs) having the greatest likelihood of being mislabeled when previously judged, and selecting those data items for re-judging. In one aspect, lambda gradient scores (information associated with ranked sample data items that indicates a relative...
Article
Full-text available
We explore the idea of using a "possibilistic graphical model" as the basis for a world model that drives a dialog system. As a first step we have developed a system that uses text-based dialog to derive a model of the user's family relations. The system leverages its world model to infer relational triples, to learn to recover from upstream corefe...
Patent
A spam detection system is disclosed. The system includes a classifier training component that receives a first set of training pages labeled as normal pages and a second set of training pages labeled as spam pages. The training component trains a web page classifier based on both the first set of training pages and the second set of training pages...
Patent
The claimed subject matter provides a system and/or a method that facilitates generating sorted search results for a query. An interface component can receive a query in a first language. A first ranker can be trained from a portion of data related to a second language. A second ranker can correspond to the first language, wherein the second ranker...
Patent
Full-text available
Described herein is a system that includes a receiver component that receives first scores for training points and second scores for the training points, wherein the first scores are individually assigned to the training points by a first ranker component and the second scores are individually assigned to the training points by a second ranker comp...
Article
Full-text available
This paper studies the problem of sentence level semantic coherence by answering SATstyle sentence completion questions. These questions test the ability of algorithms to distinguish sense from nonsense based on a variety of sentence-level phenomena. We tackle the problem with two approaches: methods that use local lexical information, such as the...
Article
We describe a method, “Shortest Path Segmentation” (SPS), which combines dynamic programming and a neural net recognizer for segmenting and recognizing character strings. We describe the application of this method to two problems: recognition of handwritten ZIP Codes, and recognition of handwritten words. For the ZIP Codes, we also used the method...
Conference Paper
Full-text available
We investigate the problem of learning to rank with document retrieval from the perspective of learning for multiple objective functions. We present solutions to two open problems in learning to rank: first, we show how multiple measures can be combined into a single graded measure that can be learned. This solves the problem of learning from a 'sc...
Article
Full-text available
We describe the system that won Track 1 of the Yahoo! Learning to Rank Challenge.
Article
Work on modeling semantics in text is progressing quickly, yet there are few existing public datasets which authors can use to measure and compare their systems. This work takes a step towards addressing this issue. We present the MSR Sentence Completion Challenge Data, which consists of 1,040 sentences, each of which has four impostor sentences, i...
Article
Full-text available
We propose a discriminative approach for automatically training chatbots to provide relevant and interesting re-sponses. In contrast to most prior work, our approach is not based on hard-wiring response rules, but rather relies on machine learning. We set ourselves the task of ranking a repository of responses to find the most suitable response. Th...
Chapter
Full-text available
We give a tutorial overview of several geometric methods for feature extractionand dimensional reduction. We divide the methods into projective methods and methods thatmodel the manifold on which the data lies. For projective methods, we review projectionpursuit, principal component analysis (PCA), kernel PCA, probabilistic PCA, and orientedPCA; an...
Article
Full-text available
We present a new ranking algorithm that combines the strengths of two previous methods: boosted tree classification, and LambdaRank, which has been shown to be empirically optimal for a widely used information retrieval measure. Our algorithm is based on boosted regression trees, although the ideas apply to any weak learners, and it is significantl...
Article
Full-text available
We give a tutorial overview of several geometric methods for dimension reduction. We divide the methods into projective methods and meth- ods that model the manifold on which the data lies. For projective methods, we review projection pursuit, principal component analysis (PCA), kernel PCA, probabilistic PCA, canonical correlation analysis, oriente...
Article
Full-text available
LambdaMART is the boosted tree version of LambdaRank, which is based on RankNet. RankNet, LambdaRank, and LambdaMART have proven to be very suc-cessful algorithms for solving real world ranking problems: for example an ensem-ble of LambdaMART rankers won Track 1 of the 2010 Yahoo! Learning To Rank Challenge. The details of these algorithms are spre...
Conference Paper
Full-text available
Despite the widespread use of BM25, there have been few studies examining its effectiveness on a document descrip- tion over single and multiple field combinations. We deter- mine the effectiveness of BM25 on various document fields. We find that BM25 models relevance on popularity fields such as anchor text and query click information no better th...
Conference Paper
Full-text available
A machine learning approach to rank learning trains a model to optimize a target evaluation measure with repect to training data. Currently, existing information retrieval measures are impossible to optimize directly except for models with a trivial number of parameters. The IR community thus faces a major challenge: how to optimize IR measures of...
Conference Paper
Full-text available
As the amount of information on the Web grows, the ability to retrieve relevant information quickly and eas- ily is necessary. The combination of ample news sources on the Web, little time to browse news, and smaller mobile devices motivates the development of automatic highlight extraction from single news arti- cles. Our system, NetSum, is the fi...
Article
Full-text available
We cast the ranking problem as (1) multiple classification (2) multiple ordinal classification, which lead to computationally tractable learning algorithms for rel-evance ranking in Web search. We consider the DCG criterion (discounted cumu-lative gain), a standard quality measure in information retrieval. Our approach is motivated by the fact that...
Article
We present a new ranking algorithm that combines the strengths of two previous methods: boosted tree classification, and LambdaRank, which has been shown to be empirically optimal for a widely used information retrieval measure. The algorithm is based on boosted regression trees, although the ideas apply to any weak learners, and it is significantl...
Chapter
Papers from the 2006 flagship meeting on neural computation, with contributions from physicists, neuroscientists, mathematicians, statisticians, and computer scientists. The annual Neural Information Processing Systems (NIPS) conference is the flagship meeting on neural computation and machine learning. It draws a diverse group of attendees—physici...
Conference Paper
Full-text available
We consider spectral clustering and transduc- tive inference for data with multiple views. A typical example is the web, which can be described by either the hyperlinks between web pages or the words occurring in web pages. When each view is represented as a graph, one may convexly combine the weight matrices or the discrete Laplacians for each gra...
Conference Paper
Full-text available
We present a new approach to automatic summarization based on neural nets, called NetSum. We extract a set of features from each sentence that helps identify its impor- tance in the document. We apply novel features based on news search query logs and Wikipedia entities. Using the RankNet learning algorithm, we train a pair-based sentence ranker to...
Conference Paper
Full-text available
We cast the ranking problem as (1) multiple classification (“Mc”) (2) multiple ordinal classification, which lead to computationally tractable learning algorithms for relevance ranking in Web search. We consider the DCG criterion (discounted cumulative gain), a standard quality measure in information retrieval. Our approach is motivated by the fact...
Chapter
Full-text available
An overview of the problem of learning to rank data is given. Some current machine learning approaches to the problem are described. The cost func- tions used to assess the quality of a ranking algorithm present particular diculties: they are non-dieren tiable (as a function of the scores output by the ranker) and multivariate (in the sense that th...
Chapter
A comprehensive review of an area of machine learning that deals with the use of unlabeled data in classification problems: state-of-the-art algorithms, a taxonomy of the field, applications, benchmark experiments, and directions for future research. In the field of machine learning, semi-supervised learning (SSL) occupies the middle ground, betwee...
Conference Paper
Full-text available
The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a cla...
Conference Paper
Full-text available
The growing libraries of multimedia objects have increased the need for applications that facilitate search, browsing, discovery, recommendation and playlist construction. Many of these applications in turn require some notion of distance between, or similarity of, such objects. The lack of a reliable proxy for similarity of entities is a serious o...
Chapter
Full-text available
Applications such as audio fingerprinting require search in high dimensions: find an item in a database that is similar to a query. An important property of this search task is that negative answers are very frequent: much of the time, a query does not correspond to any database item. We propose Redundant Bit Vectors (RBVs): a novel method for qui...
Conference Paper
Full-text available
Audio fingerprinting is a powerful tool for identifying file-based or streaming audio, using a database of fingerprints. The paper presents two new applications of audio fingerprinting: duplicate detection, whose goal is to identify duplicate audio clips in a set, even if they differ in compression quality or duration, and thumbnail generation, whi...
Conference Paper
Full-text available
We investigate using gradient descent meth- ods for learning ranking functions; we pro- pose a simple probabilistic cost function, and we introduce RankNet, an implementation of these ideas using a neural network to model the underlying ranking function. We present test results on toy data and on data from a commercial internet search engine.
Article
Full-text available
Recently graph-based algorithms, in which nodes represent data points and links en- code similarities, have become popular for semi-supervised learning. In this chapter we introduce a general probabilistic formulation called 'Conditional Harmonic Mix- ing', in which the links are directed, a conditional probability matrix is associated with each li...
Conference Paper
Full-text available
In this paper, we introduce a new framework for speech detection using convolutional networks. We propose a network architecture that can incorporate long and short-term temporal and spectral cor- relations of speech in the detection process. The proposed design is able to address many shortcomings of existing speech detectors in a unified new fram...
Article
Full-text available
We describeaprototype system capable of extracting machine print addresses from fax images of English language business letters and fax cover sheets. The system automatically orients incoming page images, locates and parses machine printed addresses, and classifies each address as one of fsender, recipient, otherg.Wepresent results of preliminary p...
Conference Paper
Full-text available
Applications such as audio fingerprinting require search in high dimensions: find an item in a database that is similar to a query. An important property of this search task is that negative answers are very frequent: much of the time, a query does not correspond to any database item. We propose Redundant Bit Vectors (RBVs): a novel method for quic...
Article
Full-text available
This paper addresses the problem of quickly performing point queries against high-dimensional regions.
Article
Full-text available
Images typically contain smooth regions, which are easily compressed by linear transforms, and high activity regions (edges, textures), which are harder to compress. To compress the first kind, we use a "zero" encoder that has infinite context, very low capacity, and which adapts very quickly to the content. For the second, we use an "interpolation...
Article
In this paper, we describe RARE (Robust Audio Recognition Engine): a system for identifying audio streams and files. RARE can be used in a variety of applications: from enhancing the consumer listening experience to cleaning large audio databases. RARE was designed with two key qualities in mind: robustness to distortion of the audio, and lookup sp...
Article
This paper collects some ideas targeted at advancing our understanding of the feature spaces associated with Support Vector (SV) kernel functions. We rst discuss the geometry of feature space. In particular, we review what is known about the shape of the image of input space under the feature space map, and how this inuences the capacity of SV meth...
Article
This paper has three goals: to find uniqueness theorems for some well-known kernel methods, and thereby to better understand their behavior; to derive theorems which apply to more general families of kernel methods; and to collect results which are hoped to be useful to workers who would like to prove uniqueness theorems for their own algorithms. K...
Article
Mapping audio data to feature vectors for the classification, retrieval or identification tasks presents four principal challenges. The dimensionality of the input must be significantly reduced; the resulting features must be robust to likely distortions of the input; the features must be informative for the task at hand; and the feature extraction...
Article
Mapping audio data to feature vectors for the classification, retrieval or identification tasks presents four principal challenges. The dimensionality of the input must be significantly reduced; the resulting features must be robust to likely distortions of the input; the features must be informative for the task at hand; and the feature extraction...
Article
In this work we present a novel method for global optimization, exploiting the mathematics of quantum mechanics, and in particular the tunnelling phenomenon.
Conference Paper
In this paper, we describe RARE (Robust Audio Recognition Engine): a system for identifying audio streams and files. RARE can be used in a variety of applications: from enhancing the consumer listening experience to cleaning large audio databases. RARE was designed with two key qualities in mind: robustness to distortion of the audio, and lookup sp...
Conference Paper
Full-text available
This chapter describes Lagrange multipliers and some selected subtopics from matrix analysis from a machine learning perspective. The goal is to give a detailed description of a number of mathematical constructions that are widely used in applied machine learning.
Article
The Support Vector (SV) machine is a novel type of learning machine, based on statistical learning theory, which contains polynomial classifiers, neural networks, and radial basis function (RBF) networks as special cases. In the RBF case, the SV algorithm automatically determines centers, weights and threshold such as to minimize an upper bound on...
Article
Full-text available
This paper presents AutoDJ: a system for automatically generating music playlists based on one or more seed songs selected by a user. AutoDJ uses Gaussian Process Regression to learn a user preference function over songs. This function takes music metadata as inputs. This paper further introduces Kernel Meta-Training, which is a method of learning...
Article
Full-text available
this paper presents a system that reduces the input dimensionality by a factor of approximately 8,000. Second, the resulting features must be robust to likely distortions of the input: for example, most radio stations introduce nonlinear distortions into the signal before broadcasting. Third, the resulting features must be informative: for audio id...
Article
Full-text available
We explore the use of neural networks to predict wavelet coefficients for image compression. We show that by reducing the variance of the residual coefficients, the nonlinear prediction can be used to reduce the length of the compressed bitstream. We report results on several network architectures and training methodologies; some pitfalls of the ap...
Conference Paper
Full-text available
This paper presents AutoDJ: a system for automatically generating music play-lists based on one or more seed songs selected by a user. AutoDJ uses Gaus-sian Process Regression to learn a user preference function over songs. This function takes music metadata as inputs. This paper further introduces Kernel Meta-Training, which is a method of learnin...
Article
We give necessary and sufficient conditions for uniqueness of the support vector solution for the problems of pattern recognition and regression estimation, for a general class of cost functions. We show that if the solution is not unique, all support vectors are necessarily at bound, and we give some simple examples of non-unique solutions. We not...
Article
Full-text available
We show that the recently proposed variant of the Support Vector machine (SVM) algorithm, known as -SVM, can be interpreted as a maximal separation between subsets of the convex hulls of the data, which we call soft convex hulls. The soft convex hulls are controlled by choice of the parameter . If the intersection of the convex hulls is empty, the...
Article
. Kernel-based learning methods provide their solutions as expansions in terms of a kernel. We consider the problem of reducing the computational complexity of evaluating these expansions by approximating them using fewer terms. As a by-product, we point out a connection between clustering and approximation in reproducing kernel Hilbert spaces gene...
Article
Full-text available
This paper collects some ideas targeted at advancing our understanding of the feature spaces associated with support vector (SV) kernel functions. We first discuss the geometry of feature space. In particular, we review what is known about the shape of the image of input space under the feature space map, and how this influences the capacity of SV...
Article
. Two view-based object recognition algorithms are compared: (1) a heuristic algorithm based on oriented filters, and (2) a support vector learning machine trained on low-resolution images of the objects. Classification performance is assessed using a high number of images generated by a computer graphics system under precisely controlled condition...
Conference Paper
An important aspect of distinctive feature based approaches to automatic speech recognition is the formulation of a framework for robust detection of these features. We discuss the application of the support vector machines (SVM) that arise when the structural risk minimization principle is applied to such feature detection problems. In particular,...
Article
We show that the recently proposed variant of the Support Vector machine (SVM[) algorithm, known as ν-SVM, can be interpreted as a maximal separation between subsets of the convex hulls of the data, which we call soft convex hulls. The soft convex hulls are controlled by choice of the parameter ν If the intersection of the convex hulls is empty, th...
Article
Full-text available
this paper should not be used as an indication of the quality of the method. The primary weakness of the MPM approaches is that they have not been guided by statistical learning theory. In the problems investigated in this chapter, altering MPM methods to include principles of statistical learning theory almost always improved generalization. Many...
Article
The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We...
Article
Full-text available
The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We...
Article
Full-text available
The support vector (SV) machine is a novel type of learning machine, based on statistical learning theory, which contains polynomial classifiers, neural networks, and radial basis function (RBF) networks as special cases. In the RBF case, the SV algorithm automatically determines centers, weights, and threshold that minimize an upper bound on the e...
Article
Full-text available
Support Vector Learning Machines (SVM) are finding application in pattern recognition, regression estimation, and operator inversion for ill-posed problems. Against this very general backdrop, any methods for improving the generalization performance, or for improving the speed in test phase, of SVMs are of increasing interest. In this paper we comb...
Article
Full-text available
A Support Vector Machine (SVM) is a universal learning machine whose decision surface is parameterized by a set of support vectors, and by a set of corresponding weights. An SVM is also characterized by a kernel function. Choice of the kernel determines whether the resulting SVM is a polynomial classifier, a two-layer neural network, a radial basis...
Conference Paper
Full-text available
Two view-based object recognition algorithms are compared: (1) a heuristic algorithm based on oriented filters, and (2) a support vector learning machine trained on low-resolution images of the objects. Classification performance is assessed using a high number of images generated by a computer graphics system under precisely controlled conditions....
Conference Paper
A Support Vector Machine (SVM) is a universal learning machine whose decision surfaceis parameterized by a set of support vectors and by a set of corresponding weights.An SVM is also characterized by a kernel function. Choice of the kernel determines whether the resulting SVM is a polynomial classifier, a two-layer neural network, a radialbasis fun...
Conference Paper
Full-text available
A new regression technique based on Vapnik's concept of support vectors is introduced. We compare support vector regression (SVR) with a committee regression technique (bagging) based on regression trees and ridge regression done in feature space. On the basis of these experiments, it is expected that SVR will have advantages in high dimensionality...
Article
Full-text available
We introduce a new approach for on-line recognition of handwritten words written in unconstrained mixed style. The preprocessor performs a word-level normalization by fitting a model of the word structure using the EM algorithm. Words are then coded into low resolution "annotated images" where each pixel contains information about trajectory direct...
Conference Paper
Full-text available
We describe an image analysis system for handling complex and noisy images of forms and bank documents, such as business checks, personal checks, or bank deposits. Some of these document types have no standardized layout, requiring a careful analysis of the whole image, to find out where the relevant information, for example the courtesy amount, is...
Article
Full-text available
We present a feed-forward network architecture for recognizing an unconstrained handwritten multi-digit string. This is an extension of previous work on recognizing isolated digits. In this architecture a single digit recognizer is replicated over the input. The output layer of the network is coupled to a Viterbi alignment module that chooses the b...
Conference Paper
Full-text available
Character Recognition has served as one of the principal proving grounds for neural-net methods and has emerged as one of the most successful applications of this technology. This chapter outlines optical character recognition document analysis systems developed at AT&T Bell Labs that combine the strengths of machine-learning algorithms with high-s...
Article
Full-text available
We have constructed a system for recognizing multi-character images 1. This is a nontrivial extension of our previous work on single-character im- ages. It is somewhat surprising that a very good single-character recognizer does not in general form a good basis for a multi-character recognizer. The correct solution depends on three key ideas: 1) A...
Article
We describe a method, “Shortest Path Segmentation” (SPS), which combines dynamic programming and a neural net recognizer for segmenting and recognizing character strings. We describe the application of this method to two problems: recognition of handwritten ZIP Codes, and recognition of handwritten words. For the ZIP Codes, we also used the method...
Article
Full-text available
A neural network algorithm-based system that reads handwritten ZIP codes appearing on real US mail is described. The system uses a recognition-based segmenter, that is a hybrid of connected-components analysis (CCA), vertical cuts, and a neural network recognizer. Connected components that are single digits are handled by CCA. CCs that are combined...
Conference Paper
Full-text available
The authors describe a method which combines dynamic programming and a neural network recognizer for segmenting and recognizing character strings. The method selects the optimal consistent combination of cuts from a set of candidate cuts generated using heuristics. The optimal segmentation is found by representing the image, the candidate segments,...
Article
We present a feed-forward network architecture for recognizing an unconstrainedhandwritten multi-digit string. This is an extension of previouswork on recognizing isolated digits. In this architecture a single digit recognizeris replicated over the input. The output layer of the network iscoupled to a Viterbi alignment module that chooses the best...
Conference Paper
Full-text available
The authors outline OCR (optical character recognition) technology developed at AT&T Bell Laboratories, including a recognition network that learns feature extraction kernels and a custom VLSI chip that is designed for neural-net image processing. It is concluded that both high speed and high accuracy can be obtained using neural-net methods for ch...
Article
We consider the Wess-Zumino model in a background anti-de Sitter space-time in four dimensions (Ad S)4. We show that the naive generators of the O(3, 2) isometry group, obtained by the Noether method, must be improved by adding surface terms. The improved generators have a manifestly positive definite energy density, are conserved, and have vanishi...
Article
| This paper collects some ideas targeted at advancing our understanding of the feature spaces associated with Support Vector (SV) kernel functions. We rst discuss the geometry of feature space. In particular, we review what is known about the shape of the image of input space under the feature space map, and how this inuences the capacity of SV me...
Article
Full-text available
One shortfall of existing machine learning (ML) methods when ap-plied to information retrieval (IR) is the inability to directly optimize for typical IR performance measures. This is in part due to the discrete nature, and thus non-differentiability, of these measures. When cast as an optimization problem, many methods require computing the gradi-e...
Article
Full-text available
One shortfall of existing machine learning (ML) methods when applied to information retrieval (IR) is the inability to directly optimize for typical IR performance measures. This is in part due to the discrete nature, and thus non-differentiability, of these measures. When cast as an optimization problem, many methods require computing the gradient...
Article
Full-text available
The Laplace-Beltrami operator for graphs has been been widely used in many machine learning issues, such as spectral clustering and transductive inference. Functions on the nodes of a graph with vanishing Laplacian are called harmonic functions. In differen-tial geometry, the Laplace-de Rham operator generalizes the Laplace-Beltrami operator. It is...
Article
The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a cla...
Article
The factoring of biprimes is proposed as a framework for exploring unconstrained optimization algorithms. A mapping from a given factoring problem to a positive degree four polynomial F is described. F has the properties that the global minima, which specify the factors, are at F = 0, and that all coefficients can be chosen to be of order unity. Th...