Conference Paper

Indexing for Fast Categorisation.

Authors:
Conference Paper

Indexing for Fast Categorisation.

If you want to read the PDF, try requesting it from the authors.

Abstract

Automatic categorisation is an important technique for the management of large document collections. Categorisation can be used to store or locate documents that satisfy an information need when the need cannot be expressed as a concise list of query terms. Inverted indexes are used in all query-based retrieval systems to allow efficient query processing. In this paper, we propose the application of inverted indexes to categorisation with the aim of developing a fast, scalable, and accurate approach. Specifically, we propose successful variants of inverted indexing to reduce index size: first, quantisation of term-category weights; second, compression of the quantised weights; and, last, storing only those weights that significantly impact the categorisation process. We show that our techniques permits fast, accurate categorisation: index size is reduced by orders of magnitude compared to conventional inverted indexing and the accuracy of categorisation is preserved.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Sparse inference Earlier research has applied inverted indices for reducing the classification times for Knearest Neighbours [Yang, 1994] and Centroid [Shanks et al., 2003]. The same reductions are gained for computing posterior probabilities for linearly interpolated language models in information retrieval [Hiemstra, 1998, Zhai andLafferty, 2001b]. ...
... Representing a collection of high-dimensional sparse data can be done with an inverted index [Zobel and Moffat, 2006], enabling scalable retrieval of documents as well as other types of inference [Yang, 1994, Shanks et al., 2003, Kudo and Matsumoto, 2003, Puurula, 2012a. The scalability of modern web search engines is largely due to the representation of web pages using inverted indices [Witten et al., 1994, Zobel andMoffat, 2006]. ...
... However, the time complexity for posterior inference can be substantially reduced by taking into account sparsity in the parameters. Earlier work with inverted indices has shown that classifier scores for Centroid classifier [Shanks et al., 2003] and K-nearest Neighbours [Yang, 1994] can be done as a function of sparsity. The posteriors for MNB with uniform priors can be computed similarly [Hiemstra, 1998, Zhai andLafferty, 2001b], for both Jelinek-Mercer [Hiemstra, 1998] and other basic smoothing methods [Zhai and Lafferty, 2001b]. ...
Article
Full-text available
The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places.
... We have previously considered efficiency and scalability issues for categorisation. In particular, we have investigated how category information can be compactly represented in the inverted index structure that is used in all practical query-based text retrieval systems [24]. In this previous work, we have shown that indexes can be used for fast categorisation. ...
... In practice, indexes are stored compactly using integer compression techniques [22, 30, 32]. We have previously proposed the adaptation of inverted indexes to permit fast categorisation [24] using the Rocchio weight learning technique [17]. Other approaches to implementing fast categorisation store category vectors that are derived during the training process [9, 14, 31]. ...
... In addition, we compress both category number differences and weights using integer compression schemes [6]. The result of this approach is a compact index that permits fast, accurate categorisation: when floating point values are quantised to 256 equal-sized bins and represented as an 8-bit integer, the index for the 312 Mb collection is just over 5 Mb in size, the categorisation accuracy is the same as the floating-point approach over three accuracy measures, and categorisation speed around 20 times faster that the floating point-based index [24]. However, despite these excellent results, the index construction process would require that the entire collection be held in main-memory during training with an SVM. ...
Conference Paper
Categorisation is a useful method for organising documents into subcollections that can be browsed or searched to more accurately and quickly meet information needs. On the Web, category-based portals such as Yahoo! and DMOZ are extremely popular: DMOZ is maintained by over 56,000 volunteers, is used as the basis of the popular Google directory, and is perhaps used by millions of users each day. Support Vector Machines (SVM) is a machine-learning algorithm which has been shown to be highly effective for automatic text categorisation. However, a problem with iterative training techniques such as SVM is that during their learning or training phase, they require the entire training collection to be held in main-memory; this is infeasible for large training collections such as DMOZ or large news wire feeds. In this paper, we show how inverted indexes can be used for scalable training in categorisation, and propose novel heuristics for a fast, accurate, and memory efficient approach. Our results show that an index can be constructed on a desktop workstation with little effect on categorisation accu-racy compared to a memory-based approach. We conclude that our techniques permit automatic categorisation using very large train-ing collections, vocabularies, and numbers of categories.
... Sparse inference Earlier research has applied inverted indices for reducing the classification times for K-nearest Neighbours [Yang, 1994] and Centroid [Shanks et al., 2003]. The same reductions are gained for computing posterior probabilities for linearly interpolated language models in information retrieval [Hiemstra, 1998, Zhai andLafferty, 2001b]. ...
... Representing a collection of high-dimensional sparse data can be done with an inverted index [Zobel and Moffat, 2006], enabling scalable retrieval of documents as well as other types of inference [Yang, 1994, Shanks et al., 2003, Kudo and Matsumoto, 2003, Puurula, 2012a. The scalability of modern web search engines is largely due to the representation of web pages using inverted indices [Witten et al., 1994, Zobel andMoffat, 2006]. ...
Thesis
Full-text available
The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places.
... The dictionary of keys is an efficient representation for model training, but for classification an even more efficient sparse representation exists. The inverted index forms the core technique of modern information retrieval, but surprisingly it has been proposed for classification use only recently [6]. An inverted index can be used to access the multinomials, consisting of a vector κ of label lists called postings lists κ n . ...
Conference Paper
Full-text available
Machine learning technology faces challenges in handling "Big Data": vast volumes of online data such as web pages, news stories and articles. A dominant solution has been parallelization, but this does not make the tasks less challenging. An alternative solution is using sparse computation methods to fundamentally change the complexity of the processing tasks themselves. This can be done by using both the sparsity found in natural data and sparsified models. In this paper we show that sparse representations can be used to reduce the time complexity of generative classifiers to build fundamentally more scalable classifiers. We reduce the time complexity of Multinomial Naive Bayes classification with sparsity and show how to extend these findings into three multi-label extensions: Binary Relevance, Label Powerset and Multi-label Mixture Models. To provide competitive performance we provide the methods with smoothing and pruning modifications and optimize model meta-parameters using direct search optimization. We report on classification experiments on 5 publicly available datasets for large-scale multi-label classification. All three methods scale easily to the largest available tasks, with training times measured in seconds and classification times in milliseconds, even with millions of training documents, features and classes. The presented sparse modeling techniques should be applicable to many other classifiers, providing the same types of fundamental complexity reductions when applied to large scale tasks.
Article
Entity relationship search at Web scale depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices. These data structures cannot be readily built upon standard inverted indices. Here we present a Web scale entity annotator and annotation index. Using a new workload-sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. In contrast, DBPedia Spotlight spends 158 milliseconds, Wikipedia Miner spends 21 milliseconds, and Zemanta spends 9.5 milliseconds. Our annotation indices use ideas from vertical databases to reduce storage by 30%. On 40x8 cores with 40x3 disk spindles, we can annotate and index, in about a day, a billion Web pages with two million entities and 200,000 types from Wikipedia. Index decompression and scan speed are comparable to MG4J.
Article
Full-text available
This article describes the porting and optimization of an explicit, time-dependent, computational fluid dynamics code on an 8,192-node MasPar MP-1. The MasPar is a very fine-grained, single instruction, multiple data parallel computer. The code uses the flux-corrected transport algorithm. We describe the techniques used to port and optimize the code, and the behavior of a test problem. The test problem used to benchmark the flux-corrected transport code on the MasPar was a two-dimensional exploding shock with periodic boundary conditions. We discuss the performance that our code achieved on the MasPar, and compare its performance on the MasPar with its performance on other architectures. The comparisons show that the performance of the code on the MasPar is slightly better than on a CRAY Y-MP for a functionally equivalent, optimized two-dimensional code.
Article
Full-text available
Introduction The Waikato Environment for Knowledge Analysis (Weka) is a comprehensive suite of Java class libraries that implement many state-of-the-art machine learning and data mining algorithms. Weka is freely available on the World-Wide Web and accompanies a new text on data mining [1] which documents and fully explains all the algorithms it contains. Applications written using the Weka class libraries can be run on any computer with a Web browsing capability; this allows users to apply machine learning techniques to their own data regardless of computer platform. Tools are provided for pre-processing data, feeding it into a variety of learning schemes, and analyzing the resulting classifiers and their performance. An important resource for navigating through Weka is its on-line documentation, which is automatically generated from the source. The primary learning methods in Weka are classifiers, and they induce a rule set or decision tree that models the data. Weka also
Conference Paper
In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 1015 % better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks. Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.