Conference Paper

Fast categorisation of large document collections

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

First Page of the Article

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Our intuitive feature selection technique is to use a fixed number of terms extracted from the beginning of each document . This approach has been applied in non-hierarchical categorisation [19] but, to our knowledge, has not been investigated in hierarchical categorisation. Our approach is intuitive as it resembles the behaviour of a visitor searching for information at a library. ...
... This approach determines whether a document should be assigned to a category based on the computation of a linear func- tion [11] . This approach is effective, and can be used efficiently on large scale datasets on general-purpose hard- ware [19]. ...
... Our techniques have the advantage that the features that are used from each document are chosen solely on the content of that document, that is, collection statistics are not used and do not need to be maintained. Our approach — which Shanks and Williams [19] refer to as first m words — is to extract as features the first fragment of each training document; in their approach, m is a constant. The rationale is that, in general, a summary of each document is present at its beginning and this is supported by their experimental results; in contrast, they show that the last words, middle words, and random words are not good representative summaries. ...
Conference Paper
Categorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical categorisation and, specifically, the role of features in the development of more effective categorisers. We show that a good hierarchical machine learning-based categoriser can be developed using small numbers of features from pre-categorised training documents. Overall, we show that by using a few terms, categorisation accuracy can be improved substantially: unstructured leaf level categorisation can be improved by up to 8.6\%, while top-down hierarchical categorisation accuracy can be improved by up to 12\%. In addition, unlike other feature selection models --- which typically require different feature selection parameters for categories at different hierarchical levels --- our technique works equally well for all categories in a hierarchical structure. We conclude that, in general, more accurate hierarchical categorisation is possible by using our simple feature selection technique.
... Our intuitive feature selection technique is to use a fixed number of terms extracted from the beginning of each document. This approach has been applied in non-hierarchical categorisation [19] but, to our knowledge, has not been investigated in hierarchical categorisation. Our approach is intuitive as it resembles the behaviour of a visitor searching for information at a library. ...
... This approach determines whether a document should be assigned to a category based on the computation of a linear function [11]. This approach is effective, and can be used efficiently on large scale datasets on general-purpose hardware [19]. ...
... Our techniques have the advantage that the features that are used from each document are chosen solely on the content of that document, that is, collection statistics are not used and do not need to be maintained. Our approach -which Shanks and Williams [19] refer to as first m words -is to extract as features the first fragment of each training document; in their approach, m is a constant. The rationale is that, in general, a summary of each document is present at its beginning and this is supported by their experimental results; in contrast, they show that the last words, middle words, and random words are not good representative summaries. ...
Article
Full-text available
Categorisation of digital do ments is useful for organisation and retrieval. While do c mentcP egoriesc an be a set of unstruc turedc ategory labels, some do c mentc; egories are hierarc hicPRA stru c ured. This paper investigates automatic hierarc hicP c ategorisation and, spec; c;; , the role of features in the development of more e# ec ivec; egorisers. We show that a good hierarc hicR mac hine learningbasedc ategoriserce be developed using small numbers of features from pre-c ategorised training doc uments. Overall, we show that by using a few terms, c: egorisation ac; - rac yc an be improved substantially unstruc tured leaf level c: egorisationc an be improved by up to 8.6%, while topdown hierarc hicP c egorisation acP rac cc be improved by up to 12%. In addition, unlike other feature selec tion models --- whic h typicPFM require di#erent featureselec ion parameters forc ategories at di#erent hierarc hic; levels --- our tec hnique works equally well for all c: egories in a hierarc hic$ stru c ure. WecP c lude that, in general, more ace rate hierarc hic; c egorisation is possible by using our simple feature selecPA: tec hnique.
... Our secondary goals were efficiency-a similar goal to our recent categorisation research [7]-and to experiment with various term extraction and profiling techniques to see which work better and why. ...
... The system we used was based on a Rocchio [6] routing categoriser we previously developed for fast document categorisation [7]. The system was developed in C and experiments were carried out on a Pentium III based server with 256 megabytes of main-memory. ...
Article
This is RMIT's first year of participation in the TDT evaluati on. Our system uses a linear classifier to track topics and an approac h based on our previous work in document routing. We aimed this year to develop a baseline system, and to then test selected variati ons, in- cluding adaptive tracking. Our contribution this year to ha ve im- plemented an efficient system, that is, one that maximises tr acking document throughput and minimises system overheads.
... The classification was actually conducted on the summaries of the web text documents which are organized in word-based approach. Shanks and Williams used only the first fragment of each document for their classification task [19]. However, this approach only works well for documents which present overview of the whole document at the beginning. ...
... F-measure is often viewed as the harmonic mean of precision and recall. In addition, we have also included the value of area under ROC curve (AUC), generated by Weka [19]. Weka uses the Mann Whitney statistic to calculate the AUC and ROC (receiver operating curve), where ROC Labeled web pages Web pages in case frames and words Classification Model (classification rules) ...
Conference Paper
Full-text available
Web pages are conventionally represented by the words found within the contents for classification purpose. However, word-based web page representation suffers several limitations such as synonymy and homonymy. Motivated by the limitations of word-based representation, we explore the potential of representing web pages using information extraction patterns, in addition to words that are identified within the web contents. In this paper, we share the results as well as the findings learned from our experiments. Our empirical study conducted using WebKB dataset indicates that the addition of information extraction patterns in web page representation helps to improve the classification precision, especially in the categories which have much diversified web content.
... Some simple approaches have been proven effective. For example, Shanks and Williams [174] showed that only using the first fragment of each document offers fast and accurate classification of news articles. This approach is based on an assumption that a summary is present at the beginning of each document, which is usually true for news articles, but does not always hold for other kinds of documents. ...
... This approach makes the assumption that the most important information and discriminating features are found near the beginning of the document. Shanks and Williams [112] were able to accurately classify text documents with this method of feature set reduction, while Wibowo and Williams [140] applied this approach to the hierarchical classification of Web pages. Kim and Ross [63,64,65,66] also followed this approach for one of the classifiers they investigated for the task of classifying documents in PDF by genre; the visual layout features for the classifier were extracted from only the first page of a PDF file when it was treated as an image. ...
Article
The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre. This research involves the development and testing of a new model for the automatic identification of Web page genre; classification results using this model compare very favorably with those of other researchers.
... In this research work, we use a similarity-based linear categoriser that determines whether a document should be assigned to a category based on the computation of a linear function [10]. This approach is effective, and can be used efficiently on large scale datasets on general-purpose hard- ware [15]. Moreover, these categorisers are term-based , that is, tokens such as words are assigned weights that represent their importance in each category. ...
Conference Paper
On the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual categorisation by experts is costly, subjective, and not scalable with the increasing volumes of data that must be processed. Several methods have been investigated for effective automatic text categorisation. These include selection of categorisation methods, selection of pre-categorised training samples, use of hierachies, and selection of document fragments or features. In this paper, we further investigate categorisation into Web hierarchies and the role of hierarchical information in improving categorisation effectiveness. We introduce new strategies to reduce errors in hierarchical categorisation. In particular, we propose novel techniques that shift the assignment into higher level categories when lower level assignment is uncertain. Our results show that absolute error rates can be reduced by over 2%.
... In the current work, our primary interest was not Web page categorization itself, and other text categorization methods could be explored for use in X4. Techniques such as feature selection [36] might be used to improve both efficiency and accuracy. Problems associated with incremental crawling and dynamically changing content were not considered and should be examined by future work. ...
Article
A major concern in the implementation of a distributed Web crawler is the choice of a strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy is to minimize the overlap between the activities of individual nodes. We propose a topic-oriented approach, in which the Web is partitioned into general subject areas with a crawler assigned to each. We examine design alternatives for a topic-oriented distributed crawler, including the creation of a Web page classifier for use in this context. The approach is compared experimentally with a hash-based partitioning, in which crawler assignments are determined by hash functions computed over URLs and page contents. The experimental evaluation demonstrates the feasibility of the approach, addressing issues of communication overhead, duplicate content detection, and page quality assessment.
Chapter
Web page, a kind of semi-structured document, includes a lot of additional attribute content besides text information. Traditional web page classification technology is mostly based on text classification methods. They ignore the additional attribute information of web page text. We propose WEB-GNN, an approach for Web page classification. There are two major contributions to this work. First, we propose a web page graph representation method called W2G that reconstructs text nodes into graph representation based on text visual association relationship and DOM-tree hierarchy relationship and realizes the efficient integration of web page content and structure. Our second contribution is to propose a web page classification method based on graph convolutional neural network. It takes the web page graph representation as to the input, integrates text features and structure features through graph convolution layer, and generates the advanced webpage feature representation. Experimental results on the Web-black dataset suggest that the proposed method significantly outperforms text-only method.
Chapter
In this paper, we benchmark the efficiency of support vector machines (SVMs), in terms of classification accuracy and the classification speed with the other two popular classification algorithms, which are decision tree and Naïve Bayes. We conduct the study on the 4-University data set, using 4-fold cross validation. The empirical results indicate that both SVMs and Naïve Bayes achieve comparative results in the average precision and recall while decision tree ID3 algorithm outperforms the rest in the average accuracy despite. Nevertheless, ID3 consumes the longest time in generating the classification model as well as classifying the web pages.
Conference Paper
Automatic categorization has been shown to be an accurate alternative to manual categorization in which documents are processed and automatically assigned to pre-defined categories. The accuracy of different methods for categorization has been studied largely, but their efficiency has seldom been mentioned. Aiming to maintain effectiveness while improving efficiency, we proposed a fast algorithm for text categorization and a compressed document vector representation method based on a novel class space model. The experiments proved our methods have better efficiency and tolerable effectiveness.
Article
Classification of web page content is essential to many tasks in web information retrieval such as maintaining web directories and focused crawling. The uncontrolled nature of web content presents additional challenges to web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in web page classification, we note the importance of these web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assump- tions behind the use of information from neighboring pages.
Article
Full-text available
Recent developments and widespread usage of the Internet have made business and processes to be completed faster and easily in electronic media. The increasing size of the stored, transferred and processed data brings many problems that affect access to information on the Web. Because of users’ need get to access to the information in electronic environment quickly, correctly and appropriately, different methods of classification and categorization of data are strictly needed. Millions of search engines should be supported with new approaches every day in order for users to get access to relevant information quickly. In this study, Multilayered Perceptrons (MLP) artificial neural network model is used to classify the web sites according to the specified subjects. A software is developed to select the feature vector, to train the neural network and finally to categorize the web sites correctly. It is considered that this intelligent approach will provide more accurate and secure platform to the Internet users for classifying web contents precisely.
Conference Paper
Full-text available
As WWW grows at an increasing speed, a classifier targeted at hypertext has become in high demand. While document categorization is quite a mature, the issue of utilizing hypertext structure and hyperlinks has been relatively unexplored. In this paper, we propose a practical method for enhancing both the speed and the quality of hypertext categorization using hyperlinks. In comparison against a recently proposed technique that appears to be the only one of the kind, we obtained up to 18.5\% of improvement in effectiveness while reducing the processing time dramatically. We attempt to explain through experiments what factors contribute to the improvement.
Conference Paper
Full-text available
We investigate several recent approaches for text categorization under the framework of similarity-based learning. They include two families of text categorization techniques, namely the k-nearest neighbor (k-NN) algorithm and linear classifiers. After identifying the weakness and strength of each technique, we propose a new technique known as the generalized instance set (GIS) algorithm by unifying the strengths of LNN and linear classifiers and adapting to characteristics of text categorization problems. We also explore some variants of our GIS approach. We have implemented our GIS algorithm, the ExpNet algorithm, and some linear classifiers. Extensive experiments have been conducted on two common document corpora, namely the OHSUMED collection and the Reuters-21578 collection. The results show that our new approach outperforms the latest LNN approach and linear classifiers in all experiments.
Article
Full-text available
In this article, we evaluate the retrieval performance of an algorithm that automatically categorizes medical documents. The categorization, which consists in assigning an International Code of Disease (ICD) to the medical document under examination, is based on well-known information retrieval techniques. The algorithm, which we proposed, operates in a fully automatic mode and requires no supervision or training data. Using a database of 20,569 documents, we verify that the algorithm attains levels of average precision in the 70–80% range for category coding and in the 60–70% range for subcategory coding. We also carefully analyze the case of those documents whose categorization is not in accordance with the one provided by the human specialists. The vast majority of them represent cases that can only be fully categorized with the assistance of a human subject (because, for instance, they require specific knowledge of a given pathology). For a slim fraction of all documents (0.77% for category coding and 1.4% for subcategory coding), the algorithm makes assignments that are clearly incorrect. However, this fraction corresponds to only one-fourth of the mistakes made by the human specialists.
Article
Full-text available
We describe the results of extensive experiments on large document collections using optimized rule-based induction methods. The goal of these methods is to automatically discover classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring manymanyears of developmental efforts, have been successfully built to "read" documents and assign topics to them. In this paper, weshowthatmachine generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 65% recall/precision breakeven point to 80.5%. In the context of a very high dimensional feature space, several methodological alternatives are examined, including universal versu...
Article
Full-text available
Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides performance guarantees and guidance on parameter settings for these algorithms. Experimental data is presented showing Widrow-Hoff and EG to be more effective than the widely used Rocchio algorithm on several categorization and routing tasks. 1 Introduction Document retrieval, categorization, routing, and filtering systems often are based on classification. That is, the IR system decides for each document which of two or more classes it belongs to, or how strongly it belongs to a class, in order to accomplish the IR task of interest. For instance, the two classes may be the documents relevant to and not relevant to a particular user, and the system may rank documents based on how likely it i...
Article
In this article, we evaluate the retrieval performance of an algorithm that automatically categorizes medical documents, The categorization, which consists in assigning an International Code of Disease (ICD) to the medical document under examination, is based on well-known information retrieval techniques. The algorithm, which we proposed, operates in a fully automatic mode and requires no supervision or training data. Using a database of 20,559 documents, we verify that the algorithm attains levels of average precision in the 70-80% range for category coding and in the 60-70% range for subcategory coding. We also carefully analyze the case of those documents whose categorization is not in accordance with the one provided by the human specialists. The vast majority of them represent cases that can only be fully categorized with the assistance of a human subject (because, for instance, they require specific knowledge of a given pathology). For a slim fraction of all documents (0.77% for category coding and 1.4% for subcategory coding), the algorithm makes assignments that are clearly incorrect. However, this fraction corresponds to only one-fourth of the mistakes made by the human specialists.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Book
A general introduction to adaptive systems is presented with emphasis given to: open and closed loop adaptation; the adaptive linear combiner; and alternative expressions of the gradient. Some theoretical considerations in adaptive processing of stationary signals are discussed including: the properties of the quadratic performance surface; methods for searching the performance surface; and the effect of gradient estimation on adaptation. Finally, the basic adaptive algorithm and structures are described, with attention given to LMS algorithms; the z-transform; the sequential regression algorithm; and adaptive recursive filters. Some of the applications of adaptive signal processing are also considered, including adaptive control systems; adaptive interference canceling; and adaptive beam forming.
Conference Paper
In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 1015 % better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks. Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.
Article
Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Article
This article studies aggressive word removal in text categorization to reduce the noise in free texts and to enhance the computational efficiency of categorization. We use a novel stop word identification method to automatically generate domain-specific stoplists which are much larger than a conventional domain-independent stoplist. In our tests with three categorization methods on text collections from different domains/applications, significant numbers of words were removed without sacrificing categorization effectiveness. In the test of the Expert Network method on CACM documents, for example, an 87\% removal of unique words reduced the vocabulary of documents from 8,002 distinct words to 1,045 words, which resulted in a 63\% time saving and a 74\% memory saving in the computation of category ranking, with a 10\% precision improvement, on average, over not using word removal. It is evident in this study that automated word removal based on corpus statistics has a practical and significant impact on the computational tractability of categorization methods in large databases.
Article
In November of 1992 the first Text REtrieval Conference (TREC-1) was held at NIST (Harman 1993). This conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on the new TIPSTER test collection. This was the first time that such groups had ever compared results on the same data using the same evaluation methods, and represented a breakthrough in cross-system evaluation in information retrieval. It was also the first time that most of these groups had tackled such a large test collection and required a major effort by all groups to scale up their retrieval techniques.
Article
A central problem in information retrieval is the automated classification of text documents. Given a set of documents, and a set of topics, what is sought is an algorithm that can determine whether or not each document is about each topic. The Defence applications of text document classifiers are broad ranging, and often have the potential to enhance Defence capability significantly. For example, an intelligence analyst can work more efficiently and effectively if the large numbers of text documents available to them are automatically organised by topic. This paper presents preliminary work on a text document classifier that integrates a number of psychological insights. The result is a classifier that is reasonably accurate, makes decisions rapidly, and is able to give a measure of confidence in its decisions.
Conference Paper
Computer technology is continually developing, with ongoing rapid improvements in processor speed and disk capacity. At the same time, demands on retrieval systems are increasing, with, in applications such as World-Wide Web search engines, growth in data volumes outstripping gains in hardware performance. We experimentally explore the relationship between hardware and data volumes using a new framework designed for retrieval systems. We show that changes in performance depend entirely on the application: in some cases, even with large increases in data volume, the faster hardware allows improvements in response time; but in other cases, performance degrades far more than either raw hardware statistics or speed on processor-bound tasks would suggest. Overall, it appears that seek times rather than processor limitations are a crucial bottleneck and there is little likelihood of reductions in retrieval system response time without improvements in disk performance
Conference Paper
Given a large hierarchical dictionary of concepts, the task of selection of the concepts that describe the contents of a given document is considered. The problem consists in proper handling of the top-level concepts in the hierarchy. As a representation of the document, a histogram of the topics with their respective contribution in the document is used. The contribution is determined by comparison of the document with the “ideal” document for each topic in the dictionary. The “ideal” document for a concept is one that contains only the keywords belonging to this concept, in proportion to their occurrences in the training corpus. A fast algorithm of comparison for some types of metrics is proposed. The application of the method in a system classifier is discussed
Article
Past access to files of integers is crucial for the efficient resolution of queries to databases. Integers are the basis of indexes used to resolve queries, for example, in large internet search systems, and numeric data forms a large part of most databases. Disk access costs can be reduced by compression, if the cost of retrieving a compressed representation from disk and the CPU cost of decoding such a representation is less than that of retrieving uncompressed data. In this paper we show experimentally that, for large or small collections, storing integers in a compressed format reduces the time required for either sequential stream access or random access. We compare different approaches to compressing integers, including the Elias gamma and delta codes, Golomb coding, and a variable-byte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed.
Article
Text retrieval systems are used to fetch documents from large text collections, using queries consisting of words and word sequences.
Article
Two recently implemented machine-learning algorithms, RIPPER and sleeping-experts for phrases, are evaluated on a number of large text categorization problems. These algorithms both construct classifiers that allow the "context" of a word w to affect how (or even whether) the presence or absence of w will contribute to a classification. However, RIPPER and sleeping-experts differ radically in many other respects: differences include different notions as to what constitutes a context, different ways of combining contexts to construct a classifier, different methods to search for a combination of contexts, and different criteria as to what contexts should be included in such a combination. In spite of these differences, both RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods. We view this result as a confirmation of the usefulness of classifiers that represent contextual information.
Article
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 5...
Article
A probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization framework. The analysis results in a probabilistic version of the Rocchio classifier and offers an explanation for the TFIDF word weighting heuristic. The Rocchio classifier, its probabilistic variant and a standard naive Bayes classifier are compared on three text categorization tasks. The results suggest that the probabilistic algorithms are preferable to the heuristic Rocchio classifier. This research is sponsored by the Wright Laboratory, Aeronautical Systems Center, Air Force Materiel Command, USAF, and the Advanced Research Projects Agency (ARPA) under grant F33615-93-1-1330. The US Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation thereon. Views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of Wright Laboratory or the United States Government. Keywords: text categorization, relevance feedback, naive Bayes classifier, information retrieval, vector space retrieval model, machine learning 1
Article
The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. Existing classification schemes which ignore the hierarchical structure and treat the topics as separate classes are often inadequate in text classification where the there is a large number of classes and a huge number of relevant features needed to distinguish between them. We propose an approach that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree. As we show, each of these smaller problems can be solved accurately by focusing only on a very small set of features, those relevant to the task at hand. This set of relevant features varies widely throughout the hierarchy, so that, while the overall relevant feature set may be large, each classifier only examines a small subset. The use of reduced feature sets allows us to utilize more complex (probabilistic) models, without encountering many of the standard computational and robustness difficulties. 1
Article
This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is learned to distinguish a second-level category from other categories within the same top level. In the flat non-hierarchical case, a model distinguishes a second-level category from all other second-level categories. Scoring rules can further take advantage of the hierarchy by considering only second-level categories that exceed a threshold at the top level. We use support vector machine (SVM) classifiers, which have been shown to be efficient and effective for classification, but not previously explored in the context of hierarchical classification. We found small advantages in accuracy for hierarchical models over flat models. For the hierarchical approach, we found the same accuracy using a sequential Boolean decision rule and a multiplica...
Algorithms in C: Parts 1-4: Fundamentals, data structures, sorting, searching On using hierarchies for doc-ument classification
  • R Sedgewicking
  • Ma
  • W Usa
  • H Wibowo
  • Williams
R. Sedgewick. Algorithms in C: Parts 1-4: Fundamentals, data structures, sorting, searching. Addison-Wesley, Read-ing, MA, USA, 1998. [211 W. Wibowo and H. Williams. On using hierarchies for doc-ument classification. In Proc. Australasian Document Com-puting Symposium, pages 3 1-37, Coffs Harbour, Australia, December 1999. Southern Cross University.
What's next? In-dex structures for efficient phrase querying A re-examination of text categorization methods A comparative study on feature se-lection in text categorization Using corpus statistics to remove redundant words in text categorization Trends in retrieval system performance
  • H Williams
  • J H Zobel
  • J Williams
  • P Zobel
  • X Anderson Yang
  • J Liu Yang
  • J Pedersen Yang
  • Wilbur
H. Williams and J. Zobel. Compressing integers for fast file access. Computer Journal, 42(3): 193-201, 1999. H. Williams, J. Zobel, and P. Anderson. What's next? In-dex structures for efficient phrase querying. In Proc. Aus-tralasian Database Conference, pages 141-1 52, Auckland, New Zealand, January 1999. Springer. I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 1999. Y. Yang and X. Liu. A re-examination of text categorization methods. In Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 42-49, Berkeley, California, August 1999. ACM Press. Y. Yang and J. Pedersen. A comparative study on feature se-lection in text categorization. In Proc. International Confer-ence on Machine Learning, pages 412-420, Nashville, July 1997. Morgan Kaufmann. Y. Yang and J. Wilbur. Using corpus statistics to remove redundant words in text categorization. JoumaI of the Amer-ican Society for Information Science, 47(5):357-369, 1996. J. Zobel, H. Williams, and S. Kimberley. Trends in retrieval system performance. In Proc. Australasian Computer Sci-ence Conference, volume 22, pages 241-248, Canberra, Jan-uarymebruary 2000. IEEE Computer Society Press.