Conference Paper

Minimising Errors in Hierarchical Web Categorisation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

On the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual categorisation by experts is costly, subjective, and not scalable with the increasing volumes of data that must be processed. Several methods have been investigated for effective automatic text categorisation. These include selection of categorisation methods, selection of pre-categorised training samples, use of hierachies, and selection of document fragments or features. In this paper, we further investigate categorisation into Web hierarchies and the role of hierarchical information in improving categorisation effectiveness. We introduce new strategies to reduce errors in hierarchical categorisation. In particular, we propose novel techniques that shift the assignment into higher level categories when lower level assignment is uncertain. Our results show that absolute error rates can be reduced by over 2%.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the seminal work by [21], a document to be classified proceeds top-down along the given taxonomy, each classifier being used to decide to which subtree(s) the document should be sent to, until one or more leaves of the taxonomy are reached. This approach, which requires the implementation of multiclass classifiers for each parent node, gave rise to a variety of actual systems, e.g., [25], [10],[36], and [26]. ...
... This solution is normally adopted in monolithic systems, where only one classifier is entrusted with distinguishing among all categories in a taxonomy [16, 19]. Variations on this theme can be found in [36] and in [24]. In local approaches, different sets of features are selected for different nodes in the taxonomy, thus taking advantage of dividing a large initial problem into subproblems, e.g., [36]. ...
... Variations on this theme can be found in [36] and in [24]. In local approaches, different sets of features are selected for different nodes in the taxonomy, thus taking advantage of dividing a large initial problem into subproblems, e.g., [36]. This is the default choice for Pachinko machines. ...
Article
Full-text available
Progressive filtering is a simple way to perform hierarchical classification, inspired by the behavior that most humans put into practice while attempting to categorize an item according to an underlying taxonomy. Each node of the taxonomy being associated with a different category, one may visualize the categorization process by looking at the item going downwards through all the nodes that accept it as belonging to the corresponding category. This paper is aimed at modeling the progressive filtering technique from a probabilistic perspective, in a hierarchical text categorization setting. As a result, the designer of a system based on progressive filtering should be facilitated in the task of devising, training, and testing it.
... In the seminal work by [31], a sample to be classified proceeds top-down along the given taxonomy, each classifier being used to decide to which subtree(s) the sample should be sent to, until one or more leaves of the taxonomy are reached. This approach, which requires the implementation of multiclass classifiers for each parent node, gave rise to a variety of actual systems, e.g., [35], [15], [49], and [37]. ...
... This solution is normally adopted in monolithic systems, where only one classifier is entrusted with distinguishing among all categories in a taxonomy [22,28]. Variations on this theme can be found in [49] and in [34]. In local approaches, different sets of features are selected for different nodes in the taxonomy, thus taking advantage of dividing a large initial problem into subproblems, e.g., [49]. ...
... Variations on this theme can be found in [49] and in [34]. In local approaches, different sets of features are selected for different nodes in the taxonomy, thus taking advantage of dividing a large initial problem into subproblems, e.g., [49]. This is the default choice for Pachinko machines. ...
Article
Full-text available
Progressive filtering is a simple way to perform hierarchical classification, inspired by the behavior that most humans put into practice while attempting to categorize an item according to an underlying taxonomy. Each node of the taxonomy being associated with a different category, one may visualize the categorization process by looking at the item going downwards through all the nodes that accept it as belonging to the corresponding category. This paper is aimed at modeling the progressive filtering technique from a probabilistic perspective. As a result, the designer of a system based on progressive filtering should be facilitated in the task of devising, training, and testing it.
... They found not only hierarchical model was more accurate than the baseline flat model but also it was more efficient in saving evaluation time (Dumais and Chen, 2000). Wibowo and Williams (2002b) differentiated two approaches for hierarchical categorization – top down approach and combination schemes approach. The top down approach is intuitive. ...
... Moreover, because of fewer categories to choose from, given the parent labeling is correct, the accuracy of child assignment might be higher. In the combination schemes approach, we need to calculate the similarity of a document to the classes in each level of the tree structure, and then combine the results from different levels to make a final assignment decision (Wibowo and Williams, 2002b). Dumais and Chen (2000)'s multiplicative scoring function and sequential Boolean function are examples of combining scores from different level classifiers. ...
... Our proposed cascading model is different from other approaches reported in hierarchical text classification literature. Top down approach and schemes combination approach are the two major methods for hierarchical text classification (Dumais and Chen, 2000; Wibowo and Williams, 2002b). In our proposed approach, we take a twostage approach and simply pass down the predicted Parent class label to the Child classification model. ...
Conference Paper
Full-text available
Many real world classification problems involve classes organized in a hierarchical tree-like structure. However in many cases the hierarchical structure is ignored and each class is treated in isolation or in other words the class structure is flattened (Dumais and Chen, 2000). In this paper, we propose a new approach of incorporating hierarchical structure knowledge by cascading it as an additional feature for Child level classifier. We posit that our cascading model will outperform the baseline “flat” model. Our empirical experiment provides strong evidences supporting our proposal. Interestingly, even imperfect hierarchical structure knowledge would also improve classification performance.
... A semantically sound taxonomy can be used as a base for a divide-and-conquer strategy. A classifier is built independently at each internal node of the hierarchy using all the documents of the subcategories of this category, and a document is labeled using these classifiers to greedily select sub-branches until we reach a leaf node, or certain constraints are satisfied(like the score should be larger than a threshold [8] or the predictions of adjacent levels should be consistent [22]). Feature selection is often performed at each node before constructing a classifier [10] [4]. ...
... A semantically sound taxonomy can be used as a base for a divide-and-conquer strategy. A classifier is built independently at each internal node of the hierarchy using all the documents of the subcategories of this category, and a document is labeled using these classifiers to greedily select sub-branches until we reach a leaf node, or certain constraints are satisfied(like the score should be larger than a threshold [8] or the predictions of adjacent levels should be consistent [22]). Feature selection is often performed at each node before constructing a classifier [10, 4]. ...
Conference Paper
Full-text available
Hierarchical models have been shown to be effective in content classification. However, we observe through empirical study that the performance of a hierarchical model varies with given taxonomies; even a semantically sound taxonomy has potential to change its structure for better classification. By scrutinizing typical cases, we elucidate why a given semantics-based hierarchy does not work well in content classification, and how it could be improved for accurate hierarchical classification. With these understandings, we propose effective localized solutions that modify the given taxonomy for accurate hierarchical classification. We conduct extensive experiments on both toy and real-world data sets, report improved performance and interesting findings, and provide further analysis of algorithmic issues such as time complexity, robustness, and sensitivity to the number of features.
... The purpose of hierarchical classification of web data is to minimize error rate by assigning them into intermediate level in the hierarchy instead of categorizing them into leaf level. It leverages content from ancestors and descendants to reduce errors when assigning data [31, 12]. Even if these approaches are promising in web categorization, they are not directly applicable to indexing as they need training data. ...
Article
Full-text available
When documents are atomically structured, it is possible to assign them keyword vectors to support indexing. Most web content, however, have non-atomic structures. These include navigational/semantic hierarchies on the web. Al-though they are especially effective for browsing, such struc-tures make it hard for individual nodes to be properly in-dexed. This is because, in many cases, their contents have to be inferred from the contents of their neighbors, ances-tors, and descendants in the structure. In this paper, we propose a novel keyword and keyword weight propagation technique to properly enrich the data nodes in structured content. In particular, our approach first relies on under-standing the context provided by the relative content rela-tionships between entries in the structure. We then lever-age this information for relative-content preserving keyword propagation. Experiments show that we observe a signifi-cant improvement (10 − 15%) in precision with the proposed keyword propagation algorithm.
... A similar idea is used in " Hierarchical Mixture Model " [26] proposed by Toutanova et al., where a generative model incorporates the term probabilities from all parent classes into the current class. Wibowo and Williams [27] suggested that an instance should be assigned to a higher level category when a lower level classifier is uncertain. The hierarchical classification approaches mentioned above share a common characteristic: they were posed as meta-classifiers built on top of base classifiers. ...
Article
Full-text available
Hierarchical classification has been shown to have superior performance than flat classification. It is typically performed on hierarchies created by and for hu-mans rather than for classification performance. As a result, classification based on such hierarchies often yields suboptimal results. In this paper, we propose a novel genetic algorithm-based method on hierarchy adaptation for improved clas-sification. Our approach customizes the typical GA to optimize classification hi-erarchies. In several text classification tasks, our approach produced hierarchies that significantly improved upon the accuracy of the original hierarchy as well as hierarchies generated by state-of-the-art methods.
... For example Chung and Clarke [2002] used HTML data and metadata to classify Web documents using such diverse algorithms as Naïve Bayes Classifiers [Mitchell 1997], Rocchio Feedback [Joachims 1997] and Support Vector Machines (SVM) [Joachims 2001]. In addition, a linear function categorizer shows promise [Wibowo and Williams 2002]. Categorization of large-scale DLs, such as the ACM DL, may require a relational database or other optimization techniques to ensure adequate performance as the system scales up. ...
Article
Large scale research Digital Libraries (DLs) have a large array of potentially useful metadata. Yet, many popular DLs do not provide a convenient way to navigate the metadata or to visualize classification schema in the user session. For example, in the broad world of Management Information Systems (MIS) research, a high-level overview of MIS topics and their inter-relationships would be useful to navigate a MIS DL before zooming in on a specific article. To address this obstacle, this paper describes a prototype, the Technical Report Visualizer System (TRV), which uses a wide variety of open standards to expose DL classification metadata in the navigation interface. The system captures MIS article metadata from the Open Archives Initiative (OAI) compliant arXiv e-Print archive at Cornell University. The OAI Protocol for Metadata Harvesting (OAI-PMH) is used to collect the topic metadata; the articles' Association for Computing Machinery's (ACM) Computing Classification System codes. We display the topic metadata in a Java hyperbolic tree and make use of XML conceptual product and implementation product standards and specifications, such as the Dublin Core and BiblioML bibliographic metadata sets, XML Topic Maps, Xalan and Xerces, to link user navigation activity to the abstracts and full text contents of the articles. We discuss the flexibility and convenience of XML standards and link this effort to related digital library visualization approaches.
... It may also be complementary: augmenting manually-constructed directories with automatically-labelled documents is an important potential application of automatic text categorisation. In such a process, automatic categorisation could be used to recommend each document be forwarded to a particular maintainer, who then performs the manual category allocation [27] . There are many other applications for automatic text categorisation, such as fighting the battle against spam emails, and prioritising incoming email or voicemail in a high-volume system. ...
Conference Paper
Categorisation is a useful method for organising documents into subcollections that can be browsed or searched to more accurately and quickly meet information needs. On the Web, category-based portals such as Yahoo! and DMOZ are extremely popular: DMOZ is maintained by over 56,000 volunteers, is used as the basis of the popular Google directory, and is perhaps used by millions of users each day. Support Vector Machines (SVM) is a machine-learning algorithm which has been shown to be highly effective for automatic text categorisation. However, a problem with iterative training techniques such as SVM is that during their learning or training phase, they require the entire training collection to be held in main-memory; this is infeasible for large training collections such as DMOZ or large news wire feeds. In this paper, we show how inverted indexes can be used for scalable training in categorisation, and propose novel heuristics for a fast, accurate, and memory efficient approach. Our results show that an index can be constructed on a desktop workstation with little effect on categorisation accu-racy compared to a memory-based approach. We conclude that our techniques permit automatic categorisation using very large train-ing collections, vocabularies, and numbers of categories.
... Dumais and Chen [8] demonstrated that making use of the hierarchical structure of web directories can improve both efficiency and accuracy. Wibowo and Williams [24] also studied the problem of hierarchical web classification and suggested methods to minimize errors. Chakrabarti et al. [4] have shown that directly including neighboring pages' textual content into the page does not improve the performance of classification because too much noise is introduced in this approach. ...
Conference Paper
Full-text available
Web page classification is important to many tasks in information retrieval and web mining. However, applying traditional textual classifiers on web data often produces unsatisfying results. Fortunately, hyperlink information provides important clues to the categorization of a web page. In this paper, an improved method is proposed to enhance web page classification by utilizing the class information from neighboring pages in the link graph. The categories represented by four kinds of neighbors (parents, children, siblings and spouses) are combined to help with the page in question. In experiments to study the effect of these factors on our algorithm, we find that the method proposed is able to boost the classification accuracy of common textual classifiers from around 70% to more than 90% on a large dataset of pages from the Open Directory Project, and outperforms existing algorithms. Unlike prior techniques, our approach utilizes same-host links and can improve classification accuracy even when neighboring pages are unlabeled. Finally, while all neighbor types can contribute, sibling pages are found to be the most important.
... The document annotation or classification problem of phase two is interesting in that the codes themselves are structured hierarchically. Similar hierarchical classification problems have been addressed8910 including by our own group [11,12]. When working on GO annotation one may certainly draw from these related papers. ...
Article
Full-text available
Annotating genes and their products with Gene Ontology codes is an important area of research. One approach is to use the information available about these genes in the biomedical literature. The goal in this paper, based on this approach, is to develop automatic annotation methods that can supplement the expensive manual annotation processes currently in place. Using a set of Support Vector Machines (SVM) classifiers we were able to achieve Fscores of 0.49, 0.41 and 0.33 for codes of the molecular function, cellular component and biological process GO hierarchies respectively. We find that alternative term weighting strategies are not different from each other in performance and feature selection strategies reduce performance. The best thresholding strategy is one where a single threshold is picked for each hierarchy. Hierarchy level is important especially for molecular function and biological process. The cellular component hierarchy stands apart from the other two in many respects. This may be due to fundamental differences in link semantics. This research shows that it is possible to beneficially exploit the hierarchical structures by defining and testing a relaxed criteria for classification correctness. Finally it is possible to build classifiers for codes with very few associated documents but as expected a huge penalty is paid in performance. The GO annotation problem is complex. Several key observations have been made as for example about topic drift that may be important to consider in annotation strategies.
Article
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
Article
Information is often organized as a text hierarchy. A hierarchical text-classification system is thus essential for the management, sharing, and dissemination of information. It aims to automatically classify each incoming document into zero, one, or several categories in the text hierarchy. In this paper, we present a technique called CRHTC (context recognition for hierarchical text classification) that performs hierarchical text classification by recognizing the context of discussion (COD) of each category. A category's COD is governed by its ancestor categories, whose contents indicate contextual backgrounds of the category. A document may be classified into a category only if its content matches the category's COD. CRHTC does not require any trials to manually set parameters, and hence is more portable and easier to implement than other methods. It is empirically evaluated under various conditions. The results show that CRHTC achieves both better and more stable performance than several hierarchical and nonhierarchical text-classification methodologies.
Article
Full-text available
A topic taxonomy is an effective representation that describes salient features of virtual groups or online communities. A topic taxonomy consists of topic nodes. Each internal node is defined by its vertical path (i.e., ancestor and child nodes) and its horizonal list of attributes (or terms). In a text-dominant environment, a topic taxonomy can be used to flexibly describe a group's interests with varying granularity. However, the stagnant nature of a taxonomy may fail to timely capture the dynamic change of a group's interest. This article addresses the problem of how to adapt a topic taxonomy to the accumulated data that reflects the change of a group's interest to achieve dynamic group profiling. We first discuss the issues related to topic taxonomy. We next formulate taxonomy adaptation as an optimization problem to find the taxonomy that best fits the data. We then present a viable algorithm that can efficiently accomplish taxonomy adaptation. We conduct extensive experiments to evaluate our approach's efficacy for group profiling, compare the approach with some alternatives, and study its performance for dynamic group profiling. While pointing out various applications of taxonomy adaption, we suggest some future work that can take advantage of burgeoning Web 2.0 services for online targeted marketing, counterterrorism in connecting dots, and community tracking.
Article
Full-text available
Thesis (Ph.D.)--University of Ottawa, 2006. Includes bibliographies.
Article
Full-text available
We consider the problem of assigning level numbers (weights) to hierarchically organized categories during the process of text categorization. These levels control the ability of the categories to attract documents during the categorization process. The levels are adjusted in order to obtain a balance between recall and precision for each category. If a category's recall exceeds its precision, the category is too strong and its level is reduced. Conversely, a category's level is increased to strengthen it if its precision exceeds its recall. The categorization algorithm used is a supervised learning procedure that uses a linear classifier based on the category levels. We are given a set of categories, organized hierarchically. We are also given a training corpus of documents already placed in one or more categories. From these, we extract vocabulary, words that appear with high frequency within a given category, characterizing each subject area. Each node's vocabulary is filtered and its words assigned weights with respect to the specific category. Then, test documents are scanned and categories ranked based on the presence of vocabulary terms. Documents are assigned to categories based on these rankings. We demonstrate that precision and recall can be significantly improved by solving the categorization problem taking hierarchy into account. Specifically, we show that by adjusting the category levels in a principled way, that precision can be significantly improved, from 84\% to 91\%, on the much-studied Reuters-21578 corpus organized in a three-level hierarchy of categories.
Conference Paper
Full-text available
Given a set of categories, with or without a preexisting hierarchy among them, we consider the problem of assigning documents to one or more of these categories from the point of view of a hierarchy with more or less depth. We can choose to make use of none, part or all of the hierarchical structure to improve the categorization effectiveness and efficiency. It is possible to create additional hierarchy among the categories. We describe a procedure for generating a hierarchy of classifiers that models the hierarchy structure. We report on computational experience using this procedure. We show that judicious use of a hierarchy can significantly improve both the speed and effectiveness of the categorization process. Using the Reuters-21578 corpus, we obtain an improvement in running time of over a factor of three and a 5% improvement in F-measure. 1. Introduction and Background The document categorization problem is one of assigning newly arriving documents to one or more preexisting c...
Conference Paper
Full-text available
This paper presents the design and evalu- ation of a text categorization method based on the Hi- erarchical Mixture of Experts model. This model uses a divide and conquer principle to define smaller categoriza- tion problems based on a predefined hierarchical struc- ture. The final classifier is a hierarchical array of neu- ral networks. The method is evaluated using the UMLS Metathesaurus as the underlying hierarchical structure, and the OHSUMED test set of MEDLINE records. Com- parisons with traditional Rocchio's algorithm adapted for text categorization, as well as flat neural network classi- fiers are provided. The results show that the use of the hierarchical structure improves text categorization per- formance significantly.
Article
Full-text available
This paper describes automatic document categorization based on large text hierarchy. We handle the large number of features and training examples by taking into account hierarchical structure of examples and using feature selection for large text data. We experimentally evaluate feature subset selection on real-world text data collected from the existing Web hierarchy named Yahoo. In our learning experiments naive Bayesian classifier was used on text data using featurevector document representation that includes word sequences (n-grams) instead of just single words (unigrams). Experimental evaluation on real-world data collected form the Web shows that our approach gives promising results and can potentially be used for document categorization on the Web. Additionally the best result on our data is achieved for relatively small feature subset, while for larger subset the performance substantially drops. The best performance among six tested feature scoring measure was achieved by the feature scoring measure called Odds ratio that is known from information retrieval.
Article
Full-text available
We describe the results of extensive experiments on large document collections using optimized rule-based induction methods. The goal of these methods is to automatically discover classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring manymanyears of developmental efforts, have been successfully built to "read" documents and assign topics to them. In this paper, weshowthatmachine generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 65% recall/precision breakeven point to 80.5%. In the context of a very high dimensional feature space, several methodological alternatives are examined, including universal versu...
Article
Full-text available
Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides performance guarantees and guidance on parameter settings for these algorithms. Experimental data is presented showing Widrow-Hoff and EG to be more effective than the widely used Rocchio algorithm on several categorization and routing tasks. 1 Introduction Document retrieval, categorization, routing, and filtering systems often are based on classification. That is, the IR system decides for each document which of two or more classes it belongs to, or how strongly it belongs to a class, in order to accomplish the IR task of interest. For instance, the two classes may be the documents relevant to and not relevant to a particular user, and the system may rank documents based on how likely it i...
Article
Full-text available
A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented. The technique is applicable to a wide variety of classifiaction functions, including Perceptrons, polynomials, and Radial Basis Functions. The effective number of parameters is adjusted automatically to match the complexity of the problem. The solution is expressed as a linear combination of supporting patterns. These are the subset of training patterns that are closest to the decision boundary. Bounds on the generalization performance based on the leave-one-out method and the VC-dimension are given. Experimental results on optical character recognition problems demonstrate the good generalization obtained when compared with other learning algorithms. 1
Conference Paper
The Construe news story categorization system assigns indexing terms to news stories according to their content using knowledge-based techniques. An initial deployment of Construe in Reuters Ltd. topic identification system (TIS) has replaced human indexing for Reuters Country Reports, an online information service based on news stories indexed by country and type of news. TIS indexing is comparable to human indexing in overall accuracy but costs much less, is more consistent, and is available much more rapidly. TIS can be justified in terms of cost savings alone, but Reuters also expects the speed and consistency of TIS to provide significant competitive advantage and, hence, an increased market share for Country Reports and other products from Reuters Historical Information Products Division.
Article
With the recent dramatic increase in electronic access to documents, text categorization-the task of assigning topics to a given document-has moved to the center of the information sciences and knowledge management. This article uses the structure that is present in the semantic space of topics in order to improve performance in text categorization: according to their meaning, topics can be grouped together into ``meta-topics'', e.g., gold, silver, and copper are all metals. The proposed architecture matches the hierarchical structure of the topic space, as opposed to a flat model that ignores the structure. It accommodates both single and multiple topic assignments for each document. Its probabilistic interpretation allows its predictions to be combined in a principled way with information from other sources. The first level of the architecture predicts the probabilities of the meta-topic groups. This allows the individual models for each topic on the second level to focus on finer discriminations within the group. Evaluating the performance of a two-level implementation on the Reuters-22173 testbed of newswire articles shows the most significant improvement for rare classes.
Article
Three different types of classifiers were investigated in the context of a text categorization problem in the medical domain: the automatic assignment of ICD9 codes to dictated inpatient discharge summaries. K-nearest-neighbor, relevance feedback, and Bayesian independence classifers were applied individually and in combination. A combination of different classifiers produced better results than any single type of classifier. For this specific medical categorization problem, new query formulation and weighting methods used in the k-nearest-neighbor classifier improved performance. 1 Introduction Past research in information retrieval has shown that one can improve retrieval effectiveness by using multiple representations in indexing and query formulation [27] [19] [3] [11] and by using multiple search strategies [5] [24] [7]. In this work, we investigate whether we can attain similar improvements in the domain of text categorization by combining different representations and classif...
Article
This paper presents the design and evaluation of a text categorization method based on the Hierarchical Mixture of Experts model. This model uses a divide and conquer principle to dene smaller categorization problems based on a predened hierarchical structure. The nal classier is a hierarchical array of neural networks. The method is evaluated using the UMLS Metathesaurus as the underlying hierarchical structure, and the OHSUMED test set of MEDLINE records. Comparisons with traditional Rocchio's algorithm adapted for text categorization, as well as at neural network classi- ers are provided. The results show that the use of the hierarchical structure improves text categorization performance signicantly. 1 Introduction Text categorization, also known as automatic indexing, is the process of algorithmically analyzing an electronic document to assign a set of categories (or index terms) that succinctly describe the content of the document. This assignment can be used for classic...
Article
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word, and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 gigabytes of world-wide web documents, and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large data sets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
Article
A probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization framework. The analysis results in a probabilistic version of the Rocchio classifier and offers an explanation for the TFIDF word weighting heuristic. The Rocchio classifier, its probabilistic variant and a standard naive Bayes classifier are compared on three text categorization tasks. The results suggest that the probabilistic algorithms are preferable to the heuristic Rocchio classifier. This research is sponsored by the Wright Laboratory, Aeronautical Systems Center, Air Force Materiel Command, USAF, and the Advanced Research Projects Agency (ARPA) under grant F33615-93-1-1330. The US Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation thereon. Views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of Wright Laboratory or the United States Government. Keywords: text categorization, relevance feedback, naive Bayes classifier, information retrieval, vector space retrieval model, machine learning 1
Article
This paper is a comparative study of text categorization methods. Fourteen methods are investigated, based on previously published results and newly obtained results from additional experiments. Corpus biases in commonly used document collections are examined using the performance of three classifiers. Problems in previously published experiments are analyzed, and the results of flawed experiments are excluded from the cross-method evaluation. As a result, eleven out of the fourteen methods are remained. A k-nearest neighbor (kNN) classifier was chosen for the performance baseline on several collections; on each collection, the performance scores of other methods were normalized using the score of kNN. This provides a common basis for a global observation on methods whose results are only available on individual collections. Widrow-Hoff, k-nearest neighbor, neural networks and the Linear Least Squares Fit mapping are the top-performing classifiers, while the Rocchio approaches had relatively poor results compared to the other learning methods. KNN is the only learning method that has scaled to the full domain of MEDLINE categories, showing a graceful behavior when the target space grows from the level of one hundred categories to a level of tens of thousands. This research was supported in part by NIH grant LM-05714 and by NSF grant IRI9314992. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of NIH or the U.S. Government. Keywords: text categorization, statistical learning, comparative study. 1
Article
The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. Existing classification schemes which ignore the hierarchical structure and treat the topics as separate classes are often inadequate in text classification where the there is a large number of classes and a huge number of relevant features needed to distinguish between them. We propose an approach that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree. As we show, each of these smaller problems can be solved accurately by focusing only on a very small set of features, those relevant to the task at hand. This set of relevant features varies widely throughout the hierarchy, so that, while the overall relevant feature set may be large, each classifier only examines a small subset. The use of reduced feature sets allows us to utilize more complex (probabilistic) models, without encountering many of the standard computational and robustness difficulties. 1
Article
This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is learned to distinguish a second-level category from other categories within the same top level. In the flat non-hierarchical case, a model distinguishes a second-level category from all other second-level categories. Scoring rules can further take advantage of the hierarchy by considering only second-level categories that exceed a threshold at the top level. We use support vector machine (SVM) classifiers, which have been shown to be efficient and effective for classification, but not previously explored in the context of hierarchical classification. We found small advantages in accuracy for hierarchical models over flat models. For the hierarchical approach, we found the same accuracy using a sequential Boolean decision rule and a multiplica...
The effect of using hierarchical classifiers in text categorizationRecherche d'Information Assistee par Ordinateur
  • S D Alessio
  • K Murray
  • R Schiaffino
  • A Kershenbaum
S. D'Alessio, K. Murray, R. Schiaffino, and A. Kershenbaum. The effect of using hierarchical classifiers in text categorization. In Proceeding of RIAO-00, 6th International Conference "Recherche d'Information Assistee par Ordinateur", pages 302–313, Paris, FR, 2000.
Category levels in hierarchical text categorization Association for Computational Linguistics Hierarchical classification of Web content
  • S D Alessio
  • K Murray
  • R Schiaffino
  • A Kershenbaum
S. D'Alessio, K. Murray, R. Schiaffino, and A. Kershenbaum. Category levels in hierarchical text categorization. In Proc. of EMNLP-98, 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain, 1998. Association for Computational Linguistics, Morristown. [5] S. T. Dumais and H. Chen. Hierarchical classification of Web content. In N.J. Belkin, P. Ingwersen, and M.-K. Leong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 256–263, Athens, Greece, 2000. ACM Press, New York.