Conference Paper

Simple and accurate feature selection for hierarchical categorisation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Categorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical categorisation and, specifically, the role of features in the development of more effective categorisers. We show that a good hierarchical machine learning-based categoriser can be developed using small numbers of features from pre-categorised training documents. Overall, we show that by using a few terms, categorisation accuracy can be improved substantially: unstructured leaf level categorisation can be improved by up to 8.6\%, while top-down hierarchical categorisation accuracy can be improved by up to 12\%. In addition, unlike other feature selection models --- which typically require different feature selection parameters for categories at different hierarchical levels --- our technique works equally well for all categories in a hierarchical structure. We conclude that, in general, more accurate hierarchical categorisation is possible by using our simple feature selection technique.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... web pages) text corpora. We report on experience with our approach on Reuters-21578 in comparison with the papers of D'Alessio et al [7] and Chakrabarti et al [4], on 20 newsgroups data set in comparison with the paper of McCallum et al [17], and Wibowo and Williams [30], and on TV closed caption data set in comparison with the paper of Chuang et al [6] (see Section 4). ...
... Beside that it can speed up the categorization, papers also reported that it can increase the performance of the classifier with a few percent, if only a certain subset of terms are used to represent documents (see e.g. [12,30]). In our previous experiments [26] we also found that performance can be increased slightly (less than 1%) if rare terms are disregarded, but the effect of DR on time efficiency is more significant. ...
... Observe that our shows the same relationship between the hierarchies as reported by D'Alessio et al: E4 hierarchy yields the best result. Wibowo and Williams [30] experienced this collection with another hierarchy [10], perhaps this is the reason why they best result is even lower, 73.74%, than that has been achieved by flat categorizers. Our method achieved remarkable results on flat category system as well. ...
Article
Full-text available
Text categorization is the classiflcation to assign a text document to an appropriate category in a predeflned set of categories. This paper focuses on the special case when categories are organized in hierarchy. We presents a new approach on this recently emerged subfleld of text categorization. The algorithm applies an iterative learning module that allow of gradually creating a classifler by trial-and-error-like method. Experimental results performed on three document corpora (including the well- known Reuters-21578, and 20 newsgroups data sets) with several topic hierarchies show that our approach outperforms existing ones by up to 10%. We also indicate another application of the method on the fleld of fuzzy relational thesauri (FRT): the expansion of knowledge base can be supported in a cost-efiective way.
... Chakrabarti et al [9] also applied their hierarchical text classifier on this corpus but without taxonomy. Wibowo and Williams [11] applied the hierarchical category system of Hayes [12] on the corpus, but their achieved results were not superior to flat results. On the other hand, D'Alessio et al [13] defined 3 different hierarchies that comprise all categories of the original flat settings and they achieved better numbers in accuracy that has been reached so far by flat categorizers. ...
... Beside that it can speed up the categorization, papers also reported that it can increase the performance of the classifier with a few percent, if only a certain subset of terms are used to represent documents (see e.g. [10], [11]). In our previous experiments [19] we also found that performance can be increased slightly (less than 1%) if rare terms are disregarded, but the effect of DR on time efficiency is more significant. ...
... We remark that our results shows the same relationship among taxonomies as reported by D'Alessio et al: E4 yields the best result (see Table V). Wibowo and Williams [11] experienced this collection with another hierarchy [12], perhaps this is the reason why they best result is even lower, 73.74%, than that has been achieved by flat categorizers. Our method achieved remarkable results on flat category system as well. ...
Article
Full-text available
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present an approach on hierarchical text categorization that is a recently emerged subfield of the main topic. Here, documents are assigned to leaf-level categories of a category tree (called taxonomy). The algorithm applies an iterative learning module that allow of gradually creating a classifier by weight adjusting method. We experimented on the well-known Reuters-21578 document corpus with different taxonomies. Results show that our approach outperforms existing ones by up to 10%.
... Rocchio classifiers assume that the representation of a particular category must combine the properties of both positive and negative example documents [10]. The training algorithm consists of constructing a representative vector w j for each category, which is updated according to ...
... Hence feature selection, which aims to reduce the dimensionality of the feature vector by only retaining those features that are most informative, or most likely to distinguish. The notion of information content is encapsulated by several theoretical measures [13] [8], as well as more empirical metrics [10] and less direct measures retrieved as part of the learning procedure [3]. ...
... Wibowo and Williams [10] take a dramatically simpler approach to feature selection. Their problem domain is the classification of documents into hierarchical categories. ...
Article
this paper! ) that there is a strong correlation between the three information measures. 3.3 Feature Weighting The third and final component of feature vector construction is feature weighting, that is assigning scalar weight values to every feature left in the vector after extraction and selection. Weighting serves to scale the feature vector, which has consequences on similarity measures often used in classifiers. Consider the most often used distance metric, employed by kNN classifiers amongst others, the Euclidean: D(w1,w2) = # # i=1 (w1 i w2 i ) where w j is the weight applied to the j th feature. Clearly the distance computation depends heavily on the feature weights
... Chakrabarti et al [9] also applied their hierarchical text classifier on this corpus but without taxonomy. Wibowo and Williams [11] applied the hierarchical category system of Hayes [12] on the corpus, but their achieved results were not superior to flat results. On the other hand, D'Alessio et al [13] defined 3 different hierarchies that comprise all categories of the original flat settings and they achieved better numbers in accuracy that has been reached so far by flat categorizers. ...
... Beside that it can speed up the categorization, papers also reported that it can increase the performance of the classifier with a few percent, if only a certain subset of terms are used to represent documents (see e.g. [10], [11]). In our previous experiments [19] we also found that performance can be increased slightly (less than 1%) if rare terms are disregarded, but the effect of DR on time efficiency is more significant. ...
... We remarkTable V). Wibowo and Williams [11] experienced this collection with another hierarchy [12], perhaps this is the reason why they best result is even lower, 73.74%, than that has been achieved by flat categorizers. Our method achieved remarkable results on flat category system as well. ...
Article
Full-text available
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present an approach on hierarchical text categorization that is a recently emerged subfield of the main topic. Here, documents are assigned to leaf-level categories of a category tree (called taxonomy). The algorithm applies an iterative learning module that allow of gradually creating a classifier by weight adjusting method. We experimented on the well-known Reuters-21578 document corpus with different taxonomies. Results show that our approach outperforms existing ones by up to 10%.
... Beside that it can speed up the categorization, papers also reported that it can increase the performance of the classifier with a few percent, if only a certain subset of terms are used as feature set for the representation of documents (see e.g. [9,13]). ...
... We remark that our results shows the same relationship among taxonomies as reported by D'Alessio et al: E4 yields the best result. Wibowo and Williams [13] experienced this collection with another hierarchy [8], perhaps this is the reason why they best result is even lower, 73.74%, than that has been achieved by flat categorizers. Our method achieved remarkable results on flat category system as well. ...
Article
Full-text available
We developed a general categorizer engine, called UFEX (Universal Feature EXtractor), that is able to recognize rel-evant characteristics of categories based on training exam-ples. The engine is specialized to categorize entities into hierarchical category systems (taxonomy), but it is equally able to work on flat category systems. The iterative learn-ing algorithm of UFEX, that allows of gradually creating a classifier by trial-and-error-like method, is able to work with arbitrary inputs, but here we solely focus on text cat-egorization (TC). TC is the classification to assign a text document to an appropriate category in a predefined set of categories. This exposé introduces our software HITEC (HIerarchical TExt Categorizer), that contains the imple-mentation of UFEX and I/O interfaces for hierarchical TC. This work presents HITEC's and UFEX's effectiveness on large data collection. We experimented on the several docu-ment collections: the well-known Reuters-21578 database, a corporate database containing 38730 documents in a cat-egory tree of about 12000 elements in 10 levels, and the very large WIPO-alpha (World Intellectual Property Organiza-tion, Geneva, Switzerland, 2002) English patent database that consists of about 75000 XML documents distributed over 5000 categories. HITEC is able to index the corpus quickly and creates a classifier in a few iteration cycle. We present the results achieved by the classifier w.r.t. various test setting.
... Before the release of RCV1 there was no available widely used benchmark text collection for hierarchic TC, and therefore researchers performed their tests on diverse corpora (see e.g. [1] [2] [3] [4] [5] [6] [7] [8]) that made the comparison of the methods extremely difficult and unreliable [9]. Therefore it is a straighforward step for developers of hierarchical categorizers to run tests on new collection. ...
... Before the release of RCV1 there was no available widely used benchmark text collection for hierarchic TC, and therefore researchers performed their tests on diverse corpora (see e.g. [1, 2, 3, 4, 5, 6, 7, 8]) that made the comparison of the methods extremely difficult and unreliable [9]. Therefore it is a straighforward step for developers of hierarchical categorizers to run tests on new collection. ...
Article
Full-text available
This paper presents categorization results performed by means of HITEC categorizer tool on the new benchmark document collection of text categorization, the Reuters Corpus Volume 1 (RCV1). RCV1 is an archive of over 800,000 manually categorized newswire stories made available by Reuters in 2000 for research purposes. This collection was released to take place of the Reuters-21578 collection that has been used widespread in the text retrieval community. This paper inted to add some interesting result to the characterization of RCV1 and HITEC categorizer.
... Sorting documents into a hierarchical category system (termed taxonomy) can be performed by means of automatic hierarchical text categorizers. In this regard to first papers were published in the late 1990s [2] [3] [4] [5]. However, most of the early methods were not able to cope with extremely large taxonomies and document corpora, such us e.g. the International Patent Classification (IPC) taxonomy consisting of about 5000 categories at the top four levels, since they did not incorporate the hierarchy in the categorization algorithm, but applied classical 'flat' classifier algorithm on the flattened category system. ...
... Sorting documents into a hierarchical category system (termed taxonomy) can be performed by means of automatic hierarchical text categorizers. In this regard to first papers were published in the late 1990s [2, 3, 4, 5]. However, most of the early methods were not able to cope with extremely large taxonomies and document corpora, such us e.g. the International Patent Classification (IPC) taxonomy consisting of about 5000 categories at the top four levels, since they did not incorporate the hierarchy in the categorization algorithm, but applied classical 'flat' classifier algorithm on the flattened category system. ...
Article
Full-text available
In this paper we present a categorization-based method for supporting navigation in large document corpora that are created, collected, revised, tagged etc. by independent individuals forming an online community. Users often face the problem of redundancy when intend to contribute to the content created by a community. Since different people have different skills and background use of terminology, one may easily 'reinvent the wheel' when adding seemingly new, but semantically duplicated content to an opened document corpus. This reduces the usability and coherence, hence decreases the quality of the corpus. One way to minimize the risk of such problem is to support the users with navigational and searching facilities in the corpus. Our paper proposes to exploit any available topical structure to build a category-based model of the corpus, and to apply categorization for new contents. We performed a case-study on the English Wikipedia, one of the largest online document corpora, which shows that our methodology can be useful for regular users, as well as for administrators. The category-based model was build by the HITEC document processing and categorization framework.
... Originally, research in text categorization addressed the binary problem, where a document is either relevant or not w.r.t. a given category. In real-world situation, however, the great variety of different sources and hence categories usually poses multi-class classification problem, where a document belongs to exactly one category selected from a predefined set [2, 7, 15, 17]. Even more general is the case of multi-label problem, where a document can be classified into more than one category. ...
Conference Paper
Full-text available
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We focus on the special case when categories are organized in hierarchy. We present a new approach on this recently emerged subfield of text categorization. The algorithm applies an iterative learning module that allow of gradually creating a classifier by trial-and-error-like method. We present a software that has been developed on the basis of the algorithm to illustrate the capability of the algorithm on large data collection. We experimented on the very large benchmark collection, on the WIPO-alpha (World Intellectual Property Organization, Geneva, Switzerland, 2002) English patent database that consists of about 75000 XML documents distributed over 5000 categories. Our software is able to index the corpus quickly and creates a classifier in a few iteration cycles. We present the results achieved by the classifier w.r.t. various test setting
... Beside that it can speed up the categorization, papers also reported that it can increase the performance of the classifier with a few percent, if only a certain subset of terms are used to represent documents (see e.g. [9] [16]). In our previous experiments [13] we also found that performance can be increased slightly (less than 1%) if rare terms are disregarded, but the effect of DR on time efficiency is more significant. ...
Conference Paper
Full-text available
HITEC is a hierarchical text categorizer tool that is based on UFEX (universal feature extractor) algorithm. This paper presents experiments on the effectiveness of HITEC on several natural languages (English, German) and with various kinds of text corpora. The obtained results show that HITEC outperforms its known competitors on the investigated corpora, and its performance is independent from the processed languages. The time and storage requirement of HITEC is considerable, therefore it can be run on an average PC.
... In addition, a considerable number of studies have been performed by many researchers who sought to reveal the link between text classification models and feature selection methods [4][6] [7][8] [9][10] [11][12] [13] [14][15] [16]. Recently, many studies have been published related to the invention of new feature selection methods based on existing feature evaluation models which seeks to enhance the performance of text classification [20][21] [21] [22] [23] [24] [25][26] [27][28] [29]. In particular, nearly all of these aforementioned approaches in are dependent upon the statistical characteristics of each target domain. ...
Article
Full-text available
This paper introduces DICE, a Domain-Independent text Classification Engine. DICE is robust, efficient, and domain-independent in terms of software and architecture. Each module of the system is clearly modularized and encapsulated for extensibility. The clear modular architecture allows for simple and continuous verification and facilitates changes in multiple cycles, even after its major development period is complete. Those who want to make use of DICE can easily implement their ideas on this test bed and optimize it for a particular domain by simply adjusting the configuration file. Unlike other publically available tool kits or development environments targeted at general purpose classification models, DICE specializes in text classification with a number of useful functions specific to it. This paper focuses on the ways to locate the optimal states of a practical text classification framework by using various adaptation methods provided by the system such as feature selection, lemmatization, and classification models.
... This approach makes the assumption that the most important information and discriminating features are found near the beginning of the document. Shanks and Williams [112] were able to accurately classify text documents with this method of feature set reduction, while Wibowo and Williams [140] applied this approach to the hierarchical classification of Web pages. Kim and Ross [63,64,65,66] also followed this approach for one of the classifiers they investigated for the task of classifying documents in PDF by genre; the visual layout features for the classifier were extracted from only the first page of a PDF file when it was treated as an image. ...
Article
The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre. This research involves the development and testing of a new model for the automatic identification of Web page genre; classification results using this model compare very favorably with those of other researchers.
... Beside that it can speed up the categorization, papers also reported that it can increase the performance of the classifier with a few percent, if only a certain subset of terms are used to represent documents (see e.g. [15, 16]). In our previous experiments [17] we also found that performance can be increased slightly (less than 1%) if rare terms are disregarded, but the effect of DR on time efficiency is more significant. ...
Article
Full-text available
This paper presents experiments with a hierarchical text categorizer on a multi-lingual (English, French) corpus. The results obtained are very similar for both languages. The results allow us to apply in the near future cross-language text categorization that can be used to support automatic translation to create multi-lingual topic glossary.
... This approach is based on an assumption that a summary is present at the beginning of each document, which is usually true for news articles, but does not always hold for other kinds of documents. However, this approach was later applied to hierarchical classification of web pages by Wibowo and Williams [198], and was shown to be useful for web documents. ...
... Text classification [9] is considered as the act of dividing a set of input documents into two or more categories where each document can be said to belong to one or multiple classes. Large growth of information flows and especially the explosive growth of Internet and computer network promoted growth of automated text classification. ...
... This dropped the number of features used from 17,827 unique terms to 6,883 unique terms. Wibowo and Williams (2002) combine stop word removal with document frequency thresholding to eliminate not only the more rarely used words but also the most commonly used words. ...
... Therefore, removing these terms will not affect the quality of the classification (Efron et al., 2003). However, it will reduce the complexity and the required time as well as increasing the accuracy of the classification process (Wibowo and Williams, 2002). ...
Article
Full-text available
The role of intelligence and security informatics based on statistical computations is becoming more significant in detecting terrorism activities proactively as the extremist groups are misusing many of the obtainable facilities on the Internet to incite violence and hatred. However, the performance of statistical methods is limited due to the inadequate accuracy produced by the inability of these methods to comprehend the texts created by humans. In this paper, we propose a hybridized feature selection method based on the basic term-weighting techniques for accurate terrorism activities detection in textual contexts. The proposed method combines the feature sets selected based on different individual feature selection methods into one feature space for effective web pages classification. UNION and Symmetric Difference combination functions are proposed for dimensionality reduction of the combined feature space. The method is tested on a selected dataset from the Dark Web Forum Portal and benchmarked using various famous text classifiers. Experimental results show that the hybridized method efficiently identifies the terrorist activities content and outperforms the individual methods. Furthermore, the results revealed that the classification performance achieved by hybridizing few feature sets is relatively competitive in the number of features used for classification with higher hybridization levels. Moreover, the experiments of hybridizing functions show that the dimensionality of the feature sets is significantly reduced by applying the Symmetric Difference function for feature sets combination.
... There have been several studies focused on feature selection methods for the flat classification problem [18,19,20,21,22,23]. However, very few work emphasize on feature selection for HC problem that are limited to small number of categories [24,25]. Figure 1 demonstrates the importance of feature selection for hierarchical settings where only the relevant features are chosen at each of the decision (internal) nodes. ...
Article
Full-text available
Large-scale Hierarchical Classification (HC) involves datasets consisting of thousands of classes and millions of training instances with high-dimensional features posing several big data challenges. Feature selection that aims to select the subset of discriminant features is an effective strategy to deal with large-scale HC problem. It speeds up the training process, reduces the prediction time and minimizes the memory requirements by compressing the total size of learned model weight vectors. Majority of the studies have also shown feature selection to be competent and successful in improving the classification accuracy by removing irrelevant features. In this work, we investigate various filter-based feature selection methods for dimensionality reduction to solve the large-scale HC problem. Our experimental evaluation on text and image datasets with varying distribution of features, classes and instances shows upto 3x order of speed-up on massive datasets and upto 45% less memory requirements for storing the weight vectors of learned model without any significant loss (improvement for some datasets) in the classification accuracy. Source Code: https://cs.gmu.edu/~mlbio/featureselection.
... The research community has also devoted efforts to cope with high-level error recovering. The interested reader will find several proposals aimed at tackling this issue in [8] and in [37]. ...
Article
Full-text available
Progressive filtering is a simple way to perform hierarchical classification, inspired by the behavior that most humans put into practice while attempting to categorize an item according to an underlying taxonomy. Each node of the taxonomy being associated with a different category, one may visualize the categorization process by looking at the item going downwards through all the nodes that accept it as belonging to the corresponding category. This paper is aimed at modeling the progressive filtering technique from a probabilistic perspective, in a hierarchical text categorization setting. As a result, the designer of a system based on progressive filtering should be facilitated in the task of devising, training, and testing it.
... There are various methods of selecting features for hierarchical classification, nevertheless most of these methods are focused on hierarchical document classification ( [10], [26]), webpage classification ( [27], [28]] and text classification [29]. ...
Conference Paper
Full-text available
Protein dataset contains high dimensional feature space. These features may encompass of noise and not relatively to protein function. Therefore, we need to select the appropriate features to improve the efficiency and performance of the classifier. Feature selection is an important step in any classification tasks. Filter methods are important in order to obtain only the relevant features to the class and to avoid redundancy. While wrapper methods are applied to get optimized features and better classification accuracy. This paper proposed a feature selection strategy for hierarchical classification of G-Protein-Coupled Receptors (GPCR) based on hybridization of correlation feature selection (CFS) filter and genetic algorithm (GA) wrapper methods. The optimum features were then classified using K-nearest neighbor algorithm. These methods are capable to reduce the features and achieved comparable classification accuracy at every hierarchy level. The results also shown that the integration between CFS and GA is capable of searching the optimum features for hierarchical protein classification.
... Therefore, removing these terms will not affect the quality of the classification (Efron et al., 2003). However, it will reduce the complexity and the required time as well as increasing the accuracy of the classification process (Wibowo and Williams, 2002). ...
... w.r.t. a given category. In real-world situation, however, the great variety of different sources and hence categories usually poses multi-class classification problem, where a document belongs to exactly one category selected from a predefined set [2,7,15,17]. Even more general is the case of multi-label problem, where a document can be classified into more than one category. ...
Article
Full-text available
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. This paper focuses on the special case when categories are organized in hierarchy. We presents a new approach on this recently emerged subfield of text categorization. The algorithm applies an iterative learning module that allow of gradually creating a classifier by trialand -error-like method. We present a software that has been developed on the basis of the algorithm to illustrate the capability of the algorithm on large data collection. We experimented on the very large benchmark collection, on the WIPO-alpha (World Intellectual Property Organization, Geneva, Switzerland, 2002) English patent database that consists of about 75000 XML documents distributed over 5000 categories. Our software is able to index the corpus quickly and creates a classifier in a few iteration cycle. We present the results achieved by the classifier w.r.t. various test setting.
... The research community has also devoted efforts to cope with high-level error recovering. The interested reader will find several proposals aimed at tackling this issue in [13] and in [50]. ...
Article
Full-text available
Progressive filtering is a simple way to perform hierarchical classification, inspired by the behavior that most humans put into practice while attempting to categorize an item according to an underlying taxonomy. Each node of the taxonomy being associated with a different category, one may visualize the categorization process by looking at the item going downwards through all the nodes that accept it as belonging to the corresponding category. This paper is aimed at modeling the progressive filtering technique from a probabilistic perspective. As a result, the designer of a system based on progressive filtering should be facilitated in the task of devising, training, and testing it.
... Therefore, removing these terms will not affect the quality of the classification [34]. However, it will reduce the complexity and the required time as well as increasing the accuracy of the classification process [101]. ...
Article
As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization. Researchers have introduced many feature selection algorithms with different selection criteria. However, it has been discovered that no single criterion is best for all applications. We proposed a hybrid feature selection framework called based on genetic algorithms (GAs) that employs a target learning algorithm to evaluate features, a wrapper method. We call it hybrid genetic feature selection (HGFS) framework. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for the target algorithm. The experiments on genomic data demonstrate that ours is a robust
Conference Paper
Crowdsourcing is a low cost way of obtaining human judgements on a large number of items, but the knowledge in these judgements is not reusable and further items to be processed require further human judgement. Ideally one could also obtain the reasons people have for these judgements, so the ability to make the same judgements could be incorporated into a crowd-sourced knowledge base. This paper reports on experiments with 27 students building knowledge bases to classify the same set of 1000 documents. We have assessed the performance of the students building the knowledge bases using the same students to assess the performance of each other’s knowledge bases on a set of test documents. We have explored simple techniques for combining the knowledge from the students. These results suggest that although people vary in document classification, simple merging may produce reasonable consensus knowledge bases.
Article
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
Chapter
Full-text available
We presented HITEC, an automated text classifier and its application categorize to English and German patent collections of WIPO under the IPC taxonomy. IPC covers all areas of technology and is currently used by the industrial property offices of many countries. Patent classification is indispensable for the retrieval of patent documents in the search for prior art. Such retrieval is crucial to patent-issuing authorities, potential inventors, research and development units, and others concerned with the application or development of technology. An efficient automated patent classifier is crucial component in providing an automated classification assistance system for categorizing patent applications in the IPC, that is a main aim at WIPO Fall et al., 2002. HITEC can be a prominent candidate for this purpose.
Conference Paper
Full-text available
Many real world classification problems involve classes organized in a hierarchical tree-like structure. However in many cases the hierarchical structure is ignored and each class is treated in isolation or in other words the class structure is flattened (Dumais and Chen, 2000). In this paper, we propose a new approach of incorporating hierarchical structure knowledge by cascading it as an additional feature for Child level classifier. We posit that our cascading model will outperform the baseline “flat” model. Our empirical experiment provides strong evidences supporting our proposal. Interestingly, even imperfect hierarchical structure knowledge would also improve classification performance.
Chapter
LSHC involves dataset consisting of thousands of classes and millions of training instances with high-dimensional features posing several big data challenges. Feature selection that aims to select the subset of discriminant features is an effective strategy to deal with large-scale problem. It speeds up the training process, reduces the prediction time, and minimizes the memory requirements by compressing the total size of learned model weight vectors. Majority of the studies have also shown feature selection to be competent and successful in improving the classification accuracy by removing irrelevant features. In this chapter, we investigate various filter-based feature selection methods for dimensionality reduction to solve the LSHC problem.
Article
Full-text available
Thesis (Ph.D.)--University of Ottawa, 2006. Includes bibliographies.
Chapter
Patent categorization (PC) is a typical application area of text categorization (TC). TC can be applied in different scenarios at the work of patent offices depending on at what stage the categorization is needed. This is a challenging field for TC algorithms, since the applications have to deal simultaneously with large number of categories (in the magnitude of 1000–10000) organized in hierarchy, large number of long documents with huge vocabularies at training, and they are required to work fast and accurate at on-the-fly categorization. In this paper we present a hierarchical online classifier, called HITEC, which meets the above requirements. The novelty of the method relies on the taxonomy dependent architecture of the classifier, the applied weight updating scheme, and on the relaxed category selection method. We evaluate the presented method on two large English patent application databases, the WIPO-alpha and the Espace A/B corpora. We also compare the presented method to other TC algorithms on these collections, and show that it outperforms them significantly.
Article
Residents have a limited time to be trained. Although having a highly variable caseload should be beneficial for resident training, residents do not necessarily get a uniform distribution of cases. By developing a dashboard where residents and their attendings can track the procedures they have done and cases that they have seen, we hope to give residents a greater insight into their training and into where gaps in their training may be occurring. By taking advantage of modern advances in NLP techniques, we process medical records and generate statistics describing each resident’s progress so far. We have built the system described and its life within the NYP ecosystem. By creating better tracking, we hope that caseloads can be shifted to better close any individual gaps in training. One of the educational pain points for radiology residency is the assignment of cases to match a well-balanced curriculum. By illuminating the historical cases of a resident, we can better assign future cases for a better educational experience.
Article
Full-text available
Patent categorization (PC) is a typical application areaof text categorization (TC). TC can be applied in different scenarios at the work of patent officesdepending on at what stage the categorization is needed. This is a challenging field for TC algorithms, since the applications have to deal simultaneously with large number of categories (in the magnitude of 1000–10000) organized in hierarchy, large number of long documents with huge vocabularies at training, and they are required to work fast and accurate at on-the-fly categorization. In this paper we present a hierarchical online classifier, called HITEC, which meets the above requirements. The novelty ofthe method lies in the taxonomy dependent architecture of the classifier, the applied weight updating scheme, and in the relaxed category selection method. We evaluate the presented method on two large English patent application databases, the WIPO-alpha and the Espace A/B corpora.,We also compare the method to other TC algorithms on these collections, and show that it outperforms them significantly. Keywords: Patent Categorization, HierarchicalText Categorization, International Patent
Article
Full-text available
We consider the problem of assigning level numbers (weights) to hierarchically organized categories during the process of text categorization. These levels control the ability of the categories to attract documents during the categorization process. The levels are adjusted in order to obtain a balance between recall and precision for each category. If a category's recall exceeds its precision, the category is too strong and its level is reduced. Conversely, a category's level is increased to strengthen it if its precision exceeds its recall. The categorization algorithm used is a supervised learning procedure that uses a linear classifier based on the category levels. We are given a set of categories, organized hierarchically. We are also given a training corpus of documents already placed in one or more categories. From these, we extract vocabulary, words that appear with high frequency within a given category, characterizing each subject area. Each node's vocabulary is filtered and its words assigned weights with respect to the specific category. Then, test documents are scanned and categories ranked based on the presence of vocabulary terms. Documents are assigned to categories based on these rankings. We demonstrate that precision and recall can be significantly improved by solving the categorization problem taking hierarchy into account. Specifically, we show that by adjusting the category levels in a principled way, that precision can be significantly improved, from 84\% to 91\%, on the much-studied Reuters-21578 corpus organized in a three-level hierarchy of categories.
Article
Full-text available
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Conference Paper
Full-text available
Given a set of categories, with or without a preexisting hierarchy among them, we consider the problem of assigning documents to one or more of these categories from the point of view of a hierarchy with more or less depth. We can choose to make use of none, part or all of the hierarchical structure to improve the categorization effectiveness and efficiency. It is possible to create additional hierarchy among the categories. We describe a procedure for generating a hierarchy of classifiers that models the hierarchy structure. We report on computational experience using this procedure. We show that judicious use of a hierarchy can significantly improve both the speed and effectiveness of the categorization process. Using the Reuters-21578 corpus, we obtain an improvement in running time of over a factor of three and a 5% improvement in F-measure. 1. Introduction and Background The document categorization problem is one of assigning newly arriving documents to one or more preexisting c...
Conference Paper
Full-text available
This paper presents the design and evalu- ation of a text categorization method based on the Hi- erarchical Mixture of Experts model. This model uses a divide and conquer principle to define smaller categoriza- tion problems based on a predefined hierarchical struc- ture. The final classifier is a hierarchical array of neu- ral networks. The method is evaluated using the UMLS Metathesaurus as the underlying hierarchical structure, and the OHSUMED test set of MEDLINE records. Com- parisons with traditional Rocchio's algorithm adapted for text categorization, as well as flat neural network classi- fiers are provided. The results show that the use of the hierarchical structure improves text categorization per- formance significantly.
Article
Full-text available
This paper describes automatic document categorization based on large text hierarchy. We handle the large number of features and training examples by taking into account hierarchical structure of examples and using feature selection for large text data. We experimentally evaluate feature subset selection on real-world text data collected from the existing Web hierarchy named Yahoo. In our learning experiments naive Bayesian classifier was used on text data using featurevector document representation that includes word sequences (n-grams) instead of just single words (unigrams). Experimental evaluation on real-world data collected form the Web shows that our approach gives promising results and can potentially be used for document categorization on the Web. Additionally the best result on our data is achieved for relatively small feature subset, while for larger subset the performance substantially drops. The best performance among six tested feature scoring measure was achieved by the feature scoring measure called Odds ratio that is known from information retrieval.
Article
Full-text available
We describe the results of extensive experiments on large document collections using optimized rule-based induction methods. The goal of these methods is to automatically discover classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring manymanyears of developmental efforts, have been successfully built to "read" documents and assign topics to them. In this paper, weshowthatmachine generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 65% recall/precision breakeven point to 80.5%. In the context of a very high dimensional feature space, several methodological alternatives are examined, including universal versu...
Article
Full-text available
Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides performance guarantees and guidance on parameter settings for these algorithms. Experimental data is presented showing Widrow-Hoff and EG to be more effective than the widely used Rocchio algorithm on several categorization and routing tasks. 1 Introduction Document retrieval, categorization, routing, and filtering systems often are based on classification. That is, the IR system decides for each document which of two or more classes it belongs to, or how strongly it belongs to a class, in order to accomplish the IR task of interest. For instance, the two classes may be the documents relevant to and not relevant to a particular user, and the system may rank documents based on how likely it i...
Article
This paper examines statistical techniques for exploiting relevance information to weight search terms. These techniques are presented as a natural extension of weighting methods using information about the distribution of index terms in documents in general. A series of relevance weighting functions is derived and is justified by theoretical considerations. In particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval. Different applications of relevance weighting are illustrated by experimental results for test collections.
Conference Paper
In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 1015 % better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks. Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.
Conference Paper
The Construe news story categorization system assigns indexing terms to news stories according to their content using knowledge-based techniques. An initial deployment of Construe in Reuters Ltd. topic identification system (TIS) has replaced human indexing for Reuters Country Reports, an online information service based on news stories indexed by country and type of news. TIS indexing is comparable to human indexing in overall accuracy but costs much less, is more consistent, and is available much more rapidly. TIS can be justified in terms of cost savings alone, but Reuters also expects the speed and consistency of TIS to provide significant competitive advantage and, hence, an increased market share for Country Reports and other products from Reuters Historical Information Products Division.
Article
With the recent dramatic increase in electronic access to documents, text categorization-the task of assigning topics to a given document-has moved to the center of the information sciences and knowledge management. This article uses the structure that is present in the semantic space of topics in order to improve performance in text categorization: according to their meaning, topics can be grouped together into ``meta-topics'', e.g., gold, silver, and copper are all metals. The proposed architecture matches the hierarchical structure of the topic space, as opposed to a flat model that ignores the structure. It accommodates both single and multiple topic assignments for each document. Its probabilistic interpretation allows its predictions to be combined in a principled way with information from other sources. The first level of the architecture predicts the probabilities of the meta-topic groups. This allows the individual models for each topic on the second level to focus on finer discriminations within the group. Evaluating the performance of a two-level implementation on the Reuters-22173 testbed of newswire articles shows the most significant improvement for rare classes.
Article
Training a support vector machine SVM leads to a quadratic optimization problem with bound constraints and one linear equality constraint. Despite the fact that this type of problem is well understood, there are many issues to be considered in designing an SVM learner. In particular, for large learning tasks with many training examples on the shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements. SVM light is an implementation of an SVM learner which addresses the problem of large tasks. This chapter presents algorithmic and computational results developed for SVM light V 2.0, which make large-scale SVM training more practical. The results give guidelines for the application of SVMs to large domains.
Article
This paper describes the application of Distributional Clustering [20] to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionalityreduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document classification accuracy. Experimental results obtained on three real-world data sets show that we can reduce the feature dimensionality by three orders of magnitude and lose only 2% accuracy---significantly better than Latent Semantic Indexing [6], class-based clustering [1], feature selection by mutual information [23], or Markov-blanket-based feature selection [13]. We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering. 1 Introduction The popularity of the Internet has caused an exponen...
Article
This paper presents the design and evaluation of a text categorization method based on the Hierarchical Mixture of Experts model. This model uses a divide and conquer principle to dene smaller categorization problems based on a predened hierarchical structure. The nal classier is a hierarchical array of neural networks. The method is evaluated using the UMLS Metathesaurus as the underlying hierarchical structure, and the OHSUMED test set of MEDLINE records. Comparisons with traditional Rocchio's algorithm adapted for text categorization, as well as at neural network classi- ers are provided. The results show that the use of the hierarchical structure improves text categorization performance signicantly. 1 Introduction Text categorization, also known as automatic indexing, is the process of algorithmically analyzing an electronic document to assign a set of categories (or index terms) that succinctly describe the content of the document. This assignment can be used for classic...
Article
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word, and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 gigabytes of world-wide web documents, and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large data sets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
Article
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 5...
Article
A probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization framework. The analysis results in a probabilistic version of the Rocchio classifier and offers an explanation for the TFIDF word weighting heuristic. The Rocchio classifier, its probabilistic variant and a standard naive Bayes classifier are compared on three text categorization tasks. The results suggest that the probabilistic algorithms are preferable to the heuristic Rocchio classifier. This research is sponsored by the Wright Laboratory, Aeronautical Systems Center, Air Force Materiel Command, USAF, and the Advanced Research Projects Agency (ARPA) under grant F33615-93-1-1330. The US Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation thereon. Views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of Wright Laboratory or the United States Government. Keywords: text categorization, relevance feedback, naive Bayes classifier, information retrieval, vector space retrieval model, machine learning 1
Article
This paper studies noise reduction for computational efficiency improvements in a statistical learning method for text categorization, the Linear Least Squares Fit (LLSF) mapping. Multiple noise reduction strategies are proposedand evaluated, including: an aggressive removal of "non-informative words" from texts before training; the use of a truncated singular value decomposition to cut off noisy "latent semantic structures" during training; the elimination of non-influential components in the LLSF solution (a word-concept association matrix) after training. Text collections in different domains were used for evaluation. Significant improvements in computational efficiency without losing categorization accuracy were evident in the testing results. 1
Article
The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. Existing classification schemes which ignore the hierarchical structure and treat the topics as separate classes are often inadequate in text classification where the there is a large number of classes and a huge number of relevant features needed to distinguish between them. We propose an approach that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree. As we show, each of these smaller problems can be solved accurately by focusing only on a very small set of features, those relevant to the task at hand. This set of relevant features varies widely throughout the hierarchy, so that, while the overall relevant feature set may be large, each classifier only examines a small subset. The use of reduced feature sets allows us to utilize more complex (probabilistic) models, without encountering many of the standard computational and robustness difficulties. 1
Article
This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is learned to distinguish a second-level category from other categories within the same top level. In the flat non-hierarchical case, a model distinguishes a second-level category from all other second-level categories. Scoring rules can further take advantage of the hierarchy by considering only second-level categories that exceed a threshold at the top level. We use support vector machine (SVM) classifiers, which have been shown to be efficient and effective for classification, but not previously explored in the context of hierarchical classification. We found small advantages in accuracy for hierarchical models over flat models. For the hierarchical approach, we found the same accuracy using a sequential Boolean decision rule and a multiplica...
The effect of using hierarchical classifiers in text categorizationRecherche d'Information Assistee par Ordinateur
  • S D Alessio
  • K Murray
  • R Schiaffino
  • A Kershenbaum
S. D'Alessio, K. Murray, R.Schiaffino, and A. Kershenbaum. The effect of using hierarchical classifiers in text categorization. In Proceeding of RIAO-00, 6th International Conference "Recherche d'Information Assistee par Ordinateur", pages 302–313, Paris, 2000.
Text categorization with support vector machines: Learning with many relevant features Making large-scale SVM learning practical
  • T Joachims
  • T Joachims
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning (ECML-98), volume 1398, pages 137–142, Berlin, 1998. Springer. [9] T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods -Support Vector Learning, chapter 11, pages 169–184. The MIT Press, 1999.
  • C J Van Rijsbergen
C.J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979.