Conference Paper

Strategies for minimising errors in hierarchical web categorisation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the seminal work by [21], a document to be classified proceeds top-down along the given taxonomy, each classifier being used to decide to which subtree(s) the document should be sent to, until one or more leaves of the taxonomy are reached. This approach, which requires the implementation of multiclass classifiers for each parent node, gave rise to a variety of actual systems, e.g., [25], [10], [36], and [26]. ...
... This solution is normally adopted in monolithic systems, where only one classifier is entrusted with distinguishing among all categories in a taxonomy [16,19]. Variations on this theme can be found in [36] and in [24]. In local approaches, different sets of features are selected for different nodes in the taxonomy, thus taking advantage of dividing a large initial problem into subproblems, e.g., [36]. ...
... Variations on this theme can be found in [36] and in [24]. In local approaches, different sets of features are selected for different nodes in the taxonomy, thus taking advantage of dividing a large initial problem into subproblems, e.g., [36]. This is the default choice for Pachinko machines. ...
Article
Full-text available
Progressive filtering is a simple way to perform hierarchical classification, inspired by the behavior that most humans put into practice while attempting to categorize an item according to an underlying taxonomy. Each node of the taxonomy being associated with a different category, one may visualize the categorization process by looking at the item going downwards through all the nodes that accept it as belonging to the corresponding category. This paper is aimed at modeling the progressive filtering technique from a probabilistic perspective, in a hierarchical text categorization setting. As a result, the designer of a system based on progressive filtering should be facilitated in the task of devising, training, and testing it.
... It is demonstrated in the paper that splitting the classification problem into a number of sub-problems at each level of the hierarchy is more efficient and accurate than classifying in the non-hierarchical way. Wibowo and Williams [2002] also studied the problem of hierarchical web classification and suggested methods to minimize errors by shifting the assignment into higher level categories when lower level assignment is uncertain. Peng and Choi [2002] proposed an efficient method which classifies a web page into a hierarchy through only one path of the hierarchical tree, and is able to expand the hierarchical tree dynamically. ...
Article
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
... A similar idea is used in "Hierarchical Mixture Model" [192] proposed by Toutanova et al., where a generative model incorporates the term probabilities from all parent classes into the current class. Wibowo and Williams [199] suggested that an instance should be assigned to a higher level category when a lower level classifier is uncertain. ...
... The purpose of hierarchical classification of web data is to minimize error rate by assigning them into intermediate level in the hierarchy instead of categorizing them into leaf level. It leverages content from ancestors and descendants to reduce errors when assigning data [31,12]. Even if these approaches are promising in web categorization, they are not directly applicable to indexing as they need training data. ...
Article
Full-text available
When documents are atomically structured, it is possible to assign them keyword vectors to support indexing. Most web content, however, have non-atomic structures. These include navigational/semantic hierarchies on the web. Al-though they are especially effective for browsing, such struc-tures make it hard for individual nodes to be properly in-dexed. This is because, in many cases, their contents have to be inferred from the contents of their neighbors, ances-tors, and descendants in the structure. In this paper, we propose a novel keyword and keyword weight propagation technique to properly enrich the data nodes in structured content. In particular, our approach first relies on under-standing the context provided by the relative content rela-tionships between entries in the structure. We then lever-age this information for relative-content preserving keyword propagation. Experiments show that we observe a signifi-cant improvement (10 − 15%) in precision with the proposed keyword propagation algorithm.
... A similar idea is used in " Hierarchical Mixture Model "[26]proposed by Toutanova et al., where a generative model incorporates the term probabilities from all parent classes into the current class. Wibowo and Williams[27]suggested that an instance should be assigned to a higher level category when a lower level classifier is uncertain. The hierarchical classification approaches mentioned above share a common characteristic: they were posed as meta-classifiers built on top of base classifiers. ...
Article
Full-text available
Hierarchical classification has been shown to have superior performance than flat classification. It is typically performed on hierarchies created by and for hu-mans rather than for classification performance. As a result, classification based on such hierarchies often yields suboptimal results. In this paper, we propose a novel genetic algorithm-based method on hierarchy adaptation for improved clas-sification. Our approach customizes the typical GA to optimize classification hi-erarchies. In several text classification tasks, our approach produced hierarchies that significantly improved upon the accuracy of the original hierarchy as well as hierarchies generated by state-of-the-art methods.
... Moreover, because of fewer categories to choose from, given the parent labeling is correct, the accuracy of child assignment might be higher. In the combination schemes approach, we need to calculate the similarity of a document to the classes in each level of the tree structure, and then combine the results from different levels to make a final assignment decision (Wibowo and Williams, 2002b). Dumais and Chen (2000)'s multiplicative scoring function and sequential Boolean function are examples of combining scores from different level classifiers. ...
Conference Paper
Full-text available
Many real world classification problems involve classes organized in a hierarchical tree-like structure. However in many cases the hierarchical structure is ignored and each class is treated in isolation or in other words the class structure is flattened (Dumais and Chen, 2000). In this paper, we propose a new approach of incorporating hierarchical structure knowledge by cascading it as an additional feature for Child level classifier. We posit that our cascading model will outperform the baseline “flat” model. Our empirical experiment provides strong evidences supporting our proposal. Interestingly, even imperfect hierarchical structure knowledge would also improve classification performance.
... It is demonstrated in the paper that splitting the classification problem into a number of sub-problems at each level of the hierarchy is more efficient and accurate than classifying in the non-hierarchical way. Wibowo and Williams [2002] also studied the problem of hierarchical web classification and suggested methods to minimize errors by shifting the assignment into higher level categories when lower level assignment is uncertain. Peng and Choi [2002] proposed an efficient method which classifies a web page into a hierarchy through only one path of the hierarchical tree, and is able to expand the hierarchical tree dynamically. ...
Article
Classification of web page content is essential to many tasks in web information retrieval such as maintaining web directories and focused crawling. The uncontrolled nature of web content presents additional challenges to web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in web page classification, we note the importance of these web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assump- tions behind the use of information from neighboring pages.
... A topic taxonomy can be used as a base for a divide-and-conquer strategy. A classifier is built independently at each internal node of the hierarchy, using all the documents of the subcategories of this category, and a document is labeled using these classifiers to greedily select subbranches, until we reach a leaf node or certain constraints are satisfied (e.g., the score should be larger than a threshold [Dumais and Chen 2000] or the predictions of adjacent levels should be consistent [Wibowo and Williams 2002]). Feature selection is often performed at each node before constructing a classifier [Chakrabarti et al. 1998; Liu and Motoda 2007]. ...
Article
Full-text available
A topic taxonomy is an effective representation that describes salient features of virtual groups or online communities. A topic taxonomy consists of topic nodes. Each internal node is defined by its vertical path (i.e., ancestor and child nodes) and its horizonal list of attributes (or terms). In a text-dominant environment, a topic taxonomy can be used to flexibly describe a group's interests with varying granularity. However, the stagnant nature of a taxonomy may fail to timely capture the dynamic change of a group's interest. This article addresses the problem of how to adapt a topic taxonomy to the accumulated data that reflects the change of a group's interest to achieve dynamic group profiling. We first discuss the issues related to topic taxonomy. We next formulate taxonomy adaptation as an optimization problem to find the taxonomy that best fits the data. We then present a viable algorithm that can efficiently accomplish taxonomy adaptation. We conduct extensive experiments to evaluate our approach's efficacy for group profiling, compare the approach with some alternatives, and study its performance for dynamic group profiling. While pointing out various applications of taxonomy adaption, we suggest some future work that can take advantage of burgeoning Web 2.0 services for online targeted marketing, counterterrorism in connecting dots, and community tracking.
... Dumais and Chen [8] demonstrated that making use of the hierarchical structure of web directories can improve both efficiency and accuracy. Wibowo and Williams [24] also studied the problem of hierarchical web classification and suggested methods to minimize errors. ...
Conference Paper
Full-text available
Web page classification is important to many tasks in information retrieval and web mining. However, applying traditional textual classifiers on web data often produces unsatisfying results. Fortunately, hyperlink information provides important clues to the categorization of a web page. In this paper, an improved method is proposed to enhance web page classification by utilizing the class information from neighboring pages in the link graph. The categories represented by four kinds of neighbors (parents, children, siblings and spouses) are combined to help with the page in question. In experiments to study the effect of these factors on our algorithm, we find that the method proposed is able to boost the classification accuracy of common textual classifiers from around 70% to more than 90% on a large dataset of pages from the Open Directory Project, and outperforms existing algorithms. Unlike prior techniques, our approach utilizes same-host links and can improve classification accuracy even when neighboring pages are unlabeled. Finally, while all neighbor types can contribute, sibling pages are found to be the most important.
ResearchGate has not been able to resolve any references for this publication.