Conference Paper

A Comparison Study: Web Pages Categorization with Bayesian Classifiers.

DOI: 10.1109/HPCC.2008.80 Conference: 10th IEEE International Conference on High Performance Computing and Communications, HPCC 2008, 25-27 Sept. 2008, Dalian, China
Source: DBLP

ABSTRACT In the recent few years, web mining has become a hotspot of data mining with the development of Internet. Web pages classification is one of the essential techniques for web mining since classifying web pages of an interesting class is often the first step of mining the web. The high dimensional text vocabulary space is one of the main challenges of web pages. In this paper, we study the capabilities of Bayesian classifiers for web pages categorization. Several feature selection techniques, such as Chi Squared, Information Gain and Gain Ratio are used for selecting relevant words in web pages. Results on benchmark dataset show that the performances of Aggregating One-Dependence Estimators (AODE) and Hidden Naive Bayes (HNB) are both more competitive than other traditional methods.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes keyword-based Web page categorization. Our goal is to embed our categorization technique into information retrieval (IR) systems to facilitate the end-users' search task. In such systems, search results must be categorized faster, while keeping accuracy high. Our categorization system uses a knowledge base (KB) to assign categories to Web pages. The KB contains a set of characteristic keywords with weights by category, and is automatically generated from training texts. With the keyword-based approach, the algorithms to extract keywords and assign weights to them should be considered, because the algorithms affect strongly both categorization accuracy and processing speed. Furthermore, we must take two characteristics of Web pages into account: (1) the text length is very variable, which makes it harder to use statistics such as word frequency to calculate keyword weights, and (2) a huge number of distinct words are used, which makes the KB bigger and therefore pro...
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Naive Bayes is very popular in commercial and open-source anti-spam e-mail filters. There are, however, several forms of Naive Bayes, something the anti-spam literature does not always acknowledge. We discuss five dierent versions of Naive Bayes, and compare them on six new, non-encoded datasets, that contain ham messages of particular Enron users and fresh spam messages. The new datasets, which we make publicly available, are more realistic than previous comparable benchmarks, because they maintain the tempo- ral order of the messages in the two categories, and they emulate the varying proportion of spam and ham messages that users receive over time. We adopt an experimental procedure that emulates the incremental training of person- alized spam filters, and we plot roc curves that allow us to compare the dierent versions of nb over the entire tradeo between true positives and true negatives.
    CEAS 2006 - The Third Conference on Email and Anti-Spam, July 27-28, 2006, Mountain View, California, USA; 01/2006
  • Source
    Conference Paper: Hidden Naive Bayes
    [Show abstract] [Hide abstract]
    ABSTRACT: The conditional independence assumption of naive Bayes essentially ignores attribute dependencies and is often violated. On the other hand, although a Bayesian network can represent arbitrary attribute dependencies, learning an optimal Bayesian network from data is in- tractable. The main reason is that learning the opti- mal structure of a Bayesian network is extremely time consuming. Thus, a Bayesian model without structure learning is desirable. In this paper, we propose a novel model, called hidden naive Bayes (HNB). In an HNB, a hidden parent is created for each attribute which combines the inuences from all other attributes. We present an approach to creating hidden parents using the average of weighted one-dependence estimators. HNB inherits the structural simplicity of naive Bayes and can be easily learned without structure learning. We propose an algorithm for learning HNB based on conditional mutual information. We experimentally test HNB in terms of classication accuracy, using the 36 UCI data sets recommended by Weka (Witten & Frank 2000), and compare it to naive Bayes (Langley, Iba, & Thomas 1992), C4.5 (Quinlan 1993), SBC (Langley & Sage 1994), NBTree (Kohavi 1996), CL-TAN (Fried- man, Geiger, & Goldszmidt 1997), and AODE (Webb, Boughton, & Wang 2005). The experimental results show that HNB outperforms naive Bayes, C4.5, SBC, NBTree, and CL-TAN, and is competitive with AODE.
    Proceedings, The Twentieth National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference, July 9-13, 2005, Pittsburgh, Pennsylvania, USA; 01/2005
Show more


Available from