A Comparison Study: Web Pages Categorization with Bayesian Classifiers
Zengmei Fu1, Chuanliang Chen1, Yunchao Gong2, Rongfang Bie1
1Department of Computer Science, Beijing Normal University, Beijing 100875, China
2Software Institute, Nanjing University, Nanjing, China
firstname.lastname@example.org, C.L.Chen86@gmail.com, Corresponding Author: email@example.com
In the recent few years, web mining has become a
hotspot of data mining with the development of
Internet. Web pages classification is one of the
essential techniques for web mining since classifying
web pages of an interesting class is often the first step
of mining the web. The high dimensional text
vocabulary space is one of the main challenges of web
pages. In this paper, we study the capabilities of
bayesian classifiers for web pages categorization.
Several feature selection techniques, such as Chi
Squared, Information Gain and Gain Ratio are used
for selecting relevant words in web pages. Results on
benchmark dataset show that the performances of
Aggregating One-Dependence Estimators (AODE)
and Hidden Naive Bayes (HNB) are both more
competitive than other traditional methods.
Since the Internet has become a huge repository of
information, many studies address the issue of web
pages classification. It is a fact that web pages are
based on loosely structures text and therefore, various
statistical text learning algorithms have been applied
to web pages categorization [6, 11]. The methods of
classification include some novel ones: Naive Bayes,
Bayes Network, Hidden Naive Bayes, Aggregating
One-Dependence Estimators, Complement class Naive
Bayes and some traditional ones such as Support
Vector Machine and so on. The origins of our
motivation are the great success of Naive Bayes for
web pages classification. In this paper, we investigate
the capabilities of bayesian algorithms for web pages
Feature selection means that we want to find a
subset of words which help to discriminate between
different kinds of web pages. In this paper, we perform
several feature selection methods such as Chi Squared,
Information Gain and Gain Ratio to extract relevant
words of web pages in order to reduce the complexity
of classifiers and preserve their performances.
The remainder of this paper is organized as follows.
In section 2, we briefly review the five bayesian
classification methods. Section 3 describes several
feature selection methods. In section 4, we
demonstrate performance measures, experiments’
results and analyze. Finally, we conclude our work in
2. Comparison of Different Classifiers
2.1. Naive Bayes
The Naive Bayesian classifier is also simply named
Naive Bayes [1, 2, 3, 4, 6]. It is widely deployed for
classification due to its simplicity, efficiency and
Figure 1. An example of Naive Bayes
The structure of Naive Bayes is depicted in Fig. 1.
In Naive Bayes, each attribute node has the class node
as its parent but it does not have any parent from
For a given module sample, Naive Bayes classifier
searches for a class ci which maximizes the posterior
probability P(ci|x;θ’) through applying the bayes rule.
Then x can be classified by computing the equation as
Bayesian classifiers perform better than others in web
Moreover, comparing with Boolean, the highest
average accuracy on TF is 88.02%, which is achieved
by HNB and AODE is the second one with the data of
86.01%. In contrast, NB and SVM perform not so well
as on Boolean. The average accuracy of NB is 75.88%
and SVM is 70.10%, which are the third and the fourth
highest. As to the average F-measure, the highest one
is 0.884, which is achieved by HNB. The second one
is AODE with the value of 0.867. NB is the third
highest and SVM is the fourth, the values of them are
0.765 and 0.719 respectively. Across the two types, we
find both the highest average accuracy and F-measure
are achieved by HNB on TF. All the evidences offered
above show that TF contains more information than
Boolean, and classifiers perform better both in the
average accuracy and the average F-measure with TF
In this paper, we report our work on web pages
categorization and the comparison of bayesian
classification methods: Naive Bayes, Bayes Network,
AODE, HNB and CNB. Other traditional methods are
also performed for comparison. In our experiments,
several feature selection methods such as Chi Squared,
Information Gain and Gain Ratio are used for
selecting relevant words in web pages. Two popular
evaluation metrics, accuracy and F-measure are used
for evaluating the performances of classifiers. Our
empirical study shows the abilities of bayesian
classifiers perform satisfying, especially for AODE
and HNB which are both more competitive than other
methods. Also, SVM performs well in certain number
of attributes although limited to its high complexity.
The research work described in this paper was
supported by the National Science Foundation of
China under the Grant No. 10601064.
 Mehran Sahami, Susan Dumais, David Heckerman, and
Eric Horvitz. “A bayesian approach to filtering junk
e-mail”. In Learning for Text Categorization, Madison,
 V. Metsis, I. Androutsopoulos, G. Paliouras. “Spam
Filtering with Naive Bayes – Which Naive Bayes?”
CEAS 2006 Third Conference on Email and Anti-Spam,
Mountain View, California USA, Jul. 2006.
 H. Zhang, L. Jiang, J. Su. “Hidden Naive Bayes”
Proceedings of the Twentieth National Conference on
Artificial Intelligence. pp. 919-924, AAAI Press, 2005.
 Geoffrey I. Webb, Janice R. Boughton, Zhihai Wang.
“Not So Naive Bayes: Aggregating One-Dependence
Estimators”. Machine Learning, 58, 5–24, 2005.
 J. Ross Quinlan. “Induction of Decision Trees”.
Machine Learning, 1:81-106, 1986.
 H. Mase. “Experiments on Automatic Web Page
Categorization for IR System”. Technical Report,
Stanford University, Stanford, Calif. 1998.
 Jason D. M. Rennie, Lawrence Shih, Jaime Teevan,
David R. Karger. “Tackling the Poor Assumptions of
Naive Bayes Text Classifiers”. Proceedings of the
Twentieth International Conference on Machine
Learning (ICML-2003), Washington DC, 2003.
 Ross Quinlan. “C4.5: Programs for Machine Learning,”
Morgan Kaufmann Publishers, San Mateo, CA, 1993.
 Industry Sector Dataset
 S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K.
Murthy. “Improvements to Platt's SMO Algorithm for
SVM Classifier Design”. Neural Computation, 13(3),
pp 637-649, 2001.
 Zheng Zhao, Huan Liu. “Searching for Interacting
Features”. In: Proc. of International Joint Conference
on Artificial Intelligence, Hyderabad, India, Jan, 2007.
 Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang.
“PEBL: Web Page Classification without Negative
Examples”. IEEE Trans. Knowl. Data Eng. 16(1):
 Kira, K, L. A. Rendell. “The Feature Selection Problem:
Traditional Methods and New Algorithm”. Proceedings
of AAAI’92, 1992.
 Kira, K, L. A. Rendell. “A Practical Approach to
Feature Selection”. D.Sleeman and P.Edwards (eds.):
Machine Learning: Proceedings of International
Conference. pp. 249–256, Morgan Kaufmann, 1992.
 M.A. Hall, L.A. Smith. “Practical feature subset
selection for machine learning”, In Proceedings of the
21st Australian Computer Science Conference, pp.
 R. C. Holte. “Very Simple Classification Rules
Per-form Well on Most Commonly Used Datasets,”
Machine Learning, vol. 11, pp. 63-91, 1993.
 G. Buddhinath, D. Derry. “A Simple Enhancement to
One Rule Classification,” Technique Report at
earch/stemming/general/, Access 2006.
 L. Yu, H. Liu. “Feature selection for high-dimensional
data: a fast correlation-based filter solution”. In
Proceedings of ICML 2003, 2003.