Page 1

A Comparison Study: Web Pages Categorization with Bayesian Classifiers

Zengmei Fu1, Chuanliang Chen1, Yunchao Gong2, Rongfang Bie1

1Department of Computer Science, Beijing Normal University, Beijing 100875, China

2Software Institute, Nanjing University, Nanjing, China

fzm806@163.com, C.L.Chen86@gmail.com, Corresponding Author: rfbie@bnu.edu.cn

Abstract

In the recent few years, web mining has become a

hotspot of data mining with the development of

Internet. Web pages classification is one of the

essential techniques for web mining since classifying

web pages of an interesting class is often the first step

of mining the web. The high dimensional text

vocabulary space is one of the main challenges of web

pages. In this paper, we study the capabilities of

bayesian classifiers for web pages categorization.

Several feature selection techniques, such as Chi

Squared, Information Gain and Gain Ratio are used

for selecting relevant words in web pages. Results on

benchmark dataset show that the performances of

Aggregating One-Dependence Estimators (AODE)

and Hidden Naive Bayes (HNB) are both more

competitive than other traditional methods.

1. Introduction

Since the Internet has become a huge repository of

information, many studies address the issue of web

pages classification. It is a fact that web pages are

based on loosely structures text and therefore, various

statistical text learning algorithms have been applied

to web pages categorization [6, 11]. The methods of

classification include some novel ones: Naive Bayes,

Bayes Network, Hidden Naive Bayes, Aggregating

One-Dependence Estimators, Complement class Naive

Bayes and some traditional ones such as Support

Vector Machine and so on. The origins of our

motivation are the great success of Naive Bayes for

web pages classification. In this paper, we investigate

the capabilities of bayesian algorithms for web pages

categorization.

Feature selection means that we want to find a

subset of words which help to discriminate between

different kinds of web pages. In this paper, we perform

several feature selection methods such as Chi Squared,

Information Gain and Gain Ratio to extract relevant

words of web pages in order to reduce the complexity

of classifiers and preserve their performances.

The remainder of this paper is organized as follows.

In section 2, we briefly review the five bayesian

classification methods. Section 3 describes several

feature selection methods. In section 4, we

demonstrate performance measures, experiments’

results and analyze. Finally, we conclude our work in

Section 5.

2. Comparison of Different Classifiers

2.1. Naive Bayes

The Naive Bayesian classifier is also simply named

Naive Bayes [1, 2, 3, 4, 6]. It is widely deployed for

classification due to its simplicity, efficiency and

efficacy.

Figure 1. An example of Naive Bayes

The structure of Naive Bayes is depicted in Fig. 1.

In Naive Bayes, each attribute node has the class node

as its parent but it does not have any parent from

attribute nodes.

For a given module sample, Naive Bayes classifier

searches for a class ci which maximizes the posterior

probability P(ci|x;θ’) through applying the bayes rule.

Then x can be classified by computing the equation as

Page 2

follows:

argmax (

cC

∈

| ') ( | ; ') P x c

θ

i

lii

c P c

θ=

.

2.2. Bayes Network

By specifying a set of conditional independence

statements together with a set of conditional

probability functions, Bayes Network [3] estimates the

probability density function governing a set of random

variables.

Assume that A1, A2,…, An are n attributes. E is an

example which represented by a vector (a1, a2, …, an ),

where ai is the value of Ai. The class variable is

represented by C. We use c to represent the value that

C takes and c(E) to denote the class of E. The

definition of Bayes Network is represented by the

follow equation.

( ) argmax ( ) ( ,

c C

∈

12

,...| )c

n

c EP c P a aa

=

.

2.3. Hidden Naive Bayes

The following picture shows the structure of Hidden

Naive Bayes (HNB).

Figure 2. The structure of HNB

In an HNB [3], hidden parents of attributes

represent attribute dependencies. The class node is

represented by C which is also the parent of all

attribute nodes. Each attribute Ai has a hidden parent

Ahpi, where i = 1, 2,…, n, represented by a dashed

circle. In order to distinguish from regular arcs, the arc

from the hidden parent Ahpi to Ai is also represented by

a dashed directed line.

The follow equation defines the joint distribution

represented by an HNB.

1

1

(,..., )( )(|, )C

n

nihpi

i

P AA CP cP A A

=

=

∏

,

where

,

1,

(|, )C(|, )

n

ihpii jij

jj i

≠

P A AWP A A C

=

=∗

∑

.

Essentially, the hidden parent Ahpi for Ai is a mixture

of the weighted influences from all other attributes.

HNB can be defined as follow equation and M is an

example.

()argmax ( ) (

c C

∈

|, )c

ihpi

c MP c P a a

=

.

Algorithm HNB (X)

Input: a set X of training web pages.

For each value c of C

Compute P(c) from X

For each pair of words/attributes Ai and Aj

For each assignment ai, aj, and c to Ai, Aj, and C

Compute P (ai; aj|c) from X

For each pair of attributes Ai and Aj

Compute IP (Ai; Aj |C)

For each attribute Ai

n

WI

=≠

For each attribute Aj and j ≠i

( ;

p

ij

W

=

Compute

1,

(;|)

ipij

jj i

A A C

= ∑

,

Compute

|)

ij

i

IA A C

W

,

Output: HNB models for X

Figure 3. The process of HNB algorithm

2.4. AODE

Abbreviation of Aggregating One-Dependence

Estimators (AODE）intends to average models from a

restricted class of one-dependence classifiers, the class

of such classifiers have all other attributes depend on a

common attribute and class [4].

Selecting a limited class of one-dependence

classifiers in the process of AODE make NB’s

attribute independence assumption weaken. The

selected class is one-dependence classifiers and the

parent of all other attributes is a single attribute [4].

When it comes to classifying an object x = <x1,…, xn>,

the models in which the training data contain fewer

than m samples of the value for x of the parent

attribute xi can be excluded by AODE and m is a

predefined threshold parameter on the size of samples

for statistical inference purposes. The main equation is

defined as follows:

( , ) (x | , )

( ,x)

|{ :1iin

≤ ≤∧

where F(xi) is a count of the number of training

instances having attribute-value xi and is used for

enforcing the limit m.

AODE estimates the class probabilities through the

above equation. Since the denominator of Eq. 1 is

invariance and it need not to be calculated, we can find

a new way to estimate the probabilities of Eq. 1 and to

:1()

( )F x}|

i

ii

ii n F x

≤ ≤ ∧

m

i

P y x Py x

p y

m

≥

=

≥

∑

, (1)

Page 3

seek the class which maximizes the obtained term. The

class satisfies:

⎛

⎜

⎝

:1()

1

ˆˆ

argmax

y

( ,)( | ,

y x

)

i

n

iji

ii n F x

≤ ≤ ∧

m

j

P y x P x

≥

=

⎞

⎟

⎠

∑

∏

,

When

NB [4].

2.5. Complement class Naive Bayes

In order to deal with skewed training data, we

introduce a method called Complement class Naive

Bayes (CNB) [7] which is a complement class of

Naive Bayes. CNB estimates parameters using data

from all classes except c. Due to using a more even

amount of training data which could lessen the bias in

the weight estimates, the result of CNB get more

stable weight estimates and improved classification

accuracy. The estimate of CNB is as follows:

N

N

where

ci

N is the number of times that word i occurred

in documents of classes other than c;

number of word occurrences in classes other than c;

i α and α are smoothing parameters. The rule of

classification is defined as follows:

:1 ( )F x

i

iinm

¬∃ ≤ ≤∧≥

, AODE defaults to

cii

ci

c

α

α

+

θ

∧

+

=

,

c

N is the total

( )

d

argmax [log ( )

c

log]

cii

CNB

l

ci

i

c

N

N

pf

α

α

θ

+

+

=−

∑

?

.

3. Feature Selection Methods

There are four feature selection methods [13-15]

used in this paper to evaluate the performances of

algorithm. The definitions are stated below.

In the following equations, m denotes the number of

classes, and

partitions which a feature could split the training set

into is represented by V.

i C represents the ith class. The number of

ic

N is the total number of

samples in class i, where N is the total number of

samples. The number of samples belongs to class i in

the vth partition denoted by

( )v

i

c

N

.

Chi Squared: The statistic of Chi Squared is

calculated by comparing the obtained frequency with

the priori frequency of the same class. The definition

is as follows:

(N

χ

==

where, the prior frequency of the class is denoted by

the equation:

(/

i

C

NN

( )v ( ) 2v

2

( )

v

11

)

ii

i

mV

CC

iv

C

N

?

N

?

−

=∑∑

,

( )v ( )v

)

i

C

N N

=

?

.

Information Gain: Information Gain is based on the

feature’s impact about decreasing entropy and it can

be defined with the follow equation:

m

CC

NN

InfoGain

NN

=

Gain Ratio: Gain Ratio is first used in C4.5 and the

definition is as follows:

( )v

( )

v

111

[()log()] [

−

()()log()]

iiii

Vm

CC

ivi

N

N

N

N

N

N

==

=−−

∑∑∑

.

1

/[()log()]

ii

m

CC

i

N

N

N

N

GainRatioInfoGain

=

=−

∑

.

Symmetrical Uncertainty: Symmetrical Uncertainty

has been described in books about information theory

and in numerical recipes [19]. It is often used as a fast

correlation measure to evaluate the relevance of

individual features. With this method, the most

relevant feature is positioned at the beginning of the

list. This criterion compensates for the inherent bias of

Information Gain through dividing it by the sum of the

entropies of class labels M and features X [15]:

2 /(SU InfoGain Ent M

= ×

The value of SU ranges from 0 to 1. A value 0

indicates the attribute X and the class M have no

association while a value 1 indicates X can completely

predict M.

4. Experiments

4.1. Corpus and Preprocessing

In our experiments, we use CMU industry sector

which is a collection of web pages belonging to

companies from various economic sectors. A subset of

the original data which form a two-level hierarchy is

used in this research. There are 527 instances

partitioned into seven classes: materials, energy,

financial, healthcare, technology, transportation and

utilities.

Each web page of the corpus is represented as a set

of words. After analyzing all the web pages of a

corpus, a dictionary with N words is formed. Two data

types are used in the experiments, one is Boolean and

the other is term frequencies (TF). The type of Boolean

represents whether a word occurs in web pages while

TF describes the frequency of a word in web pages.

During preprocessing, we perform word stemming,

stop-word removable and Document Frequency

Thresholding (DFT) [18], all of them are used for

reducing the dimension of feature space for web pages

categorization. In the end, the first 3,000 tokens of

dictionary are extracted according to their Mutual

Information and form the corpus used in this paper.

4.2. Performance Measure

() ( ))Ent X

+

,

Page 4

Accuracy and F-measure are two popular evaluation

metrics [17] of text categorization domain used for

measuring the performance of classifiers.

Accuracy: Accuracy represents the percentage of

correct predictions in total predictions. It usually can

be defined as follows:

100%

c

t

P

P

Accuracy

=×

,

where Pc depicts the number of correct predictions and

Pt is the number of total predictions.

F-measure: F-measure can be defined as follows:

2R P

F

R

where Recall is represented by R and it is the

percentage of the messages for a given category which

are classified correctly; P is the Precision, the

percentage of the predicted messages for a given class

which are classified correctly. F-measure ranges from

0 to 1 and the higher the better.

4.3. Results and Analysis

We choose 10-fold cross-validation on this

benchmark dataset to estimate the performances of

classification in our experiments, studying the

comparison of the above eight different methods and

P

×

+

=

,

four feature selection methods.

When it comes to Fig. 4, we select top 100 relevant

words by performing the four feature selection

methods and compare the capabilities of the above

eight algorithms. The results of our experiments show

that HNB is a better classifier than the other seven

methods both evaluated by accuracy and F-measure in

these two figures. In Fig. 4, the accuracy of HNB

reach to 88.97% and F-measure hit 0.895, both of

them are the highest. We also find that selecting

relevant words by SU is more competitive than other

three ones since both the highest accuracy and

F-measure occurs when we select features according

to SU scores.

Since the poor research on SU for web pages

categorization, we further study the capability and

stability of SU by performing classifiers on different

number of relevant words selected according to SU

scores.

In the following experiment, we sort words

according to their SU scores, and then study the

performances of the above eight classifiers on

different number of top relevant words. As is showed

in Fig. 5, we select number of attributes through

removing top N words according to SU scores, where

N is the number of attributes and in our experiments, it

65%

70%

75%

80%

85%

90%

95%

AODE BNCNBHNBNBSVM kNN C4.5

Accuracy

InfoGain

GainRatio

ChiSquared

SU

0.70

0.75

0.80

0.85

0.90

0.95

AODE BNCNBHNB NBSVMkNNC4.5

F-measure

InfoGain

GainRatio

ChiSquared

SU

Figure 4. The accuracy (left) and F-measure (right) comparison of Bayesian classifiers with the Boolean data

type, BN represents Bayes Network, NB represents Naive Bayes.

72%

74%

76%

78%

80%

82%

84%

86%

88%

90%

92%

94%

96%

30 5080100 200300 500

Number of Attributes

Accuracy

AODE

BN

CNB

HNB

NB

SVM

kNN

C4.5

0.72

0.74

0.76

0.78

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

30 5080100200300 500

Number of Attributes

F-measure

AODE

BN

CNB

HNB

NB

SVM

kNN

C4.5

Figure 5. Accuracy (left) and F-measure (right) curves of the eight classifiers with different numbers of

relevant words according to their SU scores and Boolean data type, BN represents Bayes Network, NB

represents Naive Bayes.

Page 5

65%

70%

75%

80%

85%

90%

95%

AODE BNCNBHNB NBSVM kNN C4.5

Accuracy

InfoGain

GainRatio

ChiSquared

SU

0.70

0.75

0.80

0.85

0.90

0.95

AODE BNCNB HNB NBSVM kNN C4.5

F-measure

InfoGain

GainRatio

ChiSquared

SU

Figure 6. The accuracy (left) and F-measure (right) comparison of Bayesian classifiers with the TF data type,

BN represents Bayes Network, NB represents Naive Bayes.

45%

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

3050 80 100200 300 500

Number of Attributes

Accuracy

AODE

BN

CNB

HNB

NB

SVM

kNN

C4.5

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

30 5080 100200300 500

Number of Attributes

F-measure

AODE

BN

CNB

HNB

NB

SVM

kNN

C4.5

Figure 7. Accuracy (left) and F-measure (right) curves of the eight classifiers with different numbers of

relevant words according to their SU scores and TF data type, BN represents Bayes Network, NB represents

Naive Bayes.

range from 30 to 500. It is easy to find that if the

number of attributes reaches to 500, CNB will achieve

the highest accuracy and the highest F-measure,

95.63% and 0.956 respectively. Compared with KNN

and C4.5, bayesian classifiers perform satisfying,

especially for HNB and AODE which are more stable

than other classifiers both in accuracy and F-measure.

Since the high complexity, SVM is time-consuming

and it is more comfortable to small-size problems. In

the contrast, the complexity of AODE or HNB is much

little, so they are more feasible in practice. Certainly,

the ability of SU for relevant words selection is also

promising.

Similar as Boolean, we study of TF as follows. In

Fig. 6, we also select top 100 relevant words. The

results obviously show that both accuracy and

F-measure of SVM descend while AODE and HNB

always perform well. From Fig. 6, we observe the

accuracy of HNB reach 90.30%, which is the highest.

Similarly, the highest F-measure is also achieved by

HNB which is 0.906.

The lowest accuracy and F-measure is 68.06% and

0.706 respectively, which are both achieved by SVM.

Also, we find that selecting relevant words by SU is

more competitive since both the highest accuracy and

F-measure occurs when we select feature according to

SU scores.

In the following experiment, we sort the words

according to their SU scores, and then study the

performances of the eight classifiers on different

number of top N relevant words. In this experiment, N

ranges from 30 to 500. In Fig. 7, we find if the number

of attributes reaches to 500 then the accuracy of CNB

will hit the highest value 96.96%. Fig. 7 also shows

the changes of F-measure by performing different

classifiers and we find the F-measure of CNB will

reach to the highest value when the number of

attributes is 500.

Our experiments also show the comparison across

the above two data types. On the Boolean data type,

we find 87.34% is the highest average accuracy which

is achieved by NB and the second one is 87.18%,

which is achieved by HNB. SVM is the third highest

and AODE is the fourth, the values of them are

87.10% and 85.58% respectively. As to the average

F-measure, 0.878 is the highest which is achieved by

NB. Second to it, the average F-measure of HNB is

0.877. The third one is 0.873, which is achieved by

SVM. 0.864 is the fourth highest and it is achieved by

AODE. From the data offered above, we believe