Content uploaded by George Duncan

Author content

All content in this area was uploaded by George Duncan

Content may be subject to copyright.

Incremental Hierarchical Clustering of

Text Documents

by Nachiketa Sahoo

Adviser: Jamie Callan

May 5, 2006

Abstra ct

Incremental hierarchical text document clu stering algorithms are important in organizin g

docum ents generat ed from streaming on-line sources, such as, Ne wswire a nd B logs. How-

ever, this is a relative ly unexplored area in the text docum ent clustering literature. Pop-

ular incremental hierarchical c lustering algorithm s, namely Cobweb a nd C lassit, have

not b een applied to text do cument data. We discuss w hy, in the cu rrent form, these algo-

rithms are not suitable for text clustering and propose an alternative formulation for the

same. T his includes changes to the underlying d istribution al assum ption of the algorithm

in ord er to conform with the emp irical data. Bo th the original Classit algo rithm and our

proposed algorithm are evaluated using R euters newswire articles and Ohsumed dataset,

and the gain from using a more appropriate distribution is demonstrated.

1 Introduction

Document clustering is an eﬀective tool to manage information overload. By grouping similar

documents together, we enable a human observer to quickly browse large document collec-

tions[6], make it possible to easily grasp the distinct topics and subtopics (concept hierarchies)

in them, allow search engines to eﬃciently query large document collections [16] among many

other applications. Hence, it has been widely studied as a part of the broad literature of data

clustering. One such survey of existing clustering literature can be found in Jain et. al[13].

The often studied document clustering algorithms are batch clustering algorithms, which

require all the documents to be present at the start of the exercise and cluster the document col-

lection by making multiple iterations over them. But, with the advent of online publishing in

the World Wide Web, the number of documents being generated everyday has increased consid-

erably. Popular sources of informational text documents such as Newswire and Blogs are contin-

uous in nature. To organize such documents naively using existing batch clustering algorithms

one might attempt to perform clustering on the documents collected so far. But, this is

extremely time consuming, if not impossible, due to the sheer volume of documents. One might

be tempted to convert the existing batch clustering algorithms into incremental clustering algo-

rithms by performing batch clustering on periodically collected small batches of documents and

then merge the generated clusters. However, ignoring for the moment the problem of deciding

on an appropriate time window to collect documents, there will always be a wait time before a

newly generated document can appear in the cluster hierarchy. This delay would be unaccept-

able in several important scenarios, e.g., ﬁnancial services, where trading decisions depend on

breaking news, and quick access to appropriately classiﬁed news documents is important. A

clustering algorithm in such a setting needs to process the documents as soon as they arrive.

This calls for the use of an incremental clustering algorithm.

There has been some work in incremental clustering of text documents as a part of Topic

Detection and Tracking initiative ([1], [19], [10] and [7]) to detect a new event from a stream of

news articles. But, the clusters generated by this task are not hierarchical in nature. Although,

that was adequate for the purpose of new event detection, we believe this is a limitation. The

beneﬁts of using a hierarchy of clusters instead of clusters residing at the same level of granu-

larity is twofold. First, by describing the relationship between groups of documents one makes it

possible to quickly browse to the speciﬁc topic of interest. The second reason is a technical one.

Finding the right number of clusters in a set of documents is an ill-formed problem when one

1

does not know the information needs of the end user. But, if we present the user with a topic

hierarchy populated with documents, which she can browse at her desired level of speciﬁcity, we

would circumvent the problem of ﬁnding the right number of clusters while generating a solution

that would satisfy users with diﬀerent needs.

In spite of potential beneﬁts of an incremental algorithm that can cluster text documents as

they arrive into a informative cluster hierarchy, this is a relatively unexplored area in text docu-

ment clustering literature. In this work we examine a well known incremental hierarchical clus-

tering algorithm Cobweb that has been used in non-text domain and its variant Classit. We

discuss why they are not suitable to be directly applied to text clustering and propose a variant

of these algorithm that is based on the properties of text document data. Then we evaluate

both the algorithm using real world data and show the gains obtained by our proposed algo-

rithm.

1.1 Contribution of this research

In this paper we demonstrate methods to carry out incremental hierarchical clustering of text

documents. Speciﬁcally, the contributions of this work are:

1. A Cobweb-based algorithm for text document clustering where word occurrence

attributes follow Katz’s distribution.

2. Evaluation of the existing algorithms and our proposed algorithm on large real world doc-

ument datasets.

In Section 34 we brieﬂy review the text clustering literature. In Section 3 we describe the

Cobweb and Classit algorithms. In Section 3 we describe key properties of text documents

that are central to this work. In Section 4 we explain the contributions of our work. In Section 5

we describe the cluster quality metrics that we have used to evaluate the results obtained. In

Section 6 we explain the setup of the experiment and discuss the results. In Section 7 we con-

clude with scope for future research.

2 Literature review

Clustering is a widely studied problem in the Machine Learning literature [13]. The prevalent

clustering algorithms have been categorized in diﬀerent ways depending on diﬀerent criteria,

such as hierarchical vs. non-hierarchical, partitional vs. agglomerative algorithms, deterministic

vs. probabilistic algorithms, incremental vs. batch algorithms, etc. Hierarchical clustering algo-

rithms and non hierarchical clustering algorithms are categorized based on whether they pro-

duce a cluster hierarchy or a set of clusters all belonging to the same level. Diﬀerent hierarchical

and non-hierarchical clustering algorithms for text documents have been discussed by Manning

and Schutze[17]. Clustering algorithms can be partitional or agglomerative in nature. In a parti-

tional algorithm one starts with one large cluster containing all the documents in the dataset

and divides it into smaller clusters. On the other hand, an agglomerative clustering algorithm

starts with all documents belonging to their individual clusters and combines the most similar

clusters until the desired number of clusters are obtained. Deterministic clustering algorithms

assign each document to only one cluster, while probabilistic clustering algorithms produce the

probabilities of each item belonging to each cluster. The former is said to make “hard” assign-

ment while the later is said to make “soft” assignments. Incremental clustering algorithms make

one or very few passes over the entire dataset and they decide the cluster of an item as they see

it. But, the batch clustering algorithms iterate over the entire dataset many times and gradually

change the assignments of the items to the cluster so that a clustering criterion function is

improved. One such criterion function is the average similarity among documents inside the

clusters formed. Another criterion function is the average similarity between a document in a

cluster and documents outside the cluster. The ﬁrst criterion is called average internal simi-

larity and the second criterion is called average external similarity. In a clustering solution we

would want high average internal similarity, because that would mean that our clusters are com-

2Section 2

posed of similar items. We would also want low average external similarity because that would

mean our clusters are dissimilar, i.e., they do not overlap. The ﬁnal set of clusters is produced

after many iterations when no further improvement of the cluster assignment is possible.

Clustering to browser large document collections (Scatter/Gather)

Cutting et al.[6] is one of the ﬁrst to suggest a cluster aided approach, called Scatter/Gather,

to browse large document collections. It describes two fast routines named Buckshot and Frac-

tionation to ﬁnd the centroids of the clusters to be formed. Then it assigns the documents in

the collection to the nearest centroid and recomputes the centroids iteratively until very little or

no improvement is observed. The last step is similar to the Simple K-means clustering except

that in Simple K-means initially one randomly assigns kitems as centroids of kclusters [17].

Note that kis a ﬁxed user provided number. Buckshot ﬁnds the kcenters in the document

datasets by drawing a sample of k n

√documents and clustering them into kclusters using an

agglomerative hierarchical clustering routine. The agglomerative hierarchical clustering algo-

rithms have a time complexity of On2. By drawing a random sample of size k n

√, the time

complexity is reduced to O(k n). Fractionation, on the other hand, ﬁnds kcentroids in the fol-

lowing manner. It divides the set of documents into buckets of size m, where m > k. Then it

clusters each bucket into ρ m clusters, where ρ < 1and is a constant. Then it repeats the pro-

cess of partitioning the data and clustering them treating each of the formed cluster as a one

data item, until kclusters are obtained. Cutting et al. have shown that Fractionation has a

time complexity of O(m n). The center of the clusters formed by the two methods are returned

as the starting points for the Simple K-means clustering routine. With the help of these two

routines they have proposed a cluster aided approach to browse document collections in which

the program presents the user with a set of clusters for the document dataset (Scatter) along

with their descriptive labels. Then the user can select the clusters which interest her and submit

them to the program. The program merges the documents contained in those clusters (Gather)

and clusters them again. This process is repeated until the user’s information need is met or the

user decides to stop the process. The recursive clustering idea proposed in Scatter/Gather can

be eﬀective in browsing large document sets, especially when one does not know enough about

the documents to query a deployed search engine using key words. This concept loosely parallels

the idea of organizing documents into a hierarchy of topics and subtopics, except that the orga-

nization in this case is guided by the user and executed by a clustering routine. However,

Scatter/Gather has its limitations. It is a batch clustering routine, hence it cannot be used in

some important scenarios as described in subsection. Another limitation that Scatter/Gather

shares with many other clustering algorithms is that it requires the input of k, the number of

clusters to present the user. A value of kdiﬀerent from the number of subtopics in the collec-

tion might lead to meaningless clusters.

Right number of clusters

Finding the right number of clusters in a non-hierarchical clustering exercise is often a diﬃ-

cult problem [18]. The approaches suggested in the literature can, in general, be divided into

two groups [4]. The ﬁrst approach is a multi-fold cross validation one with likelihood as the

objective function, in which one ﬁts a series of mixture models with diﬀerent numbers of compo-

nents to a subset of the data called training data and computes the likelihood of each model

given the remaining subset of the data called testing data. The model that results in the highest

likelihood is selected. The second approach also ﬁts a mixture model to the data and computes

the likelihood of the model given the entire dataset using diﬀerent number of clusters, but it

penalizes a model with a higher number of clusters for increased complexity. Observe that a

higher number of clusters can be made to ﬁt any dataset better than a lower number of clusters.

Hence, by penalizing a clustering solution for its complexity one can achieve a trade oﬀ between

ﬁtness, or likelihood, of the model and its complexity, which is optimized at the right number of

clusters. One such work has been done by Cheeseman and Stutz in their AutoClass algo-

rithm[5]. Other such works include Bayesian Information Criteria and Minimum Descriptor

Length criteria [8]. A diﬀerent approach has been suggested in Liu et al.[16] for clustering text

documents. It uses stability of clustering solutions over multiple runs at each of a set of cluster

counts to decide the right number of clusters for the document dataset.

Literature review 3

Even when the “right” number of clusters can be determined by an algorithm based on some

criterion, human observers often diﬀer from each other about the clusters existing in the dataset

and what should be the right number of clusters. One alternative solution is to generate a hier-

archy of clusters, also called a dendrogram, with all the documents belonging to a single cluster

at the top of the hierarchy, each document in its individual cluster at the lowest level of the

hierarchy and intermediate number of clusters at levels between the two. Thus, the user can

look at the desired level in the hierarchy and ﬁnd a number of clusters that meets her require-

ment ([17],[13]).

Incremental document clustering

As part of Topic Detection and Tracking (TDT) initiative ([1], [19], [10] and [7]) some experi-

ments have been done in incrementally clustering text documents. The TDT initiative is a

DARPA sponsored project started to study and advance the state of the art in detection and

tracking of new events in stream of news broadcast and intelligence reports. The identiﬁed tasks

of TDT are Story Segmentation, Retrospective Topic Detection, On-line New Event Detection,

Topic Tracking and Link Detection. The Story Segmentation task involves breaking a stream of

text or audio data without story delimiters into its constituent stories. Retrospective topic

detection involves detecting new events in the already collected set of documents. On-line new

event detection involves identifying a new event, e.g., an earthquake or a road accident, in a new

document. Tracking involves keeping track of evolution of an event by assigning the incoming

news stories to their corresponding events. Among these tasks the on-line new event detection

task involves incremental clustering. In this task a decision is made, after observing a new item,

whether it belongs to one of the existing clusters, or it belongs to a new cluster of its own.

The TDT team at the Carnegie Mellon University (CMU) uses a threshold-based rule to

decide whether a new document is another story of one of the detected events or it belongs to a

new event of its own. If the maximum similarity between the new document and any of the

existing clusters is more than a threshold (tc) the new document is said to belong to the cluster

to which it is most similar and it is merged to the cluster. If the maximum similarity is less

than tcbut more than another threshold, tn, then the document is assumed to be an old story

but it is not merged to any cluster. If the maximum similarity is less than tn, then the docu-

ment is accepted to be about a new event and a new cluster is formed. They have also investi-

gated adding a time component to the incremental clustering. In this experiment, similarities of

a new document to each of the past mdocuments are computed but they are weighted down

linearly depending on how old the past documents are. If the similarity scores computed in this

manner are less than a preset threshold then the new document is presumed to be about a new

event. This work ﬁnds that use of time component improves the performance of new event

detection task.

TDT team at the University of Massachusetts Amherst (UMASS) takes a variable thresh-

olding approach to the on line event detection task[1]. For each document that initiates a new

cluster the top nwords are extracted and called a query vector . The similarity of the query

vector to the document from which the query was extracted deﬁnes an upper bound on the

threshold required to be met by a document to match the query. A time dependent component

is also used in the variable threshold that makes it harder for a new documents to match an

older query. When a new document djis compared to a past query qithe threshold is computed

as 0.4 +p×(sim(qi, di)−0.4)+tp ×(j−i), where 0< p < 1and tp, a time penalty factor, are

tunable parameters. qiis the query generated from document di. Such threshold is computed for

all existing queries qis. If the similarity of the new document djdoes not exceed any of the

thresholds then the document is assigned to a new cluster and a query is computed for the doc-

ument, else it is added to the clusters assigned to the queries it triggers. The newly generated

cluster is said to have detected a new news event.

Outside the TDT initiative, Zhang and Liu in a recent study have proposed a competitive

learning algorithm, which is incremental in nature and does not need to be supplied with the

correct number of clusters [20]. The algorithm, called Self Splitting Competitive Learning, starts

with a prototype vector that is a property of the only cluster present initially. During the execu-

tion of the algorithm the prototype vector is split and updated to approximate the centroids of

4Section 2

the clusters in the dataset. The update of the property vector is controlled, i.e., when a new

data point is added to the cluster the prototype vector is updated only if the data point is near

enough to the prototype. This determined by another property vector that starts away from the

prototype and zeroes on to it as more and more data points are added. Time for splitting the

cluster associated with the prototype is determined based on a threshold condition. When there

are more than one prototype a new data point is added to the prototype nearest to it. They

have demonstrated their algorithm over text snippets returned from search engines as a response

to a query. However, the success of this algorithm on datasets with longer text documents is yet

to be demonstrated.

Yet another on-line algorithm called frequency sensitive competitive learning has been pro-

posed and evaluated on text datasets by Banerjee and Ghosh[2], which is designed to produce

clusters of items of approximately equal sizes. In this work a version of the K-means clustering

algorithm called spherical K-means has been modiﬁed so that the dispersion of the distributions

associated with the clusters reduces as more and more data points are added to them. This

makes larger clusters less likely candidates for a new data point than the smaller clusters. Thus,

the algorithm is tailored to produce clusters which are more or less equal in size.

All of these algorithms produce non-hierarchical clustering solutions, which foregoes the

opportunity to use clustering as an aid to detect topic and subtopic structure within a large

document collection. Also, TDT experiments eﬀectively exploit the information in the time

stamp available with news stories, i.e., assumes that news stories that describe the same event

will occur within a brief span of time. Such information may not always be available.

Incremental Hierarchical Clustering: Nominal Attributes

Methods have been proposed in the non-text domain to cluster items in an incremental manner

into hierarchies. Most notable among them is the Cobweb algorithm by Fisher [9] and its

derivative Classit [12]. Cobweb is an algorithm to incrementally cluster data points with

nominal attributes into cluster hierarchies.

At the heart of Cobweb is a cluster quality measure called Category Utility.

Let C1, , CKbe the child clusters of a cluster Cp. The Category Utility of C1, , CKis

computed as

C Up[C1,, CK] = Pk=1

KP(Ck)PiPj[P(Ai=Vij Ck)2−P(Ai=Vij Cp)2]

K,(1)

where,

P(Ck) = Probability of a document belonging to the parent cluster Cpbelongs to the child

cluster Ck.

Ai= The ith attribute of the items being clustered (say A1∈ {male, female},A2∈ {Red,

Green, Blue}; assumed to be a multinomial variable),

Vij =jth value of the ith attribute (say, V12 indicates “female”),

P(Ck) = the probability of a document belonging to cluster k, given that it belongs to the

parent cluster p.

The P(Ai=Vij Ck)2is the expected number of times we can correctly guess of the value of

multinomial variable Aito be Vij for an item in the cluster kwhen one follows a probability

matching guessing strategy. For example, if we have a variable that takes values A, B and C

with probabilities 0.3, 0.5 and 0.2, and we randomly predict that the variable takes value A 0.3

fraction of the time, B 0.5 fraction of the time and C 0.2 fraction of the time, we would be cor-

rect in predicting A 0.3 ×0.3 =0.09 fraction of the time, B 0.25 fraction of the time and C 0.04

fraction of the time. A good cluster, in which the attributes of the items take similar values, will

have high P(Ai=Vij |Ck)values, hence high score of PjP(Ai=Vi j Ck)2.Cobweb maximizes

sum of P(Ai=Vij Ck)2scores over all possible assignment of a document to children clusters.

When the algorithm assigns a new item to a child node of the node p, it assigns the item in such

a manner that the total gain in expected number of correct guesses by moving an item from p

to its child node, PiPj[P(Ai=Vij Ck)2−P(Ai=Vi j Cp)2], is maximized. In this manner

the algorithm maximizes the utility function for each node to which a new item is added.

The Cobweb control structure is shown in Fig 3.

Literature review 5

Algorithm CobWeb (Adapted from Fisher’s original work [9])

function Cobweb(item, root)

Update the attribute value statistics at the root

If root is a leaf node then

Return the expanded node that accommodates the new Object

else

Find the best child of the root to host the item and perform the

qualifying step (if any) among the following:

1Create a new node for the item instead of adding it to the

best host, if that leads to improved Category Utility.

2Merge nodes if it leads to improved Category Utility and

call cobweb(item, Merged Node)

3Split node if it leads to improved Category Utility and call

cobweb(item, root)

If none of the above steps are performed then

Call cobweb(item, best child of root)

end if

end if

Figure 1. Cobweb control structure.

An illustration of the clustering process is given in Figure 2.

(2) →(1) (3)

(1) (2)

Addition of a new item (2) to a leaf node (1).

Let (104) be a new item.

(104)→(89)

(34) (67)

(23) (12)

(@89) Which node should the new item be added to? (34) or (67)

or should it belong to a cluster of its own next to (34) and (67)?

Use Category U tility com parison a s describ ed in F ig 3. Let the answer be (67)

(@67) Which node should the new item be added to? (23) or (12)

Figure 2 . Cobweb illustrated

Assume that there is only one attribute of interest called tand it takes values in {A, B , C }.

Also assume that we have three items a, b and cwith tvalue A, B and Crespectively. Further

assume that the objects are presented in the order speciﬁed, i.e. ﬁrst afollowed by bwhich is

followed by c.

After the ﬁrst two items are presented the following cluster conﬁguration is arrived without

any computation of category utility (First part of Figure 2).

C3P(C3) = 1

(aand b)t=A, t =B

C1P(C1) = 0.5

(a)t=A

C2P(C2) = 0.5

(b)t=B

Figure 3 . After ﬁrst two item s are added.

C3is the root cluster and C1and C2are two child clusters each containing one item. P(C1)is

the probability that a document randomly picked from its parent cluster of C1, i.e., C3, belongs

to C1. Similarly for C2.

6Section 2

Let’s add the third item cto the root node. We can add it at the level of C1and C2(level 2)

as another cluster C3, or we can add it in C1or C2that will delegate the item cto the third (a

new) level. So, our options are (omitting the cwithin (b, c)conﬁguration that is analogous to

the cwithin (a, c)conﬁguration described below):

C3P(C3) = 1

(a, b and c): t=A , t =B, t =C

C1P(C1) = 1

3

(a): t=A

C2P(C2) = 1

3

(b): t=B

C4P(C4) = 1

3

(c): t=C

or

C3P(C3) = 1

(a, b and c): t=A , t =B, t =C

C4P(C4) = 2

3

(aand c): t=A, t =C

C1P(C1) = 0.5

(a): t=A

C5P(C5) = 0.5

(c): t=C

C2P(C2) = 1

3

(b): t=B

Figure 4. Two partitions of the ro ot cluster.

At this point Category Utilities of the two conﬁgurations let us decide which conﬁguration to

choose. Note that we need to compute category utility of the two partitions of the root clusters.

They can be computed using expression (1) as described below.

For the ﬁrst conﬁguration in Figure 4 the parent cluster is C3and the child clusters are C1,

C2and C4. The category utility of this conﬁguration is:

CU1=Pk={1,2,4}P(Ck)hPAi=tPt={A, B, C }P(t|Ck)2−PAi=tPt={A,B , C }P(t|C3)2i

3

="1

3(12− 1

32

+1

32

+1

32!)

+1

3(12− 1

32

+1

32

+1

32!)

+1

3(12− 1

32

+1

32

+1

32!)#,3

=2

9

For the second conﬁguration in Figure 4 the parent cluster is C3and the child clusters are C4

and C2.

CU2=Pk={4,2}P(Ck)hPAi=tPt={A,B , C }P(t|Ck)2−PAi=tPt={A,B , C }P(t|C3)2i

2

="2

3( 1

22

+1

22!− 1

32

+1

32

+1

32!)

+1

3(12− 1

32

+1

32

+1

32!)#,2

=1

6

Literature review 7

Since, CU1>CU2we select conﬁguration 1 over conﬁguration 2. Looking at the Figure 4, it is

intuitive to make a new cluster for the third item, because, it has an attribute value not seen in

any of the existing categories.

There is one more possible conﬁguration, where cis added below C2instead of C1, but that

is symmetrical to the second conﬁguration in Figure 4. So, the analysis will be identical to the

one shown in previous paragraph.

Incremental clustering algorithms, such as Cobweb, are sensitive to the order in which items

are presented [9]. Cobweb makes use of split and merge operations to correct this problem. In

the merge operation the child nodes with highest and second highest Category Utility are

removed from the original node and made child nodes of a new node, which takes their place

under the parent node. In the split operation the best node is removed and its child nodes are

made children of the parent of the removed node. Merge and split operations are only carried

out if they lead to a better Category Utility than obtainable by either assigning the item to

existing best node or to a new cluster of its own. By using these two operators, the algorithm

remains ﬂexible on the face of change in property of data items in the subsequent observations.

1

234

1

2 5

3 4

Merge (merging 3 and 4 into 5)

1

2 3

4 5

1

2 4 5

Split (splitting 3 into 4 and 5)

Figure 5 . M erge and split op erations illustrated.

Incremental Hierarchical Clustering: Numerical Attributes

We now consider an extension of the Cobweb from nominal attributes to numerical attributes.

Gennari et al.[12] has shown that in order to use cobweb for data items with numeric, rather

than nominal, attribute values we need to make some assumption about the distribution of

attribute values. When the values of each attribute follow a normal distribution, they have

shown that the Category Utility function can be written as

C Up[C1,, Ck] = PkP(Ck)Pi1

σik −1

σip

K

where,

σip = standard deviation of the value of the attribute iin parent node p, and

σik = standard deviation of the value of the attribute iin the child node k.

This algorithm is known as the classit algorithm.

We have not seen any prior application of either of these algorithms to text clustering.

Hence, their performance on text document data is uncertain at the time of this work. Further,

word occurrence counts, attributes of text documents that are commonly used to represent a

document, follow a skewed distribution–-unlike the Normal distribution (Figure 6). Also,

Normal distribution assumes that the attributes are Real numbers, but, word occurrence counts

are Nonnegative Integers. They can not be treated as nominal attributes either, because the

occurrence counts are not contained in a bounded set, which one would have to assume while

treating them as nominal attributes. A more suitable distribution for such count data is Nega-

tive Binomial, or Katz’s distribution [14].

Our work proposes to improve upon the original Cobweb algorithm using distributional

assumptions that are more appropriate for word count data.

8Section 2

3 Text Documents and word distributions

Text, as we commonly know it, is available in the form of unstructured documents. Before we

can use such documents for classiﬁcation or clustering, we need to convert them to items with

attributes and values. A popular way of converting the document to such a form is to use the

words1in a document as attributes and the number of times the word occurs in the document,

or some function of it, as the value of the attribute. This is called the “Bag of Words” approach.

One consequence of using such a method to convert documents to an actionable form is that one

foregoes information contained in the order of the word. Despite this drawback, the bag-of-

words approach is one of the most successful and widely used method of converting text docu-

ments into actionable form.

Several attempts has been made to characterize the distribution of words across documents.

This is useful in judging the information content of a word. For instance a word that occurs uni-

formly in every document of the corpus, e.g., “the” is not as informative as a word that occurs

frequently in only a few, e.g., “Zipf”.

Occurrence statistics of a word in a document can be used along with the information con-

tent of the word to infer the topic of the document and cluster documents of similar topic into

same group–- as is done in this work. Manning and Schutze have discussed several models to

characterize the occurrence of words across diﬀerent documents [17].

3.1 Models based on Poisson distribution

3.1.1 Poisson The Poisson distribution has been used to model number of times a word

occurs in a document. The probability of a word occurring ktimes in a document is given by

P(k) = λke−λ

k!(2)

where, λis a rate parameter. However, from empirical observations, it has been found that

Poisson distribution tends to over estimate the frequency of informative words (content words)

[17].

3.1.2 Two Poisson Model There have been attempts to characterize the occurrence of a

word across documents using a mixture of Poisson distributions. One such attempts uses two

Poisson distributions to model the probability of a word occurring a certain number of times in

a document. One of the distributions captures the rate of the word occurrence when the word

occurs because it is topically relevant to the document. The second distribution captures the

rate of the word occurrence when the word occurs without being topically relevant to the docu-

ment. This mixture of two probability distributions has the probability density function:

P(k) = αλ1

ke−λ1

k!+ (1 −α)λ2

ke−λ2

k!(3)

where, αis the probability of the word being topically relevant and 1−αis the probability of

the word being topically unrelated to the document.

It has been empirically observed that, although the two Poisson model ﬁts the data better

than single Poisson model[3], a spurious drop is seen for the probability of a word occurring

twice in a document[14]. The ﬁtted distribution has lower probability for a word occurring twice

in a document than it occurring three times, i.e., it predicts that there are fewer documents that

contain a word twice than there are documents that contain the same word three times. But,

empirically it has been observed that document count monotonically decreases for increasing

number of occurrences of a word (see Figure 6).

1. Through out this pap er we shall use w ord an d term interchan gea bly to refer to the sam e th ing, i.e., a con-

tiguous sequence of alphanumeric characters delim ited by non-alphanumeric character(s). E.g. the ﬁrst word or

term in this foo tnote is “Through”.

Text Documents and word distributions 9

3.1.3 Negative Binomial A proposed solution to the above problem is to use a mixture of

more than two Poisson distributions to model the word occurrences. A natural extension of this

idea is to use a Negative Binomial distribution, which is a gamma mixture of inﬁnite number of

Poisson distributions[11]. The probability density functions of a Negative Binomial distribution

is given below,

P(k) =k+r−1

r−1pr(1 −p)k,(4)

where pand rare parameters of the distributions.

Although the Negative Binomial distribution ﬁts the word occurrence data very well it can

be hard to work with because it often involves computing a large number of coeﬃcients[17].

This has been conﬁrmed in our analysis (see Expressions (28) and (29) in Section 4.2).

3.1.4 Zero inﬂated Poisson When we observe the word occurrence counts in documents, we

ﬁnd that most words occurs in only a few documents in the corpus. So, for most of the words,

the count of documents where they occur zero times is very large (see Figure 6). Looking at the

shape of the empirical probability density function we attempt to model the occurrence counts

using a Zero Inﬂated Poisson distribution, which assigns a large probability mass at the variable

value 0 and distributes the remaining probability mass over rest of the occurrence counts

according to a Poisson distribution.

result

frequency

Density

0 1 2 3 4 5 6 7

0.0 0.2 0.4 0.6 0.8

Figure 6 . The occurrence of a typical word (“result”) across diﬀe rent documents in our test collection.

The probability density function of Zero Inﬂated Poisson distribution is given by

P(k) = (1 −α)δk+αλkeλ

λ!, k = 0,1,2(5)

10 Section 3

where,

δk=1,iﬀk = 0

0,otherwise

As we shall demonstrate in Section 3.3, this distribution does not ﬁt text data as well as the

Negative Binomial or the Katz’s distribution.

3.2 Katz’s K-mixture model

This distribution, proposed by Katz[14], although simple to work with, has been shown to model

the occurrences of words in the documents better than many other distributions such as Poisson

and Two Poisson, and about as well as the more complex Negative Binomial distribution[17].

Katz’s distribution assigns the following probability to the event that word ioccurs ktimes in a

document2.

P(k) = (1 −α)δk+α

β+ 1β

β+ 1 k

(6)

δk= 1iﬀk = 0and0otherwise.

The MLE estimates of parameters αand βare:

β=cf −df

df (7)

α=1

β×cf

N(8)

cf = collection frequency = number of times word ioccurred in the document collection

obtained by adding up the times the word occurred in each document. Here, a collection can be

whatever we deem our universe of documents to be. It can be the entire corpus of documents or

a subset of it.

df = document frequency = number of documents in the entire collection that contain the

word i.

From (6) it follows that

P(0) = 1 −α+α

β+ 1

= 1 −df

N(9)

= 1 −Pr (the word occurs in a document)

=Pr (the word does not occur in a document)

Also, it follows that

P(k) = α

β+ 1β

β+ 1 k

, k = 1,2,(10)

Substituting pfor β

β+ 1 , we have

P(k) = α(1 −p)pk(11)

Let’s deﬁne a parameter p0as

p0=P(0) (12)

using (7) we ﬁnd that

p=

cf −d f

df

cf

df

=cf −df

cf (13)

=Pr (the word repeats in a document)

Pr (the word occurs in a document)

=Pr (the word repeats Tthe word occurs)

Pr (the word occurs)

=Pr (the word repeats | the word occurs)

2. In this section we shall discuss the case of one word, the ith wo rd. Hence, we shall drop the subscript i

from the equations and expressions.

Text Documents and word distributions 11

Hence, 1−pcan be interpreted as the probability of the word occurring only once. Or, it can be

thought of as a scaling factor used to make (11) and (12) together a valid probability density

function.

We can write Expression (6) for k= 0, using pas

P(0) = (1 −α) + α(1 −p)

= 1 −α+α−αp

Hence, αin terms of p0and pis

p0= 1 −αp

⇒αp = 1 −p0

⇒α=1−p0

p(14)

Expression (11) can now be written as

P(k) = (1 −p0)(1−p)pk−1(15)

when k > 0.

Using Expressions (12) and (15), we can fully specify the Katz’s distribution. The two

parameters are p0and p, which can be estimated as (see Expressions 9 and 13)

p0

ˆ = 1 −df

N(16)

and

pˆ = cf −df

cf (17)

It can be shown that if a distribution is deﬁned by Expressions (12) and (15), then the estimates

(16) and (17) are the MLE of the parameters p0and p(see Appendix A).

3.3 Fitness comparison

We estimated the parameters of Zero Inﬂated Poisson and Negative Binomial using the method

of moment, and parameters for Katz’s distribution using the Maximum Likelihood Estimate

(MLE) method. The reason for using the method of moments and not the MLE is that for the

Negative Binomial and the Zero Inﬂated Poisson distributions the MLE can only be found

numerically, which is computationally complex for our task of incremental clustering. One can

still use numerical methods to determine MLEs of the parameters of the distribution, which

admittedly have better properties, if one is willing to pay the cost in terms of delay. In this

work we shall limit ourselves to the estimates that have closed form expressions and can be com-

puted eﬃciently, because our goal is to carry out the incremental document clustering in real

time.

3.3.1 Zero Inﬂated PoissonIf the probability density function of a Zero Inﬂated Poisson dis-

tribution is given in the form of Expression (5), then the method of moment estimates of its

parameters αand λare

λ

ˆ=Var(X)

X+X−1(18)

and

αˆ = X

λ(19)

3.3.2 Negative BinomialFor the Negative Binomial distribution, parameters pand rcan be

estimated as

rˆ = X

¯2

Var(X)−X

¯(20)

pˆ = X

¯

Var(X)(21)

For the Katz’s distribution we used Expressions (16) and (17) to estimate the parameters p0and

p.

12 Section 3

We evaluated the ﬁtness of these three distributions by computing the probabilities of the

word occurrences using the estimated parameters, on three diﬀerent datasets. For each dataset

we selected the top 100 terms by their cf ×log(N/d f)score. The distribution that has a higher

likelihood than another can be considered a better ﬁt to the data. For each term a pairwise

comparison of ﬁtness of diﬀerent distributions is carried out in this manner. The results are

shown in the form of three dominance matrices in Table 1. Each cell records the number of

terms for which distribution for the row has 10% or higher likelihood than the distribution for

the column.

dataset dominance table

classic

NB Katz’s ZIP

NB 0 55 92

Katz’s 41 0 96

ZIP 7 4 0

tr41

NB Katz’s ZIP

NB 0 41 98

Katz’s 58 0 98

ZIP 2 2 0

k1a

NB Katz’s ZIP

NB 0 63 98

Katz’s 35 0 98

ZIP 2 2 0

Table 1. Likelihoo d com parisons, count of likelihoo d of row distrib ution >likelihoo d of col distribu-

tion ×1.1

It can be observed from the table that Katz’s distribution, is not only easier to work with as

we will see in Section 4, it also ﬁts better than Zero Inﬂated Poisson (ZIP) and gives ﬁtness

comparable to Negative Binomial (NB) distribution.

4 Algorithms for text

4.1 COBWEB: when attribute values follow Katz’s distribution

4.1.1 Category utility

Using words as attributes, we can derive the Category Utility function assuming that word

occurrences follow Katz’s distribution. For reference, the Category Utility formula as given in

Cobweb is

1

KX

k

P(Ck)

"X

iX

jP(Ai=Vi, j |Ck)2−P(Ai=Vi, j |Cp)2#

Notice that for each attribute indexed iwe need to compute

X

jP(Ai=Vi, j |Ck)2−P(Ai=Vi, j |Cp)2(22)

where, jis an index of value of the attribute i. In this case Vi, j would take values 0, 1, 2 ...

because we are working with count data.

Hence, the ﬁrst part of Expression (22) can be written as

CUi,k =X

f=0

∞

P(Ai=f|Ck)2(23)

Let’s use CUi, k to refer to the contribution of the attribute itowards the Category Utility of the

cluster k.

Substituting Expressions (12) and (15) in Expression (23), we obtain

CUi,k =X

f=0

∞

P(Ai=f|Ck)2=1−2p0(1 −p0)−p(1 −2p0)

1 + p(24)

Algorithms for text 13

Substituting estimates of p0and pfrom Expressions (16) and (17) in Expression (24), and sim-

plifying, we get

CUi,k =X

f=0

∞

P(Ai=f|Ck)2= 1 −

2×dfN−cf ×df

2×cf −d f

N2(25)

where, df, cf, and Nare counted in the category k.

Expression (25) speciﬁes how to calculate the Category Utility contribution of an attribute in

a category. Hence, the Category Utility of the Classit algorithm, when the distribution of

attributes follows Katz’s model, is given by

CUp=1

KX

k

P(Ck)

X

i

CUi,k −X

i

CUi, p (26)

where, CUi, k is given by Expression (25).

4.2 Cobweb: when attribute values follow Negative Binomial distribu-

tion

The probability density function of the Negative Binomial distribution is

P(x) =x+r−1

r−1pr(1 −p)x(27)

pand rare the parameters of the distribution, which are to be estimated from the data.

4.2.1 Category utility

Substituting Expression (27) in (23), we obtain the contribution of a word in a child cluster

towards Category Utility

CUi,k =X

x=0

∞(x+r−1)!

x!(r−1)! pr(1 −p)x−12

(28)

This expression cannot be reduced to any simpler form, although, it can be written using a

hyper-geometric function in the following manner.

CUi,k =p2

2rF1r, r , 1,(1 −p)2

(1 −p)2(29)

One can use a library, such as the one available with Mathematica, to numerically evaluate

2F1(r, r, 1,(1 −p)2). In our experience this computation is three orders of magnitudes more

resource intensive than computing (25), the equivalent expression for Katz’s distribution. As we

described in Section 3.3, in this work we shall limit ourselves to the methods that will let us

carry out incremental clustering in real time, i.e., in the time available between arrival of two

documents.

For this reason and the reasons cited in Section 3.1 and 3.3, we shall fully explore only

Katz’s distribution and original Classit algorithm based on Normal distribution in our work.

5 Cluster Evaluation Methods

5.1 Evaluating the clusters

One commonly used cluster quality measure is the purity of clustering solution. Purity of a

cluster is deﬁned as

pk=maxc{CFk(c)}

Nk

(30)

where,

−cis the index of classes

•class is a pre-speciﬁed group of items

−kis the index of clusters

•cluster is an algorithm generated group of items

14 Section 5

CFk(c)= number of items from class coccurring in cluster k. Or, the frequency of class cin

cluster k.

Nk= number of items in class k.

Purity of the entire collection of clusters can be found by taking the average of the cluster

qualities. Here, there are two kinds of averages one might consider: weighted or unweighted. If

we assign a weight to each cluster proportional to the size of the cluster and take the weighted

average then it is called micro average, since each of the documents get equal weight. If we

instead want to give equal weight to each cluster, we compute the arithmetic average instead.

This is called macro average. The ﬁrst one is a document level evaluation, while the second one

is a cluster level evaluation. Both these purity are greater than 0 and less than 1.

The drawback of relying only on purity to evaluate the quality of a set of clusters, becomes

apparent in hierarchical clustering. When we collect clusters occurring at or near the lowest

level of the hierarchy, we get clusters with very few documents in them. Hence, we obtain clus-

ters with high purity score. In the limit, at the lowest level there are Nclusters each containing

only one item. Hence, maxc{CFk(c)}is 1for each k∈ {1, , N }resulting in purity score of 1.

We get larger clusters at a higher level in the hierarchy, which are more likely to contain docu-

ments belonging to diﬀerent classes, leading to a lower purity score. This illustrates how purity

score can be misleading when the number of clusters formed is diﬀerent than the number of

classes in the dataset. If we make more number of clusters than there are in the dataset we bias

the purity score up. If we make less number of clusters than there are in the dataset we bias the

purity score down.

To correct this problem, we deﬁne another score of the clustering solution in the following

manner.

rc=maxk{CFk(c)}

Nc

where, Ncis the size of the class c. The other variables are as deﬁned for the expression of the

purity score in Expression (30). Here, also we can compute the micro average or the macro

average to compute the score for the entire solution.

This is a purity computation with the clustering solution treated as the true classes of the

data items and the human generated clusters as the solutions to be evaluated. Using this mea-

sure we evaluate how well the “true” classes in the datasets are represented in the clusters

formed.

These metrics, pkand rc, have interpretations that parallel the precision and recall metrics,

respectively, in information retrieval literature. Precision is the fraction of the retrieved docu-

ments that are relevant. Our pkhas the precision interpretation when we think of a cluster to

retrieve documents from the class to which majority of its elements belong. On the other hand

recall is the fraction of all the relevant documents that are retrieved. In the framework we

described for pk, our metric rchas the recall interpretation.

Taking a cue from the Fmeasure commonly used in IR literature to combine precision and

recall, we computed the Fscore as the harmonic mean of the PandR values:

1

F=1

21

P+1

R(31)

The Fscore is the metric by which we shall measure the quality of our clusters.

5.2 Evaluating the hierarchy

Another question of interest when evaluating a hierarchical clustering algorithm is “To what

extent the generated cluster hierarchy agree with the class hierarchy present in the data?”. As

we shall describe in Section 6, the datasets we have used in our experiments have a hierarchy of

classes and provide us a rare opportunity to evaluate our generated cluster hierarchy for correct-

ness. As a reminder, a class is a document category that has been provided to us as a part of

the dataset. It is what the documents have been labeled with by an external entity and help us

in evaluating how good our algorithm is. On the other hand, a cluster is a grouping of docu-

ments that our algorithm generates. It does so by grouping together the documents it considers

similar.

Cluster Evaluation Methods 15

Matching the generated cluster hierarchy with the existing class hierarchy is a non-trivial

task. In stead, in this work we focus on measuring how often the sibling clusters in the gener-

ated hierarchy have sibling classes, i.e, how often children clusters of a parent cluster have chil-

dren classes of the class that is assigned to the parent cluster. For instance, consider the gener-

ated cluster subtree shown in Figure 7.

K0

K1(C1.1)K2(C1 .2 )K3(C2.1)K4(C1 .1 .3)

Parent class frequency

C12

C21

C1.1 1

K0(C1)

K1(C1)K2(C1)K3(C2)K4(C1.1 )

Figure 7 . A samp le subtree with the childre n nodes. Class labels of the children no de are given in

parenthe sis.

In this case we have already determined the classes of child clusters3. To be able to measure

if they are ﬁled under the correct class, we need to ﬁnd the class of the parent cluster. To do

this we tabulate the parent classes of the child clusters and assign the most frequent parent class

to the parent cluster K0. So, in this case the parent cluster K0gets the label C1. Then we eval-

uate this cluster conﬁguration as if K0is merely a cluster of four other smaller entities, each of

which has a class label same as the parent class of what they really have. This is equivalent of

saying that as long as the children clusters of K0have children classes of the class of K0, i.e., C1

in this case, they are correct. Clusters with all other class labels that occur under that parent

cluster are incorrect classiﬁcations by the algorithm. They should have been somewhere else.

So, in the above example the precision of K0would be 2

4=0.5. We compute this precision

for all the internal nodes of the cluster tree and take their average (both micro average and

macro average) to compute the overall precision of the hierarchy. This gives us a measure of

how much the generated cluster hierarchy agree with the class hierarchy present in the data. We

call it sibling precision score of the cluster hierarchy.

We needed to make a few decisions while evaluating the hierarchy in this manner. For

instance, we used only the internal nodes to compute the precision of any node. This is because,

often times leaf nodes co-exist with internal nodes as children of another internal node. In this

case if we compute precision based on leaf nodes, i.e., single documents, then we are mixing the

precision of the kind we described in Section 5.1 with the precision of the hierarchy and it is not

clear how we should interpret the resulting number. Another decision that needed to be made

was, what should we do if a child cluster has the broadest class label assigned to it? Since, we

can not ﬁnd a parent class for these classes, we explored the possibility of

i. dropping such child clusters from our evaluation and

ii. treating them as their own parent cluster since, they are the broadest level classes.

3. At the lowest level each cluster has only one document and its class can b e read from the data directly.

16 Section 5

In our experiments the results do not change much if we take either of these strategy. So, we

shall report only the results we got by treating the broadest classes as their own parent classes.

6 Experiment setup and results

We evaluate our algorithm over two text document collections, i.e., Reuters-RCV1 and

Ohsumed (88-91). These datasets were picked because of the presence of human labeled hierar-

chical class labels and reasonably large number of documents in them. They are described in

more detail in the following section.

6.1 Reuters-RCV1

Incremental clustering algorithms process the data points only once and in the order in which

they are presented and the order in which data points are present in the dataset inﬂuences the

clusters produced4. Therefore, it is imperative that we test the incremental clustering algorithms

with an ordering of data points that is similar to the what they are expected to receive during

their deployment. As we envision the two algorithms in this work to be used to process streams

of text documents from newswire, newsgroups, Blogs, etc., the natural ordering among the docu-

ments is determined by the time at which they are received. Therefore, we need a document

dataset in which the time order of the documents is preserved. Reuters-RCV1[15] is one such

dataset.

Reuters-RCV1 dataset is a collection of over 800,000 English newswire articles collected from

Reuters over a period of one year(20th Aug 1996 to 19th Aug 1997). These documents have

been classiﬁed by editors at Reuters simultaneously under three category hierarchies: “Topic”

hierarchy, “Industry” hierarchy and “Region” hierarchy. The Topic hierarchy contains four cate-

gories at the depth one of the tree, namely “Corporate/Industrial”, “Economics”, “Govern-

ment/Social” and “Market”. There are ten such categories in the Industry hierarchy. Some of

them are “Metals and Minerals”, “Construction”, etc. The Region hierarchy has geographical

locations, such as country names, and economic/political groups as categories. There are no

ﬁner sub-categories in the Region hierarchy.

Topic Root

Corporate/Industrial Economics Government/Social Market

Region Root

MEX USA UK INDIA ... ...

Figure 8. Three classiﬁcation hierarchies.

The classiﬁcation policy, also called The Coding Policy, requires that each document must

have at least one Topic category and at least one Region category assigned to it. It also requires

that each document be assigned to the most speciﬁc possible subcategory in a classiﬁcation hier-

archy. A document might be, and often is, assigned more than one categories from any one of

the three category hierarchies. The documents are present in the dataset in the order in time in

which they were collected.

4. However, the idea l incremental clusterin g algorithm is expected to b e insensitive to the order in which it

enco unters th e data p oints. Such, characteristic is partly achieved by the C obweb algorithm by its split and

merge operators.

Experiment setup and results 17

number of documents 62935

number of unique words 93792

average document length 222

number of classes 259

Table 2. rcv1 dataset (First 30 d ays). Classes are the region classes

6.1.1 Evaluating clusters

Experiment setup For our experiments articles from the ﬁrst 30 days of the Reuters-RCV1

dataset were used. There were 62935 articles. Stop words were removed from the documents

and the terms were stemmed. Then the most informative terms were selected by their cf ×

log(N/d f)scores to represent the documents. We repeated the experiments using 100 to 800

terms at step size of 100.

We have evaluated the clustering solutions for the correctness of assignment of documents to

the clusters using the region categories, because (i) in the region class hierarchy all the assigned

classes belong to one level and (ii) fewer articles are assigned multiple region class labels than

they are assigned other class labels, suggesting that the region classes in the dataset do not

overlap a lot. This allows us to evaluate out algorithm on a dataset with well deﬁned classes.

There were 259 region categories present in the selected documents. So, we have extracted 259

clusters from the dendrogram constructed by the clustering algorithms and measured their

quality using the Region categories of the documents.

Results and Discussion The results of the clustering exercise is given in Table 3. We can see

that Katz’s distribution based Classit algorithm dominates Normal distribution based Classit

algorithm across varying vocabulary sizes in both the micro and macro average of Fscores.

V

100

K N

micro 0.46 0.31

macro 0.83 0.60

200

K N

micro 0.45 0.43

macro 0.81 0.74

300

K N

micro 0.45 0.33

macro 0.85 0.67

400

K N

micro 0.45 0.42

macro 0.79 0.74

500

K N

micro 0.45 0.36

macro 0.84 0.69

600

K N

micro 0.45 0.42

macro 0.82 0.76

700

K N

micro 0.45 0.39

macro 0.81 0.74

800

K N

micro 0.45 0.30

macro 0.83 0.61

Table 3. Cluster quality comparison on RCV1 da ta

18 Section 6

0

0.2

0.4

0.6

0.8

1

100 200 300 400 500 600 700 800

Katz

Normal

0

0.2

0.4

0.6

0.8

1

100 200 300 400 500 600 700 800

Katz

Normal

Figure 9 . Cluster quality comparison on RCV 1 data. The left panel sho ws the micro average of F-score

and the right panel shows the macro average of the F-score.

As we can see Katz based Classit algorithm consistently performs better than the Normal

based Classit algorithm on this dataset. However, we are cautious in interpreting the micro

averaged-F score. Both of these algorithms produce clusters of widely diﬀerent sizes, i.e., a few

big clusters, a few more clusters of intermediate size and a lot of smaller clusters. The micro-

averaged F score, is aﬀected by it. Because, performance over a few good clusters dominates the

entire performance metric. This explains the ﬂat nature of the plot of micro averaged F score

with Katz based Classit. The larger of the clusters generated by the algorithm do not change

much over diﬀerent vocabulary sizes, so, the micro-averaged F score remains nearly constant.

Therefore, we also compute the macro-averaged F score, where each cluster gets equal weight,

and ﬁnd that Katz based Classit performs better than Normal based Classit over a wide

range of vocabulary sizes.

6.1.2 Evaluating hierarchy

We evaluate the generated cluster hierarchy using the topic hierarchy of classes5as our refer-

ence. There are 63 diﬀerent topic codes in the documents we used, where as in the entire topic

hierarchy there are 103 topic codes.

We pre-processed the documents using the steps described in the previous section. Evaluated

the accuracy of the parent/child cluster conﬁgurations as described in Section 5.2. The results

are given in Table 4.

V Normal Katz Normal Katz

Macro avg Macro avg Micro avg Micro avg

100 0.925 0.956 0.814 0.959

200 0.924 0.935 0.797 0.943

300 0.926 0.874 0.825 0.871

400 0.92 0.866 0.814 0.789

500 0.918 0.896 0.812 0.871

600 0.922 0.841 0.814 0.989

700 0.929 0.836 0.846 0.653

800 0.918 0.855 0.832 0.718

Table 4. Evaluation of the cluster hierarchy using Rcv1 data

The values in the table cells are the average sibling precision of internal nodes of the cluster

hierarchy. As we can see there is no clear winner in this case, although, both the algorithms do

reasonably well in assigning sibling classes under the same cluster. However, we must be careful

to interpret these values as the correctness of the sibling classes getting grouped together and

not as recovering all of the original class hierarchy.

5. This can b e obtained from [15] Appendix 2.

Experiment setup and results 19

6.2 OHSUMED (88-91)

The Ohsumed test collection is a set of 348,566 abstracts collected from 270 medical journals

over a period of 5 years. Each abstract is annotated with MeSH (Medical Subject Heading)

labels by human observers. This indicates the topic of the abstract. Unlike the rcv1 dataset,

these documents are not in temporal order. Another property of this dataset is, being from a

speciﬁc subject area, they contain words from a much smaller vocabulary. Due to the presence

of human assigned MeSH keywords over such a large collection, this dataset provides us with an

opportunity to evaluate our algorithm over a large dataset and against real topic labels.

number of documents 196555

number of unique words 16133

average document length 167

number of classes 14138

Table 5. Ohsumed dataset (88-91)

6.2.1 Evaluating clusters

Experiment Setup We used the Ohsumed 88-91 dataset from the TREC-9 ﬁltering track to

evaluate our algorithm for the correctness of assignment of documents to the classes. We

selected only those articles for which both the MeSH labels and the abstract text were present.

There were 196,555 such articles. As with the rcv1 dataset most informative words in the

dataset were selected using cf ×logN

df score of the words. We repeated the clustering exercise

using 25 to 200 words at a step size of 25. To determine the number of diﬀerent topics present

in this dataset one can look at the unique MeSH labels present in the dataset. But, as there are

tens of thousands of such labels present we used ﬁxed number of clusters to evaluate (see Table

6) the algorithms.

Results and discussion The F-score results of the experiments are given in Table 6.

k

V5 10 20 40 8 0 16 0 3 2 0 6 4 0

25

?K N

µ57 5 5

M62 6 2

?K N

µ57 5 3

M60 6 1

K N

µ57 5 3

M63 6 2

?K N

µ57 5 3

M62 6 2

K N

µ55 3 8

M60 5 4

×K N

µ55 6 1

M37 5 4

?K N

µ36 3 4

M49 5 2

×K N

µ27 3 4

M43 5 2

50

K N

µ70 5 7

M74 6 5

K N

µ70 5 7

M75 6 3

K N

µ69 5 7

M75 6 5

K N

µ69 5 7

M76 6 9

K N

µ69 5 7

M76 7 0

K N

µ69 5 7

M76 7 1

×K N

µ48 5 1

M60 6 5

×K N

µ47 5 1

M59 6 5

75

?K N

µ70 7 0

M71 7 1

?K N

µ70 7 0

M69 7 0

?K N

µ70 7 0

M73 7 7

×K N

µ69 7 0

M76 8 0

×K N

µ69 7 0

M77 8 1

×K N

µ69 7 0

M78 8 2

K N

µ69 3 9

M78 5 6

K N

µ69 3 5

M79 5 3

10 0

K N

µ70 6 2

M72 6 9

K N

µ69 6 2

M71 7 0

K N

µ69 6 2

M75 7 3

K N

µ69 6 2

M78 7 5

K N

µ68 6 2

M78 7 6

K N

µ68 4 5

M79 6 2

K N

µ69 4 5

M80 6 2

K N

µ69 4 5

M80 6 2

12 5

K N

µ71 6 1

M74 6 8

K N

µ71 6 1

M76 6 8

K N

µ69 6 1

M77 7 1

K N

µ69 6 1

M78 7 2

K N

µ69 6 1

M80 7 4

×K N

µ53 6 1

M68 7 4

K N

µ53 4 7

M69 6 1

K N

µ53 4 6

M69 6 0

15 0

K N

µ72 5 4

M72 6 5

K N

µ72 5 1

M77 6 1

K N

µ59 5 1

M72 6 4

K N

µ59 5 1

M74 6 6

K N

µ55 5 1

M71 6 6

K N

µ54 5 1

M71 6 7

K N

µ54 5 1

M71 6 7

×K N

µ48 5 1

M66 6 7

17 5

K N

µ71 5 2

M74 6 4

K N

µ71 5 1

M78 6 2

K N

µ71 5 1

M81 6 4

K N

µ71 5 1

M83 6 6

K N

µ59 5 1

M75 6 7

K N

µ58 4 3

M74 6 0

K N

µ54 4 3

M71 6 0

K N

µ54 4 1

M71 5 8

20 0

K N

µ62 5 2

M72 6 3

K N

µ62 5 0

M75 6 2

K N

µ62 5 0

M77 6 5

K N

µ62 5 0

M78 6 5

K N

µ62 5 0

M79 6 6

K N

µ62 5 0

M79 6 7

K N

µ62 5 0

M79 6 7

K N

µ62 5 0

M79 6 7

Table 6. Cluster quality comparison on OHSU MED da ta at diﬀerent number of clusters (k) and vocabu-

lary size (V ). T he ﬁgures in th e table are F-score ×100. K stands for Katz-Classit, N for the original

Classit.µand Mrow in the smallest table hold the m icro and m acro average of the F-score respec-

tively. The cells where Katz-C lassit p erform s bette r are marked w ith a , the cells where Normal-

Classit performs b etter are marked with a ×and the cells where there is n o clear winner are m arked

with a ?. Best Katz-C lassit and the be st Normal-C lassit have been highlighted by grey cells.

20 Section 6

We can see from the table that Normal-Classit is the most competitive when the vocabu-

lary size is small and the number of clusters formed is large. For all other settings, i.e., when

the size of the vocabulary used is larger or when the number of clusters formed is smaller, Katz-

Classit performs better. This shows that the Katz-Classit algorithm is more robust as it per-

forms well across a much larger range of parameter values.

Performances of both the algorithms suﬀer when we create more number of clusters, which

makes sense because, there are fewer features based on which to distinguish between clusters.

6.2.2 Evaluating hierarchy

MeSH labels present in the Ohsumed collection has a hierarchical structure to it6. This pro-

vides us with another opportunity to evaluate the correctness of our hierarchy. This class hier-

archy is much larger than the topic hierarchy of Rcv1 dataset. There are 42610 diﬀerent MeSH

labels. Each MeSH label has a code attached to it. The class hierarchy information can be

directly read from this code. For instance the ﬁrst three records of 2005 “ASCII MeSH collec-

tion” reads

Body Regions;A01

Abdomen;A01.047

Abdominal Cavity;A01.047.025

...

Figure 1 0. F irst three lines of MeSH labels ﬁle (ﬁlename: mtrees2005 .bin )

This says that the topic labeled “Abdominal Cavity” (A01.047 .025) is a child topic of label

with code A01.047, which we can ﬁnd from the ﬁle as the topic “Abdomen” (A01 .047), which in

turn is a child topic of a topic with code A01. We can ﬁnd from the ﬁle that this is the code of

the label “Body Regions”. This “.” separated topic codes let us easily ﬁnd the parent topics by

dropping the suﬃx of the code. Not all the MeSH labels are seen in our dataset. There were

only about 14138 diﬀerent MeSH labels used in document set we used for our experiments.

Documents were pre-processed as described in the previous section. Entire cluster hierarchy

was generated and the correctness of the hierarchy was evaluated as described in Section 5.2.

The precision values are reported in table

V Normal Katz Normal Katz

Macro avg Macro avg Micro avg Micro avg

25 0.786 0.795 0.626 0.749

50 0.781 0.831 0.667 0.784

75 0.79 0.857 0.654 0.831

100 0.801 0.888 0.742 0.891

125 0.828 0.939 0.788 0.976

150 0.847 0.935 0.812 0.963

175 0.876 0.91 0.859 0.858

200 0.894 0.958 0.819 0.919

Table 7. Evaluation of the cluster hierarchy using Ohsumed data

Here again both the algorithms do reasonably well in grouping classes with common parents

under the same cluster with Katz-Classit seems to have an advantage over Normal-Classit

6. The entire collection of MeSH labels can be downloaded from the web-sit e of National Institute of Health

(http://www.nlm.nih.gov). We have used 2005 MeSH label colle ctio n for our purp ose.

Experiment setup and results 21

across all vocabulary sizes. But, we must be careful here not to interpret these precision values

as closeness of the entire cluster hierarchy to the existing class hierarchy. Instead it is the accu-

racy of the algorithms in classifying sibling classes under same parent cluster.

We also tracked the sibling precision score at diﬀerent depths of the generated cluster tree

(Figure 11 and 12).

V=25

0

0.2

0.4

0.6

0.8

1

1.2

2 4 6 8 10 12 14 16 18

Precision

Depth

Macro average

Normal

Katz

0

0.2

0.4

0.6

0.8

1

1.2

2 4 6 8 10 12 14 16 18

Precision

Depth

Micro average

Normal

Katz

V=75

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25

Precision

Depth

Macro average

Normal

Katz

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25

Precision

Depth

Micro average

Normal

Katz

Figure 1 1. Tracing the sibling precision over th e height of the tree. Vocab ulary 25 and 75.

22 Section 6

V=125

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25

Precision

Depth

Macro average

Normal

Katz

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25

Precision

Depth

Micro average

Normal

Katz

V=175

0

0.2

0.4

0.6

0.8

1

1.2

2 4 6 8 10 12 14 16 18 20

Precision

Depth

Macro average

Normal

Katz

0

0.2

0.4

0.6

0.8

1

1.2

2 4 6 8 10 12 14 16 18 20

Precision

Depth

Micro average

Normal

Katz

Figure 1 2. Tracing the sibling precisio n over the height of the tree. Vocabulary 125 and 175.

These plots show the general trend at diﬀerent vocabulary sizes. As we can see there is con-

siderable variation in the sibling precision over diﬀerent depths. Amidst these variation we can

observe that the sibling precision is higher and more consistent when we look at the nodes

occuring at the lower layers of the tree. Also, we ﬁnd that on these layers the Katz-Classit

usually performs better than the Normal-Classit.

It is interesting to observe the general consistency at the lower levels of the tree and lack of

it at higher levels of the tree. At the lower levels we have a large number of nodes at each layer.

When we average the perfomance of each algorithm over these large number of nodes we get a

score that is robust to random mistakes. So, we get a consistent score from layer to layer and it

is easier to see which algorithm does better. But, it is not so in the higher levels. In the higher

Experiment setup and results 23

levels we have only a few nodes in each layer over which to average the score. So, the average is

more sensitive to random mistakes. Note that both the micro average and macro average are

sensitive to these random mistakes. The wrong nodes in the higher levels of the tree either get a

weight equal to other nodes (macro average) or they get a weight that is proportional to the

number of documents in them. Both of these weights are signiﬁcant at these levels of the tree.

This is the reason why we ﬁnd the plot of average sibling precision ﬂuctuating a lot at these

levels and we do not get a clear winner across the layers in the upper part of the tree.

7 Conclusion

This is the ﬁrst attempt of incremental hierarchical clustering of text documents to our knowl-

edge. We have evaluated an incremental hierarchical clustering algorithm, which is often used

with non-text datasets, using text document datasets. We have also proposed a variation of the

same that has more desirable properties when used for incremental hierarchical text clustering.

The variation of Cobweb/Classit algorithm that we have demonstrated in this work uses

Katz’s distribution instead of Normal distribution used in the original formulation of the

Classit algorithm. Katz’s distribution is more appropriate for the word occurrence data as has

been shown in prior work[14] and empirically observed in our work. We have evaluated both the

algorithms over Reuters-RCV1 dataset, which allows us to carry out the experiments in a sce-

nario very similar to the real life. We tested the algorithms by presenting them Newswire arti-

cles from Reuters-RCV1 dataset in time order and have shown that our algorithm performs con-

sistently better than the Normal based Classit algorithm as measured by both the micro and

macro average of the Fscore over a range vocabulary sizes. We have also evaluated both the

algorithms using Ohsumed 88-91 dataset and have found that Katz-Classit performs better

except for the narrow range of parameter values with small vocabulary sizes and large number

of clusters, where results are likely to be unreliable. This shows that the performance of Katz-

Classit is more robust across broad parameter settings.

We have also proposed a way to evaluate the quality of the hierarchy generated by the hier-

archical clustering algorithms, by observing how often children clusters of a cluster get children

classes of the class assigned to the cluster. We found that although, both the existing algorithm

and our proposed algorithm perform well in this metric, our algorithm performs marginally

better on Ohsumed dataset.

The most important contribution we think we have made in this work is a separation of

attribute distribution and its parameter estimation from the control structure of the Classit

algorithm. Thus, one can use a new attribute distribution, which may be diﬀerent from Normal

or Katz but is more appropriate for the data at hand, inside the well established control struc-

ture of the Classit algorithm to carry out incremental hierarchical clustering of a new kind of

data. For instance, if it is considered that Negative Binomial could be better ﬁt for the word

distribution than Katz distribution, and one can come up with an eﬃcient way to estimate the

parameters of the distribution, it can be used in the framework of the existing Classit algo-

rithm as demonstrated in this work. One can also experiment using a Bayesian approach to esti-

mate the parameters of the distribution and carry out incremental hierarchical clustering in this

framework, which might lead to better results due to more reliable parameter estimates for clus-

ters with a small number of documents.

Bibliography

[1] James Allan, Ron Papka, and Victor Lavrenko. On-line new event detection and tracking. In Proceedings

of the 21st annual in ternation al AC M SIGI R conference on Research and development in info rmation

retrie val , pages 37 –45. ACM P ress, 1998.

[2] J. Banerjee, A.; Ghosh. Co mp etitive learning mechanism s for scalable, incremental and balanced clus-

terin g of streaming texts. In Proceedin gs of the Interna tional Joint Conference on, Neural Networks ,

volume 4, pages 2697 – 270 2, Jul 2003.

[3] Abraham Bookstein and D on R . Swanson. A decision theore tic foundation for indexing. Journal of th e

American Society for Information Science, pages 45–50, Jan-Feb 1975.

24 Section

[4] Soum en Ch akrabarti. M ining the Web : Discovering Know ledge from H ypertext Data . Morgan-K auﬀman,

2002.

[5] P. Cheesem an and J. Stutz. B ayesian classiﬁcation (AU TOCLASS ): Theory and results. Advances in

Know ledge Dis covery an d Data Mining , 1996.

[6] Douglass R. Cutting, David R. K arger, Pedersen Pedersen, a nd John W . Tukey. Scatter/ gather: A

cluster-based approa ch to browsing large do cument collections. In Proceedings of the Fifteenth An nual

Internatio nal ACM SIGIR Conference on Research and D evelopment in Informa tion Retrieval , Interfa ce

Design and Display, pa ges 3 18–329, 199 2.

[7] George Do ddington, Jaim e Carb onell, James Allan, Jonathan Yam ron, Umass A mherst, and Yim ing

Yang. Topic d etection and tracking pilot study ﬁnal report, Jul 2000.

[8] M . A. T. Figueiredo and A. K . Jain. Unsupervised learning of ﬁnite mixture mo dels. IEE E Trans. on

Patt. Analysis an d Mach ine I nte ll., 24(3):381–396, March 2002.

[9] Douglas H. Fish er. K nowled ge ac quisition via incre mental conceptual clustering. Ma chin e Learn ing ,

2:139–172, 1987.

[10] Martin Franz, Todd Ward, J. Scott McCarley, and Wei-Jin g Zhu. Unsupervised a nd supervised clus-

terin g for topic tracking. In SIGIR ’01: Proceedings of the 24th annual internationa l ACM SIGIR confer-

ence on Research and de velopm en t in inform ation retriev al , pages 310–317 . ACM Press, 2001.

[11] David L. Wallace Frederick Mosteller. Applied Bayesian and Classical Inference The case of The Feder-

alist Papers . Springer series in Statistics. Springer-Verlag, 198 3.

[12] J. H. Gennari, P. Langley, and D. Fisher. M odels of incremental concept formation. Journal of A rtiﬁ-

cial Intellige nce , 40:11–61, 1989.

[13] A. K. Jain, M . N. Mu rty, and P. J. Flynn. Data clustering: a review. ACM C om puting Sur veys ,

31(3):264–323, 1 999 .

[14] Slava M. Katz. D istribution of co ntent words and phrases in text and languag e modelling. N at. Lang.

Eng., 2(1):15–59, 1996.

[15] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RC V1: A new benchmark collection for text

categorization research. Journal of M achine Learning Re search , 5:361–397, 2004.

[16] Xiaoyong Liu and W. B ruce Croft. Cluster-based retrieval using language models. In Proceedings of the

27th Annua l International AC M S IGIR C on ference on Research and D evelopment in Information

Retrieva l , La ngu age mode ls, pages 186–193, 2004.

[17] Christop her D . Manning a nd Hinrich Schütze. Foundations of Statistical Na tural Languag e Processing .

The MIT Press, Cambrid ge, E ngland, 2000.

[18] Padhraic S myth. Clustering Using M onte Carlo Cross-Validation. In Evangelos Simoudis, Jia Wei Han ,

and Usama Fayyad, editors, Proceedin gs of the Secon d Intern ational Conference on Know ledge Discovery

and Data Mining (KDD-96), page 1 26. AAAI Press, 1996.

[19] Yiming Yang , Tom Pierce, and Jaime Carbonell. A study of retrosp ective an d on-line event detection.

In SIGIR ’98 : Proceedings of the 21st annual interna tional ACM SIG IR conference on Research and

development in informatio n retrieval , pages 28–36. ACM Press, 1998.

[20] Ya-Jun Zhang and Zhi-Qia ng Liu. Reﬁning web search engine results using in cremental clustering.

Internatio nal journal of intel ligent system s , 19:191–199, 2004.

Appendix A

MLE of Katz’s distribution parameters

The Katz’s distribution is deﬁned as:

P(0) = p0

P(k) = (1 −p0)(1 −p)pk−1;when k > 0(32)

where, p0and pare the parameters of the distribution.

Let us discuss about the distribution of only one word or term. The data is the count of

occurrences of the word in each document in the text collection. So, if we have Ndocuments in

the dataset we have Nobservations, each of which is a count.

ML E of Katz’s distribution parameters 25

Let us also deﬁne nkto be the number of observations equal to k, i.e., number of documents

in which the term occur ktimes. Let’s assume the maximum value of kis K.

Hence,

•document frequency df =N−n0=Pk=1

Knkand

•collection term frequency cf =Pk=1

Kk nk

The likelihood L(p , p0)of the parameters given data is

=Y

i=1

N

Pr (the word occurs xtimes in document i)

=Y

i=1

N

δ(x)p0+ (1 −δk)(1 −p0)(1 −p)px−1;x∈1K

=p0

n0Y

k=1

K

(1 −p0)nk(1 −p)nk(pk−1)nk

where, δ(·)is the indicator function that is 1 if argument is zero and 0 otherwise.

Log of likelihood is

LL(p, p0)

=n0log (p0)

+X

k=1

K

[nklog (1 −p0) + nklog (1 −p) + nk(k−1)log (p)]

Taking the partial derivative of the log likelihood with respect to p0and equating it to 0:

∂LL

∂p0

=n0

p0

ˆ+X

k=1

K

nk−1

1−p0

ˆ= 0

⇒n0

p0

ˆ=1

1−p0

ˆX

k=1

K

nk=1

1−p0

ˆ(N−n0)

⇒1−p0

ˆ

p0

ˆ=N−n0

n0

⇒1

p0

ˆ−1 = N

n0−1

⇒p0

ˆ = n0

N=N−df

N= 1 −df

N(33)

We can ﬁnd the MLE of pin a similar manner.

∂LL

∂p =X

k=1

K

nk−1

1−pˆ+nk(k−1)

pˆ= 0

⇒0 = 1

pˆX

k=1

K

nk(k−1) −1

1−pˆX

k=1

K

nk

⇒0 = 1

pˆ X

k=1

K

k nk−X

k=1

K

nk!−1

1−pˆX

k=1

K

nk

⇒0 = 1

pˆ(cf −df)−1

1−pˆdf

⇒1−pˆ

pˆ=df

cf −df

⇒1

pˆ=cf

cf −df

⇒pˆ = cf −df

cf (34)

Expressions (33) and (34) are the MLE of the parameters of Katz’s distribution described in

Expression

26 Section