Conference PaperPDF Available

Incremental hierarchical clustering of text documents


Abstract and Figures

Incremental hierarchical text document clustering algorithms are important in organizing documents generated from stream- ing on-line sources, such as, Newswire and Blogs. However, this is a relatively unexplored area in the text document clustering literature. Popular incremental hierarchical clus- tering algorithms, namely Cobweb and Classit ,h ave not been widely used with text document data. We discuss why, in the current form, these algorithms are not suitable for text clustering and propose an alternative formulation that includes changes to the underlying distributional assump- tion of the algorithm in order to conform with the data. Both the original Classit algorithm and our proposed al- gorithm are evaluated using Reuters newswire articles and Ohsumed dataset.
Content may be subject to copyright.
Incremental Hierarchical Clustering of
Text Documents
by Nachiketa Sahoo
Adviser: Jamie Callan
May 5, 2006
Abstra ct
Incremental hierarchical text document clu stering algorithms are important in organizin g
docum ents generat ed from streaming on-line sources, such as, Ne wswire a nd B logs. How-
ever, this is a relative ly unexplored area in the text docum ent clustering literature. Pop-
ular incremental hierarchical c lustering algorithm s, namely Cobweb a nd C lassit, have
not b een applied to text do cument data. We discuss w hy, in the cu rrent form, these algo-
rithms are not suitable for text clustering and propose an alternative formulation for the
same. T his includes changes to the underlying d istribution al assum ption of the algorithm
in ord er to conform with the emp irical data. Bo th the original Classit algo rithm and our
proposed algorithm are evaluated using R euters newswire articles and Ohsumed dataset,
and the gain from using a more appropriate distribution is demonstrated.
1 Introduction
Document clustering is an effective tool to manage information overload. By grouping similar
documents together, we enable a human observer to quickly browse large document collec-
tions[6], make it possible to easily grasp the distinct topics and subtopics (concept hierarchies)
in them, allow search engines to efficiently query large document collections [16] among many
other applications. Hence, it has been widely studied as a part of the broad literature of data
clustering. One such survey of existing clustering literature can be found in Jain et. al[13].
The often studied document clustering algorithms are batch clustering algorithms, which
require all the documents to be present at the start of the exercise and cluster the document col-
lection by making multiple iterations over them. But, with the advent of online publishing in
the World Wide Web, the number of documents being generated everyday has increased consid-
erably. Popular sources of informational text documents such as Newswire and Blogs are contin-
uous in nature. To organize such documents naively using existing batch clustering algorithms
one might attempt to perform clustering on the documents collected so far. But, this is
extremely time consuming, if not impossible, due to the sheer volume of documents. One might
be tempted to convert the existing batch clustering algorithms into incremental clustering algo-
rithms by performing batch clustering on periodically collected small batches of documents and
then merge the generated clusters. However, ignoring for the moment the problem of deciding
on an appropriate time window to collect documents, there will always be a wait time before a
newly generated document can appear in the cluster hierarchy. This delay would be unaccept-
able in several important scenarios, e.g., financial services, where trading decisions depend on
breaking news, and quick access to appropriately classified news documents is important. A
clustering algorithm in such a setting needs to process the documents as soon as they arrive.
This calls for the use of an incremental clustering algorithm.
There has been some work in incremental clustering of text documents as a part of Topic
Detection and Tracking initiative ([1], [19], [10] and [7]) to detect a new event from a stream of
news articles. But, the clusters generated by this task are not hierarchical in nature. Although,
that was adequate for the purpose of new event detection, we believe this is a limitation. The
benefits of using a hierarchy of clusters instead of clusters residing at the same level of granu-
larity is twofold. First, by describing the relationship between groups of documents one makes it
possible to quickly browse to the specific topic of interest. The second reason is a technical one.
Finding the right number of clusters in a set of documents is an ill-formed problem when one
does not know the information needs of the end user. But, if we present the user with a topic
hierarchy populated with documents, which she can browse at her desired level of specificity, we
would circumvent the problem of finding the right number of clusters while generating a solution
that would satisfy users with different needs.
In spite of potential benefits of an incremental algorithm that can cluster text documents as
they arrive into a informative cluster hierarchy, this is a relatively unexplored area in text docu-
ment clustering literature. In this work we examine a well known incremental hierarchical clus-
tering algorithm Cobweb that has been used in non-text domain and its variant Classit. We
discuss why they are not suitable to be directly applied to text clustering and propose a variant
of these algorithm that is based on the properties of text document data. Then we evaluate
both the algorithm using real world data and show the gains obtained by our proposed algo-
1.1 Contribution of this research
In this paper we demonstrate methods to carry out incremental hierarchical clustering of text
documents. Specifically, the contributions of this work are:
1. A Cobweb-based algorithm for text document clustering where word occurrence
attributes follow Katz’s distribution.
2. Evaluation of the existing algorithms and our proposed algorithm on large real world doc-
ument datasets.
In Section 34 we briefly review the text clustering literature. In Section 3 we describe the
Cobweb and Classit algorithms. In Section 3 we describe key properties of text documents
that are central to this work. In Section 4 we explain the contributions of our work. In Section 5
we describe the cluster quality metrics that we have used to evaluate the results obtained. In
Section 6 we explain the setup of the experiment and discuss the results. In Section 7 we con-
clude with scope for future research.
2 Literature review
Clustering is a widely studied problem in the Machine Learning literature [13]. The prevalent
clustering algorithms have been categorized in different ways depending on different criteria,
such as hierarchical vs. non-hierarchical, partitional vs. agglomerative algorithms, deterministic
vs. probabilistic algorithms, incremental vs. batch algorithms, etc. Hierarchical clustering algo-
rithms and non hierarchical clustering algorithms are categorized based on whether they pro-
duce a cluster hierarchy or a set of clusters all belonging to the same level. Different hierarchical
and non-hierarchical clustering algorithms for text documents have been discussed by Manning
and Schutze[17]. Clustering algorithms can be partitional or agglomerative in nature. In a parti-
tional algorithm one starts with one large cluster containing all the documents in the dataset
and divides it into smaller clusters. On the other hand, an agglomerative clustering algorithm
starts with all documents belonging to their individual clusters and combines the most similar
clusters until the desired number of clusters are obtained. Deterministic clustering algorithms
assign each document to only one cluster, while probabilistic clustering algorithms produce the
probabilities of each item belonging to each cluster. The former is said to make “hard” assign-
ment while the later is said to make “soft” assignments. Incremental clustering algorithms make
one or very few passes over the entire dataset and they decide the cluster of an item as they see
it. But, the batch clustering algorithms iterate over the entire dataset many times and gradually
change the assignments of the items to the cluster so that a clustering criterion function is
improved. One such criterion function is the average similarity among documents inside the
clusters formed. Another criterion function is the average similarity between a document in a
cluster and documents outside the cluster. The first criterion is called average internal simi-
larity and the second criterion is called average external similarity. In a clustering solution we
would want high average internal similarity, because that would mean that our clusters are com-
2Section 2
posed of similar items. We would also want low average external similarity because that would
mean our clusters are dissimilar, i.e., they do not overlap. The final set of clusters is produced
after many iterations when no further improvement of the cluster assignment is possible.
Clustering to browser large document collections (Scatter/Gather)
Cutting et al.[6] is one of the first to suggest a cluster aided approach, called Scatter/Gather,
to browse large document collections. It describes two fast routines named Buckshot and Frac-
tionation to find the centroids of the clusters to be formed. Then it assigns the documents in
the collection to the nearest centroid and recomputes the centroids iteratively until very little or
no improvement is observed. The last step is similar to the Simple K-means clustering except
that in Simple K-means initially one randomly assigns kitems as centroids of kclusters [17].
Note that kis a fixed user provided number. Buckshot finds the kcenters in the document
datasets by drawing a sample of k n
documents and clustering them into kclusters using an
agglomerative hierarchical clustering routine. The agglomerative hierarchical clustering algo-
rithms have a time complexity of On2. By drawing a random sample of size k n
, the time
complexity is reduced to O(k n). Fractionation, on the other hand, finds kcentroids in the fol-
lowing manner. It divides the set of documents into buckets of size m, where m > k. Then it
clusters each bucket into ρ m clusters, where ρ < 1and is a constant. Then it repeats the pro-
cess of partitioning the data and clustering them treating each of the formed cluster as a one
data item, until kclusters are obtained. Cutting et al. have shown that Fractionation has a
time complexity of O(m n). The center of the clusters formed by the two methods are returned
as the starting points for the Simple K-means clustering routine. With the help of these two
routines they have proposed a cluster aided approach to browse document collections in which
the program presents the user with a set of clusters for the document dataset (Scatter) along
with their descriptive labels. Then the user can select the clusters which interest her and submit
them to the program. The program merges the documents contained in those clusters (Gather)
and clusters them again. This process is repeated until the user’s information need is met or the
user decides to stop the process. The recursive clustering idea proposed in Scatter/Gather can
be effective in browsing large document sets, especially when one does not know enough about
the documents to query a deployed search engine using key words. This concept loosely parallels
the idea of organizing documents into a hierarchy of topics and subtopics, except that the orga-
nization in this case is guided by the user and executed by a clustering routine. However,
Scatter/Gather has its limitations. It is a batch clustering routine, hence it cannot be used in
some important scenarios as described in subsection. Another limitation that Scatter/Gather
shares with many other clustering algorithms is that it requires the input of k, the number of
clusters to present the user. A value of kdifferent from the number of subtopics in the collec-
tion might lead to meaningless clusters.
Right number of clusters
Finding the right number of clusters in a non-hierarchical clustering exercise is often a diffi-
cult problem [18]. The approaches suggested in the literature can, in general, be divided into
two groups [4]. The first approach is a multi-fold cross validation one with likelihood as the
objective function, in which one fits a series of mixture models with different numbers of compo-
nents to a subset of the data called training data and computes the likelihood of each model
given the remaining subset of the data called testing data. The model that results in the highest
likelihood is selected. The second approach also fits a mixture model to the data and computes
the likelihood of the model given the entire dataset using different number of clusters, but it
penalizes a model with a higher number of clusters for increased complexity. Observe that a
higher number of clusters can be made to fit any dataset better than a lower number of clusters.
Hence, by penalizing a clustering solution for its complexity one can achieve a trade off between
fitness, or likelihood, of the model and its complexity, which is optimized at the right number of
clusters. One such work has been done by Cheeseman and Stutz in their AutoClass algo-
rithm[5]. Other such works include Bayesian Information Criteria and Minimum Descriptor
Length criteria [8]. A different approach has been suggested in Liu et al.[16] for clustering text
documents. It uses stability of clustering solutions over multiple runs at each of a set of cluster
counts to decide the right number of clusters for the document dataset.
Literature review 3
Even when the “right” number of clusters can be determined by an algorithm based on some
criterion, human observers often differ from each other about the clusters existing in the dataset
and what should be the right number of clusters. One alternative solution is to generate a hier-
archy of clusters, also called a dendrogram, with all the documents belonging to a single cluster
at the top of the hierarchy, each document in its individual cluster at the lowest level of the
hierarchy and intermediate number of clusters at levels between the two. Thus, the user can
look at the desired level in the hierarchy and find a number of clusters that meets her require-
ment ([17],[13]).
Incremental document clustering
As part of Topic Detection and Tracking (TDT) initiative ([1], [19], [10] and [7]) some experi-
ments have been done in incrementally clustering text documents. The TDT initiative is a
DARPA sponsored project started to study and advance the state of the art in detection and
tracking of new events in stream of news broadcast and intelligence reports. The identified tasks
of TDT are Story Segmentation, Retrospective Topic Detection, On-line New Event Detection,
Topic Tracking and Link Detection. The Story Segmentation task involves breaking a stream of
text or audio data without story delimiters into its constituent stories. Retrospective topic
detection involves detecting new events in the already collected set of documents. On-line new
event detection involves identifying a new event, e.g., an earthquake or a road accident, in a new
document. Tracking involves keeping track of evolution of an event by assigning the incoming
news stories to their corresponding events. Among these tasks the on-line new event detection
task involves incremental clustering. In this task a decision is made, after observing a new item,
whether it belongs to one of the existing clusters, or it belongs to a new cluster of its own.
The TDT team at the Carnegie Mellon University (CMU) uses a threshold-based rule to
decide whether a new document is another story of one of the detected events or it belongs to a
new event of its own. If the maximum similarity between the new document and any of the
existing clusters is more than a threshold (tc) the new document is said to belong to the cluster
to which it is most similar and it is merged to the cluster. If the maximum similarity is less
than tcbut more than another threshold, tn, then the document is assumed to be an old story
but it is not merged to any cluster. If the maximum similarity is less than tn, then the docu-
ment is accepted to be about a new event and a new cluster is formed. They have also investi-
gated adding a time component to the incremental clustering. In this experiment, similarities of
a new document to each of the past mdocuments are computed but they are weighted down
linearly depending on how old the past documents are. If the similarity scores computed in this
manner are less than a preset threshold then the new document is presumed to be about a new
event. This work finds that use of time component improves the performance of new event
detection task.
TDT team at the University of Massachusetts Amherst (UMASS) takes a variable thresh-
olding approach to the on line event detection task[1]. For each document that initiates a new
cluster the top nwords are extracted and called a query vector . The similarity of the query
vector to the document from which the query was extracted defines an upper bound on the
threshold required to be met by a document to match the query. A time dependent component
is also used in the variable threshold that makes it harder for a new documents to match an
older query. When a new document djis compared to a past query qithe threshold is computed
as 0.4 +p×(sim(qi, di)0.4)+tp ×(ji), where 0< p < 1and tp, a time penalty factor, are
tunable parameters. qiis the query generated from document di. Such threshold is computed for
all existing queries qis. If the similarity of the new document djdoes not exceed any of the
thresholds then the document is assigned to a new cluster and a query is computed for the doc-
ument, else it is added to the clusters assigned to the queries it triggers. The newly generated
cluster is said to have detected a new news event.
Outside the TDT initiative, Zhang and Liu in a recent study have proposed a competitive
learning algorithm, which is incremental in nature and does not need to be supplied with the
correct number of clusters [20]. The algorithm, called Self Splitting Competitive Learning, starts
with a prototype vector that is a property of the only cluster present initially. During the execu-
tion of the algorithm the prototype vector is split and updated to approximate the centroids of
4Section 2
the clusters in the dataset. The update of the property vector is controlled, i.e., when a new
data point is added to the cluster the prototype vector is updated only if the data point is near
enough to the prototype. This determined by another property vector that starts away from the
prototype and zeroes on to it as more and more data points are added. Time for splitting the
cluster associated with the prototype is determined based on a threshold condition. When there
are more than one prototype a new data point is added to the prototype nearest to it. They
have demonstrated their algorithm over text snippets returned from search engines as a response
to a query. However, the success of this algorithm on datasets with longer text documents is yet
to be demonstrated.
Yet another on-line algorithm called frequency sensitive competitive learning has been pro-
posed and evaluated on text datasets by Banerjee and Ghosh[2], which is designed to produce
clusters of items of approximately equal sizes. In this work a version of the K-means clustering
algorithm called spherical K-means has been modified so that the dispersion of the distributions
associated with the clusters reduces as more and more data points are added to them. This
makes larger clusters less likely candidates for a new data point than the smaller clusters. Thus,
the algorithm is tailored to produce clusters which are more or less equal in size.
All of these algorithms produce non-hierarchical clustering solutions, which foregoes the
opportunity to use clustering as an aid to detect topic and subtopic structure within a large
document collection. Also, TDT experiments effectively exploit the information in the time
stamp available with news stories, i.e., assumes that news stories that describe the same event
will occur within a brief span of time. Such information may not always be available.
Incremental Hierarchical Clustering: Nominal Attributes
Methods have been proposed in the non-text domain to cluster items in an incremental manner
into hierarchies. Most notable among them is the Cobweb algorithm by Fisher [9] and its
derivative Classit [12]. Cobweb is an algorithm to incrementally cluster data points with
nominal attributes into cluster hierarchies.
At the heart of Cobweb is a cluster quality measure called Category Utility.
Let C1, , CKbe the child clusters of a cluster Cp. The Category Utility of C1, , CKis
computed as
C Up[C1,, CK] = Pk=1
KP(Ck)PiPj[P(Ai=Vij Ck)2P(Ai=Vij Cp)2]
P(Ck) = Probability of a document belonging to the parent cluster Cpbelongs to the child
cluster Ck.
Ai= The ith attribute of the items being clustered (say A1∈ {male, female},A2∈ {Red,
Green, Blue}; assumed to be a multinomial variable),
Vij =jth value of the ith attribute (say, V12 indicates “female”),
P(Ck) = the probability of a document belonging to cluster k, given that it belongs to the
parent cluster p.
The P(Ai=Vij Ck)2is the expected number of times we can correctly guess of the value of
multinomial variable Aito be Vij for an item in the cluster kwhen one follows a probability
matching guessing strategy. For example, if we have a variable that takes values A, B and C
with probabilities 0.3, 0.5 and 0.2, and we randomly predict that the variable takes value A 0.3
fraction of the time, B 0.5 fraction of the time and C 0.2 fraction of the time, we would be cor-
rect in predicting A 0.3 ×0.3 =0.09 fraction of the time, B 0.25 fraction of the time and C 0.04
fraction of the time. A good cluster, in which the attributes of the items take similar values, will
have high P(Ai=Vij |Ck)values, hence high score of PjP(Ai=Vi j Ck)2.Cobweb maximizes
sum of P(Ai=Vij Ck)2scores over all possible assignment of a document to children clusters.
When the algorithm assigns a new item to a child node of the node p, it assigns the item in such
a manner that the total gain in expected number of correct guesses by moving an item from p
to its child node, PiPj[P(Ai=Vij Ck)2P(Ai=Vi j Cp)2], is maximized. In this manner
the algorithm maximizes the utility function for each node to which a new item is added.
The Cobweb control structure is shown in Fig 3.
Literature review 5
Algorithm CobWeb (Adapted from Fisher’s original work [9])
function Cobweb(item, root)
Update the attribute value statistics at the root
If root is a leaf node then
Return the expanded node that accommodates the new Object
Find the best child of the root to host the item and perform the
qualifying step (if any) among the following:
1Create a new node for the item instead of adding it to the
best host, if that leads to improved Category Utility.
2Merge nodes if it leads to improved Category Utility and
call cobweb(item, Merged Node)
3Split node if it leads to improved Category Utility and call
cobweb(item, root)
If none of the above steps are performed then
Call cobweb(item, best child of root)
end if
end if
Figure 1. Cobweb control structure.
An illustration of the clustering process is given in Figure 2.
(2) (1) (3)
(1) (2)
Addition of a new item (2) to a leaf node (1).
Let (104) be a new item.
(34) (67)
(23) (12)
(@89) Which node should the new item be added to? (34) or (67)
or should it belong to a cluster of its own next to (34) and (67)?
Use Category U tility com parison a s describ ed in F ig 3. Let the answer be (67)
(@67) Which node should the new item be added to? (23) or (12)
Figure 2 . Cobweb illustrated
Assume that there is only one attribute of interest called tand it takes values in {A, B , C }.
Also assume that we have three items a, b and cwith tvalue A, B and Crespectively. Further
assume that the objects are presented in the order specified, i.e. first afollowed by bwhich is
followed by c.
After the first two items are presented the following cluster configuration is arrived without
any computation of category utility (First part of Figure 2).
C3P(C3) = 1
(aand b)t=A, t =B
C1P(C1) = 0.5
C2P(C2) = 0.5
Figure 3 . After first two item s are added.
C3is the root cluster and C1and C2are two child clusters each containing one item. P(C1)is
the probability that a document randomly picked from its parent cluster of C1, i.e., C3, belongs
to C1. Similarly for C2.
6Section 2
Let’s add the third item cto the root node. We can add it at the level of C1and C2(level 2)
as another cluster C3, or we can add it in C1or C2that will delegate the item cto the third (a
new) level. So, our options are (omitting the cwithin (b, c)configuration that is analogous to
the cwithin (a, c)configuration described below):
C3P(C3) = 1
(a, b and c): t=A , t =B, t =C
C1P(C1) = 1
(a): t=A
C2P(C2) = 1
(b): t=B
C4P(C4) = 1
(c): t=C
C3P(C3) = 1
(a, b and c): t=A , t =B, t =C
C4P(C4) = 2
(aand c): t=A, t =C
C1P(C1) = 0.5
(a): t=A
C5P(C5) = 0.5
(c): t=C
C2P(C2) = 1
(b): t=B
Figure 4. Two partitions of the ro ot cluster.
At this point Category Utilities of the two configurations let us decide which configuration to
choose. Note that we need to compute category utility of the two partitions of the root clusters.
They can be computed using expression (1) as described below.
For the first configuration in Figure 4 the parent cluster is C3and the child clusters are C1,
C2and C4. The category utility of this configuration is:
CU1=Pk={1,2,4}P(Ck)hPAi=tPt={A, B, C }P(t|Ck)2PAi=tPt={A,B , C }P(t|C3)2i
3(12 1
3(12 1
3(12 1
For the second configuration in Figure 4 the parent cluster is C3and the child clusters are C4
and C2.
CU2=Pk={4,2}P(Ck)hPAi=tPt={A,B , C }P(t|Ck)2PAi=tPt={A,B , C }P(t|C3)2i
3( 1
22! 1
3(12 1
Literature review 7
Since, CU1>CU2we select configuration 1 over configuration 2. Looking at the Figure 4, it is
intuitive to make a new cluster for the third item, because, it has an attribute value not seen in
any of the existing categories.
There is one more possible configuration, where cis added below C2instead of C1, but that
is symmetrical to the second configuration in Figure 4. So, the analysis will be identical to the
one shown in previous paragraph.
Incremental clustering algorithms, such as Cobweb, are sensitive to the order in which items
are presented [9]. Cobweb makes use of split and merge operations to correct this problem. In
the merge operation the child nodes with highest and second highest Category Utility are
removed from the original node and made child nodes of a new node, which takes their place
under the parent node. In the split operation the best node is removed and its child nodes are
made children of the parent of the removed node. Merge and split operations are only carried
out if they lead to a better Category Utility than obtainable by either assigning the item to
existing best node or to a new cluster of its own. By using these two operators, the algorithm
remains flexible on the face of change in property of data items in the subsequent observations.
2 5
3 4
Merge (merging 3 and 4 into 5)
2 3
4 5
2 4 5
Split (splitting 3 into 4 and 5)
Figure 5 . M erge and split op erations illustrated.
Incremental Hierarchical Clustering: Numerical Attributes
We now consider an extension of the Cobweb from nominal attributes to numerical attributes.
Gennari et al.[12] has shown that in order to use cobweb for data items with numeric, rather
than nominal, attribute values we need to make some assumption about the distribution of
attribute values. When the values of each attribute follow a normal distribution, they have
shown that the Category Utility function can be written as
C Up[C1,, Ck] = PkP(Ck)Pi1
σik 1
σip = standard deviation of the value of the attribute iin parent node p, and
σik = standard deviation of the value of the attribute iin the child node k.
This algorithm is known as the classit algorithm.
We have not seen any prior application of either of these algorithms to text clustering.
Hence, their performance on text document data is uncertain at the time of this work. Further,
word occurrence counts, attributes of text documents that are commonly used to represent a
document, follow a skewed distribution–-unlike the Normal distribution (Figure 6). Also,
Normal distribution assumes that the attributes are Real numbers, but, word occurrence counts
are Nonnegative Integers. They can not be treated as nominal attributes either, because the
occurrence counts are not contained in a bounded set, which one would have to assume while
treating them as nominal attributes. A more suitable distribution for such count data is Nega-
tive Binomial, or Katz’s distribution [14].
Our work proposes to improve upon the original Cobweb algorithm using distributional
assumptions that are more appropriate for word count data.
8Section 2
3 Text Documents and word distributions
Text, as we commonly know it, is available in the form of unstructured documents. Before we
can use such documents for classification or clustering, we need to convert them to items with
attributes and values. A popular way of converting the document to such a form is to use the
words1in a document as attributes and the number of times the word occurs in the document,
or some function of it, as the value of the attribute. This is called the “Bag of Words” approach.
One consequence of using such a method to convert documents to an actionable form is that one
foregoes information contained in the order of the word. Despite this drawback, the bag-of-
words approach is one of the most successful and widely used method of converting text docu-
ments into actionable form.
Several attempts has been made to characterize the distribution of words across documents.
This is useful in judging the information content of a word. For instance a word that occurs uni-
formly in every document of the corpus, e.g., “the” is not as informative as a word that occurs
frequently in only a few, e.g., “Zipf”.
Occurrence statistics of a word in a document can be used along with the information con-
tent of the word to infer the topic of the document and cluster documents of similar topic into
same group–- as is done in this work. Manning and Schutze have discussed several models to
characterize the occurrence of words across different documents [17].
3.1 Models based on Poisson distribution
3.1.1 Poisson The Poisson distribution has been used to model number of times a word
occurs in a document. The probability of a word occurring ktimes in a document is given by
P(k) = λkeλ
where, λis a rate parameter. However, from empirical observations, it has been found that
Poisson distribution tends to over estimate the frequency of informative words (content words)
3.1.2 Two Poisson Model There have been attempts to characterize the occurrence of a
word across documents using a mixture of Poisson distributions. One such attempts uses two
Poisson distributions to model the probability of a word occurring a certain number of times in
a document. One of the distributions captures the rate of the word occurrence when the word
occurs because it is topically relevant to the document. The second distribution captures the
rate of the word occurrence when the word occurs without being topically relevant to the docu-
ment. This mixture of two probability distributions has the probability density function:
P(k) = αλ1
k!+ (1 α)λ2
where, αis the probability of the word being topically relevant and 1αis the probability of
the word being topically unrelated to the document.
It has been empirically observed that, although the two Poisson model fits the data better
than single Poisson model[3], a spurious drop is seen for the probability of a word occurring
twice in a document[14]. The fitted distribution has lower probability for a word occurring twice
in a document than it occurring three times, i.e., it predicts that there are fewer documents that
contain a word twice than there are documents that contain the same word three times. But,
empirically it has been observed that document count monotonically decreases for increasing
number of occurrences of a word (see Figure 6).
1. Through out this pap er we shall use w ord an d term interchan gea bly to refer to the sam e th ing, i.e., a con-
tiguous sequence of alphanumeric characters delim ited by non-alphanumeric character(s). E.g. the first word or
term in this foo tnote is “Through”.
Text Documents and word distributions 9
3.1.3 Negative Binomial A proposed solution to the above problem is to use a mixture of
more than two Poisson distributions to model the word occurrences. A natural extension of this
idea is to use a Negative Binomial distribution, which is a gamma mixture of infinite number of
Poisson distributions[11]. The probability density functions of a Negative Binomial distribution
is given below,
P(k) =k+r1
r1pr(1 p)k,(4)
where pand rare parameters of the distributions.
Although the Negative Binomial distribution fits the word occurrence data very well it can
be hard to work with because it often involves computing a large number of coefficients[17].
This has been confirmed in our analysis (see Expressions (28) and (29) in Section 4.2).
3.1.4 Zero inflated Poisson When we observe the word occurrence counts in documents, we
find that most words occurs in only a few documents in the corpus. So, for most of the words,
the count of documents where they occur zero times is very large (see Figure 6). Looking at the
shape of the empirical probability density function we attempt to model the occurrence counts
using a Zero Inflated Poisson distribution, which assigns a large probability mass at the variable
value 0 and distributes the remaining probability mass over rest of the occurrence counts
according to a Poisson distribution.
0 1 2 3 4 5 6 7
0.0 0.2 0.4 0.6 0.8
Figure 6 . The occurrence of a typical word (“result”) across diffe rent documents in our test collection.
The probability density function of Zero Inflated Poisson distribution is given by
P(k) = (1 α)δk+αλkeλ
λ!, k = 0,1,2(5)
10 Section 3
δk=1,iffk = 0
As we shall demonstrate in Section 3.3, this distribution does not fit text data as well as the
Negative Binomial or the Katz’s distribution.
3.2 Katz’s K-mixture model
This distribution, proposed by Katz[14], although simple to work with, has been shown to model
the occurrences of words in the documents better than many other distributions such as Poisson
and Two Poisson, and about as well as the more complex Negative Binomial distribution[17].
Katz’s distribution assigns the following probability to the event that word ioccurs ktimes in a
P(k) = (1 α)δk+α
β+ 1β
β+ 1 k
δk= 1iffk = 0and0otherwise.
The MLE estimates of parameters αand βare:
β=cf df
df (7)
cf = collection frequency = number of times word ioccurred in the document collection
obtained by adding up the times the word occurred in each document. Here, a collection can be
whatever we deem our universe of documents to be. It can be the entire corpus of documents or
a subset of it.
df = document frequency = number of documents in the entire collection that contain the
word i.
From (6) it follows that
P(0) = 1 α+α
β+ 1
= 1 df
= 1 Pr (the word occurs in a document)
=Pr (the word does not occur in a document)
Also, it follows that
P(k) = α
β+ 1β
β+ 1 k
, k = 1,2,(10)
Substituting pfor β
β+ 1 , we have
P(k) = α(1 p)pk(11)
Let’s define a parameter p0as
p0=P(0) (12)
using (7) we find that
cf d f
=cf df
cf (13)
=Pr (the word repeats in a document)
Pr (the word occurs in a document)
=Pr (the word repeats Tthe word occurs)
Pr (the word occurs)
=Pr (the word repeats | the word occurs)
2. In this section we shall discuss the case of one word, the ith wo rd. Hence, we shall drop the subscript i
from the equations and expressions.
Text Documents and word distributions 11
Hence, 1pcan be interpreted as the probability of the word occurring only once. Or, it can be
thought of as a scaling factor used to make (11) and (12) together a valid probability density
We can write Expression (6) for k= 0, using pas
P(0) = (1 α) + α(1 p)
= 1 α+ααp
Hence, αin terms of p0and pis
p0= 1 αp
αp = 1 p0
Expression (11) can now be written as
P(k) = (1 p0)(1p)pk1(15)
when k > 0.
Using Expressions (12) and (15), we can fully specify the Katz’s distribution. The two
parameters are p0and p, which can be estimated as (see Expressions 9 and 13)
ˆ = 1 df
pˆ = cf df
cf (17)
It can be shown that if a distribution is defined by Expressions (12) and (15), then the estimates
(16) and (17) are the MLE of the parameters p0and p(see Appendix A).
3.3 Fitness comparison
We estimated the parameters of Zero Inflated Poisson and Negative Binomial using the method
of moment, and parameters for Katz’s distribution using the Maximum Likelihood Estimate
(MLE) method. The reason for using the method of moments and not the MLE is that for the
Negative Binomial and the Zero Inflated Poisson distributions the MLE can only be found
numerically, which is computationally complex for our task of incremental clustering. One can
still use numerical methods to determine MLEs of the parameters of the distribution, which
admittedly have better properties, if one is willing to pay the cost in terms of delay. In this
work we shall limit ourselves to the estimates that have closed form expressions and can be com-
puted efficiently, because our goal is to carry out the incremental document clustering in real
3.3.1 Zero Inflated PoissonIf the probability density function of a Zero Inflated Poisson dis-
tribution is given in the form of Expression (5), then the method of moment estimates of its
parameters αand λare
αˆ = X
3.3.2 Negative BinomialFor the Negative Binomial distribution, parameters pand rcan be
estimated as
rˆ = X
pˆ = X
For the Katz’s distribution we used Expressions (16) and (17) to estimate the parameters p0and
12 Section 3
We evaluated the fitness of these three distributions by computing the probabilities of the
word occurrences using the estimated parameters, on three different datasets. For each dataset
we selected the top 100 terms by their cf ×log(N/d f)score. The distribution that has a higher
likelihood than another can be considered a better fit to the data. For each term a pairwise
comparison of fitness of different distributions is carried out in this manner. The results are
shown in the form of three dominance matrices in Table 1. Each cell records the number of
terms for which distribution for the row has 10% or higher likelihood than the distribution for
the column.
dataset dominance table
NB Katz’s ZIP
NB 0 55 92
Katz’s 41 0 96
ZIP 7 4 0
NB Katz’s ZIP
NB 0 41 98
Katz’s 58 0 98
ZIP 2 2 0
NB Katz’s ZIP
NB 0 63 98
Katz’s 35 0 98
ZIP 2 2 0
Table 1. Likelihoo d com parisons, count of likelihoo d of row distrib ution >likelihoo d of col distribu-
tion ×1.1
It can be observed from the table that Katz’s distribution, is not only easier to work with as
we will see in Section 4, it also fits better than Zero Inflated Poisson (ZIP) and gives fitness
comparable to Negative Binomial (NB) distribution.
4 Algorithms for text
4.1 COBWEB: when attribute values follow Katz’s distribution
4.1.1 Category utility
Using words as attributes, we can derive the Category Utility function assuming that word
occurrences follow Katz’s distribution. For reference, the Category Utility formula as given in
Cobweb is
jP(Ai=Vi, j |Ck)2P(Ai=Vi, j |Cp)2#
Notice that for each attribute indexed iwe need to compute
jP(Ai=Vi, j |Ck)2P(Ai=Vi, j |Cp)2(22)
where, jis an index of value of the attribute i. In this case Vi, j would take values 0, 1, 2 ...
because we are working with count data.
Hence, the first part of Expression (22) can be written as
CUi,k =X
Let’s use CUi, k to refer to the contribution of the attribute itowards the Category Utility of the
cluster k.
Substituting Expressions (12) and (15) in Expression (23), we obtain
CUi,k =X
P(Ai=f|Ck)2=12p0(1 p0)p(1 2p0)
1 + p(24)
Algorithms for text 13
Substituting estimates of p0and pfrom Expressions (16) and (17) in Expression (24), and sim-
plifying, we get
CUi,k =X
P(Ai=f|Ck)2= 1
2×dfNcf ×df
2×cf d f
where, df, cf, and Nare counted in the category k.
Expression (25) specifies how to calculate the Category Utility contribution of an attribute in
a category. Hence, the Category Utility of the Classit algorithm, when the distribution of
attributes follows Katz’s model, is given by
CUi,k X
CUi, p (26)
where, CUi, k is given by Expression (25).
4.2 Cobweb: when attribute values follow Negative Binomial distribu-
The probability density function of the Negative Binomial distribution is
P(x) =x+r1
r1pr(1 p)x(27)
pand rare the parameters of the distribution, which are to be estimated from the data.
4.2.1 Category utility
Substituting Expression (27) in (23), we obtain the contribution of a word in a child cluster
towards Category Utility
CUi,k =X
x!(r1)! pr(1 p)x12
This expression cannot be reduced to any simpler form, although, it can be written using a
hyper-geometric function in the following manner.
CUi,k =p2
2rF1r, r , 1,(1 p)2
(1 p)2(29)
One can use a library, such as the one available with Mathematica, to numerically evaluate
2F1(r, r, 1,(1 p)2). In our experience this computation is three orders of magnitudes more
resource intensive than computing (25), the equivalent expression for Katz’s distribution. As we
described in Section 3.3, in this work we shall limit ourselves to the methods that will let us
carry out incremental clustering in real time, i.e., in the time available between arrival of two
For this reason and the reasons cited in Section 3.1 and 3.3, we shall fully explore only
Katz’s distribution and original Classit algorithm based on Normal distribution in our work.
5 Cluster Evaluation Methods
5.1 Evaluating the clusters
One commonly used cluster quality measure is the purity of clustering solution. Purity of a
cluster is defined as
cis the index of classes
class is a pre-specified group of items
kis the index of clusters
cluster is an algorithm generated group of items
14 Section 5
CFk(c)= number of items from class coccurring in cluster k. Or, the frequency of class cin
cluster k.
Nk= number of items in class k.
Purity of the entire collection of clusters can be found by taking the average of the cluster
qualities. Here, there are two kinds of averages one might consider: weighted or unweighted. If
we assign a weight to each cluster proportional to the size of the cluster and take the weighted
average then it is called micro average, since each of the documents get equal weight. If we
instead want to give equal weight to each cluster, we compute the arithmetic average instead.
This is called macro average. The first one is a document level evaluation, while the second one
is a cluster level evaluation. Both these purity are greater than 0 and less than 1.
The drawback of relying only on purity to evaluate the quality of a set of clusters, becomes
apparent in hierarchical clustering. When we collect clusters occurring at or near the lowest
level of the hierarchy, we get clusters with very few documents in them. Hence, we obtain clus-
ters with high purity score. In the limit, at the lowest level there are Nclusters each containing
only one item. Hence, maxc{CFk(c)}is 1for each k∈ {1, , N }resulting in purity score of 1.
We get larger clusters at a higher level in the hierarchy, which are more likely to contain docu-
ments belonging to different classes, leading to a lower purity score. This illustrates how purity
score can be misleading when the number of clusters formed is different than the number of
classes in the dataset. If we make more number of clusters than there are in the dataset we bias
the purity score up. If we make less number of clusters than there are in the dataset we bias the
purity score down.
To correct this problem, we define another score of the clustering solution in the following
where, Ncis the size of the class c. The other variables are as defined for the expression of the
purity score in Expression (30). Here, also we can compute the micro average or the macro
average to compute the score for the entire solution.
This is a purity computation with the clustering solution treated as the true classes of the
data items and the human generated clusters as the solutions to be evaluated. Using this mea-
sure we evaluate how well the “true” classes in the datasets are represented in the clusters
These metrics, pkand rc, have interpretations that parallel the precision and recall metrics,
respectively, in information retrieval literature. Precision is the fraction of the retrieved docu-
ments that are relevant. Our pkhas the precision interpretation when we think of a cluster to
retrieve documents from the class to which majority of its elements belong. On the other hand
recall is the fraction of all the relevant documents that are retrieved. In the framework we
described for pk, our metric rchas the recall interpretation.
Taking a cue from the Fmeasure commonly used in IR literature to combine precision and
recall, we computed the Fscore as the harmonic mean of the PandR values:
The Fscore is the metric by which we shall measure the quality of our clusters.
5.2 Evaluating the hierarchy
Another question of interest when evaluating a hierarchical clustering algorithm is “To what
extent the generated cluster hierarchy agree with the class hierarchy present in the data?”. As
we shall describe in Section 6, the datasets we have used in our experiments have a hierarchy of
classes and provide us a rare opportunity to evaluate our generated cluster hierarchy for correct-
ness. As a reminder, a class is a document category that has been provided to us as a part of
the dataset. It is what the documents have been labeled with by an external entity and help us
in evaluating how good our algorithm is. On the other hand, a cluster is a grouping of docu-
ments that our algorithm generates. It does so by grouping together the documents it considers
Cluster Evaluation Methods 15
Matching the generated cluster hierarchy with the existing class hierarchy is a non-trivial
task. In stead, in this work we focus on measuring how often the sibling clusters in the gener-
ated hierarchy have sibling classes, i.e, how often children clusters of a parent cluster have chil-
dren classes of the class that is assigned to the parent cluster. For instance, consider the gener-
ated cluster subtree shown in Figure 7.
K1(C1.1)K2(C1 .2 )K3(C2.1)K4(C1 .1 .3)
Parent class frequency
C1.1 1
K1(C1)K2(C1)K3(C2)K4(C1.1 )
Figure 7 . A samp le subtree with the childre n nodes. Class labels of the children no de are given in
parenthe sis.
In this case we have already determined the classes of child clusters3. To be able to measure
if they are filed under the correct class, we need to find the class of the parent cluster. To do
this we tabulate the parent classes of the child clusters and assign the most frequent parent class
to the parent cluster K0. So, in this case the parent cluster K0gets the label C1. Then we eval-
uate this cluster configuration as if K0is merely a cluster of four other smaller entities, each of
which has a class label same as the parent class of what they really have. This is equivalent of
saying that as long as the children clusters of K0have children classes of the class of K0, i.e., C1
in this case, they are correct. Clusters with all other class labels that occur under that parent
cluster are incorrect classifications by the algorithm. They should have been somewhere else.
So, in the above example the precision of K0would be 2
4=0.5. We compute this precision
for all the internal nodes of the cluster tree and take their average (both micro average and
macro average) to compute the overall precision of the hierarchy. This gives us a measure of
how much the generated cluster hierarchy agree with the class hierarchy present in the data. We
call it sibling precision score of the cluster hierarchy.
We needed to make a few decisions while evaluating the hierarchy in this manner. For
instance, we used only the internal nodes to compute the precision of any node. This is because,
often times leaf nodes co-exist with internal nodes as children of another internal node. In this
case if we compute precision based on leaf nodes, i.e., single documents, then we are mixing the
precision of the kind we described in Section 5.1 with the precision of the hierarchy and it is not
clear how we should interpret the resulting number. Another decision that needed to be made
was, what should we do if a child cluster has the broadest class label assigned to it? Since, we
can not find a parent class for these classes, we explored the possibility of
i. dropping such child clusters from our evaluation and
ii. treating them as their own parent cluster since, they are the broadest level classes.
3. At the lowest level each cluster has only one document and its class can b e read from the data directly.
16 Section 5
In our experiments the results do not change much if we take either of these strategy. So, we
shall report only the results we got by treating the broadest classes as their own parent classes.
6 Experiment setup and results
We evaluate our algorithm over two text document collections, i.e., Reuters-RCV1 and
Ohsumed (88-91). These datasets were picked because of the presence of human labeled hierar-
chical class labels and reasonably large number of documents in them. They are described in
more detail in the following section.
6.1 Reuters-RCV1
Incremental clustering algorithms process the data points only once and in the order in which
they are presented and the order in which data points are present in the dataset influences the
clusters produced4. Therefore, it is imperative that we test the incremental clustering algorithms
with an ordering of data points that is similar to the what they are expected to receive during
their deployment. As we envision the two algorithms in this work to be used to process streams
of text documents from newswire, newsgroups, Blogs, etc., the natural ordering among the docu-
ments is determined by the time at which they are received. Therefore, we need a document
dataset in which the time order of the documents is preserved. Reuters-RCV1[15] is one such
Reuters-RCV1 dataset is a collection of over 800,000 English newswire articles collected from
Reuters over a period of one year(20th Aug 1996 to 19th Aug 1997). These documents have
been classified by editors at Reuters simultaneously under three category hierarchies: “Topic”
hierarchy, “Industry” hierarchy and “Region” hierarchy. The Topic hierarchy contains four cate-
gories at the depth one of the tree, namely “Corporate/Industrial”, “Economics”, “Govern-
ment/Social” and “Market”. There are ten such categories in the Industry hierarchy. Some of
them are “Metals and Minerals”, “Construction”, etc. The Region hierarchy has geographical
locations, such as country names, and economic/political groups as categories. There are no
finer sub-categories in the Region hierarchy.
Topic Root
Corporate/Industrial Economics Government/Social Market
Region Root
Figure 8. Three classification hierarchies.
The classification policy, also called The Coding Policy, requires that each document must
have at least one Topic category and at least one Region category assigned to it. It also requires
that each document be assigned to the most specific possible subcategory in a classification hier-
archy. A document might be, and often is, assigned more than one categories from any one of
the three category hierarchies. The documents are present in the dataset in the order in time in
which they were collected.
4. However, the idea l incremental clusterin g algorithm is expected to b e insensitive to the order in which it
enco unters th e data p oints. Such, characteristic is partly achieved by the C obweb algorithm by its split and
merge operators.
Experiment setup and results 17
number of documents 62935
number of unique words 93792
average document length 222
number of classes 259
Table 2. rcv1 dataset (First 30 d ays). Classes are the region classes
6.1.1 Evaluating clusters
Experiment setup For our experiments articles from the first 30 days of the Reuters-RCV1
dataset were used. There were 62935 articles. Stop words were removed from the documents
and the terms were stemmed. Then the most informative terms were selected by their cf ×
log(N/d f)scores to represent the documents. We repeated the experiments using 100 to 800
terms at step size of 100.
We have evaluated the clustering solutions for the correctness of assignment of documents to
the clusters using the region categories, because (i) in the region class hierarchy all the assigned
classes belong to one level and (ii) fewer articles are assigned multiple region class labels than
they are assigned other class labels, suggesting that the region classes in the dataset do not
overlap a lot. This allows us to evaluate out algorithm on a dataset with well defined classes.
There were 259 region categories present in the selected documents. So, we have extracted 259
clusters from the dendrogram constructed by the clustering algorithms and measured their
quality using the Region categories of the documents.
Results and Discussion The results of the clustering exercise is given in Table 3. We can see
that Katz’s distribution based Classit algorithm dominates Normal distribution based Classit
algorithm across varying vocabulary sizes in both the micro and macro average of Fscores.
micro 0.46 0.31
macro 0.83 0.60
micro 0.45 0.43
macro 0.81 0.74
micro 0.45 0.33
macro 0.85 0.67
micro 0.45 0.42
macro 0.79 0.74
micro 0.45 0.36
macro 0.84 0.69
micro 0.45 0.42
macro 0.82 0.76
micro 0.45 0.39
macro 0.81 0.74
micro 0.45 0.30
macro 0.83 0.61
Table 3. Cluster quality comparison on RCV1 da ta
18 Section 6
100 200 300 400 500 600 700 800
100 200 300 400 500 600 700 800
Figure 9 . Cluster quality comparison on RCV 1 data. The left panel sho ws the micro average of F-score
and the right panel shows the macro average of the F-score.
As we can see Katz based Classit algorithm consistently performs better than the Normal
based Classit algorithm on this dataset. However, we are cautious in interpreting the micro
averaged-F score. Both of these algorithms produce clusters of widely different sizes, i.e., a few
big clusters, a few more clusters of intermediate size and a lot of smaller clusters. The micro-
averaged F score, is affected by it. Because, performance over a few good clusters dominates the
entire performance metric. This explains the flat nature of the plot of micro averaged F score
with Katz based Classit. The larger of the clusters generated by the algorithm do not change
much over different vocabulary sizes, so, the micro-averaged F score remains nearly constant.
Therefore, we also compute the macro-averaged F score, where each cluster gets equal weight,
and find that Katz based Classit performs better than Normal based Classit over a wide
range of vocabulary sizes.
6.1.2 Evaluating hierarchy
We evaluate the generated cluster hierarchy using the topic hierarchy of classes5as our refer-
ence. There are 63 different topic codes in the documents we used, where as in the entire topic
hierarchy there are 103 topic codes.
We pre-processed the documents using the steps described in the previous section. Evaluated
the accuracy of the parent/child cluster configurations as described in Section 5.2. The results
are given in Table 4.
V Normal Katz Normal Katz
Macro avg Macro avg Micro avg Micro avg
100 0.925 0.956 0.814 0.959
200 0.924 0.935 0.797 0.943
300 0.926 0.874 0.825 0.871
400 0.92 0.866 0.814 0.789
500 0.918 0.896 0.812 0.871
600 0.922 0.841 0.814 0.989
700 0.929 0.836 0.846 0.653
800 0.918 0.855 0.832 0.718
Table 4. Evaluation of the cluster hierarchy using Rcv1 data
The values in the table cells are the average sibling precision of internal nodes of the cluster
hierarchy. As we can see there is no clear winner in this case, although, both the algorithms do
reasonably well in assigning sibling classes under the same cluster. However, we must be careful
to interpret these values as the correctness of the sibling classes getting grouped together and
not as recovering all of the original class hierarchy.
5. This can b e obtained from [15] Appendix 2.
Experiment setup and results 19
6.2 OHSUMED (88-91)
The Ohsumed test collection is a set of 348,566 abstracts collected from 270 medical journals
over a period of 5 years. Each abstract is annotated with MeSH (Medical Subject Heading)
labels by human observers. This indicates the topic of the abstract. Unlike the rcv1 dataset,
these documents are not in temporal order. Another property of this dataset is, being from a
specific subject area, they contain words from a much smaller vocabulary. Due to the presence
of human assigned MeSH keywords over such a large collection, this dataset provides us with an
opportunity to evaluate our algorithm over a large dataset and against real topic labels.
number of documents 196555
number of unique words 16133
average document length 167
number of classes 14138
Table 5. Ohsumed dataset (88-91)
6.2.1 Evaluating clusters
Experiment Setup We used the Ohsumed 88-91 dataset from the TREC-9 filtering track to
evaluate our algorithm for the correctness of assignment of documents to the classes. We
selected only those articles for which both the MeSH labels and the abstract text were present.
There were 196,555 such articles. As with the rcv1 dataset most informative words in the
dataset were selected using cf ×logN
df score of the words. We repeated the clustering exercise
using 25 to 200 words at a step size of 25. To determine the number of different topics present
in this dataset one can look at the unique MeSH labels present in the dataset. But, as there are
tens of thousands of such labels present we used fixed number of clusters to evaluate (see Table
6) the algorithms.
Results and discussion The F-score results of the experiments are given in Table 6.
V5 10 20 40 8 0 16 0 3 2 0 6 4 0
?K N
µ57 5 5
M62 6 2
?K N
µ57 5 3
M60 6 1
µ57 5 3
M63 6 2
?K N
µ57 5 3
M62 6 2
µ55 3 8
M60 5 4
×K N
µ55 6 1
M37 5 4
?K N
µ36 3 4
M49 5 2
×K N
µ27 3 4
M43 5 2
µ70 5 7
M74 6 5
µ70 5 7
M75 6 3
µ69 5 7
M75 6 5
µ69 5 7
M76 6 9
µ69 5 7
M76 7 0
µ69 5 7
M76 7 1
×K N
µ48 5 1
M60 6 5
×K N
µ47 5 1
M59 6 5
?K N
µ70 7 0
M71 7 1
?K N
µ70 7 0
M69 7 0
?K N
µ70 7 0
M73 7 7
×K N
µ69 7 0
M76 8 0
×K N
µ69 7 0
M77 8 1
×K N
µ69 7 0
M78 8 2
µ69 3 9
M78 5 6
µ69 3 5
M79 5 3
10 0
µ70 6 2
M72 6 9
µ69 6 2
M71 7 0
µ69 6 2
M75 7 3
µ69 6 2
M78 7 5
µ68 6 2
M78 7 6
µ68 4 5
M79 6 2
µ69 4 5
M80 6 2
µ69 4 5
M80 6 2
12 5
µ71 6 1
M74 6 8
µ71 6 1
M76 6 8
µ69 6 1
M77 7 1
µ69 6 1
M78 7 2
µ69 6 1
M80 7 4
×K N
µ53 6 1
M68 7 4
µ53 4 7
M69 6 1
µ53 4 6
M69 6 0
15 0
µ72 5 4
M72 6 5
µ72 5 1
M77 6 1
µ59 5 1
M72 6 4
µ59 5 1
M74 6 6
µ55 5 1
M71 6 6
µ54 5 1
M71 6 7
µ54 5 1
M71 6 7
×K N
µ48 5 1
M66 6 7
17 5
µ71 5 2
M74 6 4
µ71 5 1
M78 6 2
µ71 5 1
M81 6 4
µ71 5 1
M83 6 6
µ59 5 1
M75 6 7
µ58 4 3
M74 6 0
µ54 4 3
M71 6 0
µ54 4 1
M71 5 8
20 0
µ62 5 2
M72 6 3
µ62 5 0
M75 6 2
µ62 5 0
M77 6 5
µ62 5 0
M78 6 5
µ62 5 0
M79 6 6
µ62 5 0
M79 6 7
µ62 5 0
M79 6 7
µ62 5 0
M79 6 7
Table 6. Cluster quality comparison on OHSU MED da ta at different number of clusters (k) and vocabu-
lary size (V ). T he figures in th e table are F-score ×100. K stands for Katz-Classit, N for the original
Classit.µand Mrow in the smallest table hold the m icro and m acro average of the F-score respec-
tively. The cells where Katz-C lassit p erform s bette r are marked w ith a , the cells where Normal-
Classit performs b etter are marked with a ×and the cells where there is n o clear winner are m arked
with a ?. Best Katz-C lassit and the be st Normal-C lassit have been highlighted by grey cells.
20 Section 6
We can see from the table that Normal-Classit is the most competitive when the vocabu-
lary size is small and the number of clusters formed is large. For all other settings, i.e., when
the size of the vocabulary used is larger or when the number of clusters formed is smaller, Katz-
Classit performs better. This shows that the Katz-Classit algorithm is more robust as it per-
forms well across a much larger range of parameter values.
Performances of both the algorithms suffer when we create more number of clusters, which
makes sense because, there are fewer features based on which to distinguish between clusters.
6.2.2 Evaluating hierarchy
MeSH labels present in the Ohsumed collection has a hierarchical structure to it6. This pro-
vides us with another opportunity to evaluate the correctness of our hierarchy. This class hier-
archy is much larger than the topic hierarchy of Rcv1 dataset. There are 42610 different MeSH
labels. Each MeSH label has a code attached to it. The class hierarchy information can be
directly read from this code. For instance the first three records of 2005 “ASCII MeSH collec-
tion” reads
Body Regions;A01
Abdominal Cavity;A01.047.025
Figure 1 0. F irst three lines of MeSH labels file (filename: mtrees2005 .bin )
This says that the topic labeled “Abdominal Cavity” (A01.047 .025) is a child topic of label
with code A01.047, which we can find from the file as the topic “Abdomen” (A01 .047), which in
turn is a child topic of a topic with code A01. We can find from the file that this is the code of
the label “Body Regions”. This “.” separated topic codes let us easily find the parent topics by
dropping the suffix of the code. Not all the MeSH labels are seen in our dataset. There were
only about 14138 different MeSH labels used in document set we used for our experiments.
Documents were pre-processed as described in the previous section. Entire cluster hierarchy
was generated and the correctness of the hierarchy was evaluated as described in Section 5.2.
The precision values are reported in table
V Normal Katz Normal Katz
Macro avg Macro avg Micro avg Micro avg
25 0.786 0.795 0.626 0.749
50 0.781 0.831 0.667 0.784
75 0.79 0.857 0.654 0.831
100 0.801 0.888 0.742 0.891
125 0.828 0.939 0.788 0.976
150 0.847 0.935 0.812 0.963
175 0.876 0.91 0.859 0.858
200 0.894 0.958 0.819 0.919
Table 7. Evaluation of the cluster hierarchy using Ohsumed data
Here again both the algorithms do reasonably well in grouping classes with common parents
under the same cluster with Katz-Classit seems to have an advantage over Normal-Classit
6. The entire collection of MeSH labels can be downloaded from the web-sit e of National Institute of Health
( We have used 2005 MeSH label colle ctio n for our purp ose.
Experiment setup and results 21
across all vocabulary sizes. But, we must be careful here not to interpret these precision values
as closeness of the entire cluster hierarchy to the existing class hierarchy. Instead it is the accu-
racy of the algorithms in classifying sibling classes under same parent cluster.
We also tracked the sibling precision score at different depths of the generated cluster tree
(Figure 11 and 12).
2 4 6 8 10 12 14 16 18
Macro average
2 4 6 8 10 12 14 16 18
Micro average
0 5 10 15 20 25
Macro average
0 5 10 15 20 25
Micro average
Figure 1 1. Tracing the sibling precision over th e height of the tree. Vocab ulary 25 and 75.
22 Section 6
0 5 10 15 20 25
Macro average
0 5 10 15 20 25
Micro average
2 4 6 8 10 12 14 16 18 20
Macro average
2 4 6 8 10 12 14 16 18 20
Micro average
Figure 1 2. Tracing the sibling precisio n over the height of the tree. Vocabulary 125 and 175.
These plots show the general trend at different vocabulary sizes. As we can see there is con-
siderable variation in the sibling precision over different depths. Amidst these variation we can
observe that the sibling precision is higher and more consistent when we look at the nodes
occuring at the lower layers of the tree. Also, we find that on these layers the Katz-Classit
usually performs better than the Normal-Classit.
It is interesting to observe the general consistency at the lower levels of the tree and lack of
it at higher levels of the tree. At the lower levels we have a large number of nodes at each layer.
When we average the perfomance of each algorithm over these large number of nodes we get a
score that is robust to random mistakes. So, we get a consistent score from layer to layer and it
is easier to see which algorithm does better. But, it is not so in the higher levels. In the higher
Experiment setup and results 23
levels we have only a few nodes in each layer over which to average the score. So, the average is
more sensitive to random mistakes. Note that both the micro average and macro average are
sensitive to these random mistakes. The wrong nodes in the higher levels of the tree either get a
weight equal to other nodes (macro average) or they get a weight that is proportional to the
number of documents in them. Both of these weights are significant at these levels of the tree.
This is the reason why we find the plot of average sibling precision fluctuating a lot at these
levels and we do not get a clear winner across the layers in the upper part of the tree.
7 Conclusion
This is the first attempt of incremental hierarchical clustering of text documents to our knowl-
edge. We have evaluated an incremental hierarchical clustering algorithm, which is often used
with non-text datasets, using text document datasets. We have also proposed a variation of the
same that has more desirable properties when used for incremental hierarchical text clustering.
The variation of Cobweb/Classit algorithm that we have demonstrated in this work uses
Katz’s distribution instead of Normal distribution used in the original formulation of the
Classit algorithm. Katz’s distribution is more appropriate for the word occurrence data as has
been shown in prior work[14] and empirically observed in our work. We have evaluated both the
algorithms over Reuters-RCV1 dataset, which allows us to carry out the experiments in a sce-
nario very similar to the real life. We tested the algorithms by presenting them Newswire arti-
cles from Reuters-RCV1 dataset in time order and have shown that our algorithm performs con-
sistently better than the Normal based Classit algorithm as measured by both the micro and
macro average of the Fscore over a range vocabulary sizes. We have also evaluated both the
algorithms using Ohsumed 88-91 dataset and have found that Katz-Classit performs better
except for the narrow range of parameter values with small vocabulary sizes and large number
of clusters, where results are likely to be unreliable. This shows that the performance of Katz-
Classit is more robust across broad parameter settings.
We have also proposed a way to evaluate the quality of the hierarchy generated by the hier-
archical clustering algorithms, by observing how often children clusters of a cluster get children
classes of the class assigned to the cluster. We found that although, both the existing algorithm
and our proposed algorithm perform well in this metric, our algorithm performs marginally
better on Ohsumed dataset.
The most important contribution we think we have made in this work is a separation of
attribute distribution and its parameter estimation from the control structure of the Classit
algorithm. Thus, one can use a new attribute distribution, which may be different from Normal
or Katz but is more appropriate for the data at hand, inside the well established control struc-
ture of the Classit algorithm to carry out incremental hierarchical clustering of a new kind of
data. For instance, if it is considered that Negative Binomial could be better fit for the word
distribution than Katz distribution, and one can come up with an efficient way to estimate the
parameters of the distribution, it can be used in the framework of the existing Classit algo-
rithm as demonstrated in this work. One can also experiment using a Bayesian approach to esti-
mate the parameters of the distribution and carry out incremental hierarchical clustering in this
framework, which might lead to better results due to more reliable parameter estimates for clus-
ters with a small number of documents.
[1] James Allan, Ron Papka, and Victor Lavrenko. On-line new event detection and tracking. In Proceedings
of the 21st annual in ternation al AC M SIGI R conference on Research and development in info rmation
retrie val , pages 37 –45. ACM P ress, 1998.
[2] J. Banerjee, A.; Ghosh. Co mp etitive learning mechanism s for scalable, incremental and balanced clus-
terin g of streaming texts. In Proceedin gs of the Interna tional Joint Conference on, Neural Networks ,
volume 4, pages 2697 – 270 2, Jul 2003.
[3] Abraham Bookstein and D on R . Swanson. A decision theore tic foundation for indexing. Journal of th e
American Society for Information Science, pages 45–50, Jan-Feb 1975.
24 Section
[4] Soum en Ch akrabarti. M ining the Web : Discovering Know ledge from H ypertext Data . Morgan-K auffman,
[5] P. Cheesem an and J. Stutz. B ayesian classification (AU TOCLASS ): Theory and results. Advances in
Know ledge Dis covery an d Data Mining , 1996.
[6] Douglass R. Cutting, David R. K arger, Pedersen Pedersen, a nd John W . Tukey. Scatter/ gather: A
cluster-based approa ch to browsing large do cument collections. In Proceedings of the Fifteenth An nual
Internatio nal ACM SIGIR Conference on Research and D evelopment in Informa tion Retrieval , Interfa ce
Design and Display, pa ges 3 18–329, 199 2.
[7] George Do ddington, Jaim e Carb onell, James Allan, Jonathan Yam ron, Umass A mherst, and Yim ing
Yang. Topic d etection and tracking pilot study final report, Jul 2000.
[8] M . A. T. Figueiredo and A. K . Jain. Unsupervised learning of finite mixture mo dels. IEE E Trans. on
Patt. Analysis an d Mach ine I nte ll., 24(3):381–396, March 2002.
[9] Douglas H. Fish er. K nowled ge ac quisition via incre mental conceptual clustering. Ma chin e Learn ing ,
2:139–172, 1987.
[10] Martin Franz, Todd Ward, J. Scott McCarley, and Wei-Jin g Zhu. Unsupervised a nd supervised clus-
terin g for topic tracking. In SIGIR ’01: Proceedings of the 24th annual internationa l ACM SIGIR confer-
ence on Research and de velopm en t in inform ation retriev al , pages 310–317 . ACM Press, 2001.
[11] David L. Wallace Frederick Mosteller. Applied Bayesian and Classical Inference The case of The Feder-
alist Papers . Springer series in Statistics. Springer-Verlag, 198 3.
[12] J. H. Gennari, P. Langley, and D. Fisher. M odels of incremental concept formation. Journal of A rtifi-
cial Intellige nce , 40:11–61, 1989.
[13] A. K. Jain, M . N. Mu rty, and P. J. Flynn. Data clustering: a review. ACM C om puting Sur veys ,
31(3):264–323, 1 999 .
[14] Slava M. Katz. D istribution of co ntent words and phrases in text and languag e modelling. N at. Lang.
Eng., 2(1):15–59, 1996.
[15] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RC V1: A new benchmark collection for text
categorization research. Journal of M achine Learning Re search , 5:361–397, 2004.
[16] Xiaoyong Liu and W. B ruce Croft. Cluster-based retrieval using language models. In Proceedings of the
27th Annua l International AC M S IGIR C on ference on Research and D evelopment in Information
Retrieva l , La ngu age mode ls, pages 186–193, 2004.
[17] Christop her D . Manning a nd Hinrich Schütze. Foundations of Statistical Na tural Languag e Processing .
The MIT Press, Cambrid ge, E ngland, 2000.
[18] Padhraic S myth. Clustering Using M onte Carlo Cross-Validation. In Evangelos Simoudis, Jia Wei Han ,
and Usama Fayyad, editors, Proceedin gs of the Secon d Intern ational Conference on Know ledge Discovery
and Data Mining (KDD-96), page 1 26. AAAI Press, 1996.
[19] Yiming Yang , Tom Pierce, and Jaime Carbonell. A study of retrosp ective an d on-line event detection.
In SIGIR ’98 : Proceedings of the 21st annual interna tional ACM SIG IR conference on Research and
development in informatio n retrieval , pages 28–36. ACM Press, 1998.
[20] Ya-Jun Zhang and Zhi-Qia ng Liu. Refining web search engine results using in cremental clustering.
Internatio nal journal of intel ligent system s , 19:191–199, 2004.
Appendix A
MLE of Katz’s distribution parameters
The Katz’s distribution is defined as:
P(0) = p0
P(k) = (1 p0)(1 p)pk1;when k > 0(32)
where, p0and pare the parameters of the distribution.
Let us discuss about the distribution of only one word or term. The data is the count of
occurrences of the word in each document in the text collection. So, if we have Ndocuments in
the dataset we have Nobservations, each of which is a count.
ML E of Katz’s distribution parameters 25
Let us also define nkto be the number of observations equal to k, i.e., number of documents
in which the term occur ktimes. Let’s assume the maximum value of kis K.
document frequency df =Nn0=Pk=1
collection term frequency cf =Pk=1
Kk nk
The likelihood L(p , p0)of the parameters given data is
Pr (the word occurs xtimes in document i)
δ(x)p0+ (1 δk)(1 p0)(1 p)px1;x1K
(1 p0)nk(1 p)nk(pk1)nk
where, δ(·)is the indicator function that is 1 if argument is zero and 0 otherwise.
Log of likelihood is
LL(p, p0)
=n0log (p0)
[nklog (1 p0) + nklog (1 p) + nk(k1)log (p)]
Taking the partial derivative of the log likelihood with respect to p0and equating it to 0:
ˆ= 0
ˆ1 = N
ˆ = n0
N= 1 df
We can find the MLE of pin a similar manner.
∂p =X
pˆ= 0
0 = 1
nk(k1) 1
0 = 1
pˆ X
k nkX
0 = 1
pˆ(cf df)1
cf df
cf df
pˆ = cf df
cf (34)
Expressions (33) and (34) are the MLE of the parameters of Katz’s distribution described in
26 Section
... In the text domain, clustering [5] is used for several applications including document organization, browsing, summarization, classification and visualization [1]. A variety of techniques have been devised for document partitioning, including matrix factorization [38], hierarchical methods [31], spectral methods [27] and partitional methods [12]. EX-PLORE differs from document clustering approaches, in that these are not originally meant to operate in synergy with topic modeling. ...
... Partition-based, density-based, and hierarchical-based clustering techniques are extensively used for data integration and cluster analysis on dynamic data based on these well-defined features. In the literature, incremental clusters based on the K-means clustering [12] technique are applied to a dynamic database with continuously updated data in the incremental approach. Instead of re-running the K-means algorithm, this method measures the new cluster centers by directly computing fresh data from the means of current clusters. ...
Full-text available
With rapid advancements in wireless communications and sensor technologies, the Internet of Things (IoT) has advanced dramatically in past years. In IoT, the data created by a large number of sensors are extremely intricate, diverse, and enormous, and it is unprocessed. These may have underlying patterns that are not visible that must be discovered to do large-scale data analysis. Several clustering algorithms have been developed and proved effective in data analysis in recent decades; however, they are intentionally designed for dealing with static data and infeasible for processing huge data in IoT environments. As a result, this research proposes a Density-based Adaptive Cluster Split and Merge (D-ACSM) technique for dynamically assigning and changing cluster patterns for IoT data processing to solve this challenge. For successful cluster analysis, the local density and minimum distance between dynamic data objects were first measured. In addition, the D-ACSM technique used Cluster Splitting and Merging (CSM) to alter cluster patterns between surrounding dynamic data objects. In addition, the suggested D-ACSM technique’s results were evaluated using four IoT benchmarked datasets that varied in the number of arriving data objects. Finally, the proposed D-ACSM technique improves the results of the performance metrics by 4%, 5%, 3%, and 6% on the BWS-AS dataset, CRAWDAD dataset, Minute_Weather dataset, and LinkedSensorData dataset, respectively, when compared to the AC-ICSM, IMMFC, and IAPNA techniques used for cluster analysis in all data chunks.
Topic modeling can be unified synergically with document clustering. In this manuscript, we propose two innovative unsupervised approaches for the combined modeling and interrelated accomplishment of the two tasks. Both approaches rely on respective Bayesian generative models of topics, contents and clusters in textual corpora. Such models treat topics and clusters as linked latent factors in document wording. In particular, under the generative model of the second approach, textual documents are characterized by topic distributions, that are allowed to vary around the topic distributions of their membership clusters. Within the devised models, algorithms are designed to implement Rao-Blackwellized Gibbs sampling together with parameter estimation. These are derived mathematically for carrying out topic modeling with document clustering in a simultaneous and interrelated manner. A comparative empirical evaluation demonstrates the effectiveness of the presented approaches, over different families of state-of-the-art competitors, in clustering real-world benchmark text collections and, also, uncovering their underlying semantics. Besides, a case study is developed as an insightful qualitative analysis of results on real-world text corpora.
Topic modeling can be synergically interrelated with document clustering. We present an innovative unsupervised approach to the interrelationship of topic modeling with document clustering. The devised approach exploits Bayesian generative modeling and posterior inference, to seamlessly unify and jointly carry out the two tasks, respectively. Specifically, a Bayesian nonparametric model of text collections, formulates an unprecedented interrelationship of word-embedding topics with a Dirichlet process mixture of cluster components. The latter enables countably infinite clusters and permits the automatic inference of their actual number in a statistically principled manner. All latent clusters and topics under the foresaid model are inferred through collapsed Gibbs sampling and parameter estimation. An extensive empirical study of the presented approach is effected on benchmark real-world corpora of text documents. The experimental results demonstrate its higher effectiveness in partitioning text collections and coherently discovering their semantics, compared to state-of-the-art competitors and tailored baselines. Computational efficiency is also looked into under different conditions, in order to provide an insightful analysis of scalability.
An innovative model-based approach to coupling text clustering and topic modeling is introduced, in which the two tasks take advantage of each other. Specifically, the integration is enabled by a new generative model of text corpora. This explains topics, clusters and document content via a Bayesian generative process. In this process, documents include word vectors, to capture the (syntactic and semantic) regularities among words. Topics are multivariate Gaussian distributions on word vectors. Clusters are assigned corresponding topic distributions as their semantics. Content generation is ruled by text clusters and topics, which act as interacting latent factors. Documents are at first placed into respective clusters, then the semantics of these clusters is then repeatedly sampled to draw document topics, which are in turn sampled for word-vector generation. Under the proposed model, collapsed Gibbs sampling is derived mathematically and implemented algorithmically with parameter estimation for the simultaneous inference of text clusters and topics. A comparative assessment on real-world benchmark corpora demonstrates the effectiveness of this approach in clustering texts and uncovering their semantics. Intrinsic and extrinsic criteria are adopted to investigate its topic modeling performance, whose results are shown through a case study. Time efficiency and scalability are also studied.
Recently, the popularity of social networking services has increased tremendously, so the amount of comments increases at a huge rate immediately when any post or message is generated by user. Thus, system focuses on the problem of obtaining short text summary on the comment set of a particular post on social network sites. The users of social sites always wish to obtain an overview of all comments instead of reading each and every comment in the set. So this system attempts to create groups of similar comments and provide brief summary for the particular post. As different users can demand the summary at any instant, existing clustering techniques cannot be used because they are unable to satisfy the real-time demand of such system. So the comment stream summarization problem in this paper is modelled as incremental clustering problem. This approach considers newly added comments in real time and thus provides clustering results in incremental manner. Finally, the visual interface is generated that help users to rapidly get an overview summary.
Conventional textual documents clustering algorithms suffer from several shortcomings, such as the slow convergence of the immense high-dimensional data, the sensitivity to the initial value, and the understandability of the description of the resulted clusters. Although many clustering algorithms have been developed for English and other languages, very few have tackled the problem of clustering the under-resourced Arabic language. In this work, we propose a modified version of the Bond Energy Algorithm (BEA) combined with a fuzzy merging technique to solve the problem of Arabic text document clustering. The proposed algorithm, Clustering Arabic Documents based on Bond Energy, hereafter named CADBE, attempts to identify and display natural variable clusters within huge sized data. CADBE has three steps to cluster Arabic documents: the first step instantiates a cluster affinity matrix using the BEA, the second step uses a new and novel method to partition the cluster matrix automatically into small coherent clusters, and the last step uses a fuzzy merging technique to merge similar clusters based on the associations and interrelations between the resulted clusters. Experimental results showed that the proposed algorithm effectively outperformed the conventional clustering algorithms such as Expectation–Maximization (EM), Single Linkage, and UPGMA in terms of clustering purity and entropy. It also outperformed k-means, k-means++, spherical k-means, and CoclusMod in most test cases. However, there are several merits of CADBE. First, unlike the traditional clustering algorithms, it does not require to specify the number of clusters. In addition, it produces clusters with distinct boundaries, which makes its results more objective, and finally it is deterministic, such that it is insensitive to the order in which documents are presented to the algorithm.
Full-text available
In this study, a novel product search engine system which consists of a focused crawler, a record linkage system and a sentiment analyzer is proposed. We develop an original focused web crawler for E-commerce sites, and the challenges and our proposed solutions are presented in detail. A sentiment analyzer is developed to classify E-commerce product comments into polarities as negative or positive. A novel record linkage system for E-commerce products is proposed to recognize the same product names collected from different E-commerce sites. The record linkage system is based on a modified dynamic/incremental Hierarchical Agglomerative Clustering algorithm which employs our proposed product code matching system to reduce number of product name comparisons during clustering. In addition to these systems, a search system and a user interface are developed for the product search engine. In this thesis, we present a full scale product search engine that obtains %472 performance boosts in the crawler, 91.08% accuracy in the sentiment analysis, 96.25% F-measure in the record linkage, and 100% precision in most related products search. The proposed system achieves to provide better user experience than the existing systems.
Full-text available
Conceptual clustering is an important way of summarizing and explaining data. However, the recent formulation of this paradigm has allowed little exploration of conceptual clustering as a means of improving performance. Furthermore, previous work in conceptual clustering has not explicitly dealt with constraints imposed by real world environments. This article presents COBWEB, a conceptual clustering system that organizes data so as to maximize inference ability. Additionally, COBWEB is incremental and computationally economical, and thus can be flexibly applied in a variety of domains.
Full-text available
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection's properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well as corrected versions of the category assignments and taxonomy structures, via online appendices.
This paper addresses the problem of distribution of words and phrases in text, a problem of great general interest and of importance for many practical applications. The existing models for word distribution present observed sequences of words in text documents as an outcome of some stochastic processes; the corresponding distributions of numbers of word occurrences in the documents are modelled as mixtures of Poisson distributions whose parameter values are fitted to the data. We pursue a linguistically motivated approach to statistical language modelling and use observable text characteristics as model parameters. Multi-word technical terms, intrinsically content entities, are chosen for experimentation. Their occurrence and the occurrence dynamics are investigated using a 100-million word data collection consisting of a variety of about 13,000 technical documents. The derivation of models describing word distribution in text is based on a linguistic interpretation of the process of text formation, with the probabilities of word occurrence being functions of observable and linguistically meaningful text characteristics. The adequacy of the proposed models for the description of actually observed distributions of words and phrases in text is confirmed experimentally. The paper has two focuses: one is modelling of the distributions of content words and phrases among different documents; and another is word occurrence dynamics within documents and estimation of corresponding probabilities. Accordingly, among the application areas for the new modelling paradigm are information retrieval and speech recognition.
The indexing of a document is among the most crucial steps in preparing that document for retrieval. The adequacy of the indexing determines the ability of the system to respond to patron requests. This paper discusses this process, and document retrieval in general, on the basis of formal decision theory. The basic theoretical approach taken is illustrated by means of a model of word occurrences in documents in the context of a model information system; both models are fully defined in this paper. Though the main purpose of this paper is to provide insights into a very complex process, formulae are developed that may prove to be of value for an automated operating system. The paper concludes with an interpretation of recall and precision curves as seen from the point of view of decision theory.
Given a set of observations, humans acquire concepts that organize those observations and use them in classifying future experiences. This type of concept formation can occur in the absence of a tutor and it can take place despite irrelevant and incomplete information. A reasonable model of such human concept learning should be both incremental and capable of handling the type of complex experiences that people encounter in the real world. In this paper, we review three previous models of incremental concept formation and then present CLASSIT, a model that extends these earlier systems. All of the models integrate the process of recognition and learning, and all can be viewed as carrying out search through the space of possible concept hierarchies. In an attempt to show that CLASSIT is a robust concept formation system, we also present some empirical studies of its behavior under a variety of conditions.
Conference Paper
Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine this problem within this new retrieval framework. We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. We show that cluster-based retrieval can perform consistently across collections of realistic size, and significant improvements over document-based retrieval can be obtained in a fully automatic manner and without relevance information provided by human.
Mining the Web: Discovering Knowledge from Hypertext Data is the first book devoted entirely to techniques for producing knowledge from the vast body of unstructured Web data. Building on an initial survey of infrastructural issuesâ??including Web crawling and indexingâ??Chakrabarti examines low-level machine learning techniques as they relate specifically to the challenges of Web mining. He then devotes the final part of the book to applications that unite infrastructure and analysis to bring machine learning to bear on systematically acquired and stored data. Here the focus is on results: the strengths and weaknesses of these applications, along with their potential as foundations for further progress. From Chakrabarti's workâ??painstaking, critical, and forward-lookingâ??readers will gain the theoretical and practical understanding they need to contribute to the Web mining effort.
This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval
This paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective "unsupervised" is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the standard expectation-maximization (EM) algorithm, it does not require careful initialization. The proposed method also avoids another drawback of EM for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space. The novelty of our approach is that we do not use a model selection criterion to choose one among a set of preestimated candidate models; instead, we seamlessly integrate estimation and model selection in a single algorithm. Our technique can be applied to any type of parametric mixture model for which it is possible to write an EM algorithm; in this paper, we illustrate it with experiments involving Gaussian mixtures. These experiments testify for the good performance of our approach