Mining common topics from multiple asynchronous text streams.
-
Citations (0)
-
Cited In (0)
Page 1
Mining Common Topics from Multiple Asynchronous
Text Streams
∗
Xiang Wang
School of Software, Tsinghua University
Beijing 100084, China
xiang_w00@mails.tsinghua.edu.cn
Xiaoming Jin
School of Software, Tsinghua University
Beijing 100084, China
xmjin@tsinghua.edu.cn
Kai Zhang
School of Software, Tsinghua University
Beijing 100084, China
z-k06@mails.tsinghua.edu.cn
Dou Shen
Microsoft Adcenter Labs
One Microsoft Way, Redmond, WA, USA
doushen@microsoft.com
ABSTRACT
Text streams are becoming more and more ubiquitous, in
the forms of news feeds, weblog archives and so on, which
result in a large volume of data. An effective way to ex-
plore the semantic as well as temporal information in text
streams is topic mining, which can further facilitate other
knowledge discovery procedures. In many applications, we
are facing multiple text streams which are related to each
other and share common topics.
these streams can provide more meaningful and comprehen-
sive clues for topic mining than those from each individual
stream. However, it is nontrivial to explore the correlation
with the existence of asynchronism among multiple streams,
i.e. documents from different streams about the same topic
may have different timestamps, which remains unsolved in
the context of topic mining. In this paper, we formally ad-
dress this problem and put forward a novel algorithm based
on the generative topic model. Our algorithm consists of
two alternate steps: the first step extracts common topics
from multiple streams based on the adjusted timestamps
by the second step; the second step adjusts the timestamps
of the documents according to the time distribution of the
discovered topics by the first step. We perform these two
steps alternately and a monotone convergence of our objec-
tive function is guaranteed. The effectiveness and advantage
of our approach were justified by extensive empirical studies
on two real data sets consisting of six research paper streams
and two news article streams, respectively.
The correlation among
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval—clustering
∗The work was partly supported by NSFC 60403021,
60673140 and 863 funding 2007AA01Z156.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
WSDM ’09, February 9-12, 2009, Barcelona, Spain.
Copyright 2009 ACM 978-1-60558-390-7 ...$5.00.
General Terms
Algorithms
Keywords
Temporal text mining, topic model, asynchronous streams
1.INTRODUCTION
More and more text streams are being generated in var-
ious forms, such as news streams, weblog articles, emails,
instant messages, research paper archives, web forum dis-
cussion threads, and so forth. To discover valuable knowl-
edge from a text stream, a first step is usually to extract
topics from the stream containing both semantic and tem-
poral information, which are described by two distributions,
respectively: a word distribution describing the semantics
of the topic and a time distribution describing the topic’s
intensity over time [3, 5, 7, 8, 10, 11, 12, 14, 15].
In many real-world applications, we are facing multiple
text streams that are correlated to each other by sharing
common topics. Intuitively, the interactions among these
streams could provide clues to derive more meaningful and
comprehensive topics than topics found using information
from each individual stream alone. The intuition was con-
firmed by very recent work [16], which utilized the temporal
correlation over multiple streams to explore the semantic
correlation among common topics. The method proposed
therein relied on a critical assumption that different streams
are always synchronous in time, or in their own term coordi-
nated, which means that the common topics share the same
time distribution over different streams.
However, this assumption is too strong to hold in all cases.
Rather, asynchronism among multiple streams, i.e. docu-
ments from different streams about the same topic have dif-
ferent timestamps, is actually very common in practice. For
instance, in news streams, there is no guarantee that news
articles covering the same topic are indexed by the same
timestamps. There can be hours of delay for news agencies,
days for newspapers, and even weeks for periodicals. This
is because some news feeds try to provide first-hand flashes
shortly after the incidents, while others provide more com-
prehensive reviews afterwards. Another example is research
paper archives, where the latest research topics are closely
followed by newsletters and communications within weeks
Page 2
19921994199619982000200220042006
0
0.1
0.2
0.3
0.4
0.5
0.6
Year
(a) Before synchronization
Relative Frequency
warehouse − SIGMOD
warehouse − TKDE
19921994 199619982000 20022004 2006
0
0.1
0.2
0.3
0.4
0.5
0.6
Year
Relative Frequency
mining − SIGMOD
mining − TKDE
1992 1994 1996 19982000 20022004 2006
0
0.1
0.2
0.3
0.4
0.5
0.6
Year
(b) After synchronization
Relative Frequency
warehouse − SIGMOD
warehouse − TKDE
19921994 199619982000 20022004 2006
0
0.1
0.2
0.3
0.4
0.5
0.6
Year
Relative Frequency
mining − SIGMOD
mining − TKDE
Figure 1: An illustrative example of the asynchro-
nism between two text streams and how it is elimi-
nated by our method.
or months, then the extended versions may appear in con-
ference proceedings, which are usually published annually,
and at last in journals, which may sometimes take years
to appear after submission. Specifically, let us see the rel-
ative frequency of the occurrences of two terms warehouse
and mining respectively in the titles of all research papers
published in SIGMOD (ACM International Conference on
Management of Data) and TKDE (IEEE Transactions on
Knowledge and Data Engineering) from 1992 to 2006. The
first term identifies the topic data warehouse and the second
data mining, which are two common topics shared by both
streams. As shown in Fig. 1(a), the bursts of both terms
in SIGMOD are significantly earlier than those in TKDE,
which suggests the presence of asynchronism between these
two streams. Thus, in this paper, we do not assume that
given text streams are always synchronous. Instead, we deal
with text streams that share common topics yet are tempo-
rally asynchronous.
We apparently expect multiple correlated streams to fa-
cilitate topic mining. However, the asynchronism among
streams brings new challenges to conventional topic mining
methods. As shown in Fig. 1(a), we may fail to discover the
topic about data mining and/or data warehouse since they
are relatively weak in each individual stream and the bursts
in two streams do not coincide. On the other hand, as shown
in Fig. 1(b), after adjusting the timestamps of documents in
the two streams using our proposed method, the relative
frequency of both warehouse and mining are boosted over a
certain range of time, relatively. It proves that eliminating
asynchronism can significantly benefit the topic discovery
process. However, as desirable as it is for topic discovery to
detect the temporal asynchronism among streams and even-
tually synchronize them, the task is difficult without know-
ing the topics to which the documents belong before hand. A
na¨ ıve solution is to use coarse granularity of the timestamps
of streams so that the asynchronism among streams can be
smoothed out. This is obviously dissatisfactory as it may
lead to unbearable loss in the temporal information of com-
mon topics and different topics would be inevitably mixed
up. A second way, shifting or scaling the time dimension
manually and empirically, may not work either because the
time difference of topics among different streams can vary
largely and irregularly, of which we can never have enough
prior knowledge.
In this paper, we target the problem of mining common
topics from multiple asynchronous text streams and pro-
pose an effective method to solve it. We formally define
the problem by introducing a principled probabilistic frame-
work, based on which a unified objective function can be
derived. Then we put forward an algorithm to optimize this
objective function by exploiting the mutual impact between
topic discovery and time synchronization.
The key idea of our approach is to utilize the semantic
and temporal correlation among streams and to build up a
mutual reinforcement process. We start with extracting a
set of common topics from given streams using their orig-
inal timestamps. Based on the extracted topics and their
word distributions, we update the timestamps of documents
in all streams by assigning them to most relevant topics.
This step reduces the asynchronism among streams. Then
after synchronization, we refine the common topics accord-
ing to the new timestamps. These two steps are repeated
alternately to maximize a unified objective function, which
provably converges monotonously.
Besides of theoretical justification, our method was also
evaluated empirically on two real-world text streams. The
first is a collection of 6 literature streams consisting of re-
search papers on database technology from year 1975 to 2006
and the second contains 2 news streams of 61 days’ news ar-
ticles between April 1 and May 31, 2007. We show that our
method is able to detect and eliminate the underlying asyn-
chronism among different streams and effectively discover
meaningful and highly discriminative common topics.
To sum up, the main contributions of our work are:
• We address the problem of mining common topics from
multiple asynchronous text streams. To the extent of
our knowledge, this is the first attempt to solve this
problem.
• We formalize our problem by introducing a principled
probabilistic framework and propose an objective func-
tion for our problem.
• We develop a novel alternate optimization algorithm to
solve the objective function with a theoretically guar-
anteed (local) optimum.
• The effectiveness and advantage of our method are val-
idated by extensive empirical study on two real-world
data sets.
The rest of the paper is organized as follows: related work
is briefly discussed in Section 2; we formalize our problem
and propose a generative model with a unified objective
function in Section 3; we show how to optimize the objec-
tive function in Section 4; empirical results are presented in
Section 5; we conclude our work in Section 6.
2.RELATED WORK
Topic mining has been extensively studied in the litera-
ture, starting with the Topic Detection and Tracking (TDT)
project [1, 17], which aimed to find and track topics (events)
in news streams with clustering based techniques.
on probabilistic generative models were introduced into use,
such as Probabilistic Latent Semantic Analysis (PLSA) [6],
Latent Dirichlet Allocation (LDA) [4] and their derivatives
[2, 9, 13].
Later
Page 3
Table 1: Symbols and their meanings
SymbolsDescription
ddocument
t timestamp
wword
z topic
Mnumber of streams
Tnumber of different timestamps
V number of different words
K number of topics
In many real applications, text collections carry generic
temporal information and thus can be considered as text
streams. To capture the temporal dynamics of topics, var-
ious methods have been proposed to discover topics over
time in text streams [3, 5, 7, 8, 10, 11, 12, 14, 15]. How-
ever, these methods were designed to extract topics from a
single stream. For example, in [10, 15], which adopted the
generative model, timestamps of individual documents were
modeled with a random variable, either discrete or contin-
uous. Then it was assumed that given a document in the
stream, the timestamp of the document was generated con-
ditionally independently from word. In [3], the authors in-
troduced hyper-parameters that evolve over time in state
transfer models in the stream. For each time slice, a hyper-
parameter is assigned with a state by a probability distribu-
tion, given the state on the former time slice. In [12], the
time dimension of the stream was cut into time slices and
topics were discovered from documents in each slice indepen-
dently. As a result, in multiple-stream cases, topics in each
stream can only be estimated separately and potential corre-
lation between topics in different streams, both semantically
and temporally, could not be fully explored. In [2, 9, 13], the
semantic correlation between different topics in static text
collections was considered. Similarly, [18] explored common
topics in multiple static text collections.
A very recent work by Wang et al. [16] firstly proposed a
topic mining method that aimed to discover common (bursty)
topics over multiple text streams. Their approach is different
from ours because they tried to find topics that shared com-
mon time distribution over different streams by assuming
that the streams were synchronous, or coordinated. Based
on this premise, documents with same timestamps are com-
bined together over different streams so that the word dis-
tributions of topics in individual streams can be discovered.
As a contrast, in our work, we aim to find topics that are
common in semantics, while having asynchronous time dis-
tributions in different streams.
3.PROBLEMANDOBJECTIVEFUNCTION
In this section, we formally define our problem of mining
common topics from multiple asynchronous text streams.
We introduce a generative topic model which incorporates
both temporal and semantic information in given text streams.
We derive our objective function, which is to maximize the
likelihood estimation subject to certain constraints.
main symbols used throughout the paper are listed in Ta-
ble 1.
First of all, we define text stream as follows:
The
Definition 1
sequence of N documents (d1,...,dN). Each document d
(Text Stream). A text stream S is a
Figure 2: An illustration of our generative model.
Shaded nodes mean observable variables while white
nodes mean unobservable variables. Arrow indicates
the generation relationship.
is a collection of words over vocabulary V and indexed by a
unique timestamp t ∈ {1,...,T}.
Note that in our definition, we allow multiple documents in
the same stream to share a common timestamp, which is
usually the case in real applications.
Given M text streams, we aim to extract K common top-
ics from them (K is given by users), which are defined as:
Definition 2
over text streams is defined by a word distribution over vo-
cabulary V and a time distribution over timestamps {1,...,T}.
To find common topics {Zk : 1 ≤ k ≤ K} over text
streams {Sm : 1 ≤ m ≤ M}, we put forward a novel
generative model, derived from the topic model family that
has been widely-used in topic mining tasks. Our generative
model is able to capture the interaction between temporal
and semantic information of topics and this interaction as
shown later can be used to extract common topics from asyn-
chronous streams with an alternate optimization process.
The documents {d ∈ Sm : 1 ≤ m ≤ M} are modeled
by a discrete random variable d. The words are modeled
by a discrete random variable w over vocabulary V. The
timestamps are modeled by a discrete random variable t
over {1,...,T}. At last the common topics Z are encoded
by a discrete random variable z ∈ {1,2,...,K}. Note that
semantic information of a topic is encoded by the conditional
distribution p(w|z) and its temporal information by p(z|t).
The generating process is as follows (also see Fig. 2):
(Common Topic). A common topic Z
1. Pick a document d with probability p(d).
2. Given the document d, pick a timestamp t with proba-
bility p(t|d) ∼ Mult(η,{0,1}), which is a multinomial
distribution with parameter η and the value of p(t|d)
is either 0 or 1. It means that a given document has
and only has one timestamp.
3. Given the timestamp t, pick a common topic z with
probability p(z|t) ∼ Mult(θ).
4. Given the topic z, pick a word w with probability
p(w|z) ∼ Mult(ϕ).
According to the generative process, the probability of
word w in document d is
?
Consequentially the log-likelihood function over all streams
writes:
?
p(w,d) =
t,z
p(d)p(t|d)p(z|t)p(w|z).
L =
w
?
d
c(w,d)logp(w,d),
Page 4
where c(w,d) is the number of occurrences of word w in
document d.
Conventional methods on topic mining try to maximize
the likelihood function L by adjusting p(z|t) and p(w|z)
while assuming p(t|d) is known. However, in our work, we
need to consider the potential asynchronism among different
streams, i.e., p(t|d) is also to be determined. Thus besides
of finding optimal p(z|t) and p(w|z), we also need to decide
p(t|d) to further maximize L. In other words, we want to
assign the document with timestamp t to a new timestamp
g(t) by determining its relevance to respective topics, so that
we can obtain larger L, or equivalently, topics with better
quality.
Note that the mapping from t to g(t) is not arbitrary.
By the term asynchronism, we refer to the time distortion
among different streams. The relative temporal order within
each individual stream is still considered meaningful and
generally correct (otherwise the current temporal informa-
tion in the streams will be discarded and the problem would
reduce to mining topics from a collection of texts, not text
streams). Therefore, during each synchronization step, we
preserve the relative temporal order of documents in each
individual streams, i.e., a document with earlier timestamp
before adjustment will always be assigned to earlier times-
tamp after adjustment as compared to its successors. This
constraint aims to protect local temporal information within
each individual stream while fixing the asynchronism among
different streams. Formally, given two documents d1 and d2
in a same stream, we require that:
g(t1) ≤ g(t2) iff t1 ≤ t2.
In sum we have:
Definition 3
{Sm : 1 ≤ m ≤ M}, in which documents are indexed by
timestamps {t : 1 ≤ t ≤ T}, asynchronism means that the
timestamps of the documents sharing the same topic in dif-
ferent streams are not properly aligned. However, it does not
involve the relative temporal order between documents within
the same stream.
(Asynchronism). Given M text streams
Finally, our objective is to maximize the likelihood func-
tion L by adjusting p(z|t) and p(w|z) as well as p(t|d) sub-
ject to the constraint of preserving temporal order within
stream. Formally it writes:
argmaxp(t|d),p(z|t),p(w|z)L,
s.t. ∀d1,d2 ∈ Sm,g(t1) ≤ g(t2) iff t1 ≤ t2,
for 1 ≤ m ≤ M, where t1 and t2 are the current times-
tamps of d1 and d2, respectively and g(t1) and g(t2) are the
timestamps after adjustment.
(1)
4. ALGORITHM
In this section we show how to solve our objective function
in Eq.(1) through an alternate (constrained) optimization
scheme. The outline of our algorithm is:
Step 1 We assume the current timestamps of streams are
synchronous and extract common topics from them.
Step 2 We synchronize the timestamps of all documents
by matching them to most related topics respectively.
Then we go back to Step 1 until convergence.
4.1Topic Extraction
First we assume the current timestamps of all streams are
already synchronous and extract common topics from them.
In other words, now p(t|d) is fixed and we try to maximize
the likelihood function by adjusting p(t|z) and p(w|z). Thus
we can rewrite the likelihood function as follows:
?
=
c(w,d)logp(d)
w
?
?
d
c(w,d)log
?
t
?
?
z
p(d)p(t|d)p(z|t)p(w|z)
?
?
w
d
t
p(t|d)
z
p(z|t)p(w|z).
Since p(t|d) ∼ Mult(η,{0,1}), above equation can be re-
duced to
?
=
c(w,t)log
w
?
?
d
?
t
c(w,d,t)log
?
z
p(z|t)p(w|z)
?
wt
?
z
p(z|t)p(w|z).
(2)
Here c(w,d,t) denotes the number of occurrences of word w
in document d at time t, and p(d) is summed out because it
can be considered as a constant in the formula [6].
Eq.(2) can be solved by well-established EM algorithm [6].
The E-step writes:
p(z|w,t) =
p(z|t)p(w|z)
?
zp(z|t)p(w|z),(3)
and the M-step writes:
p(z|t) =
?
?
wc(w,t)p(z|w,t)
?
?
?
?
zwc(w,t)p(z|w,t),
tc(w,t)p(z|w,t)
p(w|z) =
wtc(w,t)p(z|w,t).
(4)
The E- and M-step repeat alternately and our objective
function will converge to a local optimum after finite rounds.
4.2 Time Synchronization
Once the common topics are extracted, we match docu-
ments in all streams to these topics and adjust their times-
tamps to synchronize the streams.
Specifically, now p(z|t) and p(w|z) are assumed as known
and we try to adjust p(t|d) to maximize our objective func-
tion. Given document d, we denote its current timestamp
with t and its timestamp after adjustment with g(t). Then
our objective function in Eq.(1) can be rewritten as:
argmaxg(t)
M
?
m=1
?
w
T
?
s=1
Q(w,s)
?
{d∈Sm:g(t)=s}
c(w,d)
s.t. ∀d1,d2 ∈ Sm,g(t1) ≤ g(t2) iff t1 ≤ t2,
where Q(w,s) = log?
for each stream respectively:
(5)
zp(z|s)p(w|z). It is obvious that we
can solve Eq.(5) by solving the following objective function
max
g(t)
?
w
T
?
s=1
Q(w,s)
?
{d:g(t)=s}
c(w,d),
s.t. ∀d1,d2,g(t1) ≤ g(t2) iff t1 ≤ t2.
And p(t|d) can be decided by p(t = g(t)|d) = 1 and p(t ?=
g(t)|d) = 0.
(6)
Page 5
Next we define following function:
H(1 : i,1 : j) = max
g(t)
?
w
j
?
s=1
Q(w,s)
i
?
r=1
?
d(r,s)
c(w,d),
where 1 ≤ i,j ≤ T. Here d(r,s) denotes the set of all doc-
uments whose timestamps are changed from r to s, i.e.,
{d : t = r,g(t) = s}. It is easy to see that our objective
function in Eq.(6) equals to H(1 : T,1 : T).
Then we show how to compute H(1 : T,1 : T) recursively.
The basic idea behind our approach is that: suppose we
already have j timestamps {1,...,j} and documents whose
current timestamps are ranging from 1 to i−1, i.e., {d : 1 ≤
t ≤ i−1}; then given documents whose current timestamps
are i, according to our constraint, its new timestamp g(i)
must be no smaller than the new timestamps of documents
in {d : 1 ≤ t ≤ i − 1}. Thus if the smallest timestamp of
documents in {d : t = i} is a, then documents in {d : 1 ≤
t ≤ i−1} can only match to timestamps from 1 to a. So we
can enumerate all possible matching for 1 ≤ a ≤ j to find
an optimal a for H(1 : i,1 : j). Formally, we have
H(1 : T;1 : T)
=max
g(t)
?
w
T
?
s=1
Q(w,s)
T−1
?
r=1
?
d(r,s)
c(w,d) +
?
d(T,s)
c(w,d)
= max
1≤a≤Tmax
?
= max
g(t)
w
1≤a≤T(H(1 : (T − 1);1 : a) + δ(T;a : T)),
where the second term equals to
?
for 1 ≤ r ≤ T, and the first term can be computed recur-
sively as
a
?
s=1
Q(w,s)
T−1
?
r=1
?
d(r,s)
c(w,d) +
T
?
s=a
Q(w,s)
?
d(T,s)
c(w,d)
δ(r;a : T) =
{d:t=r}
max
a≤s≤T
?
w
Q(w,s)c(w,d),
H(1 : i,1 : j) = max
1≤a≤j(H(1 : (i − 1);1 : a) + δ(i;a : j)) (7)
for 2 ≤ i ≤ T and 1 ≤ j ≤ T. Specially we have
H(1 : 1,1 : a) =
1≤s≤a
?
{d:t=1}
max
?
w
Q(w,s)c(w,d)
for 1 ≤ a ≤ T. After H(1 : T,1 : T) is computed recur-
sively, it gives the global optimum to our objective function
in Eq.(6).
Our algorithm is summarized in Algorithm 1. K is the
number of topics and specified by users. The initial values
of p(t|d) and c(w,d,t) are counted from the original times-
tamps in the streams.
The computational complexity of the topic extraction step
(with EM algorithm) is O(KV T) while the complexity of
time synchronization step is approximately O(V MT3). Thus
the overall complexity of our algorithm is O(V T(K+MT2)),
where V is the size of vocabulary, T the number of differ-
ent timestamps, K the number of topics and M the number
of streams. If we take V , K and M as constants and only
consider the length of stream, which is T, the complexity of
Algorithm 1 becomes O(T3). We will show in next section
how to reduce it to O(T2) with a local search strategy.
Algorithm 1: Topic mining with time synchronization
Input: K,p(t|d),c(w,d,t);
Output: p(w|z),p(z|t),p(t|d);
repeat
Update c(w,t) with p(t|d) and c(w,d,t);
Initialize p(z|t) and p(w|z) with random values;
repeat
Update p(z|t) and p(w|z) following Eq.(3) and
(4);
until Convergence ;
for m=1 to M do
for j=1 to T do Initialize H(1 : 1,1 : j);
for i=2 to T do
for j=1 to T do
Compute H(1 : i,1 : j) as shown in
Eq.(7);
end
end
Update p(t|d);
end
until Convergence ;
4.3Remarks
Constraint on Time Synchronization. During each
synchronization step, the constraint in Eq.(6) requires that
a document with an earlier timestamp can only be assigned
to an earlier timestamp, as compared to its successors in the
same stream. At the first glance, this may seem too strict
because the original temporal order of given text streams
cannot be perfect. However, the constraint in our algorithm
is much more tolerant than it appears to be. Specifically,
after several iterations, it is possible that two adjacent doc-
uments swap their positions along the time dimension. For
instance, suppose we have document d1 with timestamp 3
and d2 with timestamp 5. After the first round of synchro-
nization, both d1 and d2 are mapped to time 4. Now we
use 4 as input value for d1 and d2, thus in the following
round, it is possible that d2 would be assigned to an earlier
timestamp than d1, without violating our constraint. As we
will show later in the experimental results, in practice, doc-
uments tend to find new timestamps in the neighborhoods
of their original positions and local swapping of documents’
positions often happens, which can empirically justify the
flexibility and robustness of our method.
Convergence. Both of the two steps in our algorithm
guarantee a monotone improvement in our objective func-
tion in Eq.(1), the algorithm will converge to a local op-
timum after finite numbers of iterations. Note that there
is a trivial solution to the objective function, which is to
assign all documents to a single (arbitrary) timestamp and
our algorithm would terminate at this local optimum. This
local optimum is apparently meaningless since it is equiva-
lent to discard all temporal information of text streams and
treat them like a collection of documents. Nevertheless, this
trivial solution only exists theoretically. In practice, our al-
gorithm will not converge to this trivial solution, as long
as we use the original timestamps of text streams as initial
value and have K > 1, where K is the number of topics. As
shown in Section 5, the adjusted timestamps of documents
always converge to more than K different time points.
The Local Search Strategy. In some real-world appli-