Page 1

Multidimensional Mining of Large-Scale Search Logs: A

Topic-Concept Cube Approach

Dongyeop Kang1∗

1Korea Advanced Institute of Science and Technology

3Simon Fraser University

Email:1{dykang, hojinc}@kaist.ac.kr

3jpei@cs.sfu.ca

Daxin Jiang2Jian Pei3Zhen Liao4Xiaohui Sun2Ho-Jin Choi1

2Microsoft Research Asia

4Nankai University

2{djiang, xiaos}@microsoft.com

4liaozhen@mail.nankai.edu.cn

ABSTRACT

In addition to search queries and the corresponding click-

through information, search engine logs record multidimen-

sional information about user search activities, such as

search time, location, vertical, and search device. Multi-

dimensional mining of search logs can provide novel insights

and useful knowledge for both search engine users and de-

velopers. In this paper, we describe our topic-concept cube

project, which addresses the business need of supporting

multidimensional mining of search logs effectively and ef-

ficiently. We answer two challenges. First, search queries

and click-through data are well recognized sparse, and thus

have to be aggregated properly for effective analysis. Sec-

ond, there is often a gap between the topic hierarchies in

multidimensional aggregate analysis and queries in search

logs. To address those challenges, we develop a novel topic-

concept model that learns a hierarchy of concepts and top-

ics automatically from search logs. Enabled by the topic-

concept model, we construct a topic-concept cube that sup-

ports online multidimensional mining of search log data. A

distinct feature of our approach is that, in addition to the

standard dimensions such as time and location, our topic-

concept cube has a dimension of topics and concepts, which

substantially facilitates the analysis of log data. To handle a

huge amount of log data, we develop distributed algorithms

for learning model parameters efficiently. We also devise ap-

proaches to computing a topic-concept cube. We report an

empirical study verifying the effectiveness and efficiency of

our approach on a real data set of 1.96 billion queries and

2.73 billion clicks.

Categories and Subject Descriptors

H.2.8 [Database Management]: Database Applications—

Data Mining

∗The work was done when Dongyeop Kang was an intern at

Microsoft Research Asia.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

WSDM’11, February 9–12, 2011, Hong Kong, China.

Copyright 2011 ACM 978-1-4503-0493-1/11/02 ...$10.00.

General Terms

Algorithms, Design, Experimentation

Keywords

OLAP, search log, topic-concept cube

1. INTRODUCTION

Search logs in search engines record rich information about

user search activities. In addition to search queries and the

corresponding click-through information, the related infor-

mation is also recorded on multiple attributes, such as search

time, location, vertical, and search device. Multidimensional

mining of such rich search logs can provide novel insights and

useful knowledge for both search engine users and develop-

ers. Let us consider two multidimensional analysis tasks.

A multidimensional lookup (lookup for short) spec-

ifies a subset of user queries and clicks using multidimen-

sional constraints such as time, location and general topics,

and requests for the aggregation of the user search activities.

For example, by looking up “the top-5 electronics that were

most popularly searched by the users in the US in Decem-

ber 2009”, a business analyst can know the common inter-

ests of search engine users on topic“Electronics”. Moreover,

search engine developers can use the results from the lookup

to improve query suggestion, document ranking, and spon-

sored search. Multidimensional lookups can be extended in

many ways to achieve advanced business intelligence analy-

sis. For example, using multiple lookups with different mul-

tidimensional constraints, one may compare the major in-

terests about electronics from users in different regions such

as the US, Asia, and Europe.

A multidimensional reverse lookup (reverse lookup

for short) is concerned about the multidimensional group-

bys where one specific object is intensively queried.

example, using reverse lookup “What are the group-bys in

time and region where Apple iPad was popularly searched

for?”, an iPad accessory manufacturer can find the regions

where the accessories may have a good market. Using the

results from the reverse lookup, a search engine can improve

its service by, for example, locality-sensitive search. Again,

reverse lookups can be used to compose advanced business

intelligence analysis. For example, by organizing the results

from the reverse lookup about iPad, one may keep track of

how iPad becomes popular in time and in region, and also

compare the trend of iPad with those of iPod and iPhone.

This is interesting to both business parties and users.

For

Page 2

1

2

3

4

5

ipad

apple ipad

ipad 32g

kindle

amazon kindle

(a)

1

2

3

4

5

ipad

kindle

iphone

xbox 360

wii

(b)

Table 1: Answers to “the top-5 electronics that were

most popularly searched by the users in the US in

December 2009” by (a) individual queries and (b)

concepts.

As search engines have accumulated rich log data, it

becomes more and more important to develop a service

that supports multidimensional mining of search logs effec-

tively and efficiently. To answer multidimensional analyti-

cal queries online, a data warehousing approach is a natural

choice, which pre-computes all multidimensional aggregates

offline. However, traditional data warehousing approaches

only explore a series of statistical aggregates such as MIN,

MAX, and AVG; they cannot summarize the semantic infor-

mation of user queries and clicks. In particular, multidi-

mensional analysis on search log data presents two special

challenges.

Challenge 1: sparseness of queries in log data.

Queries in search engine logs are usually very sparse, since

users may formulate different queries for the same informa-

tion need [9]. For example, to search for Apple iPad, users

may issue queries such as “ipad”, “apple ipad”, “ipad 32g”,

“i pad apple”, and so on. Aggregating only on individual

queries cannot summarize user information needs recorded

in logs comprehensively. For example, when a business ana-

lyst asks for“the top-5 electronics that were most popularly

searched by the users in the US in December 2009”, a na¨ ıve

method may simply count the frequency of the queries in

the topic of “Electronics” and return the top-5 most fre-

quently asked queries. Due to the sparseness of queries in

the logs, the analyst may get an answer with many redun-

dant queries, such as the one shown in Table 1(a). Instead,

if we can summarize various query formulations of the same

information need and provide non-duplicate answers (e.g.,

Table 1(b)), the user experience can be improved greatly.

Similarly, in reverse lookup, when an iPad accessary manu-

facturer asks the question “What are the group-bys in time

and region where Apple iPad was popularly searched for?”,

the system should consider not only aggregates of the query

“Apple iPad” but also its various formulations. To address

the sparseness of log data, we have to aggregate queries and

click-through data in logs.

Challenge 2: mismatching between topic hierar-

chies used in analytics and learned from log data.

More often than not, people use different topic hierarchies

in searching detailed information and summarizing analytic

information. For example, when users search electronics on

the web, often the queries are about specific products, brand

names, or features. A query topic hierarchy automatically

learned from log data in a data-driven way depends on the

distribution and occurrences of such queries. “Apple prod-

ucts” may be a popular topic. When an analyst explores

a huge amount of log data, she may bear in her mind a

product taxonomy (e.g., a well adopted ontology), such as

TV & video, audio, mobile phones, cameras & camcorders,

computers, and so on being the first level categories. The an-

alytic topic hierarchy may be very different from the query

topic hierarchy learned from log data.

“Apple products” in the query topic hierarchy corresponds

to multiple topics in the analytic topic hierarchy. This mis-

matching in topic hierarchies is partly due to the different

information needs in web search and web log data analysis.

Web searches often opt for detailed information, while web

log analysis usually tries to summarize and characterize pop-

ular user behavior patterns. To bridge the gap, we need to

map the aggregates from logs to an analytic topic hierarchy.

In this paper, we describe our topic-concept cube project

that builds a multidimensional service on search log data.

We make the following contributions.

First, we tackle the sparseness of queries in logs and the

gap between concept taxonomy in analytics and queries in

logs by a novel concept-topic model. Figure 1 illustrates our

ideas. We first mine click-through information in search logs

and group similar queries into concepts. Intuitively, users

with the same information need tend to click on the same

URLs. Therefore, various query formulations, for example,

of Apple ipad, such as“ipad”,“apple ipad”,“ipad 32g”, and“i

pad apple”, can be grouped into the same concept, since all

of them lead to clicks on the web page www.apple.com/ipad.

More interestingly, some misspelled queries, such as “apple

ipda” and “apple ipade”, can also be clustered into this con-

cept, since they also lead to clicks on the ipad page. Once we

summarize queries and clicks into concepts, we will answer

lookups and reverse lookups by concepts instead of individ-

ual queries. For each concept, we use the most frequently

asked query as the representative of the concept. In this

way, we can effectively avoid redundant queries in lookup

answers. At the same time, we can effectively cover all rel-

evant queries in reverse lookup answers.

Our concept-topic model further maps concepts to topics

in a given taxonomy, which is essentially a query classifica-

tion problem. For example, suppose a concept consists of

queries “apple ipad”, “ipad 32g”, etc., we classify them into

the topic “Electronics”. Compared with classifying individ-

ual queries to topics, mapping concepts has several advan-

tages. For example, for a misspelled query “apple ipda”, the

classification problem becomes much easier once we know

that this query belongs to a concept that also contains other

queries such as“apple ipad”. Moreover, through the content

of the web pages that are commonly clicked as answers to the

queries in the concept, we may further enrich the features

to classify “apple ipda”.

Our concept-topic model provides the “semantic” aggre-

gates for search log data. Those concepts and topics not

only provide us a meaningful way to answer lookups and re-

verse lookups, but also serve as an important dimension for

multidimensional analysis and exploration.

Second, to handle large volumes of search log data, which

may contain billions of queries and clicks, we develop dis-

tributed algorithms to learn the topic-concept models effi-

ciently. In particular, we develop a strategy to initialize the

model parameters such that each machine only needs to hold

a subset of parameters much smaller than the whole set.

Third, to serve online multidimensional mining of search

log data, we build a topic-concept cube.

the standard dimensions such as time and location, a topic-

concept cube has a dimension of topics and concepts, such

as “electronics” and “Apple iPad” used in the loopup and

For example, the

In addition to

Page 3

Topic

Taxonomy

Concepts

Query &

Clicks

Figure 1: The hierarchy of topics, concepts, queries,

and clicks.

Uid

U1

U2

U1

...

Time Stamp

100605110843

100605110843

100605110846

...

Location

Seattle, WA, US

Vancouver, BC, CA

Seattle, WA, US

...

Type

Query

Query

Click

...

Value

“wsdm 2011”

“you tube”

wsdm2011.org

...

Figure 2: A search log as a stream of query and click

events with multidimensional information.

reverse lookup examples. We devise effective approaches for

computing a topic-concept cube. In particular, queries are

assigned to a hierarchy of concepts and topics in the mate-

rialization of the cube.

Finally, we conduct extensive experiments on a real log

data set containing 1.96 billion queries and 2.73 billion clicks.

We examine the effectiveness of the topic-concept model as

well as the efficiency and scalability of our training algo-

rithms. We also demonstrate several concrete examples of

lookups and reverse lookups answered by our topic-concept

cube system. The experimental results clearly show that our

approach is effective and efficient.

The rest of the paper is organized as follows. We present

the framework of our system in Section 2, review the related

work in Section 3, describe the topic-concept model in Sec-

tion 4, and develop the distributed algorithms for learning

the topic-concept model from large-scale log data in Sec-

tion 5. Section 6 briefly discusses computing topic-concept

cubes. We report the experimental results in Section 7, and

conclude the paper in section 8.

2.OUR FRAMEWORK

When a user raises a query to a search engine, a set of

URLs are returned by the search engine as the search results.

The user may browse the snippets of the top search results

and selectively click on some of them. A search log can be

regarded as a sequence of query-and-click events by users.

For each event, a search engine may record the type and

content of the event as well as some other information such

as the time stamp, location, and device associated with the

event. Figure 2 shows a small segment of a search log.

Some dimensions of search events may have a hierarchical

structure. For example, the location dimension can be or-

ganized into levels of country → state → city, and the time

dimension can be represented at levels of year → month →

day → hour. Therefore, the multi-dimensional, hierarchi-

cal log data can be naturally organized into a raw log data

cube [13], where each cell is a group-by using the dimensions.

For example, a cell may contain all query-and-click events

of time “February 2010” and location “Washington State”.

We can aggregate the query-and-click events in a cell and

derive a click-through bipartite, where each query node corre-

q1

URLs

Queries

q2

q3

q4

u1

u2

u3

u4

u5

23

1300

790

10

280

530

1050

u1

u2

u3

u4

…

q1

q2

q3

q3

q4

…

230 13000

0 79000

0000 1010 10501050

000530

(a)(b)

Figure 3: An example of (a) click-through bipartite

and (b) QU-matrix.

sponds to a unique query in the cell and each URL node cor-

responds to a unique URL, as demonstrated in Figure 3(a).

An edge eij is created between query node qi and URL node

uj if uj is a clicked URL of qi. The weight wij of edge eij

is the total number of times when uj is a clicked result of qi

among all events in the cell.

A click-through bipartite can be represented as a query-

URL matrix (QU-matrix for short), where each row corre-

sponds to a query node qi and each column corresponds to

a URL node uj. The value of entry nij is simply the weight

wij between qi and uj, as shown in Figure 3(b).

The QU-matrix at a cell is often sparse. Moreover, QU-

matrix represents information at the level of individual

queries and URLs. As discussed before, we need to summa-

rize and aggregate the information in a QU-matrix to facili-

tate online multidimensional analysis. This will be achieved

by the topic-concept model to be developed in Section 4.

Figure 4 shows the framework of our system. In the off-

line stage, we first form a raw log data cube by partitioning

the search log data along various dimensions and at different

levels. For each cell of the raw log data cube, we construct

a click-through bipartite and derive the QU-matrix. Then,

we materialize the cube by learning topic-concept models

that summarize the distributions of topics and concepts on

the QU-matrix for each cell.

called the topic-concept cube. In the online stage, we use

the learned model parameters to support multidimensional

lookups, reverse lookups, and advanced analytical explo-

rations.

The resulting data cube is

3.RELATED WORK

Supporting multidimensional online analysis of large-scale

search log data is a new problem. To the best of our knowl-

edge, the most related work to our project is a query traffic

analysis service provided by a major commercial search en-

gine1. The service allows users to look up and compare the

hottest queries in specified time ranges, regions, verticals,

and topics. However, the service organizes the user interests

at only two levels: the lower individual query level contain-

ing individual queries, and the higher topic level consisting

of 27 topics such as “Health” and “Entertainment”.

As will be illustrated in our experiment results, using only

27 topics seems insufficient to summarize user interests from

time to time. Instead, a richer hierarchical structure of top-

ics learned from search logs, as implemented in our project,

1Due to the policy of Microsoft, we cannot reveal the name

of the search engine mentioned here.

Page 4

is more effective in multidimensional analysis. For example,

after browsing the hottest queries in topic“Entertainment”,

a user may want to drill down to a subtopic “Entertain-

ment/Film”. The current two layer structure in the existing

project can only provide limited analysis power.

Moreover, using individual queries to represent user inter-

ests seems ineffective. It is well recognized that users may

formulate various queries for the same information need.

Therefore, the search log data at the individual query level

may be sparse. For example, the system returns queries

“games”, “game”, “games online”, and “free games” as the

1st, 2nd, 7th, and 8th hottest queries, respectively, on topic

“Game” in the US. Clearly, those queries carry similar in-

formation needs. To make the analysis more effective, as

achieved by the topic-concept model in our project, we need

to summarize similar queries into concepts and represent

user interests by concepts instead of individual queries.

To a broader extent, our project is related to the previ-

ous studies on search query traffic patterns, user interest

summarization, and data cube computation.

Several previous studies explored the patterns of query

traffic with respect to various aspects, such as time, loca-

tions, and search devices. For example, Beitzel et al. [8]

investigated how the web query traffic varied hourly. Back-

strom et al. [5] reported a correlation between the locations

referred in queries and the geographic focus of the users

who issued those queries. Kamvar et al. [17] presented

a log-based comparison on the distribution and variabil-

ity of search tasks that users performed from three plat-

forms, namely computers, iPhones, and conventional mobile

phones. However, those studies mainly focused on the gen-

eral trends of user query traffic without mining user interests

from the log data.

Previous approaches to summarizing user search queries

can be divided into two categories: the clustering approaches

and the categorization approaches. A clustering approach

groups similar queries and URLs in an unsupervised way.

For example, Zhao et al. [21] identified events in a time-

series of click-through bipartites derived from search logs.

Each event consists of a set of queries and clicked URLs

that evolve synchronously along the time-series. In [6, 7, 9,

19], the authors clustered the click-through bipartites and

grouped similar queries into concepts. A categorization ap-

proach classifies queries into a set of pre-defined topics in a

supervised way. For example, Shen et al. [18] leveraged the

search results returned by a search engine and converted

the query categorization problem into a text categorization

problem. Both the clustering and categorization approaches

are effective to summarize user interests into events, con-

cepts, or topics. However, they do not consider how the

interests vary with respect to various dimensions such as

time and locations. Consequently, those methods cannot be

directly used to support lookups and reverse lookups as well

as advanced online multidimensional exploration.

Grey et al. [13] developed data cubes as the core of data

warehouses and OLAP systems. A data cube contains ag-

gregated numeric measures with respect to group-bys of di-

mensions. Zhang et al. [20] proposed a topic cube that ex-

tends the traditional data cube with a measure in a hierar-

chy of topics. Each cell in the cube stores the parameters

learned from a topic modeling process. Users can apply the

OLAP operations such as roll-up and drill-down along both

standard dimensions and the topic dimension. The system

Offline

Process

Search logs

Raw Log Data Cube

Topic-Concept Cube

Click-Through

Bipartite

Topic-Concept

Model

1

4. Cube materialization

Online

Process

- Look-up

- Comparison

- Reverse look-up

- Tracing

T

C

QU

Q

Q

Q

U

U

U

U

Location

Time

32

For each cell

Location

Time

Figure 4: The framework of our system.

Q

U

TC

Figure 5: A graphical representation of TC-model.

was built on a single machine. There are several critical

differences between our topic-concept cube and the topic

cube [20]. The topic model pLSA [14] used in [20] targets at

modeling documents, which involves only two types of vari-

ables, namely the terms as observed variables and the topics

as hidden variables. However, to summarize the common in-

terests in search log data, we have to consider more variables,

especially, queries and clicked URLs as observed variables,

and concepts and topics as hidden variables. Therefore, the

traditional pLSA model cannot be applied in our project.

Consequently, the methods to materialize our topic-concept

cubes are very different from those to materialize the topic

cubes. Moreover, we reported an empirical study on a much

larger set of real data, containing billions of queries and

clicks, and processed in a distributed environment.

4.TOPIC-CONCEPT MODEL

We propose a novel topic-concept model (TC-model for

short), a graphical model as shown in Figure 5, to describe

the generation process of a QU-matrix. Essentially, we as-

sume that a user bears some search intent in mind when

interacting with a search engine. The search intent belongs

to certain topics and focuses on several specific concepts.

Based on the search intent, the user formulates queries and

selectively clicks on search results.

From the search log data, we can observe user queries q

and clicks u. Following the convention of graphical mod-

els, these two observable variables are represented by black

circles in Figure 5. Since user search intents cannot be ob-

served, the topics t and concepts c are latent variables, which

are represented by white circles.

Let Q and U be the sets of unique queries and unique

URLs in a QU-matrix, respectively. Let C and T be the sets

of concepts and topics to model user interests. The training

process of the topic-concept model is to learn four groups

of model parameters Θ = (Φ,∆,ΥQ,ΥU). Here, the prior

topic distribution Φ = {P(tk)}, where tk ∈ T and P(tk) is

Page 5

the prior probability that a user’s search intent involves topic

tk.The concept generation distribution ∆ = {P(cl|tk)},

where cl ∈ C, tk ∈ T, and P(cl|tk) is the probability that

topic tkgenerates concept cl. The query generation distribu-

tion ΥQ = {P(qi|cl)}, where qi ∈ Q, cl∈ C, and P(qi|cl) is

the probability that concept clgenerates query qi. The URL

generation distribution ΥU = {P(uj|cl)}, where uj ∈ U,

cl ∈ C, and P(uj|cl) is the probability that concept cl gen-

erates a click on URL uj.

Given that a user bears a search intent on specific con-

cepts c, we assume that (1) the formulation of queries is

conditionally independent of the clicks on search results,

i.e., P(q,u|c) = P(q|c) · P(u|c); and (2) both the formu-

lation of queries and the clicks on search results are condi-

tionally independent of the topics t of the search intent, i.e.,

P(q,u|t,c) = P(q,u|c). Then, the likelihood for each entry

(qi,uj) in the QU-matrix can be factorized as follows.

(∑

=

tk∈T

L(qi,uj;Θ) =

(∑

where nij is the value of entry (qi,uj) in the QU-matrix.

The likelihood for the whole QU-matrix D is L(D;Θ) =

∏

ically, we apply the Expectation Maximization (EM) algo-

rithm [12]. The EM algorithm iterates between the E-step

and the M-step. The E-step computes the expectation of

the log data likelihood with respect to the distribution of

the latent variables derived from the current estimation of

the model parameters. In the M-step, the model parame-

ters are estimated to maximize the expected log likelihood

found in the E-step. We have the following equations for the

E-step in the r-th iteration.

∑

·Pr−1(qi|cl) · Pr−1(uj|cl))

Pr(tk|qi,uj) ∝

cl

·Pr−1(qi|cl) · Pr−1(uj|cl))

In the M-step of the r-th iteration, the model parameters

are updated by the following equations.

∑

tk′

∑

∑

∑

cl′

5.LEARNING LARGE TC-MODELS

Although the EM algorithm can effectively learn the pa-

rameters in TC-models, there are still several challenges to

apply it on huge search log data. In Section 5.1, we will

develop distributed algorithms for learning TC-models from

tk∈T

∑

cl∈CP(qi,uj,cl,tk;Θ)

)nij

∑

cl∈CP(tk)P(cl|tk)P(qi|cl)P(uj|cl)

)nij

(1)

qi,ujP(qi,uj;Θ).

Since the data likelihood is hard to be maximized analyt-

Pr(cl|qi,uj) ∝

tk

(Pr−1(tk) · Pr−1(cl|tk)

(2)

∑

(Pr−1(tk) · Pr−1(cl|tk)

(3)

Pr(tk) =

qi,ujnijPr(tk|qi,uj)

∑

∑

qi,ujnijPr(tk′|qi,uj)

ujnijPr(cl|qi,uj)

qi′,ujni′jPr(cl|qi′,uj)

qinijPr(cl|qi,uj)

∑

∑

(4)

Pr(qi|cl) =

∑

(5)

Pr(uj|cl) =

qi,uj′nij′Pr(cl|qi,uj′)

qi,ujnijPr(cl|qi,uj)Pr(tk|qi,uj)

∑

(6)

Pr(cl|tk) =

qi,ujnijPr(cl′|qi,uj)Pr(tk|qi,uj)

(7)

Algorithm 1 The r-th round E-step for each process node.

Input: the subset of training data S; the model parameters

Θr−1of the last round

1: Load model parameters Θr−1;

2: for each tuple (qi,uj,nij) in S do

3:σij = 0;

4:for each topic tk∈ T do σt

5:let Cij = {cl|Pr−1(qi|cl) > 0 && Pr−1(uj|cl) > 0};

6:for each concept cl∈ Cij do

7:σc

ijl= 0;

8:for each topic tk ∈ T such that Pr−1(cl|tk) > 0

do

9:v = Pr−1(tk)Pr−1(cl|tk)Pr−1(qi|cl)Pr−1(uj|cl);

10:σc

ijk+ = v; σij+ = v;

11:for each concept cl∈ Cij do

12:for each topic tk ∈ T such that Pr−1(cl|tk) > 0

do

13:output(qi,uj,cl,tk,nij,σc

ijk= 0;

ijl+ = v; σt

ijl/σij,σt

ijk/σij);

a huge amount of data. In Section 5.2, we will discuss the

model initialization steps. Last, in Section 5.3 we will de-

velop effective heuristics to reduce the number of parameters

to learn in each machine.

5.1Distributed Learning of Parameters

Search logs typically contain billions of query-and-click

events involving tens of millions of unique queries and URLs.

To address this challenge, we develop distributed algorithms

for the E-step and M-step.

In our learning process, a QU-matrix is represented by a

set of (qi,uj,nij) tuples. Since a query usually has a small

number of clicked URLs, a QU-matrix is very sparse. We

only need to record the tuples where nij > 0.

partition the QU-matrix into subsets and distribute each

subset to a machine (called a process node). Then we carry

out the E-step and the M-step.

In the E-step of the r-th iteration (Algorithm 1), each

process node loads the current estimation of the model

parameters and scans the assigned subset of training

data once.For each tuple (qi,uj,nij), the process node

enumerates all the concepts cl such that Pr−1(qi|cl) >

0 and Pr−1(uj|cl)

cept cl, the process node further enumerates each topic

tk such that Pr−1(cl|tk) > 0 and evaluates the value

vk,l = Pr−1(tk)Pr−1(cl|tk)Pr−1(qi|cl)Pr−1(uj|cl). The val-

ues of vk,l are summed up to estimate Pr(cl|qi,uj) and

Pr(tk|qi,uj) using Equations 2 and 3, respectively. Finally,

we output the probabilities for the latent variables. Those

results will serve as the input of the M-step.

In the M-step, we estimate the model parameters based

on the probabilities of the hidden variables. According to

Equations 4-6, the estimation for each parameter involves

a sum over all the queries and URLs. Since the matrix is

distributed on multiple machines, the summation involves

aggregating the intermediate results across machines, which

is particularly suitable for a Map-Reduce system [11].

In the map stage of the M-step, each process node receives

a subset of tuples (qi,uj,cl,tk,nij,σc

each tuple, the process node emits four key-value pairs as

shown in Table 2. In the reduce stage, the process nodes

We first

> 0.For each enumerated con-

ijl/σij,σt

ijk/σij). For

Page 6

Key

⟨tk⟩

⟨cl,tk⟩

Value

nij· σt

nij· σc

Key

⟨qi,cl⟩

⟨uj,cl⟩

Value

nij· σc

nij· σc

ijk/σij

ijl· σt

ijl/σij

ijl/σij

ijk/σ2

ij

Table 2: The key/value pairs at the map stage of

the r-th round of M-step.

simply sum up all the values with the same key and update

the model parameters using Equations 4-6.

5.2Model Initialization

The Topic-Concept model consists of four sets of param-

eters, Φ,∆,ΥQ and ΥU. We first initialize the query-and-

click generation probabilities ΥQand ΥU by mining the con-

cepts from the click-through bipartite. We then initialize the

prior topic probabilities Φ and the concept generation prob-

abilities ∆ by assigning concepts to topics.

To mine concepts from a click-through bipartite, we clus-

ter queries from the query-URL bipartite graph by a two-

step propagation approach [10]. For each query cluster Ql,

we find the set of URLs Ul such that each URL u ∈ Ul is

connected with at least one query in Ql. In the first step

of propagation, Ql is expanded to Q′

q′∈ Q′

the second step of propagation, Ul is expanded to U′

that each URL u′∈ U′

q′∈ Q′

of query and URL sets (Q′

l), and initialize the query and

URL generation probabilities by

lsuch that each query

lis connected with at least one URL u ∈ Ul. In

lsuch

lis connected with at least one query

l. Finally, we represent each concept cl by the pair

l,U′

P0(qi|cl) ∝

∑

uj∈U′

l

nij;P0(uj|cl) ∝

∑

qi∈Q′

l

nij,

where nij is the value of entry (qi,uj) in the QU-matrix.

After deriving the set of concepts C, we consider the set

of topics T. Although we may automatically mine topics

by clustering concepts, in practice, there are several well-

accepted topic taxonomies, such as Yahoo! Directory [4],

Wikipedia [3], and ODP [2]. We use the ODP topic taxon-

omy in this paper, though others can be adopted as well.

The ODP taxonomy is a hierarchical structure where each

parent topic subsumes several sub topics, and each leaf topic

is manually associated with a list of URLs by the ODP edi-

tors. Given a set of topics at some level in the taxonomy, we

can initialize the concept generation probabilities P(cl|tk)

as follows.

According to Bayes Theorem, P(cl|tk) ∝ P(cl)P(tk|cl).

The prior probability P(cl) indicates the popularity of con-

cept cl and the probability P(tk|cl) indicates how likely cl

involves topic tk. Suppose cl is represented by the query-

and-URL sets (Q′

l). The popularity of cl can be esti-

mated byˆP(cl) ∝∑

involves topic tk, we merge the text content of the URLs

u ∈ U′

estimating P(tk|cl) is converted into a text categorization

problem, and P(tk|cl) can be estimated by applying any

text categorization techniques (e.g., [15, 16]) on the pseudo-

document dl. Based on the estimatedˆP(cl) andˆP(tk|cl),

l,U′

qi∈Q′

l,uj∈U′

lnij, where nij is the value

To tell how likely cl

of entry (qi,uj) in the QU-matrix.

linto a pseudo-document dl. Then, the problem of

we initialize the parameters by

P0(cl|tk) ∝ˆP(cl)ˆP(tk|cl);P0(tk) ∝

∑

cl

ˆP(cl)ˆP(tk|cl).

Why do we still need the EM iterations given that we

can estimate all the model parameters in the initialization

stage? The EM iterations can improve the quality of con-

cepts and topics by a mutual reinforcement process. In the

TC-model, the probabilities P(q|c) and P(u|c) assign queries

and URLs to concepts, while the probabilities P(c|t) assign

concepts to topics.In the initialization stage, those two

types of probabilities are estimated independently. If two

queries/URLs belong to the same concept, it is more likely

that they belong to the same topic, and vice versa. There-

fore, if we jointly consider those two types of probabilities,

we may derive more accurate assignments of concepts and

topics. In the EM iterations, the relationship between con-

cepts and topics is captured by the probabilities P(c|q,u)

and P(t|q,u), which contribute to the increase of the data

likelihood. In our experiments on a real data set, the data

likelihood is increased by 11% after the EM iterations.

5.3Reducing Re-estimated Parameters

As described in Section 5.1, in the E-step, each process

node estimates P(cl|qi,uj) and P(tk|qi,uj) on the basis

of the last round estimation of parameters Φ,∆,ΥQ, and

ΥU.Let Nt,Nc,Nq,Nu be the numbers of topics, con-

cepts, unique queries, and unique URLs, respectively. The

sizes of the parameter sets are |Φ| = Nt, |∆| = Nt · Nc,

|ΥQ| = Nq · Nc, and |ΥU| = Nu· Nc. In practice, we usu-

ally have tens of millions of unique queries and URLs in

the search log data, which may form millions of concepts.

For example, in the real data set in our experiments, we

have 11.76 million unique queries, 9.5 million unique URLs,

4.71 million concepts, and several hundred topics. The total

size of the parameter space reaches 1014. Consequently, it

is infeasible to hold the full parameter space into the main

memory of a process node.

To reduce the number of parameters to be re-estimated,

we analyze the cases when the model parameters remain zero

during the EM iterations. Suppose a process node receives

a subset S of training data in the E-step, we give a tight

superset Θ(S) of the nonzero model parameters that need

to be accessed by the process node in the E-step. In our

experiments, |Θ(S)| for each process node is several orders

of magnitudes smaller than the size of full parameters space.

Each process node only needs to process a subset of Θ(S).

Lemma 1. The query generation probability at the r-th

iteration Pr(qi|cl) = 0 if P0(qi|cl) = 0.

Proof. Let U be the whole set of unique URLs. From

Equation 2, if Pr−1(qi|cl) = 0, then Pr(cl|qi,uj) = 0 holds

for every uj ∈ U. According to Equation 5, if Pr(cl|qi,uj) =

0 holds for every uj ∈ U, then Pr(qi|cl) = 0. Therefore,

we have Pr−1(qi|cl) = 0 ⇒ Pr(qi|cl) = 0. Using simple

induction, we can prove P0(qi|cl) = 0 ⇒ P0(qi|cl) = 0.

Similarly, we can prove the following lemma.

Lemma 2. The URL generation probability at the r-th it-

eration Pr(uj|cl) = 0 if P0(uj|cl) = 0.

Let us consider the concept generation probabilities

P(cl|tk). We call a pair (qi,uj) belongs to concept cl, de-

noted by (qi,uj) ∈ cl, if nij > 0, P0(qi|cl) > 0, and

Page 7

P0(uj|cl) > 0. Two concepts cl and cl′ are associated if

there exists a pair (qi,uj) belonging to both concepts. Triv-

ially, a concept is associated with itself. Let A(cl) be the

set of concepts associated with cl, and QU(cl) be the set of

pairs (qi,uj) that belong to at least one concept associated

with cl, i.e., QU(cl) = {(qi,uj)|∃cl′ ∈ A(cl),(qi,uj) ∈ cl′}.

We have the following.

Lemma 3. The concept generation probability at the r-th

iteration Pr(cl|tk) = 0 if ∀cl′ ∈ A(cl), Pr−1(cl′|tk) = 0.

Proof. According to the definitions, for any (qi,uj) ̸∈ cl,

one of the following three predicates holds (1) nij = 0;

(2) P0(qi|cl) = 0; or (3) P0(uj|cl) = 0.

from Equation 7, (qi,uj) does not contribute to Pr(cl|tk).

Otherwise, if P0(qi|cl) = 0 or P0(uj|cl) = 0, according

to Lemmas 1 and 2, we have either Pr−1(qi|cl) = 0 or

Pr−1(uj|cl= 0). From Equation 2, if either Pr−1(qi|cl) = 0

or Pr−1(uj|cl) = 0, then Pr(cl|qi,uj) = 0. Therefore, Equa-

tion 7 can be re-written as

∑

Now we only need to focus on Pr(tk|qi,uj) for pairs

(qi,uj) ∈ cl.

any pair (qi,uj) ∈ cl and concept cl′′

P0(qi|cl′′) = 0 or P0(uj|cl′′) = 0 holds. Using Lemmas 1

and 2, we can rewrite Equation 3 for every pair (qi,uj) ∈ cl

as

∑

· Pr−1(qi|cl′) · Pr−1(uj|cl′).

According to Equation 9, if ∀cl′ ∈ A(cl), Pr−1(cl′|tk) = 0,

then Pr(tk|qi,uj) = 0 holds for every (qi,uj) ∈ cl. Further

according to Equation 8, if Pr(tk|qi,uj) = 0 holds for every

(qi,uj) ∈ cl, then Pr(cl|tk) = 0. Therefore, if ∀cl′ ∈ A(cl),

Pr−1(cl′|tk) = 0, then Pr(cl|tk) = 0.

Lemma 3 suggests that at each round of iteration, a con-

cept clpropagates its nonzero topics tk(i.e., topics such that

P(cl|tk) > 0) one step further to all its associated concepts.

To further explore the conditions for Pr(cl|tk) = 0, we

build a concept association graph G(V,E), where each vertex

v ∈ V represents a concept c, and two concepts ca and

cb are linked by an edge eab ∈ E if they are associated

with each other.In the association graph, two concepts

ca and cb are connected if there exists a path between ca

and cb.The connected component N∗(ca) of concept ca

consists of all concepts cb that are connected with ca. The

distance between two concepts ca and cb is the length of the

shortest path between ca and cb in the graph. If ca and cb

are not connected, the distance is set to ∞. The set of m-

step neighbors Nm(ca) (1 ≤ m < ∞) of concept ca consists

of the concepts whose distance from ca is at most m. We

have the following lemma by recursively applying Lemma 3.

If nij = 0,

Pr(cl|tk) ∝

(qi,uj)∈cl

nijPr(cl|qi,uj)Pr(tk|qi,uj).(8)

According to the definition of A(cl), for

/ ∈ A(cl), either

Pr(tk|qi,uj) ∝

cl′∈A(cl)

Pr−1(tk) · Pr−1(cl′|tk)

(9)

Lemma 4. The concept generation probability at the r-

th iteration Pr(cl|tk) = 0 if ∀cl′ ∈ Nm(cl) (1 ≤ m ≤ r),

Pr−m(cl′|tk) = 0. Moreover, Pr(cl|tk) = 0 if ∀cl′ ∈ N∗(cl),

P0(cl′|tk) = 0.

Using Lemmas 1-4, we can give a tight superset of the

parameters needed in the E-step for any subset S of training

data. Let (qi,uj,nij) be a training tuple in S. In the E-step,

we enumerate the concepts cl such that Pr−1(qi|cl) > 0

and Pr−1(uj|cl) > 0. According to Lemmas 1 and 2, to

process (qi,uj,nij), we can enumerate only those concepts

C′

We consider the nonzero parameters for each concept cl.

Using Lemmas 1 and 2, the nonzero query and URL genera-

tion probabilities are simply Υ+

0} and Υ+

Furthermore, let T(cl) = {P(cl|tk)|P0(cl|tk) > 0} and

T∗(cl) =∪

T∗(cl)}.

Let C′

Sbe the set of concepts that are enumerated for the

training tuples in S, i.e., C′

the above discussion as follows.

ij= {cl|(qi,uj) ∈ cl}.

Q(cl) = {P(qi|cl)|P0(qi|cl) >

U(cl) = {P(uj|cl)|P0(uj|cl) > 0}, respectively.

cl′∈N∗(cl)T(cl′). Using Lemma 4, the nonzero

concept generation probabilities are ∆+(cl) = {P(cl|tk)|tk∈

S=∪

sij∈SC′

ij. We summarize

Theorem 1. Let S be a subset of training data, the set

of nonzero parameters need to be accessed in the E-step for

S is a subset of Θ(S), where

cl∈C′

S

Θ(S) =

{P(tk)},

∪

Υ+

Q(cl),

∪

cl∈C′

S

Υ+

U(cl),

∪

cl∈C′

S

∆+(cl)

.

In practice, a concept association graph can be highly

connected. That is, for any two concepts ca and cb, there

likely exists a path ca,cl1,...,clm,cb.

though each pair of adjacent concepts on the path are re-

lated to each other, the two end concepts ca and cb of the

path may be about dramatically different topics. As dis-

cussed before, in the EM iterations, each concept propa-

gates its nonzero topics to its neighbors. Consequently, af-

ter several rounds of iterations, two totally irrelevant con-

cepts ca and cb may exchange their nonzero topics through

the path ca,cl1,...,clm,cb. To avoid over propagation of

the nonzero topics, we may constrain the propagation up

to ς steps.Specifically, for each concept cl, let T(cl) =

{P(cl|tk)|P0(cl|tk) > 0} and Tς(cl) =∪

tk ̸∈ Tς(cl). In our experiments, we find that the nonzero

topics propagated from the neighbors of more than one step

away are often noisy. Therefore, we set ς to 1.

Theorem 1 greatly reduces the number of parameters to

be re-estimated in process nodes in practice. For example,

when we use 50 process nodes in our experiments, each pro-

cess node only needs to re-estimate 62 million parameters,

which is about 10−7of the size of the total parameter space.

In practice, 62 million parameters may still be too many for a

machine with small memory, e.g., less than 2G. In this case,

the process node can recursively split the assigned training

data Sn into smaller blocks Snb ⊂ Sn until the necessary

nonzero parameters Θ(Snb) for each block can be loaded

into the main memory. Then, the process node can carry

out the E-step block by block. We report the details of the

experiment in Section 7.2.

In some cases, al-

cl′∈Nς(cl)T(cl′), we

constrain the concept generation probability P(cl|tk) = 0 if

6. CUBE

QUEST ANSWERING

Similar to a traditional data cube, a topic-concept cube

(TC-cube for short) contains some standard dimensions such

as time and locations.However, a TC-cube differs from

CONSTRUCTIONANDRE-

Page 8

...

D1;θ1

D22;θ22

Top-down

Bottom-up

D21;θ21

Child cells C21, C22, ?, C2M

(a) On standard dimension

D2M;θ2M

Parent cell C1

Bottom-up

T1,Θ

T2,Θ

Top-down

(b) On TC-dimension

Figure 6: The cube construction approaches on (a)

standard dimension and (b) TC-dimension.

a traditional data cube in several critical aspects.

for each cell in a TC-cube, we learn the TC-models from

the training data in the cell and use the model parameters

as the measure of the cell. Those parameters allow us to

effectively answer lookups and reverse lookups introduced

in Section 1. Second, a TC-cube contains a special topic-

concept dimension (TC-dimension for short) as shown in

Figure 1.

We have two alternative approaches to materialize the

whole TC-cube that consists of both standard dimensions

and the TC-dimension. The standard-dimension-first ap-

proach materializes a raw log data cube using the standard

dimensions, and then materializes along the TC-dimension

for each cell in the raw log data cube. The TC-dimension-

first approach processes the topic hierarchy level by level.

For each level, it materializes the cells formed by the stan-

dard dimensions. The full technical details can be found in

the extended version [1].

After materializing the whole TC-cube, we answer the

lookups and reverse lookups using the model parameters in

the TC-cube. Since the number of model parameters can be

large, we store the parameters distributively on a cluster of

process nodes, where each node contains the parameters for

a set of cells. When the system receives a lookup request, for

example, “(time=Dec., 2009; location=US; topic=Games)”,

it will delegate the query to the process node where the

model parameters of the corresponding cell are stored. Then

the process node will select the top k concepts c with the

largest concept generation probabilities P(c|t = Games).

For each top concept, the process node will use the query q

with the largest P(q|c) as the representative query. Finally,

the system returns a list of representative queries of the top

concepts as the answer to the lookup request.

To answer the reverse lookups, we build inverted lists that

map key words to concepts. The inverted list can be stored

distributively on a cluster of process nodes, where each node

takes charge of a range of key words. Suppose a user re-

quests a reverse lookup about “hurricane Bill”. The system

will delegate the key words to the corresponding node that

stores the inverted list for “hurricane Bill”. The node re-

trieves from the inverted list the set of concepts Churricane Bill

where each concept is related to “hurricane Bill”. The sys-

tem then broadcasts the concepts inChurricane Bill to all the

nodes that store the model parameters. Each node checks

the measures of all its cells and reports (Dval,Count) for

each cell, where Dval consists of the corresponding val-

ues of the standard dimensions of the cell, and Count is

the frequency of the concepts in Churricane Bill in the cell,

i.e., Count =∑

First,

c∈Churricane Bill

∑

qi,uj∈cnij, where nij is the

No.

1

2

3

4

5

6

7

8

9

10

baseline

games

game

cheats

wow

lottery

xbox

games online

free games

wii

runescape

TC cube

games

pogo

maxgames

aol games

wow heroes

killing games

addicted games

age of war

powder game

monopoly online

P(c|t)

0.020

0.013

0.012

0.011

0.010

0.009

0.008

0.008

0.008

0.008

Table 3: The top ten queries returned by our TC-

cube and the baseline for lookup “(time=ALL; loca-

tion=US; topic=Games)”.

value of entry (qi,uj) in the QU-matrix of the cell. If the

user specifies the levels of the standard dimensions, for ex-

ample, time@day;location@country, the system returns the

Dvals of the top k cells that match the specified levels of the

standard dimension. If the user does not specify the levels,

the system will answer the request at the default levels. The

user can further drill down or roll up to different levels.

7.EXPERIMENTS

In this section, we report the results from a systematic

empirical study using a large search log from a major com-

mercial search engine. The extracted log data set spans for

four months and contains 1.96 billion queries and 2.73 bil-

lion clicks from five markets, i.e., the United States, Canada,

United Kingdom, Malaysia, and New Zealand. In the follow-

ing, we first demonstrate the effectiveness of our approach

using several examples of the lookup and reverse lookup re-

quests. Then, we examine the efficiency and scalability of

our distributed training algorithms for the TC-model.

7.1Lookups and Reverse Lookups Examples

In this subsection, we show some real examples for the

lookups and reverse lookups answered by our system. We

use the query traffic analysis service by a major commercial

search engine as the baseline. Please refer to Section 3 for a

more detailed description of the baseline.

Table 3 compares the results for the lookup request (time

= ALL; location = US; topic = Games) returned by our

system and the baseline. Since the baseline does not group

similar queries into concepts, the top 10 results are quite

redundant. For example, the 1st, 2nd, 7th, and 8th queries

are similar. Our system summarizes similar queries into con-

cepts and selects only one query as the representative for

each concept. Consequently, the top 10 queries returned by

our system are more informative. We further request the top

results for four sub topics of“Games”, namely“card games”,

“gambling”, “party games”, and “puzzles”. The queries re-

turned by our system are informative (Table 4). However,

the baseline only organizes the user queries by a flat set of

27 topics; it does not support drilling down to sub topics.

As an example for reverse lookup, we asked for the group-

bys where the search for Hurricane Bill was popular by

a request “(time@day, location@state, keyword=‘hurrican

bill’)”. Purposely we misspelled the keyword “hurricane” to

“hurrican” to test the summarization capability of our TC-

Page 9

card games

pogo

gogirlsgames

solitaire

aol games

scrabble blast

party games

tombola

oyunlar

fashion games

drinking games

evite

P(c|t)

0.020

0.004

0.004

0.003

0.003

P(c|t)

0.003

0.003

0.003

0.002

0.002

gambling

sun bingo

wink bingo

tombola

skybet

ladbrokes

puzzles

pogo

sudoku

meriam webster

thesaurus com

mathgames

P(c|t)

0.004

0.004

0.003

0.003

0.002

P(c|t)

0.006

0.004

0.003

0.003

0.002

Table 4: The top queries returned by TC-cube for

four sub topics of “Games” in the US.

510

August, 2009 (date)

152025 30

0

50

100

150

200

250

300

Number of search times

Florida

Georgia

New Jersey

Massachusetts

Virginia

Figure 7: The top five states of US where Hurricane

Bill was most intensively search in Aug. 2009.

model. Our system can infer that the keyword “hurrican

bill” belongs to the concept that consists of queries “hurri-

cane bill”, “hurrican bill”, “huricane bill”, “projected path of

hurricane bill”, “hurricane bill 2009” and some other vari-

ants. Therefore, the system sums up the frequencies of all

the queries in the concept and answers the top five states

during the days in August 2009 (Figure 7). Figure 8 vi-

sualizes the trend of the popularity of the whole concept

according to the output of the reverse lookup. The dates

in the figure indicate when the concept was most intensively

searched in different states in the US. Interestingly, the trend

shown in Figure 8 reflects well the trajectory and the influ-

ence of the hurricane geographically and temporally, which

indicates that the real world events can be reflected by the

popular queries issued to search engines. However, when we

sent the same request to the baseline, it answered that the

search volume was not enough to show trend. The reason is

that the baseline may only consider the query that exactly

matches the misspelled keyword “hurrican bill”, which may

not be searched often.

7.2 Training TC-models

The TC-model was initialized as described in Section 5.2.

We derived 4.71 million concepts, which involve 11.76 million

unique queries and 9.5 million unique URLs. On average,

a concept consists of 4.68 unique queries and 6.77 unique

URLs. We further chose the second level of the ODP [2]

taxonomy and applied the text classifier in [15] to categorize

the concepts into the 483 topics. For each concept, we kept

the top five topics returned by the classifier.

From the raw log data, we derived 23 million training

tuples where each training tuple is in the form (qi,uj,nij)

8/198/19

8/188/18

8/19

8/19

8/20

8/20

8/20

8/20

8/21

8/19

8/19

8/19

8/21

8/20

8/20

8/208/19

8/20

8/19

8/20

8/18

8/19

8/20

8/17

8/17

8/19

Hurricane Bill

Figure 8: The trajectory of Hurricane Bill.

02468 10

−1.8

−1.75

−1.7

−1.65

−1.6

−1.55x 10

10

Iteration

Log L(Q,U)

0246810

0

0.5

1

1.5

2

2.5

Iteration

Avg. percentage

of param changes(%)

%

(a) Data likelihood(b) Parameter changes

Figure 9: The data likelihood and the average per-

centage of parameter changes during EM iterations.

and nij is the number of times URL uj was clicked on as

answers to query qi.

Figures 9(a) and (b) show the data likelihood and the av-

erage percentage of parameter changes with respect to the

number of EM iterations. The iteration process converges

fast; the data likelihood and parameters do not change much

(less than 0.1%) after five iterations. The results suggest

that our initialization methods are effective to set the ini-

tial parameters close to a local maximum. Moreover, the

data likelihood increases by 11% after ten iterations. As ex-

plained in Section 5.2, this indicates that the EM algorithm

is effective to improve the quality of the TC-model by jointly

mining the assignments of concepts and topics in a mutual

reinforcement process.

Figures 10(a) and (b) show the runtime of the E-step and

the M-step with respect to the percentage of the complete

data set with 50, 100, and 200 process nodes, respectively.

Each process node has a four-core 2.67GHz CPU and 4GB

main memory. We observe the following in Figure 10(a).

First, the more process nodes used, the shorter runtime for

the E-step. The runtime needed for the E-step on the com-

plete data set by 50, 100, and 200 process nodes is approxi-

mately in ratio 4:2:1. This suggests that our algorithm scales

well with respect to the number of process nodes. Second,

the more process nodes are used, the more scalable is the

E-step. For example, when 50 process nodes were used, the

runtime increased dramatically when 40%, 70%, and 100%

of the data was loaded. As explained in Section 5.3, if the

training data for a process node involves too many param-

eters to be held in the main memory, the algorithm recur-

Page 10

3040

Percentage of full data(%)

50 6070 8090100

200

400

600

800

1000

1200

1400

1600

1800

Runtime(s) of E−step

50 nodes

100 nodes

200 nodes

3040

Percentage of full data(%)

(b) M-step

50607080 90100

1500

2000

2500

3000

3500

4000

Runtime(s) of M−step

50 nodes

100 nodes

200 nodes

(a) E-step

Figure 10: The scalability of the E-step and the M-

step.

# pn

|S||Θ(S)|

62,325,884

35,368,823

18,656,725

# nonezero

parameters

56,682,113

30,370,194

15,821,818

Ratio# B

50

100

200

460,062

230,031

115,015

5.7 e-7

3.0 e-7

1.6 e-7

4

2

1

Table 5: The effectiveness of Theorem 1.

sively splits the training data into blocks until the param-

eters needed by a block can be held in the main memory.

Therefore, the runtime of the E-step mainly depends on the

number of disk scans of the parameter file, i.e., the number

of blocks to be processed. When we used 50 process nodes,

each node split the assigned training data into 2, 3, and 4

blocks when 40%, 70%, and 100% of the complete data set

was used for training, respectively. This explains why the

runtime increases dramatically at those points. When we

used 200 nodes, each node can process the assigned data

without splitting even for the complete data set. Conse-

quently, the runtime increases linearly.

In Figure 10(b), the runtime of M-step increases almost

linearly with respect to the data set size, indicating the good

scalability of our algorithm. Interestingly, the runtime of the

M-step does not change much with respect to the number of

process nodes. This is because the major cost of the map-

reduce process of the M-step is the merging of parameters,

which is done on a single machine. This bottleneck costs the

M-step much longer time than the E-step.

Table 5 evaluates the effectiveness of Theorem 1. We ex-

ecuted the E-step on the complete data set with 50, 100,

and 200 process nodes, respectively. For each setting, e.g.,

using 50 nodes, we recorded the average number of training

tuples |S| assigned to each process, the average number of

the estimated nonzero parameters Θ(S) by Theorem 1, the

average number of nonzero parameters after ten iterations,

the ratio of the average size of Θ(S) over the size of the

whole parameter space, and the number of blocks processed

by each process node. Table 5 suggests the following. First,

the average size of Θ(S) over the size of the whole param-

eter space is very small, in the order of 10−7. This means

Theorem 1 can greatly reduce the number of parameters to

be held by each process node. Moreover, the size of the

estimated nonzero parameters is close to that of nonzero

parameters during the iterations. This indicates that the

superset of nonzero parameters given by Theorem 1 is tight.

8. CONCLUSION

In this paper, we described our topic-concept cube project

that supports online multidimensional mining of search logs.

We proposed a novel topic-concept model to summarize user

interests and developed distributed algorithms to automati-

cally learn the topics and concepts from large-scale log data.

We also explored various approaches for efficient materializa-

tion of TC-cubes. Finally, we conducted an empirical study

on a large log data set and demonstrated the effectiveness

and efficiency of our approach. A prototype system that can

provide public online services is under development.

9.REFERENCES

[1] http://research.microsoft.com/en-us/people/

djiang/ext.pdf.

[2] ODP: http://www.dmoz.org.

[3] Wikipedia: http://en.wikipedia.org.

[4] Yahoo! Directory: http://dir.yahoo.com.

[5] Backstrom, L., et al. Spatial variation in search engine

queries. In WWW’08.

[6] Baeza-Yates, R.A., et al. Query recommendation using

query logs in search engines. In EDBT’04 Workshop.

[7] Beeferman, D. and Berger, A. Agglomerative

clustering of a search engine query log. In KDD’00.

[8] Beitzel, S.M., et al. Hourly analysis of a very large

topically categorized web query log. In SIGIR’04.

[9] Cao, H., et al. Context-aware query suggestion by

mining click-through and session data. In KDD’08.

[10] Cao, H., et al. Towards context-aware search by

learning a very large variable length hidden markov

model from search logs. In WWW’09.

[11] Dean, J., et al. MapReduce: simplified data processing

on large clusters. In OSDI’04.

[12] Dempster, A.P., et al. Maximal likelihood from

incomplete data via the EM algorithm. Journal of the

Royal Statistical Society, Ser B(39):1–38, 1977.

[13] Grey, J., et al. Data cube: a relational aggregation

operator generalizing group-by, cross-tab, and

sub-totals. In ICDE’96.

[14] Hofmann, T. Probabilistic Latent Semantic Analysis.

In UAI’99.

[15] Joachims, T. Text categorization with support vector

machines: learning with many relevant features. In

ECML’98.

[16] Joachims, T. Transductive inference for text

classification using support vector machines. In

ICML’99.

[17] Kamvar, M. et al. Computers and iphones and mobile

phones, oh my!: a logs-based comparison of search

users on different devices. In WWW’09.

[18] Shen, D. et al. Q2c@ust: our winning solution to

query classification in kddcup 2005. KDD Exploration,

7(2), 2005.

[19] Wen, J., et al. Clustering user queries of a search

engine. In WWW’01.

[20] Zhang, D., et al. Topic cube: Topic modeling for olap

on multidimensional text databases. In SDM’09.

[21] Zhao, Q., et al. Event detection from evolution of

click-through data. In KDD’06.