Conference Paper

Multidimensional mining of large-scale search logs: a topic-concept cube approach.

DOI: 10.1145/1935826.1935888 Conference: Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong, China, February 9-12, 2011
Source: DBLP


In addition to search queries and the corresponding clickthrough information, search engine logs record multidimensional information about user search activities, such as search time, location, vertical, and search device. Multidimensional mining of search logs can provide novel insights and useful knowledge for both search engine users and developers. In this paper, we describe our topic-concept cube project, which addresses the business need of supporting multidimensional mining of search logs effectively and efficiently. We answer two challenges. First, search queries and click-through data are well recognized sparse, and thus have to be aggregated properly for effective analysis. Second, there is often a gap between the topic hierarchies in multidimensional aggregate analysis and queries in search logs. To address those challenges, we develop a novel topic-concept model that learns a hierarchy of concepts and topics automatically from search logs. Enabled by the topicconcept model, we construct a topic-concept cube that supports online multidimensional mining of search log data. A distinct feature of our approach is that, in addition to the standard dimensions such as time and location, our topic-concept cube has a dimension of topics and concepts, which substantially facilitates the analysis of log data. To handle a huge amount of log data, we develop distributed algorithms for learning model parameters efficiently. We also devise approaches to computing a topic-concept cube. We report an empirical study verifying the effectiveness and efficiency of our approach on a real data set of 1.96 billion queries and 2.73 billion clicks.

Download full-text


Available from: Ho-Jin Choi
  • [Show abstract] [Hide abstract]
    ABSTRACT: Huge amounts of search log data have been accumulated at Web search engines. Currently, a popular Web search engine may receive billions of queries and collect terabytes of records about user search behavior daily. Beside search log data, huge amounts of browse log data have also been collected through client-side browser plugins. Such massive amounts of search and browse log data provide great opportunities for mining the wisdom of crowds and improving Web search. At the same time, designing effective and efficient methods to clean, process, and model log data also presents great challenges. In this survey, we focus on mining search and browse log data for Web search. We start with an introduction to search and browse log data and an overview of frequently-used data summarizations in log mining. We then elaborate how log mining applications enhance the five major components of a search engine, namely, query understanding, document understanding, document ranking, user understanding, and monitoring and feedback. For each aspect, we survey the major tasks, fundamental principles, and state-of-the-art methods.
    No preview · Article · Sep 2013 · ACM Transactions on Intelligent Systems and Technology
  • [Show abstract] [Hide abstract]
    ABSTRACT: Online social networks, such as twitter and facebook, are continuously generating the new contents and relationships. To fully understand the spread of topics, there are some essential but remaining open questions. Why are some seemingly ordinary topics attracting? Is it due to the attractiveness of the content itself, or some external factors, such as network structure, time or event location, play a larger role in the dissemination of information? Analyzing the influence and spread of upcoming contents is an interesting and useful research direction, and has brilliant perspective on web advertising and spam detection. In this paper, a novel time series model for predicting the topics social influence has been introduced. In this model, the existing user-generated contents are summarized with a set of valued sequences, and a hybrid model consisting of topical, social and geographic attributes has been adopted for predicting influence trends of newly coming contents. The empirical study conducted on large real data sets indicates that our model is interesting and meaningful, and our methods are effective and efficient in practice.
    No preview · Article · Nov 2014 · Neurocomputing
  • [Show abstract] [Hide abstract]
    ABSTRACT: Mining the latent intents behind search queries is critical for contemporary search engines. Therefore, there has been lots of effort on studying how to infer the intents of search queries via search engine query log. However, the task of query log-based intent inference is not trivial, since it involves cross-disciplinary knowledge of data modeling and data mining. In this paper, we tackle the problem of query intent inference by integrating multiple information sources in a seamless manner. We first propose a comprehensive data model called Search Query Log Structure (SQLS) that represents the relation between search queries via the User dimension, the URL dimension, the Session dimension and the Term dimension. In order to explore the effective ways of using such multidimensional information modeled by SQLS, we survey and compare three frameworks, namely Result-Oriented Framework, Laplacian-Oriented Framework and Topic-Oriented Framework, to infer the intents of search queries. Experimental results show that the three frameworks significantly outperform the state-of-the-art approach and meet the diverse requirements arising from different application scenarios.
    No preview · Article · Jan 2016 · Knowledge and Information Systems