Yiqun Liu

Tsinghua University, Beijing, Beijing Shi, China

Are you Yiqun Liu?

Claim your profile

Publications (56)2.83 Total impact

  • [show abstract] [hide abstract]
    ABSTRACT: Page quality estimation is one of the greatest challenges for Web search engines. Hyperlink analysis algorithms such as PageRank and TrustRank are usually adopted for this task. However, low quality, unreliable and even spam data in the Web hyperlink graph makes it increasingly difficult to estimate page quality effectively. Analyzing large-scale user browsing behavior logs, we found that a more reliable Web graph can be constructed by incorporating browsing behavior information. The experimental results show that hyperlink graphs constructed with the proposed methods are much smaller in size than the original graph. In addition, algorithms based on the proposed “surfing with prior knowledge” model obtain better estimation results with these graphs for both high quality page and spam page identification tasks. Hyperlink graphs constructed with the proposed methods evaluate Web page quality more precisely and with less computational effort.
    Decision Support Systems 12/2012; 54(1):390–401. · 2.20 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: Click-through behaviors are treated as invaluable sources of user feedback and they have been leveraged in several commercial search engines in recent years. However, estimating unbiased relevance is always a challenging task because of position bias. To solve this problem, many researchers have proposed a variety of assumptions to model click-through behaviors. Most of these models share a common examination hypothesis, which is that users examine search results from the top to the bottom. Nevertheless, this model cannot draw a complete picture of information-seeking behaviors. Many eye-tracking studies find that user interactions are not sequential but contain revisiting patterns. If a user clicks on a higher ranked document after having clicked on a lower-ranked one, we call this scenario a revisiting pattern, and we believe that the revisiting patterns are important signals regarding a user's click preferences. This paper incorporates revisiting behaviors into click models and introduces a novel click model named Temporal Hidden Click Model (THCM). This model dynamically models users' click behaviors with a temporal order. In our experiment, we collect over 115 million query sessions from a widely-used commercial search engine and then conduct a comparative analysis between our model and several state-of-the-art click models. The experimental results show that the THCM model achieves a significant improvement in the Normalized Discounted Cumulative Gain (NDCG), the click perplexity and click distributions metrics.
    Proceedings of the Fifth International Conference on Web Search and Web Data Mining, WSDM 2012, Seattle, WA, USA, February 8-12, 2012; 01/2012
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam-detection techniques are usually designed for specific, known types of Web spam and are incapable of dealing with newly appearing spam types efficiently. With user-behavior analyses from Web access logs, a spam page-detection algorithm is proposed based on a learning scheme. The main contributions are the following. (1) User-visiting patterns of spam pages are studied, and a number of user-behavior features are proposed for separating Web spam pages from ordinary pages. (2) A novel spam-detection framework is proposed that can detect various kinds of Web spam, including newly appearing ones, with the help of the user-behavior analysis. Experiments on large-scale practical Web access log data show the effectiveness of the proposed features and the detection framework.
    ACM Transactions on The Web - TWEB. 01/2012;
  • [show abstract] [hide abstract]
    ABSTRACT: Microblogging services are attracting people and companies to share their ideas and interests. Since the texts of microblog messages are limited, people post URLs to link to other websites for detailed information. Hence, URLs with higher attentions are spread widely and represent popular information. However, not all these URLs are useful. Many of them are spam URLs which are posted by automated agents or by pushing services from other websites automatically. Based on the features of the popular URLs, we divide them into four categories and propose a clustering and classification algorithm to distinguish spam URLs from the really popular ones. Comparative experiments are conducted on English (Twitter) and Chinese (Sina Weibo) messages. We conclude that more than half of the popular URLs are spam. Most of them are pushed from other websites; even the really popular ones gain much attention from the pushing services. Although the proportions of URLs in Twitter and Sina Weibo messages are different, the characteristics of the spam URLs are similar. Our method is efficient for detecting spam URLs and their authors without annotations, and is helpful for both research and business on microblog.
    Cloud Computing and Intelligence Systems (CCIS), 2011 IEEE International Conference on; 10/2011
  • IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011; 01/2011
  • [show abstract] [hide abstract]
    ABSTRACT: Anchor texts complement Web page content and have been used extensively in commercial Web search engines. Existing methods for anchor text weighting rely on the hyperlink information which is created by page content editors. Since anchor texts are created to help user browse the Web, browsing behavior of Web users may also provide useful or complementary information for anchor text weighting. In this paper, we discuss the possibility and effectiveness of incorporating browsing activities of Web users into anchor texts for Web search. We first make an analysis on the effectiveness of anchor texts with browsing activities. And then we propose two new anchor models which incorporate browsing activities. To deal with the data sparseness problem of user-clicked anchor texts, two features of user’s browsing behavior are explored and analyzed. Based on these features, a smoothing method for the new anchor models is proposed. Experimental results show that by incorporating browsing activities the new anchor models outperform the state-of-art anchor models which use only the hyperlink information. This study demonstrates the benefits of Web browsing activities to affect anchor text weighting.
    Information Retrieval 01/2011; 14:290-314. · 0.63 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Ground truth labels are one of the most important parts in many test collections for information retrieval. Each label, depicting the relevance between a query-document pair, is usually judged by a human, and this process is time-consuming and labor-intensive. Automatically Generating labels from click-through data has attracted increasing attention. In this paper, we propose a Unified Click Model to predict the multi-level labels, which aims at comprehensively considering the advantages of the Position Models and Cascade Models. Experiments show that the proposed click model outperforms the existing click models in predicting the multi-level labels, and could replace the labels judged by humans for test collections.
    Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28 - April 1, 2011 (Companion Volume); 01/2011
  • Information Retrieval Technology - 7th Asia Information Retrieval Societies Conference, AIRS 2011, Dubai, United Arab Emirates, December 18-20, 2011. Proceedings; 01/2011
  • [show abstract] [hide abstract]
    ABSTRACT: User behavior analysis has played an important role in Web information retrieval. Rare queries, whose frequencies are rather low, are usually ignored in existing studies due to the data sparseness. Little has been known about the mass of rare queries on either the information need or the user behavior. In this paper, we make an empirical study of users’ behavior on rare queries using a large scale search log. Features concerning query, resource and post-query actions are analyzed, based on which we propose a practical categorization framework and obtain an overview of rare query composition. Further, we study the characteristics of several most commonly occurring types of rare queries, and suggest improving the search performance of them separately. This work gives more insights into understanding the long tail of queries and will be helpful for Web search in terms of rare queries.
    Proceedings of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011, Campus Scientifique de la Doua, Lyon, France, August 22-27, 2011; 01/2011
  • [show abstract] [hide abstract]
    ABSTRACT: Query recommendation helps users to describe their information needs more clearly so that search engines can return appropriate answers and meet their needs. State-of-the-art researches prove that the use of users’ behavior information helps to improve query recommendation performance. Instead of finding the most similar terms previous users queried, we focus on how to detect users’ actual information need based on their search behaviors. The key idea of this paper is that although the clicked documents are not always relevant to users’ queries, the snippets which lead them to the click most probably meet their information needs. Based on analysis into large-scale practical search behavior log data, two snippet click behavior models are constructed and corresponding query recommendation algorithms are proposed. Experimental results based on two widely-used commercial search engines’ click-through data prove that the proposed algorithms outperform practical recommendation methods of these two search engines. To the best of our knowledge, this is the first time that snippet click models are proposed for query recommendation task.
    Expert Syst. Appl. 01/2011; 38:13847-13856.
  • Source
    Information Retrieval Technology - 7th Asia Information Retrieval Societies Conference, AIRS 2011, Dubai, United Arab Emirates, December 18-20, 2011. Proceedings; 01/2011
  • [show abstract] [hide abstract]
    ABSTRACT: Users’ query and click behavior information has been widely used in relevance feedback techniques to improve search engine performance. However, there is a special kind of user behavior that submitting a query but not clicking any result returned by search engines. Queries ending with non-click make up a large fraction of user search activities, but few studies on them have been done in user behavior analysis. In this paper we investigate non-click behavior using large scale search logs from a commercial search engine. We analyze query and non-click behavior characteristics on three levels, i.e., query, session and user level. Query frequency, search engine returned results and category of information need are observed to be relative to non-click behavior. There are significant differences between post-query actions of clicked and non-clicked queries. Users’ personal preference can also results in non-click behavior. Our findings have implications for separating queries which are handled well or not by search engines and are useful in user behavior reliability study.
    Information Retrieval Technology - 6th Asia Information Retrieval Societies Conference, AIRS 2010, Taipei, Taiwan, December 1-3, 2010. Proceedings; 01/2010
  • Source
    Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010; 01/2010
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Ranking is an essential part of information retrieval(IR) tasks such as Web search. Nowadays there are hundreds of features for ranking. So learning to rank(LTR), an interdisciplinary field of IR and machine learning(ML), has attracted increasing attention. Those features used in the IR are not always independent from each other, hence the feature selection, an important issue in ML, should be paid attention to for LTR. However, the state-of-the-art LTR approaches merely analyze the connection among the features from the aspects of feature selection. In this paper, we propose a hierarchical feature selection strategy containing 2 phases for ranking and learn ranking functions. The experimental results show that ranking functions based on the selected feature subset significantly outperform the ones based on all features.
    Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010; 01/2010
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We consider the problem of detecting epidemic tendency by mining search logs. We propose an algorithm based on click-through information to select epidemic related queries/terms. We adopt linear regression to model epidemic occurrences and frequencies of epidemic related terms (ERTs) in search logs. The results show our algorithm is effective in finding ERTs which obtain a high correlation value with epidemic occurrences. We also find the proposed method performs better when combining different ERTs than using single ERT.
    Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010; 01/2010
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Under different language contexts, people choose different terms or phrases to express their feelings and opinions. When a user is writing a paper or chatting with a friend, he/she applies a specific language model corresponding to the underlying goal. This paper presents a log-based study of analyzing the language models with specific goals. We exhibit the statistical information of terms and software programs, propose some methods to estimate the divergence of language models with specific user goals and measure the discrimination of these models. Experimental results show that the language models with different user goals have large divergence and different discrimination. These study conclusions can be applied to understand user needs and improve Human-Computer Interaction (HCI).
    Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010; 01/2010
  • [show abstract] [hide abstract]
    ABSTRACT: In this paper, we propose a new query recommendation method. This method is designed to generate recommended queries which are not only related to input query, but also provide high quality search results to users. Existing query recommendation methods are mostly focused on users’ intention or the relationship between input query andrecommended queries.Because the limitation of Web resource and search engine’s index, not all recommended queries lead to good search results. Such recommendation will not help users to find the information they need. In our work, we use machine learning methods to re-rank a pre-generated recommendation candidate list. We select some user behavior features to filter out the queries which have poor search performance. The experiment results show that our method can recommend queries which are related and provide useful results to users.
    Information Retrieval Technology - 6th Asia Information Retrieval Societies Conference, AIRS 2010, Taipei, Taiwan, December 1-3, 2010. Proceedings; 01/2010
  • [show abstract] [hide abstract]
    ABSTRACT: User behavior information analysis has been shown important for optimization and evaluation of Web search and has become one of the major areas in both information retrieval and knowledge management researches. This paper focuses on users’ searching behavior reliability study based on large scale query and click-through logs collected from commercial search engines. The concept of reliability is defined in a probabilistic notion. The context of user click behavior on search results is analyzed in terms of relevance. Five features, namely query number, click entropy, first click ratio, last click ratio, and rank position, are proposed and studied to separate reliable user clicks from the others. Experimental results show that the proposed method evaluates the reliability of user behavior effectively. The AUC value of the ROC curve is 0.792, and the algorithm maintains 92.8% relevant clicks when filtering out 40% low quality clicks.
    Information Retrieval Technology, 5th Asia Information Retrieval Symposium, AIRS 2009, Sapporo, Japan, October 21-23, 2009. Proceedings; 01/2009
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: This paper focuses on 'user browsing graph' which is constructed with users' click-through behavior modeled with Web access logs. User browsing graph has recently been adopted to improve Web search performance and the initial study shows it is more reliable than hyperlink graph for inferring page importance. However, structure and evolution of the user browsing graph haven't been fully studied and many questions remain to be answered. In this paper, we look into the structure of the user browsing graph and its evolution over time. We try to give a quantitative analysis on the difference in graph structure between hyperlink graph and user browsing graph, and then find out why link analysis algorithms perform better on the browsing graph. We also propose a method for combining user behavior information into hyper link graph. Experimental results show that user browsing graph and hyperlink graph share few links in common and a combination of these two graphs can gain good performance in quality estimation of pages.
    Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009, Barcelona, Spain, February 9-11, 2009; 01/2009
  • Information Retrieval Technology, 5th Asia Information Retrieval Symposium, AIRS 2009, Sapporo, Japan, October 21-23, 2009. Proceedings; 01/2009