Conference Paper

Expediting search trend detection via prediction of query counts

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The massive volume of queries submitted to major Web search engines reflects human interest at a global scale. While the popularity of many search queries is stable over time or fluctuates with periodic regularity, some queries experience a sudden and ephemeral rise in popularity that is unexplained by their past volumes. Typically the popularity surge is precipitated by some real-life event in the news cycle. Such queries form what are known as search trends. All major search engines, using query log analysis and other signals, invest in detecting such trends. The goal is to surface trends accurately, with low latency relative to the actual event that sparked the trend. This work formally defines precision, recall and latency metrics related to top-k search trend detection. Then, observing that many trend detection algorithms rely on query counts, we develop a linear auto-regression model to predict future query counts. Subsequently, we tap the predicted counts to expedite search trend detection by plugging them into an existing trend detection scheme. Experimenting with query logs from a major Web search engine, we report both the stand-alone accuracy of our query count predictions, as well as the task-oriented effects of the prediction on the emitted trends. We show an average reduction in trend detection latency of roughly twenty minutes, with a negligible impact on the precision and recall metrics.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... First, like in previous work [20,24,9], PocketTrend detects trending keywords by analyzing the frequency at which these keywords are used in the search queries. The goal is to detect keywords that are searched significantly more frequently than normal (e.g., five times more frequently than during the same hour the day before). ...
... As in related work on trend detection [20,24,9], we leverage the observation that in the absence of active global trends, the same keyword within the same hour (but for different days) tends to have similar numbers of appearances in search queries. We define a keyword to be trending if the number of appearances within a specific hour in the current day is significantly higher (e.g., 5x more) than in the same hour during a reference day. 1 Formally, a keyword is defined to be frequent if 1 We use a previous day as reference point to identify trending keywords. ...
... Discovering trends in search is a well-studied problem [9]. Commercial search engines already offer products such as Google Trends [6]. ...
Conference Paper
Full-text available
Trending search topics cause unpredictable query load spikes that hurt the end-user search experience, particularly the mobile one, by introducing longer delays. To understand how trending search topics are formed and evolve over time, we analyze 21 million queries submitted during periods where popular events caused search query volume spikes. Based on our findings, we design and evaluate PocketTrend, a system that automatically detects trending topics in real time, identifies the search content associated to the topics, and then intelligently pushes this content to users in a timely manner. In that way, PocketTrend enables a client-side search engine that can instantly answer user queries related to trending events, while at the same time reducing the impact of these trends on the datacenter workload. Our results, using real mobile search logs, show that in the presence of a trending event, up to 13-17% of the overall search traffic can be eliminated from the datacenter, with as many as 19% of all users benefiting from PocketTrend.
... Vakali et al. proposed a cloud-based framework for detecting trending topics on Twitter and blogging systems [28], focusing particularly on implementing the framework on the cloud, which is complementary to our goal. Golbandi et al. [14] tackled trend topic detection for search engines. Despite the similar goal, their solution applies to a very different domain, and thus focuses on different elements (query terms) and uses different techniques (language models) for prediction. ...
... It then takes, for each object d, the largest of the computed probabilities (line 11) and the associated class (line 12), and tests whether it is possible to state that d belongs to that class with enough confidence at t r , i.e., whether: (1) the probability exceeds the minimum confidence for the class, and (2) t r exceeds the per-class minimum threshold (line 13). If the test succeeds, the algorithm stops monitoring the object (line 16), saving the current t r and the per-class probabilities computed at this window in t and P (lines [14][15]. After exhausting all possible monitoring periods (t r > γ max ) or whenever the number of objects being monitored n ob js reaches 0, the algorithm returns. ...
... feats) 12: t, P ← MultiClassProbs(D test , C D , θ, γ) 13: return t, PredictERT ree(D test , P obj. feats) 14 For each video, the datasets contain the following features (shown in Table 3): the time series of the numbers of views, comments and favorites, as well as the ten most important referrers (incoming links), along with the date that referrer was first encountered, the video's upload date and its category. The original datasets contain videos of various ages, ranging from days to years. ...
Article
Full-text available
We here focus on the problem of predicting the popularity trend of user generated content (UGC) as early as possible. Taking YouTube videos as case study, we propose a novel two-step learning approach that: (1) extracts popularity trends from previously uploaded objects, and (2) predicts trends for new content. Unlike previous work, our solution explicitly addresses the inherent tradeoff between prediction accuracy and remaining interest in the content after prediction, solving it on a per-object basis. Our experimental results show great improvements of our solution over alternatives, and its applicability to improve the accuracy of state-of-the-art popularity prediction methods.
... It has been discussed in media that the reporting of such incidents in one city would have been helpful in preventing similar events other cities. Trend detection is also an important task for other types of streaming data monitoring, such as search engine query monitoring [18], climate related spatial temporal data analysis [33], and social media event analysis [6] [30]. ...
... In search engine query monitoring, trend detection techniques identify the rising queries that reflect the user's attention at the moment. In [18], Goldbani et al. advanced the basic search trend detection algorithm proposed by Dong et al. in [14] and reduced the latency of the detection algorithm by 20 minutes. They used linear regression to predict the future query counts. ...
Preprint
Full-text available
Methods for detecting and summarizing emergent keywords have been extensively studied since social media and microblogging activities have started to play an important role in data analysis and decision making. We present a system for monitoring emergent keywords and summarizing a document stream based on the dynamic semantic graphs of streaming documents. We introduce the notion of dynamic eigenvector centrality for ranking emergent keywords, and present an algorithm for summarizing emergent events that is based on the minimum weight set cover. We demonstrate our system with an analysis of streaming Twitter data related to public security events.
... It has been discussed in media that the reporting of such incidents in one city would have been helpful in preventing similar events other cities. Trend detection is also an important task for other types of streaming data monitoring, such as search engine query monitoring [18], climate related spatial temporal data analysis [33], and social media event analysis [6] [30]. ...
... In search engine query monitoring, trend detection techniques identify the rising queries that reflect the user's attention at the moment. In [18], Goldbani et al. advanced the basic search trend detection algorithm proposed by Dong et al. in [14] and reduced the latency of the detection algorithm by 20 minutes. They used linear regression to predict the future query counts. ...
Conference Paper
Full-text available
Methods for detecting and summarizing emergent keywords have been extensively studied since social media and microblogging activities have started to play an important role in data analysis and decision making. We present a system for monitoring emergent keywords and summarizing a document stream based on the dynamic semantic graphs of streaming documents. We introduce the notion of dynamic eigenvector centrality for ranking emergent keywords, and present an algorithm for summarizing emergent events that is based on the minimum weight set cover. We demonstrate our system with an analysis of streaming Twitter data related to public security events.
... In [4], semantic change between consecutive queries and the relationship between the changed query and the clicked document is used to infer query context. In addition, query clustering [3], geographical location [15], and association rules [1] are some of the methods used by researchers for better information retrieval. However, we argued that these context extraction methods are confined by the capacity of their employed representation, which is hardly generalizable and not optimal for retrieval tasks. ...
Article
With the proliferation of information globally, the search engine had become an indispensable tool that helps the user to search for information in a simple, easy and quick way. These search engines employ sophisticated document ranking algorithms based on query context, link structure and user behavior characterization. However, all these features keep changing in the real scenario. Ideally, ranking algorithms must be robust enough to time-sensitive queries. Microblog content is typically short-lived as it is often intended to provide quick updates or share brief information in a concise manner. The technique first determines if a query is currently in high demand, then it automatically appends a time-sensitive context to the query by mining those microblogs whose torrent matches with query-in-demand. The extracted contextual terms are further used in re-ranking the search results. The experimental results reveal the existence of a strong correlation between ephemeral search queries and microblog volumes. These volumes are analyzed to identify the temporal proximity of their torrents. It is observed that approximately 70% of search torrents occurred one day before or after blog torrents for lower threshold values. When the threshold is increased, the match ratio of torrent is raised to ∼90%. In addition, the performance of the proposed model is analyzed for different combining principles namely, aggregate relevance (AR) and disjunctive relevance (DR). It is found that the DR variant of the proposed model outperforms the AR variant of the proposed model in terms of relevance and interest scores. Further, the proposed model’s performance is compared with three categories of retrieval models: log-logistic model, sequential dependence model (SDM) and embedding based query expansion model (EQE1). The experimental results reveal the effectiveness of the proposed technique in terms of result relevancy and user satisfaction. There is a significant improvement of ∼25% in the result relevance score and ∼35% in the user satisfaction score compared to underlying retrieval models. The work can be expanded in many directions in the future as various researchers can combine these strategies to build a recommendation system, auto query reformulation system, Chatbot, and NLP professional toolkit.
... Thus, the posts in microblogs can immediately reflect users' current interest [5], [6]. Some studies also show that the user's interest in microblogging is slower than the user's interest in search engines by a few days [7], [8]. ...
Article
Full-text available
In this paper, an approach is proposed to evaluate and rearrange web pages, based on the queryrelated web context. The contexts we focus on are the terms co-occurring with queries in microblogs. The proposed approach is based on the retrieved result by a search engine. If a query is given, it retrieves the search results, and checks whether the query is on a burst state or not. If the query is on a burst state (or popular state), our method applies the query-related context to the search results. Since context terms can reflect the current interest of people regarding the query, a web page can be considered within the current interest of people, if it has many of the context terms. Thus, the retrieved web pages are re-ranked based on the context terms, to present the search result in accordance with the current public interest. We present some observations that show microblog contents and search queries are strongly related, if queries are on a burst state. In order to verify the effect of context terms, we conduct experiments, and compare the result with Google using a questionnaire survey.
... As such, some characteristics are likely to change over time. In recent years, many research papers have focused specifically on the problem of detecting trends over such data [2,3,5]. In addition, since not all trends may warrant the same level of attention from all users. ...
Conference Paper
In recent years, studies about trend detection in online social media streams have begun to emerge. Since not all users are likely to always be interested in the same set of trends, some of the research also focused on personalizing the trends by using some predefined personalized context. In this paper, we take this problem further to a setting in which the user's context is not predefined, but rather determined as the user issues a query. This presents a new challenge since trends cannot be computed ahead of time using high latency algorithms. We present RT-Trend, an online trend detection algorithm that promptly finds relevant in-context trends as users issue search queries over a dataset of documents. We evaluate our approach using real data from an online social network by assessing its ability to predict actual activity increase of social network entities in the context of a search result. Since we implemented this feature into an existing tool with an active pool of users, we also report click data, which suggests positive feedback.
... This feature has been actively implemented in almost all of the current popular search engines such as Google Trends 1 , Yahoo!, Trending Now 2 , and Bing Trends 3 . Golbandi, Katzir, Koren, and Lempel (2013) provided a method to search trend prediction more accurately with low latency relative to the actual event that sparked the trend. Similar approaches can be found in weather and Stock Market studies. ...
Article
Most information retrieval (IR) systems consider relevance, usefulness, and quality of information objects (documents, queries) for evaluation, prediction, and recommendation, often ignoring the underlying search process of information seeking. This may leave out opportunities for making recommendations that analyze the search process and/or recommend alternative search process instead of objects. To overcome this limitation, we investigated whether by analyzing a searcher's current processes we could forecast his likelihood of achieving a certain level of success with respect to search performance in the future. We propose a machine-learning-based method to dynamically evaluate and predict search performance several time-steps ahead at each given time point of the search process during an exploratory search task. Our prediction method uses a collection of features extracted from expression of information need and coverage of information. For testing, we used log data collected from 4 user studies that included 216 users (96 individuals and 60 pairs). Our results show 80-90% accuracy in prediction depending on the number of time-steps ahead. In effect, the work reported here provides a framework for evaluating search processes during exploratory search tasks and predicting search performance. Importantly, the proposed approach is based on user processes and is independent of any IR system.
Chapter
Finding important users from social media is a challenging and significant task. In this paper, we focus on the users in the blogosphere and propose an approach to identify prophetic bloggers by estimating bloggers’ prediction ability on buzzwords and categories. We conduct a time-series analysis on large-scale blog data, which includes categorizing a blogger into knowledgeable categories, identifying past buzzwords, analyzing a buzzword’s peak time content and growth period, and estimating a blogger’s prediction ability on a buzzword and on a category. Bloggers’ prediction ability on a buzzword is evaluated considering three factors: post earliness, content similarity and entry frequency. Bloggers’ prediction ability on a category is evaluated considering the buzzword coverage in that category. For calculating bloggers’ prediction ability on a category, we propose multiple formulas and compare the accuracy through experiments. Experimental results show that the proposed approach can find prophetic bloggers on real-world blog data.
Article
Query completion approaches assist searchers in formulating queries with few keystrokes when using an information retrieval system to address their information needs, which help users benefit from avoiding spelling mistakes and from producing clear query formulations, etc. Previous work on query completion algorithms returns a ranked list of queries to the users mostly based on the overall observed search popularity of query candidates in the whole query logs. However, the query search popularity could be changed over time, i.e., it's time-aware. Thus, these ranking approaches based on the overall search popularity could not work very well and users may fail to find an acceptable query in the returned list, resulting in a limited search satisfaction. Hence, this paper proposes a Learning-based Personalized Query Ranking approach, i.e., LQR, where the features on the observed and predicted search popularity both in the whole logs and the recent period are exploited. Taking a pair-wise learning scenario, this paper presents a method for generating a ranked list of query candidates, and then reranks the candidates by the similarity to current search context. The experimental results show the proposed approach outperforms the baseline in terms of Mean Reciprocal Rank (MRR), reporting an average MRR improvement of 7 against the baseline.
Article
In information retrieval, query auto completion (QAC), also known as type-ahead [Xiao et al., 2013, Cai et al., 2014b] and auto-complete suggestion [Jain and Mishne, 2010], refers to the following functionality: given a prefix consisting of a number of characters entered into a search box, the user interface proposes alternative ways of extending the prefix to a full query. Ranking query completions is a challenging task due to the limited length of prefixes entered by users, the large volume of possible query completions matching a prefix, and the broad range of possible search intents. In recent years, a large number of query auto completion approaches have been proposed that produce ranked lists of alternative query completions by mining query logs. In this survey, we review work on query auto completion that has been published before 2016. We focus mainly on web search and provide a formal definition of the query auto completion problem. We describe two dominant families of approaches to the query auto completion problem, one based on heuristic models and the other based on learning to rank. We also identify dominant trends in published work on query auto completion, viz. the use of time-sensitive signals and the use of user-specific signals. We describe the datasets and metrics that are used to evaluate algorithms for query auto completion. We also devote a chapter to efficiency and a chapter to presentation and interaction aspects of query auto completion. We end by discussing related tasks as well as potential research directions to further the area.
Conference Paper
The majority of Web email is known to be generated by machines even when one excludes spam. Many machine-generated email messages such as invoices or travel itineraries are critical to users. Recent research studies establish that causality relations between certain types of machine-generated email messages exist and can be mined. These relations exhibit a link between a given message to a past message that gave rise to its creation. For example, a shipment notification message can often be linked to a past online purchase message. Instead of studying how an incoming message can be linked to the past, we propose here to focus on predicting future email arrival as implied by causality relations. Such a prediction method has several potential applications, ranging from improved ad targeting in up sell scenarios to reducing false positives in spam detection. We introduce a novel approach for predicting which types of machine-generated email messages, represented by so-called "email templates", a user should receive in future time windows. Our prediction approach relies on (1) statistically inferring causality relations between email templates, (2) building a generative model that explains the inbox of each user using those causality relations, and (3) combining those results to predict which email templates are likely to appear in future time frames. We present preliminary experimental results and some data insights obtained by analyzing several million inboxes of Yahoo Mail users, who voluntarily opted-in for such research.
Article
Purpose Identifying important users from social media has recently attracted much attention in the information and knowledge management community. Although researchers have focused on users’ knowledge levels on certain topics or influence degrees on other users in social networks, previous works have not studied users’ prediction ability on future popularity. This paper aims to propose a novel approach to find prophetic bloggers based on their buzzword prediction ability. Design/methodology/approach The main approach is to conduct a time-series analysis in the blogosphere considering four factors: post earliness, content similarity, entry frequency and buzzword coverage. Our method has four steps: categorizing a blogger into knowledgeable categories, identifying past buzzwords, analyzing a buzzword’s peak time content and growth period and, finally, evaluating a blogger’s prediction ability on a buzzword and on a category. Findings Experimental results on real-world blog data consisting of 150 million entries from 11 million bloggers demonstrate that the proposed approach can find prophetic bloggers and outperforms others that do not take temporal features into account. Originality/value To the best of the authors’ knowledge, our approach is the first successful attempt to identify prophetic bloggers. Finding prophetic bloggers can bring great values for two reasons. First, as prophetic bloggers tend to post creative and insightful information, analysis on their blog entries may help find future buzzword candidates. Second, communication with prophetic bloggers can help understand future trends, gain insight into early adopters’ thoughts on new technology or even foresee things that will become popular.
Conference Paper
Prediction of scholar popularity has become an important research topic for a number of reasons. In this paper, we tackle the problem of predicting the popularity {\it trend} of scholars by concentrating on making predictions both as earlier and accurate as possible. In order to perform the prediction task, we first extract the popularity trends of scholars from a training set. To that end, we apply a time series clustering algorithm called K-Spectral Clustering (K-SC) to identify the popularity trends as cluster centroids. We then predict trends for scholars in a test set by solving a classification problem. Specifically, we first compute a set of measures for individual scholars based on the distance between earlier points in her particular popularity curve and the identified centroids. We then combine those distance measures with a set of academic features (e.g., number of publications, number of venues, etc) collected during the same monitoring period, and use them as input to a classification method. One aspect that distinguishes our method from other approaches is that the monitoring period, during which we gather information on each scholar popularity and academic features, is determined on a per scholar basis, as part of our approach. Using total citation count as measure of scientific popularity, we evaluate our solution on the popularity time series of more than 500,000 Computer Science scholars, gathered from Microsoft Azure Marketplace (https://datamarket.azure.com/dataset/mrc/microsoftacademic). The experimental results show that the our prediction method outperforms other alternative prediction methods. We also show how to apply our method jointly with regression models to improve the prediction of scholar popularity values (e.g., number of citations) at a given future time.
Article
Query auto completion (QAC) methods recommend queries to search engine users when they start entering a query. Current QAC methods mostly rank query completions based on their past popularity, i.e., on the number of times they have previously been submitted as a query. However, query popularity changes over time and may vary drastically across users. Accordingly, the ranking of query completions should be adjusted. Previous time-sensitive and user-specific QAC methods have been developed separately, yielding significant improvements over methods that are neither time-sensitive nor personalized. We propose a hybrid QAC method that is both time-sensitive and personalized. We extend it to handle long-tail prefixes, which we achieve by assigning optimal weights to the contribution from time-sensitivity and personalization. Using real-world search log datasets, we return top N query suggestions ranked by predicted popularity as estimated from popularity trends and cyclic popularity behavior; we rerank them by integrating similarities to a user's previous queries (both in the current session and in previous sessions). Our method outperforms state-of-the-art time-sensitive QAC baselines, achieving total improvements of between 3 and 7 percent in terms of mean reciprocal rank (MRR). After optimizing the weights, our extended model achieves MRR improvements of between 4 and 8 percent.
Conference Paper
Identifying important users from social media has recently attracted much attention in information and knowledge management community. Although researchers have focused on users' knowledge levels on certain topics or influence degrees on other users in social networks, previous works have not studied users' prediction ability on future popularity. In this paper, we propose a novel approach to find important bloggers based on their buzzword prediction ability. We conduct a time-series analysis in the blogosphere considering four factors: post earliness, content similarity, entry frequency and buzzword coverage. We perform preparatory work in categorizing a blogger into knowledgeable categories, identifying past buzzwords, analyzing a buzzword's peak time content and growth period, and finally evaluate a blogger's prediction ability on a buzzword and on a category. Experimental results on real-world blog data consisting of 150 million entries from 11 million bloggers demonstrate that the proposed approach can find prophetic bloggers and outperforms others that do not take temporal features into account.
Article
Query auto completion (QAC) models recommend possible queries to web search users when they start typing a query prefix. Most of today’s QAC models rank candidate queries by popularity (i.e., frequency), and in doing so they tend to follow a strict query matching policy when counting the queries. That is, they ignore the contributions from so-called homologous queries, queries with the same terms but ordered differently or queries that expand the original query. Importantly, homologous queries often express a remarkably similar search intent. Moreover, today’s QAC approaches often ignore semantically related terms. We argue that users are prone to combine semantically related terms when generating queries. We propose a learning to rank-based QAC approach, where, for the first time, features derived from homologous queries and semantically related terms are introduced. In particular, we consider: (i) the observed and predicted popularity of homologous queries for a query candidate; and (ii) the semantic relatedness of pairs of terms inside a query and pairs of queries inside a session. We quantify the improvement of the proposed new features using two large-scale real-world query logs and show that the mean reciprocal rank and the success rate can be improved by up to 9% over state-of-the-art QAC models.
Article
Given a set of n elements and a corresponding stream of its subsets, we consider the problem of selecting k elements that should appear in at least d such subsets arriving in the "near" future with high probability. For this min-doccur problem, we present an algorithm that provides a solution with the success probability of at least 1 - O (kd log n/D + 1/n), where D is a known constant. Our empirical observations on two streaming data sets show that this algorithm achieves high precision and recall values. We further present a sliding window adaptation of the proposed algorithm to provide a continuous selection of these elements. In contrast to the existing work on predicting trends based on potential increase in popularity, our work focuses on a setting with provable guarantees.
Article
Query auto-completion (QAC) is a prominent feature of modern search engines. It is aimed at saving user's time and enhancing the search experience. Current QAC models mostly rank matching QAC candidates according to their past popularity, i.e., frequency. However, query popularity changes over time and may vary drastically across users. Hence, rankings of QAC candidates should be adjusted accordingly. In previous work time-sensitive QAC models and user-specific QAC models have been developed separately. Both types of QAC model lead to important improvements over models that are neither time-sensitive nor personalized. We propose a hybrid QAC model that considers both of these aspects: time-sensitivity and personalization. Using search logs, we return the top N QAC candidates by predicted popularity based on their recent trend and cyclic behavior. We use auto-correlation to detect query periodicity by long-term time-series analysis, and anticipate the query popularity trend based on observations within an optimal time window returned by a regression model. We rerank the returned top N candidates by integrating their similarities with a user's preceding queries (both in the current session and in previous sessions by the same user) on a character level to produce a final QAC list. Our experimental results on two real-world datasets show that our hybrid QAC model outperforms state-of-the-art time-sensitive QAC baseline, achieving total improvements of between 3% and 7% in terms of MRR.
Article
Localized query prediction (LQP) is the task of estimating web query trends for a specific location. This problem subsumes many interesting personalized web applications such as personalization for buzz query detection, for query expansion, and for query recommendation. These personalized applications can greatly enhance user interaction with web search engines by providing more customized information discovered from user input (i.e., queries), but the LQP task has rarely been investigated in the literature. Although exist abundant work on estimating global web search trends does exist, it often encounters the big challenge of data sparsity when personalization comes into play. In this article, we tackle the LQP task by proposing a series of collaborative language models (CLMs). CLMs alleviate the data sparsity issue by collaboratively collecting queries and trend information from the other locations. The traditional statistical language models assume a fixed background language model, which loses the taste of personalization. In contrast, CLMs are personalized language models with flexible background language models customized to various locations. The most sophisticated CLM enables the collaboration to adapt to specific query topics, which further advances the personalization level. An extensive set of experiments have been conducted on a large-scale web query log to demonstrate the effectiveness of the proposed models.
Article
Trending search suggestion is leading a new paradigm of image search, where user's exploratory search experience is facilitated with the automatic suggestion of trending queries. Existing image search engines, however, only provide general suggestions and hence cannot capture user's personal interest. In this paper, we move one step forward to investigate personalized suggestion of trending image searches according to users' search behaviors. To this end, we propose a learning-based framework including two novel components. The first component, i.e., trending-aware weight-regularized matrix factorization (TA-WRMF), is able to suggest personalized trending search queries by learning user preference from many users as well as auxiliary common searches. The second component associates the most representative and trending image with each suggested query. The personalized suggestion of image search consists of a trending textual query and its associated trending image. The combined textual-visual queries not only are trending (bursty) and personalized to user's search preference, but also provide the compelling visual aspect of these queries. We evaluate our proposed learning-based framework on a large-scale search logs with 21 million users and 41 million queries in two weeks from a commercial image search engine. The evaluations demonstrate that our system achieve about 50% gain compared with state-of-the-art in terms of query prediction accuracy.
Conference Paper
Analyzing people's Web search behavior has been a significant topic of interest in the Information Retrieval domain and search engine industry over the past decade. Research in this area has focused on improving search and retrieval capabilities leading to high demands and expectations of Web search users. Understanding and analyzing the Web search process when users are performing Web search tasks is a challenging problem due to many reasons such as subjectivity, dynamic nature, difficulty in measurement of success and difficulty in evaluation. I propose to analyze the users' Web search behavior in order to identify the strategies and tactics they use in fulfilling their task. In order to achieve this, I intend to use data mining and machine learning methods with an emphasis on time series analysis given that the user search process can be considered as a sequence of time related events.
Conference Paper
Query auto-completion (QAC) is a common interactive feature that assists users in formulating queries by providing completion suggestions as they type. In order for QAC to minimise the user's cognitive and physical effort, it must: (i) suggest the user's intended query after minimal input keystrokes, and (ii) rank the user's intended query highly in completion suggestions. Typically, QAC approaches rank completion suggestions by their past popularity. Accordingly, QAC is usually very effective for previously seen and consistently popular queries. Users are increasingly turning to search engines to find out about unpredictable emerging and ongoing events and phenomena, often using previously unseen or unpopular queries. Consequently, QAC must be both robust and time-sensitive -- that is, able to sufficiently rank both consistently and recently popular queries in completion suggestions. To address this trade-off, we propose several practical completion suggestion ranking approaches, including: (i) a sliding window of query popularity evidence from the past 2-28 days, (ii) the query popularity distribution in the last N queries observed with a given prefix, and (iii) short-range query popularity prediction based on recently observed trends. Using real-time simulation experiments, we extensively investigated the parameters necessary to maximise QAC effectiveness for three openly available query log datasets with prefixes of 2-5 characters: MSN and AOL (both English), and Sogou 2008 (Chinese). Optimal parameters vary for each query log, capturing the differing temporal dynamics and querying distributions. Results demonstrate consistent and language-independent improvements of up to 9.2% over a non-temporal QAC baseline for all query logs with prefix lengths of 2-3 characters. This work is an important step towards more effective QAC approaches.
Conference Paper
Among the many tasks driven by very large scaled web search queries, it is an interesting task to predict how likely queries about a topic become popular (a.k.a. trending or buzzing) as the news in the near future, which is known as "Detecting trending queries." This task is nontrivial since the realization of buzzing trends of queries often requires sufficient statistics through users' activities. To address this challenge, we propose a novel framework that predicts whether queries become trending in the future. In principle, our system is built on the two learners. The first is to learn dynamics of time series for queries. The second, our decision maker, is to learn a binary classifier that determines whether queries become trending. Our framework is extremely efficient to be built taking advantage of the grid architecture that allows to deal with the large volume of data. In addition, it is flexible to continuously adapt as trending patterns evolve. The experiments results show that our approach achieves high quality of accuracy (over 77.5%} true positive rate) and yet detects much earlier (on average 29 hours advanced) than that of the baseline system.
Article
Full-text available
Common measures of term importance in information retrieval (IR) rely on counts of term frequency; rare terms receive higher weight in document ranking than common terms receive. However, realistic scenarios yield additional information about terms in ...
Conference Paper
Full-text available
In web search, recency ranking refers to ranking documents by rel- evance which takes freshness into account. In this paper, we pro- pose a retrieval system which automatically detects and responds to recency sensitive queries. The system detects recency sensitive queries using a high precision classifier. The system responds to re- cency sensitive queries by using a machine learned ranking model trained for such queries. We use multiple recency features to pro- vide temporal evidence which effectively represents document re- cency. Furthermore, we propose several training methodologies important for training recency sensitive rankers. Finally, we de- velop new evaluation metrics for recency sensitive queries. Our experiments demonstrate the efficacy of the proposed approaches.
Conference Paper
Full-text available
Web search is strongly influenced by time. The queries people issue change over time, with some queries occasionally spiking in popularity (e.g., earthquake) and others remaining relatively constant (e.g., youtube). The documents indexed by the search engine also change, with some documents always being about a particular query (e.g., the Wikipedia page on earthquakes is about the query earthquake) and others being about the query only at a particular point in time (e.g., the New York Times is only about earthquakes following a major seismic activity). The relationship between documents and queries can also change as people's intent changes (e.g., people sought different content for the query earthquake before the Haitian earthquake than they did after). In this paper, we explore how queries, their associated documents, and the query intent change over the course of 10 weeks by analyzing query log data, a daily Web crawl, and periodic human relevance judgments. We identify several interesting features by which changes to query popularity can be classified, and show that presence of these features, when accompanied by changes in result content, can be a good indicator of change in query intent.
Conference Paper
Full-text available
Seasonal events such as Halloween and Christmas repeat every year and initiate several temporal information needs. The impact of such events on users is often reflected in search logs in form of seasonal spikes in the frequency of related queries (e.g. "halloween costumes", "where is santa"). Many seasonal queries such as "sigir conference" mainly target fresh pages (e.g. sigir2011.org) that have less usage data such as clicks and anchor-text compared to older alternatives (e.g.sigir2009.org). Thus, it is important for search engines to correctly identify seasonal queries and make sure that their results are temporally reordered if necessary. In this poster, we focus on detecting seasonal queries using time-series analysis. We demonstrate that the seasonality of a query can be determined with high accuracy according to its historical frequency distribution.
Conference Paper
Full-text available
The analysis of query logs from blog search engines show that news-related queries occupy a significant portion of the logs. This raises a interesting research question on whether the blogosphere can be used to identify important news stories. In this paper, we present novel approaches to identify important news story headlines from the blogosphere for a given day. The proposed system consists of two components based on the language model framework, the query likelihood and the news headline prior. For the query likelihood, we propose several approaches to estimate the query language model and the news headline language model. We also suggest several criteria to evaluate the news headline prior that is the prior belief about the importance or newsworthiness of the news headline for a given day. Experimental results show that our system significantly outperforms a baseline system. Specifically, the proposed approach gives 2.62% and 10.19% further increases in MAP and P@5 over the best performing result of the TREC'09 Top Stories Identification Task.
Conference Paper
Full-text available
We present several methods for mining knowledge from the query logs of the MSN search engine. Using the query logs, we build a time series for each query word or phrase (e.g., 'Thanksgiving' or 'Christmas gifts') where the elements of the time series are the number of times that a query is issued on a day. All of the methods we describe use sequences of this form and can be applied to time series data generally. Our primary goal is the discovery of semantically similar queries and we do so by identifying queries with similar demand patterns. Utilizing the best Fourier coefficients and the energy of the omitted components, we improve upon the state-of-the-art in time-series similarity matching. The extracted sequence features are then organized in an efficient metric tree index structure. We also demonstrate how to efficiently and accurately discover the important periods in a time-series. Finally we propose a simple but effective method for identification of bursts (long or short-term). Using the burst information extracted from a sequence, we are able to efficiently perform 'query-by-burst' on the database of time-series. We conclude the presentation with the description of a tool that uses the described methods, and serves as an interactive exploratory data discovery tool for the MSN query database.
Article
Full-text available
Seasonal influenza epidemics are a major public health concern, causing tens of millions of respiratory illnesses and 250,000 to 500,000 deaths worldwide each year. In addition to seasonal influenza, a new strain of influenza virus against which no previous immunity exists and that demonstrates human-to-human transmission could result in a pandemic with millions of fatalities. Early detection of disease activity, when followed by a rapid response, can reduce the impact of both seasonal and pandemic influenza. One way to improve early detection is to monitor health-seeking behaviour in the form of queries to online search engines, which are submitted by millions of users around the world each day. Here we present a method of analysing large numbers of Google search queries to track influenza-like illness in a population. Because the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like symptoms, we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. This approach may make it possible to use search queries to detect influenza epidemics in areas with a large population of web search users.
Article
Periodic Queries (P-queries), which account for more than 21% of total queries, aim to search information of events occurring periodically. However, to the best of our knowledge, all search engines typically display P-queries' search results in a linear page list pattern without considering their periodic features. In this paper, we coin the unique characteristic of P-queries, i.e. multiple Periodic Burst Segments (PBS), and propose an approach to present and rank search results for P-queries based on PBS. We first introduce a method to automatically detect P-query based on PBS detection from query frequency recorded in Web query logs. Then, we design a novel search results presentation scheme for P-queries which organizes and visualizes search results according to the PBS. Finally, we present a ranking method for P-queries, which uses various ranking functions for inter PBS and intra PBS respectively. Experiment on P-query detection demonstrates our method can identify P-queries effectively in terms of accuracy.
Article
Abstract: STL is a filtering procedure for decomposing a time series into trend , seasonal , and remainder components. STL has a simple design that consists of a sequence of applications of the loess smoother; the simplicity allows analysis of the properties of the procedure and ...
Article
Although World Wide Web is a dynamic information space in which the number and content of pages continuously change over time, most conventional page ranking algorithms only take the content similarity between a query and indexed pages into account. In this paper, we solve this issue by utilizing the temporal features of both queries and pages to improve the rank results. We first design a classification algorithm to justify a query's category based on the query frequency recorded in Web query logs. Then, we propose a dynamic theoretical model for page rank, which considers the time relevance between a page publication time and temporal information need contained implicitly in a user query's category, the text relevance score between the page and the query, as well as the important score of the page. Experiments demonstrate that our time-based query classification algorithm and page rank method can achieve high performance.
Article
In this paper, we present our approach for geographic personalization of a content recommendation system. More specifically, our work focuses on recommending query topics to users. We do this by mining the search query logs to detect trending local topics. For a set of queries we compute their counts and what we call buzz scores, which is a metric for detecting trending behavior. We also compute the entropy of the geographic distribution of the queries as means of detecting their location affinity. We cluster the queries into trending topics and assign the topics to their corresponding location. Human editors then select a subset of these local topics and enter them into a recommendation system. In turn the recommendation system optimizes a pool of trending local and global topics by exploiting user feedback. We present some editorial evaluation of the technique and results of a live experiment. Inclusion of local topics in selected locations into the global pool of topics resulted in more than 6% relative increase in user engagement with the recommendation system compared to using the global topics exclusively.
Article
STL is a filtering procedure for decomposing a time series into trend, seasonal, and remainder components. STL has a simple design that consists of a sequence of applications of the loess smoother; the simplicity allows analysis of the properties of the procedure and allows fast computation, even for very long time series and large amounts of trend and seasonal smoothing. Other features of STL are specification of amounts of seasonal and trend smoothing that range, in a nearly continuous way, from a very small amount of smoothing to a very large amount; robust estimates of the trend and seasonal components that are not distorted by aberrant behavior in the data; specification of the period of the seasonal component to any integer multiple of the time sampling interval greater than one; and the ability to decompose time series with missing values.
Article
Can Google queries help predict economic activity?The answer depends on what you mean by "predict." Google Trends and Google Insights for Search provide a real time report on query volume, while economic data is typically released several days after the close of the month. Given this time lag, it is not implausible that Google queries in a category like "Automotive/Vehicle Shopping" during the first few weeks of March may help predict what actual March automotive sales will be like when the official data is released halfway through April.That famous economist Yogi Berra once said "It's tough to make predictions, especially about the future." This inspired our approach: let us lower the bar and just try to predict the present. Our work to date is summarized in a paper called Predicting the Present with Google Trends. We find that Google Trends data can help improve forecasts of the current level of activity for a number of different economic time series, including automobile sales, home sales, retail sales, and travel behavior. Even predicting the present is useful, since it may help identify "turning points" in economic time series. If people start doing significantly more searches for "Real Estate Agents" in a certain location, it is tempting to think that house sales might increase in that area in the near future.Our paper outlines one approach to short-term economic prediction, but we expect that there are several other interesting ideas out there. So we suggest that forecasting wannabes download some Google Trends data and try to relate it to other economic time series. If you find an interesting pattern, post your findings on a website and send a link to econ-forecast@google.com. We'll report on the most interesting results in a later blog post.It has been said that if you put a million monkeys in front of a million computers, you would eventually produce an accurate economic forecast. Let's see how well that theory works.
Conference Paper
We present TwitterMonitor, a system that performs trend detection over the Twitter stream. The system identifies emerging topics (i.e. 'trends') on Twitter in real time and provides meaningful analytics that synthesize an accurate description of each topic. Users interact with the system by ordering the identified trends using different criteria and submitting their own description for each trend. We discuss the motivation for trend detection over social media streams and the challenges that lie therein. We then describe our approach to trend detection, as well as the architecture of TwitterMonitor. Finally, we lay out our demonstration scenario.
Conference Paper
Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regression model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
Common measures of term importance in information retrieval (IR) rely on counts of term frequency; rare terms receive higher weight in document ranking than common terms receive. However, realistic scenarios yield additional information about terms in a collection. Of interest in this paper is the temporal behavior of terms as a collection changes over time. We propose capturing each term's collection frequency at discrete time intervals over the lifespan of a corpus and analyzing the resulting time series. We hypothesize the collection frequency of a term x at time t is predictable by a linear model of the term's prior observations. On the other hand, a linear time series model for a strong discriminators' collection frequency will yield a poor t to the data. Operationalizing this hypothesis, we induce three time-based measures of term importance and test these against state-of-the-art term weighting models.
Article
Time-series of count data occur in many different contexts, including internet navigation logs, freeway traffic monitoring, and security logs associated with buildings. In this paper we describe a framework for detecting anomalous events in such data using an unsupervised learning approach. Normal periodic behavior is modeled via a time-varying Poisson process model, which in turn is modulated by a hidden Markov process that accounts for bursty events. We outline a Bayesian framework for learning the parameters of this model from count time series. Two large real world data sets of time series counts are used as test beds to validate the approach, consisting of freeway traffic data and logs of people entering and exiting a building. We show that the proposed model is significantly more accurate at detecting known events than a more traditional threshold- based technique. We also describe how the model can be used to investigate different degrees of periodicity in the data, including systematic day-of-week and time-of-day effects, and to make inferences about different aspects of events such as number of vehicles or people involved. The results indicate that the Markov-modulated Poisson framework provides a robust and accurate framework for adaptively and autonomously learning how to separate unusual bursty events from traces of normal human activity.
Book
During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.
Article
Recent work has demonstrated that Web search volume can "predict the present," meaning that it can be used to accurately track outcomes such as unemployment levels, auto and home sales, and disease prevalence in near real time. Here we show that what consumers are searching for online can also predict their collective future behavior days or even weeks in advance. Specifically we use search query volume to forecast the opening weekend box-office revenue for feature films, first-month sales of video games, and the rank of songs on the Billboard Hot 100 chart, finding in all cases that search counts are highly predictive of future outcomes. We also find that search counts generally boost the performance of baseline models fit on other publicly available data, where the boost varies from modest to dramatic, depending on the application in question. Finally, we reexamine previous work on tracking flu trends and show that, perhaps surprisingly, the utility of search data relative to a simple autoregressive model is modest. We conclude that in the absence of other data sources, or where small improvements in predictive performance are material, search queries provide a useful guide to the near future.
The Apache Hadoop Project
  • Apache
Apache. The Apache Hadoop Project. http://wiki.apache.org/hadoop.
Ounis, and I. Soboroff. Overview of the trec 2009 blog track
  • C Macdonald
C. Macdonald, I. Ounis, and I. Soboroff. Overview of the trec 2009 blog track. In TREC, 2009.