Conference PaperPDF Available

Diverse and Proportional Size-l Object Summaries for Keyword Search

Authors:

Abstract

The abundance and ubiquity of graphs (e.g., Online Social Networks such as Google+ and Facebook; bibliographic graphs such as DBLP) necessitates the effective and efficient search over them. Given a set of keywords that can identify a Data Subject (DS), a recently proposed relational keyword search paradigm produces, as a query result, a set of Object Summaries (OSs). An OS is a tree structure rooted at the DS node (i.e., a tuple containing the keywords) with surrounding nodes that summarize all data held on the graph about the DS. OS snippets, denoted as size-l OSs, have also been investigated. Size-l OSs are partial OSs containing l nodes such that the summation of their importance scores results in the maximum possible total score. However, the set of nodes that maximize the total importance score may result in an uninformative size-l OSs, as very important nodes may be repeated in it, dominating other representative information. In view of this limitation, in this paper we investigate the effective and efficient generation of two novel types of OS snippets, i.e. diverse and proportional size-l OSs, denoted as DSize-l and PSize-l OSs. Namely, apart from the importance of each node, we also consider its frequency in the OS and its repetitions in the snippets. We conduct an extensive evaluation on two real graphs (DBLP and Google+). We verify effectiveness by collecting user feedback, e.g. by asking DBLP authors (i.e. the DSs themselves) to evaluate our results. In addition, we verify the efficiency of our algorithms and evaluate the quality of the snippets that they produce.
A preview of the PDF is not available
... Diversification. Diversification of query results has attracted a lot of attention as a method for improving the quality of results by balancing relevance to the query and dissimilarity among results [9,21,22,30,53]. The motivation is that, in non-diversified search methods, users are overwhelmed with many similar answers with minor differences [37]. ...
... However, this method disregards the relevance of items to the query and thus they may result in picking irrelevant items. In [21,55], this limitation is addressed by considering relevance in the objective function. Proportionality has also been studied in recommendation systems. ...
... both (1) the size of sets and (2) the number of sets. For instance, [8,11,49,55] [this] [28] [this], [21,55] minhash is an approximation algorithm that detects near duplicate web pages. Many of these algorithms are top-(or threshold based) and thus are designed to terminate fast by pre-processing sets (e.g., sorting or LSH (locality-sensitive hashing) [3,40]). ...
Article
More often than not, spatial objects are associated with some context, in the form of text, descriptive tags (e.g. points of interest, flickr photos), or linked entities in semantic graphs (e.g. Yago2, DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but also the context of the objects, especially when the retrieved objects are too many and the query result is overwhelming. In this paper, we study the problem of selecting a subset of the query result, which is the most representative. We argue that objects with similar context and nearby locations should proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost of proportional object selection in practice. In addition, we propose pre-processing, pruning and approximate computation techniques that their combination reduces the computational cost of the algorithms even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies that proportional selection is more preferable than random selection and selection based on object diversification.
... Our proposed proportional selection framework considers (1) the relevance of the objects to the query (i.e., spatial distance and keywords similarity) (2) contextual proportionality and (3) spatial proportionality w.r.t. the query location. To the best of our knowledge, [19] [7,11,42,46] [this] [23] [this], [19,46] there is no previous work that considers all these together in proportional selection, as Table 1 shows. Hereby, we discuss and compare related work in diversification and proportionality. ...
... Our proposed proportional selection framework considers (1) the relevance of the objects to the query (i.e., spatial distance and keywords similarity) (2) contextual proportionality and (3) spatial proportionality w.r.t. the query location. To the best of our knowledge, [19] [7,11,42,46] [this] [23] [this], [19,46] there is no previous work that considers all these together in proportional selection, as Table 1 shows. Hereby, we discuss and compare related work in diversification and proportionality. ...
... Diversification. Diversification of query results has attracted a lot of attention as a method for improving the quality of results by balancing relevance to the query and dissimilarity among results [8,19,20,25,45]. The motivation is that, in non-diversified search methods, users are overwhelmed with many similar answers with minor differences [31]. ...
... Diversification Diversification of query results has attracted a lot of attention recently as a method for improving the quality of results by balancing similarity (relevance) to a query q and dissimilarity among results [12,24,25,30,52]. Diversification has also been considered in keyword search over graphs and databases, where the result is usually a subgraph that contains the set of query keywords. ...
Article
Full-text available
The abundance and ubiquity of RDF data (such as DBpedia and YAGO2) necessitate their effective and efficient retrieval. For this purpose, keyword search paradigms liberate users from understanding the RDF schema and the SPARQL query language. Popular RDF knowledge bases (e.g., YAGO2) also include spatial semantics that enable location-based search. In an earlier location-based keyword search paradigm, the user inputs a set of keywords, a query location, and a number of RDF spatial entities to be retrieved. The output entities should be geographically close to the query location and relevant to the query keywords. However, the results can be similar to each other, compromising query effectiveness. In view of this limitation, we integrate textual and spatial diversification into RDF spatial keyword search, facilitating the retrieval of entities with diverse characteristics and directions with respect to the query location. Since finding the optimal set of query results is NP-hard, we propose two approximate algorithms with guaranteed quality. Extensive empirical studies on two real datasets show that the algorithms only add insignificant overhead compared to non-diversified search, while returning results of high quality in practice (which is verified by a user evaluation study we conducted).
... The link analysis algorithms have been widely used in keyword search (i.e. [2], [8], [9], [12]). As one of the most classical link analysis algorithms, the P ageRank algorithm calculates the P ageRank value of each webpage and ranks This work is licensed under a Creative Commons Attribution 4.0 License. ...
Article
Full-text available
The optimal path planning is one of the hot spots in the research of intelligence transportation and geographic information systems. There are many productions and applications in path planning and navigation, however due to the complexity of urban road networks, the difficulty of the traffic prediction increases. The optimal path means not only the shortest distance in geography, but also the shortest time, the lowest cost, the maximum road capacity, etc. In fast-paced modern cities, people tend to reach the destination with the shortest time. The corresponding paths are considered as the optimal paths. However, due to the high data sensing speed of GPS devices, it is different to collect or describe real traffic flows. To address this problem, we propose an innovative path planning method in this paper. Specially, we first introduce a crossroad link analysis algorithm to calculate the real-time traffic conditions of crossroads (i.e. the CrossRank values). Then, we adopt a CrossRank value based A-Star for the path planning by considering the real-time traffic conditions. To avoid the high volume update of CrossRank values, a R-Tree structure is proposed to dynamically update local CrossRank values from the multi-level subareas. In the optimization process, to achieve desired navigation results, we establish the traffic congestion coefficient to reflect different traffic congestion conditions. To verify the effectiveness of the proposed method, we use the actual traffic data of Beijing. The experimental results show that our method is able to generate the appropriate path plan in the peak and low dynamic traffic conditions as compared to online applications.
Article
Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., gene ontology, disease ontology, to name a few. In this paper, we study the weighted tree summarization. We motivate and formulate our kWTS-problem as selecting a diverse set of k nodes to summarize a hierarchical tree T with weighted terminologies. We first propose an efficient greedy tree summarization algorithm GTS. It solves the problem with (1-1/e)-approximation guarantee. Although GTS achieves quality-guaranteed answers approximately, but it is still not optimal. To tackle the problem optimally, we further develop a dynamic programming algorithm OTS to obtain optimal answers for kWTS-problem in O(nhk^3) time, where n, h are the node size and height in tree T. The algorithm complexity and correctness of OTS are theoretically analyzed. In addition, we propose a useful optimization technique of tree reduction to remove useless nodes with zero weights and shrink the tree into a smaller one, which ensures the efficiency acceleration of both GTS and OTS in real-world datasets. Moreover, we illustrate one useful application of graph visualization based on the answer of k-sized tree summarization and show it in a novel case study.
Article
Some queries retrieve points of interest (POIs) considering features in its neighborhood. The POIs are ranked in terms of the information needs defined by the user. However, the information provided by the users usually is not enough to distinguish the best POIs. Analyzing the distance between features and POIs, we propose a novel ranking function that extrapolates the information provided by the user. Recent findings suggest that the user’s interest in a POI increases as the distance between the POI and the feature decreases. We demonstrate the Pareto distribution suitability to model this interest in the ranking and propose two algorithms to search for POIs with the novel ranking function. Extensive experiments show that our method can boost ranking accuracy in comparison to top-ranked methods. The proposed ranking function achieves an average NDCG performance of 8.66% and Tau performance of 7.83% in comparison with the state-of-the-art ranking functions in real-world datasets.
Article
Hierarchical directed acyclic graph (HDAG) is an essential graph model to represent terminology relationships in a hierarchy, such as Disease Ontology, Gene Ontology, and Wikipedia. However, due to massive terminologies and complex structures in a HDAG, an end user might feel difficult to explore and summarize the whole graph, which is practically useful but less studied in the literature. In this demo, we develop an interactive system of HDAG-Explorer to help users summarize HDAG with highly important and diverse vertices. Our HDAG-Explorer system exhibits several useful features including summarized visualization, interactive exploration, and structural statistics report. All these features facilitate in-depth understanding of the HDAG data. We showcase the usability of the HDAG-Explorer through two real-world applications of summarized topic recommendation and visual data exploration.
Article
Many real-world data sets are modeled as entity relationship graphs or heterogeneous information networks. In these graphs, nodes represent entities and edges mimic relationships. ObjectRank extends the well-known PageRank authority flow–based ranking method to entity relationship graphs using an authority flow weight vector (W). The vector W assigns a different authority flow–based importance (weight) to each edge type based on domain knowledge or personalization. In this paper, our contribution is a framework for Learning to Rank in entity relationship graphs to learn W, in the context of authority flow. We show that the problem is similar to learning a recursive scoring function. We present a two-phase iterative solution and multiple variants of learning. In pointwise learning, we learn W, and hence the scoring function, from the scores of a sample of nodes. In pairwise learning, we learn W from given preferences for pairs of nodes. To demonstrate our contribution in a real setting, we apply our framework to learn the rank, with high accuracy, for a real-world challenge of predicting future citations in a bibliographic archive—that is, the FutureRank score. Our extensive experiments show that with a small amount of training data, and a limited number of iterations, our Learning to Rank approach learns W with high accuracy. Learning works well with pairwise training data in large graphs.
Article
Full-text available
The Object Summary (OS)is a recently proposed tree structure, which summarizes all data held in a relational database about a data subject. An OS can potentially be very large in size and therefore unfriendly for users who wish to view synoptic information about the data subject. In this paper, we investigate the effective and efficient retrieval of concise and informative OS snippets (denoted as size-$l$ OSs). We propose and investigate the effectiveness of two types of size-$l$ OSs, namely size-$l$ OS$(t)$s and size-$l$ OS$(a)$s that consist of $l$ tuple nodes and $l$ attribute nodes respectively. For computing size-$l$ OSs, we propose an optimal dynamic programming algorithm, two greedy algorithms and preprocessing heuristics. By collecting feedback from real users (e.g., from DBLP authors), we assess the relative usability of the two different types of snippets, the choice of the size-$l$ parameter, as well as the effectiveness of the snippets with respect to the user expectations. In addition, via thorough evaluation on real databases, we test the speed and effectiveness of our techniques.
Article
In this paper, we summarize our work on diversification based on dissimilarity and coverage (DisC diversity) by presenting our main theoretical results and contributions.
Article
Recently, result diversification has attracted a lot of attention as a means to improve the quality of results retrieved by user queries. In this paper, we propose a new, intuitive definition of diversity called DisC diversity. A DisC diverse subset of a query result contains objects such that each object in the result is represented by a similar object in the diverse subset and the objects in the diverse subset are dissimilar to each other. We show that locating a minimum DisC diverse subset is an NP-hard problem and provide heuristics for its approximation. We also propose adapting DisC diverse subsets to a different degree of diversification. We call this operation zooming. We present efficient implementations of our algorithms based on the M-tree, a spatial index structure, and experimentally evaluate their performance.
Article
Online Social Networks (OSNs) allow users to create and share content (e.g., posts, status updates, comments) in real-time. These activities are collected in an activity log, (e.g. Facebook Wall, Google+ Stream, etc.) on the user's social network profile. With time, the activity logs of users, which record the sequences of social activities, become too long and consequently hard to view and navigate. To alleviate this cluttering, it is useful to select a small subset of the social activities within the specified time-period as representative, i.e., as summary, for this time-period. In this paper, we study the novel problem of social activity log summarization. We propose LogRank, a novel and principled algorithm to select activities that satisfy three desirable criteria: First, activities must be important for the user. Second, they must be diverse in terms of topic, e.g., cover several of the major topics in the activity log. Third, they should be time-dispersed, that is, be spread across the specified time range of the activity log. LogRank operates on an appropriately augmented social interaction graph and employs random-walk techniques to holistically balance all three criteria. We evaluate LogRank and its variants on a real dataset from the Google+ social network and show that they outperform baseline approaches.
Conference Paper
We study the problem of answering ambiguous web queries in a setting where there exists a taxonomy of information, and that both queries and documents may belong to more than one category according to this taxonomy. We present a systematic approach to diversifying results that aims to minimize the risk of dissatisfaction of the average user. We propose an algorithm that well approximates this objective in general, and is provably optimal for a natural special case. Furthermore, we generalize several classical IR metrics, including NDCG, MRR, and MAP, to explicitly account for the value of diversification. We demonstrate empirically that our algorithm scores higher in these generalized metrics compared to results produced by commercial search engines.
Conference Paper
In this paper we describe a general framework for evaluation and optimization of methods for diversifying query results. In these methods, an initial ranking candidate set produced by a query is used to construct a result set, where elements are ranked with respect to relevance and diversity features, i.e., the retrieved elements should be as relevant as possible to the query, and, at the same time, the result set should be as diverse as possible. While addressing relevance is relatively simple and has been heavily studied, diversity is a harder problem to solve. One major contribution of this paper is that, using the above framework, we adapt, implement and evaluate several existing methods for diversifying query results. We also propose two new approaches, namely the Greedy with Marginal Contribution (GMC) and the Greedy Randomized with Neighborhood Expansion (GNE) methods. Another major contribution of this paper is that we present the first thorough experimental evaluation of the various diversification techniques implemented in a common framework. We examine the methods' performance with respect to precision, running time and quality of the result. Our experimental results show that while the proposed methods have higher running times, they achieve precision very close to the optimal, while also providing the best result quality. While GMC is deterministic, the randomized approach (GNE) can achieve better result quality if the user is willing to tradeoff running time.
Article
Given an entity represented by a single node q in semantic knowledge graph D, the Graphical Entity Summarisation problem (GES) consists in selecting out of D a very small surrounding graph S that constitutes a generic summary of the information concerning the entity q with given limit on size of S. This article concerns the role of diversity in this quite novel problem. It gives an overview of the diversity concept in information retrieval, and proposes how to adapt it to GES. A measure of diversity for GES, called ALC, is defined and two algorithms presented, baseline, diversity-oblivious PRECIS and diversity-aware DIVERSUM. A reported experiment shows that DIVERSUM actually achieves higher values of the ALC diversity measure than PRECIS. Next, an objective evaluation experiment demonstrates that diversity-aware algorithm is superior to the diversity-oblivious one in terms of fact selection. More precisely, DIVERSUM clearly achieves higher recall than PRECIS on ground truth reference entity summaries extracted from Wikipedia. We also report another intrinsic experiment, in which the output of diversity-aware algorithm is significantly preferred by human expert evaluators. Importantly, the user feedback clearly indicates that the notion of diversity is the key reason for the preference. In addition, the experiment is repeated twice on an anonymous sample of broad population of Internet users by means of a crowd-sourcing platform, that further confirms the results mentioned above.
Article
This paper presents a different perspective on diversity in search results: diversity by proportionality. We consider a result list most diverse, with respect to some set of topics related to the query, when the number of documents it provides on each topic is proportional to the topic's popularity. Consequently, we propose a framework for optimizing proportionality for search result diversification, which is motivated by the problem of assigning seats to members of competing political parties. Our technique iteratively determines, for each position in the result ranked list, the topic that best maintains the overall proportionality. It then selects the best document on this topic for this position. We demonstrate empirically that our method significantly outperforms the top performing approach in the literature not only on our proposed metric for proportionality, but also on several standard diversity measures. This result indicates that promoting proportionality naturally leads to minimal redundancy, which is a goal of the current diversity approaches.