Conference PaperPDF Available

Diverse and Proportional Size-l Object Summaries for Keyword Search

Authors:

Abstract

The abundance and ubiquity of graphs (e.g., Online Social Networks such as Google+ and Facebook; bibliographic graphs such as DBLP) necessitates the effective and efficient search over them. Given a set of keywords that can identify a Data Subject (DS), a recently proposed relational keyword search paradigm produces, as a query result, a set of Object Summaries (OSs). An OS is a tree structure rooted at the DS node (i.e., a tuple containing the keywords) with surrounding nodes that summarize all data held on the graph about the DS. OS snippets, denoted as size-l OSs, have also been investigated. Size-l OSs are partial OSs containing l nodes such that the summation of their importance scores results in the maximum possible total score. However, the set of nodes that maximize the total importance score may result in an uninformative size-l OSs, as very important nodes may be repeated in it, dominating other representative information. In view of this limitation, in this paper we investigate the effective and efficient generation of two novel types of OS snippets, i.e. diverse and proportional size-l OSs, denoted as DSize-l and PSize-l OSs. Namely, apart from the importance of each node, we also consider its frequency in the OS and its repetitions in the snippets. We conduct an extensive evaluation on two real graphs (DBLP and Google+). We verify effectiveness by collecting user feedback, e.g. by asking DBLP authors (i.e. the DSs themselves) to evaluate our results. In addition, we verify the efficiency of our algorithms and evaluate the quality of the snippets that they produce.
A preview of the PDF is not available
... Our proposed proportional selection framework considers (1) the relevance of the objects to the query (i.e., spatial distance and keywords similarity) (2) contextual proportionality and (3) spatial proportionality w.r.t. the query location. To the best of our knowledge, [19] [7,11,42,46] [this] [23] [this], [19,46] there is no previous work that considers all these together in proportional selection, as Table 1 shows. Hereby, we discuss and compare related work in diversification and proportionality. ...
... Our proposed proportional selection framework considers (1) the relevance of the objects to the query (i.e., spatial distance and keywords similarity) (2) contextual proportionality and (3) spatial proportionality w.r.t. the query location. To the best of our knowledge, [19] [7,11,42,46] [this] [23] [this], [19,46] there is no previous work that considers all these together in proportional selection, as Table 1 shows. Hereby, we discuss and compare related work in diversification and proportionality. ...
... Diversification. Diversification of query results has attracted a lot of attention as a method for improving the quality of results by balancing relevance to the query and dissimilarity among results [8,19,20,25,45]. The motivation is that, in non-diversified search methods, users are overwhelmed with many similar answers with minor differences [31]. ...
... Diversification Diversification of query results has attracted a lot of attention recently as a method for improving the quality of results by balancing similarity (relevance) to a query q and dissimilarity among results [12,24,25,30,52]. Diversification has also been considered in keyword search over graphs and databases, where the result is usually a subgraph that contains the set of query keywords. ...
Article
Full-text available
The abundance and ubiquity of RDF data (such as DBpedia and YAGO2) necessitate their effective and efficient retrieval. For this purpose, keyword search paradigms liberate users from understanding the RDF schema and the SPARQL query language. Popular RDF knowledge bases (e.g., YAGO2) also include spatial semantics that enable location-based search. In an earlier location-based keyword search paradigm, the user inputs a set of keywords, a query location, and a number of RDF spatial entities to be retrieved. The output entities should be geographically close to the query location and relevant to the query keywords. However, the results can be similar to each other, compromising query effectiveness. In view of this limitation, we integrate textual and spatial diversification into RDF spatial keyword search, facilitating the retrieval of entities with diverse characteristics and directions with respect to the query location. Since finding the optimal set of query results is NP-hard, we propose two approximate algorithms with guaranteed quality. Extensive empirical studies on two real datasets show that the algorithms only add insignificant overhead compared to non-diversified search, while returning results of high quality in practice (which is verified by a user evaluation study we conducted).
... The link analysis algorithms have been widely used in keyword search (i.e. [2], [8], [9], [12]). As one of the most classical link analysis algorithms, the P ageRank algorithm calculates the P ageRank value of each webpage and ranks This work is licensed under a Creative Commons Attribution 4.0 License. ...
Article
Full-text available
The optimal path planning is one of the hot spots in the research of intelligence transportation and geographic information systems. There are many productions and applications in path planning and navigation, however due to the complexity of urban road networks, the difficulty of the traffic prediction increases. The optimal path means not only the shortest distance in geography, but also the shortest time, the lowest cost, the maximum road capacity, etc. In fast-paced modern cities, people tend to reach the destination with the shortest time. The corresponding paths are considered as the optimal paths. However, due to the high data sensing speed of GPS devices, it is different to collect or describe real traffic flows. To address this problem, we propose an innovative path planning method in this paper. Specially, we first introduce a crossroad link analysis algorithm to calculate the real-time traffic conditions of crossroads (i.e. the CrossRank values). Then, we adopt a CrossRank value based A-Star for the path planning by considering the real-time traffic conditions. To avoid the high volume update of CrossRank values, a R-Tree structure is proposed to dynamically update local CrossRank values from the multi-level subareas. In the optimization process, to achieve desired navigation results, we establish the traffic congestion coefficient to reflect different traffic congestion conditions. To verify the effectiveness of the proposed method, we use the actual traffic data of Beijing. The experimental results show that our method is able to generate the appropriate path plan in the peak and low dynamic traffic conditions as compared to online applications.
... A related research problem is entity summarization [12], namely to extract a subset of triples from the description of an entity in a dataset. The quality of an entity summary depends on whether it selects the most distinctive features of an entity [12,19,20,23], and further, whether it can help users quickly distinguish one entity from another [13,14,30]. ...
Article
Triple-structured open data creates value in many ways. However, the reuse of datasets is still challenging. Users feel difficult to assess the usefulness of a large dataset containing thousands or millions of triples. To satisfy the needs, existing abstractive methods produce a concise high-level abstraction of data. Complementary to that, we adopt the extractive strategy and aim to select the optimum small subset of data from a dataset as a snippet to compactly illustrate the content of the dataset. This has been formulated as a combinatorial optimization problem in our previous work. In this article, we design a new algorithm for the problem, which is an order of magnitude faster than the previous one but has the same approximation ratio. We also develop an anytime algorithm that can generate empirically better solutions using additional time. To suit datasets that are partially accessible via online query services (e.g., SPARQL endpoints for RDF data), we adapt our algorithms to trade off quality of snippet for feasibility and efficiency in the Web environment. We carry out extensive experiments based on real RDF datasets and SPARQL endpoints for evaluating quality and running time. The results demonstrate the effectiveness and practicality of our proposed algorithms.
... In Precis [5], weights, added on tables and attributes in an offline phase by an expert, enable the system to decide which extra features should be added to the answer. Fakas [18] assigns importance weights to each table based on a formula that takes only structural features such as relative cardinality, schema connectivity and distance between tables into consideration. We are going to improve state-of-the-art expansion methodologies by making them smarter, content-based, and query-driven. ...
Article
Full-text available
Recently, there has been a significant growth in keyword search due to its wide range of use-cases in people's everyday life. While keyword search has been applied to different kinds of data, ambiguity always exists no matter what data the query is being asked from. Generally, when users submit a query they need an exact answer that perfectly meets their needs, rather than wondering about different possible answers that are retrieved by the system. To achieve this, search systems need a Disambiguation functionality that can efficiently filter out and rank all possible answers to a query, before showing them to user. In this paper, we are going to describe how we are improving state of the art in various stages of a keyword-search pipeline in order to retrieve the answers that best match the user's intent.
Article
Some queries retrieve points of interest (POIs) considering features in its neighborhood. The POIs are ranked in terms of the information needs defined by the user. However, the information provided by the users usually is not enough to distinguish the best POIs. Analyzing the distance between features and POIs, we propose a novel ranking function that extrapolates the information provided by the user. Recent findings suggest that the user’s interest in a POI increases as the distance between the POI and the feature decreases. We demonstrate the Pareto distribution suitability to model this interest in the ranking and propose two algorithms to search for POIs with the novel ranking function. Extensive experiments show that our method can boost ranking accuracy in comparison to top-ranked methods. The proposed ranking function achieves an average NDCG performance of 8.66% and Tau performance of 7.83% in comparison with the state-of-the-art ranking functions in real-world datasets.
Article
Many real-world data sets are modeled as entity relationship graphs or heterogeneous information networks. In these graphs, nodes represent entities and edges mimic relationships. ObjectRank extends the well-known PageRank authority flow–based ranking method to entity relationship graphs using an authority flow weight vector (W). The vector W assigns a different authority flow–based importance (weight) to each edge type based on domain knowledge or personalization. In this paper, our contribution is a framework for Learning to Rank in entity relationship graphs to learn W, in the context of authority flow. We show that the problem is similar to learning a recursive scoring function. We present a two-phase iterative solution and multiple variants of learning. In pointwise learning, we learn W, and hence the scoring function, from the scores of a sample of nodes. In pairwise learning, we learn W from given preferences for pairs of nodes. To demonstrate our contribution in a real setting, we apply our framework to learn the rank, with high accuracy, for a real-world challenge of predicting future citations in a bibliographic archive—that is, the FutureRank score. Our extensive experiments show that with a small amount of training data, and a limited number of iterations, our Learning to Rank approach learns W with high accuracy. Learning works well with pairwise training data in large graphs.
Article
With the development of society and economy, the urban rail transit has become one of the important components of urban transportation system, while the construction of the urban rail greatly improves the public transportation environments. Currently, there are many research focus on the passenger flow predictions according to their corresponding historical data, however, it is hard to assist transport models vary such volumes for a new station planning or being constructed. In view of this limitation, we provide a novel method for urban rail station characteristics analysis in intelligent transportation considering city land usages. Initially, point of interest (POIs) are divided by the proposed RC-tree (Colored R-tree)-based algorithm into the bounded areas for each station. Second, the Diversity and Proportion approaches are proposed to extract the top-k POIs from bounded areas based on their semantic and spatial characteristics. Then, classify the stations based on the similarity of the extracted top-k POIs. Moreover, we made a case study on real dataset, including a large volume of Automatic Fare Collection system (AFC) records for the experimental evaluations, and the results show that the proposed method can verify the rationality of land use and provide support for the application of transportation model technology.
Article
Full-text available
The Object Summary (OS)is a recently proposed tree structure, which summarizes all data held in a relational database about a data subject. An OS can potentially be very large in size and therefore unfriendly for users who wish to view synoptic information about the data subject. In this paper, we investigate the effective and efficient retrieval of concise and informative OS snippets (denoted as size-$l$ OSs). We propose and investigate the effectiveness of two types of size-$l$ OSs, namely size-$l$ OS$(t)$s and size-$l$ OS$(a)$s that consist of $l$ tuple nodes and $l$ attribute nodes respectively. For computing size-$l$ OSs, we propose an optimal dynamic programming algorithm, two greedy algorithms and preprocessing heuristics. By collecting feedback from real users (e.g., from DBLP authors), we assess the relative usability of the two different types of snippets, the choice of the size-$l$ parameter, as well as the effectiveness of the snippets with respect to the user expectations. In addition, via thorough evaluation on real databases, we test the speed and effectiveness of our techniques.
Article
In this paper, we summarize our work on diversification based on dissimilarity and coverage (DisC diversity) by presenting our main theoretical results and contributions.
Article
Recently, result diversification has attracted a lot of attention as a means to improve the quality of results retrieved by user queries. In this paper, we propose a new, intuitive definition of diversity called DisC diversity. A DisC diverse subset of a query result contains objects such that each object in the result is represented by a similar object in the diverse subset and the objects in the diverse subset are dissimilar to each other. We show that locating a minimum DisC diverse subset is an NP-hard problem and provide heuristics for its approximation. We also propose adapting DisC diverse subsets to a different degree of diversification. We call this operation zooming. We present efficient implementations of our algorithms based on the M-tree, a spatial index structure, and experimentally evaluate their performance.
Article
Online Social Networks (OSNs) allow users to create and share content (e.g., posts, status updates, comments) in real-time. These activities are collected in an activity log, (e.g. Facebook Wall, Google+ Stream, etc.) on the user's social network profile. With time, the activity logs of users, which record the sequences of social activities, become too long and consequently hard to view and navigate. To alleviate this cluttering, it is useful to select a small subset of the social activities within the specified time-period as representative, i.e., as summary, for this time-period. In this paper, we study the novel problem of social activity log summarization. We propose LogRank, a novel and principled algorithm to select activities that satisfy three desirable criteria: First, activities must be important for the user. Second, they must be diverse in terms of topic, e.g., cover several of the major topics in the activity log. Third, they should be time-dispersed, that is, be spread across the specified time range of the activity log. LogRank operates on an appropriately augmented social interaction graph and employs random-walk techniques to holistically balance all three criteria. We evaluate LogRank and its variants on a real dataset from the Google+ social network and show that they outperform baseline approaches.
Conference Paper
We study the problem of answering ambiguous web queries in a setting where there exists a taxonomy of information, and that both queries and documents may belong to more than one category according to this taxonomy. We present a systematic approach to diversifying results that aims to minimize the risk of dissatisfaction of the average user. We propose an algorithm that well approximates this objective in general, and is provably optimal for a natural special case. Furthermore, we generalize several classical IR metrics, including NDCG, MRR, and MAP, to explicitly account for the value of diversification. We demonstrate empirically that our algorithm scores higher in these generalized metrics compared to results produced by commercial search engines.
Conference Paper
In this paper we describe a general framework for evaluation and optimization of methods for diversifying query results. In these methods, an initial ranking candidate set produced by a query is used to construct a result set, where elements are ranked with respect to relevance and diversity features, i.e., the retrieved elements should be as relevant as possible to the query, and, at the same time, the result set should be as diverse as possible. While addressing relevance is relatively simple and has been heavily studied, diversity is a harder problem to solve. One major contribution of this paper is that, using the above framework, we adapt, implement and evaluate several existing methods for diversifying query results. We also propose two new approaches, namely the Greedy with Marginal Contribution (GMC) and the Greedy Randomized with Neighborhood Expansion (GNE) methods. Another major contribution of this paper is that we present the first thorough experimental evaluation of the various diversification techniques implemented in a common framework. We examine the methods' performance with respect to precision, running time and quality of the result. Our experimental results show that while the proposed methods have higher running times, they achieve precision very close to the optimal, while also providing the best result quality. While GMC is deterministic, the randomized approach (GNE) can achieve better result quality if the user is willing to tradeoff running time.
Article
Given an entity represented by a single node q in semantic knowledge graph D, the Graphical Entity Summarisation problem (GES) consists in selecting out of D a very small surrounding graph S that constitutes a generic summary of the information concerning the entity q with given limit on size of S. This article concerns the role of diversity in this quite novel problem. It gives an overview of the diversity concept in information retrieval, and proposes how to adapt it to GES. A measure of diversity for GES, called ALC, is defined and two algorithms presented, baseline, diversity-oblivious PRECIS and diversity-aware DIVERSUM. A reported experiment shows that DIVERSUM actually achieves higher values of the ALC diversity measure than PRECIS. Next, an objective evaluation experiment demonstrates that diversity-aware algorithm is superior to the diversity-oblivious one in terms of fact selection. More precisely, DIVERSUM clearly achieves higher recall than PRECIS on ground truth reference entity summaries extracted from Wikipedia. We also report another intrinsic experiment, in which the output of diversity-aware algorithm is significantly preferred by human expert evaluators. Importantly, the user feedback clearly indicates that the notion of diversity is the key reason for the preference. In addition, the experiment is repeated twice on an anonymous sample of broad population of Internet users by means of a crowd-sourcing platform, that further confirms the results mentioned above.
Article
This paper presents a different perspective on diversity in search results: diversity by proportionality. We consider a result list most diverse, with respect to some set of topics related to the query, when the number of documents it provides on each topic is proportional to the topic's popularity. Consequently, we propose a framework for optimizing proportionality for search result diversification, which is motivated by the problem of assigning seats to members of competing political parties. Our technique iteratively determines, for each position in the result ranked list, the topic that best maintains the overall proportionality. It then selects the best document on this topic for this position. We demonstrate empirically that our method significantly outperforms the top performing approach in the literature not only on our proposed metric for proportionality, but also on several standard diversity measures. This result indicates that promoting proportionality naturally leads to minimal redundancy, which is a goal of the current diversity approaches.