Conference Paper

Entwining structure into web search

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Keyword based retrieval has dominated web search with its simplicity and effectiveness. However, given the abundance of structured and semi-structured data, there is an opportunity for an improved user experience. In this paper, we present an overview of our past work on analyzing keyword queries and extracting latent structured semantics, mapping them to corresponding data sources and enriching the user experience with the inclusion of structured content.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

Article
Keyword search is an effective paradigm for information discovery and has been introduced recently to query XML documents. Scoring of XML search results is an important issue in XML keyword search. Traditional “bag-of-words” model cannot differentiate the roles of keywords as well as the relationship between keywords, thus is not proper for XML keyword queries. In this paper, we present a new scoring method based on a novel query model, called keyword query with structure (QWS), which is specially designed for XML keyword query. The method is based on a totally new view taken by the QWS model on a keyword query that, a keyword query is a composition of several query units, each representing a query condition. We believe that this method captures the semantic relevance of the search results. The paper first introduces an algorithm reformulating a keyword query to a QWS. Then, a scoring method is presented which measures the relevance of search results according to how many and how well the query conditions are matched. The scoring method is also extended to clusters of search results. Experimental results verify the effectiveness of our methods.
Article
Full-text available
Nowadays, there are many queries issued to search engines targeting at finding values from structured data (e.g., movie showtime of a specific location). In such scenarios, there is often a mismatch between the values of structured data (how content creators describe entities) and the Web queries (how different users try to retrieve them). Therefore, recognizing the alternative ways people use to reference an entity, is crucial for structured Web search. In this paper, we study the problem of automatic generation of entity synonyms over structured data towards closing the gap between users and structured data. We propose an off-line, data-driven approach that mines query logs for instances where content creators and Web users apply a variety of strings to refer to the same webpages. This way, given a set of strings that reference entities, we generate an expanded set of equivalent strings (entity synonyms) for each entity. Our framework consists of three modules: candidate generation, candidate selection and noise cleaning. We further study the cause of the problem through the identification of different entity synonym classes. The proposed method is verified with experiments on real-life datasets showing that we can significantly increase the coverage of structured Web queries with good precision.
Conference Paper
Full-text available
In recent years, there has been a strong trend of incorporating results from structured data sources into keyword-based web search systems such as Bing or Amazon. When presenting structured data, facets are a powerful tool for navigating, refining, and grouping the results. For a given structured data source, a fundamental problem in supporting faceted search is finding an ordered selection of attributes and values that will populate the facets. This creates two sets of challenges. First, because of the limited screen real-estate, it is important that the top facets best match the anticipated user intent. Second, the huge scale of available data to such engines demands an automated unsupervised solution. In this paper, we model the user faceted-search behavior using the intersection of web query-logs with existing structured data. Since web queries are formulated as free-text queries, a challenge in our approach is the inherent ambiguity in mapping keywords to the different possible attributes of a given entity type. We present an automated solution that elicits user preferences on attributes and values, employing different disambiguation techniques ranging from simple keyword matching, to more sophisticated probabilistic models. We demonstrate experimentally the scalability of our solution by running it on over a thousand categories of diverse entity types and measure the facet quality with a real-user study.
Conference Paper
Full-text available
Queries asked on web search engines often target structured data, such as commercial products, movie showtimes, or airline schedules. However, surfacing relevant results from such data is a highly challenging problem, due to the unstructured language of the web queries, and the imposing scalability and speed requirements of web search. In this paper, we discover latent structured semantics in web queries and produce Structured Annotations for them. We consider an annotation as a mapping of a query to a table of structured data and attributes of this table. Given a collection of structured tables, we present a fast and scalable tagging mechanism for obtaining all possible annotations of a query over these tables. However, we observe that for a given query only few are sensible for the user needs. We thus propose a principled probabilistic scoring mechanism, using a generative model, for assessing the likelihood of a structured annotation, and we define a dynamic threshold for filtering out misinterpreted query annotations. Our techniques are completely unsupervised, obviating the need for costly manual labeling effort. We evaluated our techniques using real world queries and data and present promising experimental results.
Conference Paper
In web search today, a user types a few keywords which are then matched against a large collection of unstructured web pages. This leaves a lot to be desired for when the best answer to a query is contained in structured data stores and/or when the user includes some structural semantics in the query. In our work, we include information from structured data sources into web results. Such sources can vary from fully relational DBs, to flat tables and XML files. In addition, we take advantage of information in such sources to automatically extract corresponding semantics from the query and use them appropriately in improving the overall relevance of results. For this demonstration, we show how we effectively capture, an- notate and translate web user queries such as 'popular digital cam- era around $425' returning results from a shopping-like DB. Categories and Subject Descriptors: H.2.8 Database Applica- tions General Terms: Performance.
Conference Paper
Recognizing the alternative ways people use to reference an entity, is important for many Web applications that query structured data. In such applications, there is often a mismatch between how content creators describe entities and how different users try to retrieve them. In this paper, we consider the problem of determining whether a candidate query approximately matches with an entity. We propose an off-line, data-driven, bottom-up approach that mines query logs for instances where Web content creators and Web users apply a variety of strings to refer to the same Web pages. This way, given a set of strings that reference entities, we generate an expanded set of equivalent strings for each entity. The proposed method is verified with experiments on real-life data sets showing that we can dramatically increase the queries that can be matched.
Article
Web search engines and specialized online verticals are increasingly incorporating results from structured data sources to answer semantically rich user queries. For example, the query \WebQuery{Samsung 50 inch led tv} can be answered using information from a table of television data. However, the users are not domain experts and quite often enter values that do not match precisely the underlying data. Samsung makes 46- or 55- inch led tvs, but not 50-inch ones. So a literal execution of the above mentioned query will return zero results. For optimal user experience, a search engine would prefer to return at least a minimum number of results as close to the original query as possible. Furthermore, due to typical fast retrieval speeds in web-search, a search engine query execution is time-bound. In this paper, we address these challenges by proposing algorithms that rewrite the user query in a principled manner, surfacing at least the required number of results while satisfying the low-latency constraint. We formalize these requirements and introduce a general formulation of the problem. We show that under a natural formulation, the problem is NP-Hard to solve optimally, and present approximation algorithms that produce good rewrites. We empirically validate our algorithms on large-scale data obtained from a commercial search engine's shopping vertical.