
Alon Halevy- Google Inc.
Alon Halevy
- Google Inc.
About
211
Publications
47,228
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
26,435
Citations
Current institution
Publications
Publications (211)
Every five years, a group of the leading database researchers meet to reflect on their community's impact on the computing industry as well as examine current research challenges.
Approximately every five years, a group of database researchers meet to do a self-assessment of our community, including reflections on our impact on the industry as well as challenges facing our research community. This report summarizes the discussion and conclusions of the 9th such meeting, held during October 9-10, 2018 in Seattle.
In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of structured databases casually published online in the form of HTML tables. The past decade has seen a flurry of research and commercial activities around the WebTables project itself, as well as the broad topic of informal online structured data. In this paper, we¹...
Abstract
Every few years a group of database researchers meets to discuss the state of database research,
its impact on practice, and important new directions. This report summarizes the discussion and
conclusions of the eighth such meeting, held October 14-15, 2013 in Irvine, California. It observes that
Big Data has now become a de ning challenge...
List information can be extracted into database tables. A number of fields are independently determined for items in list. A number of database table columns are determined from most common number of list item fields. New fields are determined for items with more fields than database columns. Null fields are inserted into items with fewer fields th...
We consider the problem of using humans to find a bounded number of items satisfying certain properties, from a data set. For instance, we may want humans to identify a select number of travel photos from a data set of photos to display on a travel website, or a candidate set of resumes that meet certain requirements from a large pool of applicants...
Large-scale map visualization systems play an increasingly important role in presenting geographic datasets to end-users. Since these datasets can be extremely large, a map rendering system often needs to select a small fraction of the data to visualize them in a limited space. This article addresses the fundamental challenge of thinning: determini...
For the first time since the emergence of the Web, structured data is playing a key role in search engines and is therefore being collected via a concerted effort. Much of this data is being extracted from the Web, which contains vast quantities of structured data on a variety of domains, such as hobbies, products and reference data. Moreover, the...
With massive amounts of data being generated and stored ubiquitously in every discipline and every aspect of our daily life, how to handle such big data poses many challenging issues to researchers in data and information systems. The participants of CIKM 2013 are active researchers on large scale data, information and knowledge management, from mu...
Since the beginning of the Semantic Web initiative, significant efforts have been invested in finding efficient ways to publish, store, and query metadata on the Web. RDF and SPARQL have become the standard data model and query language, respectively, ...
A system and a method for ranking search results of local search queries. A local search query and a current location of a user are received. Next, two or more places that satisfy the local search query are identified, and for each respective place a corresponding distance from the current location of the user to the respective place is also identi...
Several recent works have focused on harvesting HTML tables from the Web and recovering their semantics [Cafarella et al., 2008a; Elmeleegy et al., 2009; Limaye et al., 2010; Venetis et al., 2011]. As a result, hundreds of millions of high quality structured data tables can now be explored by the users. In this paper, we argue that those efforts on...
Among other disclosure, a computer-implemented method of analyzing a form page for indexing includes identifying a form page that is configured for use in requesting any of multiple target pages. The form page includes multiple input controls. The method includes identifying at least one of the multiple input controls as being informative with rega...
One embodiment of the present invention provides a system that facilitates searching through content which is accessible though web-based forms. During operation, the system receives a query containing keywords. Next, the system analyzes the query to create a structured query. The system then performs a lookup based on the structured query in a dat...
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for clustering query refinements. One method includes building a representation of a graph for a first query, wherein the graph has a node for the first query, a node for each of a plurality of refinements for the first query, and a node for each documen...
Google Fusion Tables aims to support an ecosystem of structured data on the Web by providing a tool for managing and visualizing data on the one hand, and for searching and exploring for data on the other. This paper describes a few recent developments in our efforts to further the ecosystem.
The World-Wide Web contains vast quantities of structured data on a variety of domains, such as hobbies, products and reference data. Moreover, the Web provides a platform that can encourage publishing more data sets from governments and other public organizations and support new data management opportunities, such as effective crisis response, dat...
We consider the problem of finding related tables in a large corpus of heterogenous tables. Detecting related tables provides users a powerful tool for enhancing their tables with additional data and enables effective reuse of available public data. Our first contribution is a framework that captures several types of relatedness, including tables t...
Large-scale map visualization systems play an increasingly important role in presenting geographic datasets to end users. Since these datasets can be extremely large, a map rendering system often needs to select a small fraction of the data to visualize them in a limited space. This paper addresses the fundamental challenge of thinning: determining...
We are in the midst of very exciting times in which structured data is having a profound impact on many aspects of our lives. In many countries, citizens take for granted the fact that governments, local authorities, and non-government organizations should make a variety of data sets available to the public. These data sets span a variety of topics...
Data integration has been an important area of research for several years. However, such systems suffer from one of the main
drawbacks of database systems: the need to invest significant modeling effort upfront. Dataspace support platforms (DSSP)
envision a system that offers useful services on its data without any setup effort and that improves wi...
Conceptual modeling has been used mainly for supporting information systems (IS) development. In order to better capture requirements for developing IS, we have been extending conceptual models to include more business context (e.g., mission of the organization). ...
Many web-search queries serve as the beginning of an exploration of an
unknown space of information, rather than looking for a specific web page. To
answer such queries effec- tively, the search engine should attempt to organize
the space of relevant information in a way that facilitates exploration. We
describe the Aspector system that computes as...
Schemr is a search engine for users to search for and visualize schemas in a metadata repository. Users may search by keywords and by example, using schema fragments as query terms. Schemr uses a novel search algorithm, based on a combination of text search and schema matching techniques, coupled with a structurally-aware scoring metric. Schemr pre...
The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our...
The practice of crowdsourcing is transforming the Web and giving rise to a new field.
Google's Web Tables and Deep Web Crawler identify and deliver this otherwise inaccessible resource directly to end users.
Studies find that at least 20% of web queries have local intent; and the fraction of queries with local intent that originate from mobile properties may be twice as high. The emergence of standardized support for location providers in web browsers, as well as of providers of accurate locations, enables so-called hyper-local web querying where the l...
Data integration systems offer users a uniform interface to a set of data sources. Previous work has typically assumed that the data sources are independent of each other; however, in scenarios involving large numbers of sources, such as the Web or large enterprises, there is an eco-system of dependent sources, where some sources copy parts of thei...
Web Data Management (or WDM) refers to a body of work concerned with leveraging the large collections of structured data that can be extracted from the Web. Over the past few years, several research and commercial efforts have explored these collections of data with the goal of improving Web search and developing mechanisms for surfacing different...
OpenII (openintegration.org) is a collaborative effort to create a suite of open-source tools for information integration (II). The project is leveraging the latest developments in II research to create a platform on which integration tools can be built and further research conducted. In addition to a scalable, extensible platform, OpenII includes...
The AAAI-10 Workshop program was held Sunday and Monday, July 11–12, 2010 at the Westin Peachtree Plaza in Atlanta, Georgia. The AAAI-10 workshop program included 13 workshops covering a wide range of topics in artificial intelligence. The titles of the workshops were AI and Fun, Bridging the Gap between Task and Motion Planning, Collaboratively-Bu...
It has long been observed that database management systems focus on traditional business applications, and that few people use a database management system outside their workplace. Many have wondered what it will take to enable the use of data management technology by a broader class of users and for a much wider range of applications. Google Fusio...
We describe the social features of Google Fusion Tables, a cloud-based data management service whose goal is to facilitate collaboration around data sets. The social features include the ability to specify attribution of data sets, a mechanism for conducting discussions on data (at fine granularity, such as row, column or cell), the ability to merg...
Communications' Virtual Extension brings more quality articles to ACM members. These articles are now available in the ACM Digital Library.
Google Fusion Tables is a cloud-based service for data management and integration. Fusion Tables enables users to upload tabular data files (spreadsheets, CSV, KML), currently of up to 100MB. The system provides several ways of visualizing the data (e.g., charts, maps, and timelines) and the ability to filter and aggregate the data. It supports the...
We address the problem of clustering the refinements of a user search query. The clusters computed by our proposed algorithm can be used to improve the selection and placement of the query suggestions proposed by a search engine, and can also serve to summarize the different aspects of information relevant to the original user query. Our algorithm...
The Age-Old Practice of Mass Collaboration is Transforming the Web and Giving Rise to a New Field Mass collaboration systems enlist a multitude of hu-mans to help solve a wide variety of problems. Over the past decade, numerous such systems have appeared on the World-Wide Web. Prime examples include Wikipedia, Linux, Yahoo! Answers, Amazon's Mechan...
In general terms, an uncertain relation encodes a set of possible certain relations. There are many ways to represent uncertainty, ranging from alternative values for attributes to rich constraint languages. Among the possible models for uncertain data, there is a tension between simple and intuitive models, which tend to be incomplete, and complet...
Over the past few years, we have built a system that has exposed large volumes of Deep-Web content to Google.com users. The content that our system exposes contributes to more than 1000 search queries per-second and spans over 50 languages and hundreds of domains. The Deep Web has long been acknowledged to be a major source of structured data on th...
The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. However, integrating data from this relational web raises several challenges that are not addressed by current data integration systems or mash-up tool...
The question of which role structured data can play in Web search has been raised from the early days of the Web. On the one hand, structured data can be used to answer factual queries. On the other, large amounts of structured data can be used to better organize web-content and therefore to improve search on a wide range of queries.
A large number of web pages contain data structured in the form of "lists". Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well define...
The question of which role structured data can play in Web search has been raised from the early days of the Web. On the one hand, structured data can be used to answer factual queries. On the other, large amounts of structured data can be used to better organize web-content and therefore to improve search on a wide range of queries.
Schemr is a schema search engine, and provides users the ability to search for and visualize schemas stored in a meta- data repository. Users may search by keywords and by ex- ample - using schema fragments as query terms. Schemr uses a novel search algorithm, based on a combination of text search and schema matching techniques, as well as a struct...
Though search on the World-Wide Web has focused mostly on unstructured text, there is an increasing amount of structured data on the Web and growing interest in harnessing such data. I will describe several current projects at Google whose overall goal is to leverage structured data and better expose it to our users.
The first project is on crawlin...
A group of database researchers, architects, users, and pundits met in May 2008 at the Claremont Resort in Berkeley, CA, to discuss the state of database research and its effects on practice. This was the seventh meeting of this sort over the past 20 years and was distinguished by a broad consensus that the database community is at a turning point...
At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step b...
This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems
need to handle uncertainty at three levels and do so in a principled fashion. First, the semantic mappings between the data
sources and the mediated schema may be approximate because there may be too many of them to be cre...
This talk describes two projects whose over goal is to make database management systems usable by a wider audience. Dataspaces aim to eliminate the upfront effort involved in creating a database. Data mangement for collaboration attempts to shift the focus of data mangement to supporting users in their natural environments and workflow.
Recently, the opportunity of extracting structured data from the Web has been identified by a number of research projects. One such example is that millions of relational-style HTML tables can be extracted from the Web. Traditional data integration approaches do not scale over such corpora with hundreds of small tables in one domain. To solve this...
Data integration has been an important area of research for several years. However, such systems suffer from one of the main
drawbacks of database systems: the need to invest significant modeling effort upfront. Dataspace Support Platforms (DSSP)
envision a system that offers useful services on its data without any setup effort, and improve with ti...
Data integration has been an important area of research for several years. In this chapter, we argue that supporting modern data integration applications requires systems to handle uncertainty at every step of integration. We provide a formal framework for data integration systems with uncertainty. We define probabilistic schema mappings and probab...
A long-standing goal of Web research has been to con- struct a unified Web knowledge base. Information ex- traction techniques have shown good results on Web in- puts, but even most domain-independent ones are not appropriate for Web-scale operation. In this paper we describe three recent extraction systems that can be operated on the entire Web (t...
The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content...
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each rela...
Dataspace systems offer services on data without requiring upfront semantic integration. In sharp contrast with existing information-integration systems, dataspaces systems offer best-effort answers even before semantic mappings are provided to the system. Dataspaces offer a pay-as-you-go approach to data management. Users (or administrators) of th...
This article describes some of the ongoing research projects related to structured data management at Google today. The organization of Google encourages research scientists to work closely with engineering teams. As a result, the research projects tend to be motivated by real needs faced by Google's products and services, and solutions are put int...
Web 2.0 refers to a set of technologies that enables indviduals to create and share content on the Web. The types of content that are shared on Web 2.0 are quite varied and include photos and videos (e.g., Flickr, YouTube), encyclopedic knowledge (e.g., Wikipedia), the blogosphere, social book-marking and even structured data (e.g., Swivel, Many-ey...
This paper introduces uldbs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation,
however many applications require the features in tande...
The World-Wide Web consists of a huge number of unstruc- tured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small "schema" of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extr...
Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaini ng a data integration application still requires significant up front effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple dat...
ABSTRACT A primary challenge to large-scale data integration is creat- ing semantic equivalences between elements from dierent data sources that correspond to the same real-world entity or concept. Dataspaces propose a pay-as-you-go approach: automated,mechanisms such as schema matching and refer- ence reconciliation provide initial correspondences...
Both the resource description framework (RDF), used in the semantic web, and Maya Viz u-forms represent data as a graph of objects connected by labeled edges. Existing systems for flexible visualization of this kind of data require manual specification of the possible visualization roles for each data attribute. When the schema is large and unfamil...
ABSTRACT A primary challenge to large-scale data integration is creat- ing semantic equivalences between elements from dierent
Genomic medicine aims to revolutionize health care by applying our growing understanding of the molecular basis of disease. Research in this arena is data intensive, which means data sets are large and highly heterogeneous. To create knowledge from data, researchers must integrate these large and diverse data sets. This presents daunting informatic...
The World Wide Web is witnessing an increase in the amount of structured content - vast heterogeneous collections of structured data are on the rise due to the Deep Web, annotation schemes like Flickr, and sites like Google Base. While this phenomenon is cre- ating an opportunity for structured data management, dealing with heterogeneity on the web...
Dataspaces are collections of heterogeneous and partially unstructured data. Unlike data-integration systems that also offer uniform access to heterogeneous data sources, datas- paces do not assume that all the semantic relationships be- tween sources are known and specified. Much of the user interaction with dataspaces involves exploring the data,...
Web 2.0 is a buzzword we have been hearing for over 2 years. According to Wikipedia, it hints at an improved form of the World Wide Web where technologies such as weblogs, social bookmarking, RSS feeds, photo and video sharing, based on an architecture of participation and democracy that encourages users to add value to the application as they use...
Most data management scenarios today rarely have a situation in which all the data that needs to be managed can fit nicely into a conventional relational DBMS, or into any other single data model or system. Instead, we see a set of loosely connected data sources, typically with the following recurring challenges:
– Users want be able to search the...
The most acute information management challenges today stem from organizations relying on a large number of diverse, interrelated data sources, but having no means of managing them in a convenient, integrated, or principled fashion. These challenges arise in enterprise and government data management, digital libraries, "smart" homes and personal in...
The World Wide Web is witnessing an increase in the amount of structured content ñ vast heterogeneous collections of structured data are on the rise due to the Deep Web, annotation schemes like Flickr, and sites like Google Base. While this phenomenon is creating an opportunity for structured data man- agement, dealing with heterogeneity on the web...
Abstract The development of relational database management systems served to focus the data management community for decades, with spectacular results In recent years, however, the rapidly - expanding demands of "data everywhere" have led to a field comprised of interesting and productive efforts, but without a central focus or coordinated agenda T...
This paper explores an inherent tension in modeling and querying uncertain data: simple, intuitive representations of uncertain data capture many application requirements, but these representations are generally incomplete―standard operations over the data may result in unrepresentable types of uncertainty. Complete models are theoretically attract...
Data integration is a pervasive challenge faced in appli-cations that need to query across multiple autonomous and heterogeneous data sources. Data integration is crucial in large enterprises that own a multitude of data sources, for progress in large-scale scientific projects, where data sets are being produced independently by multiple researcher...
Research on the Semantic Web has focused on reasoning about data that is semantically annotated in the RDF data model, with concepts and properties specified in rich ontology languages such as OWL. However, to flourish, the Semantic Web needs to provide interoperability both between sites with different ontologies and with existing, non-RDF data an...
The development of relational database management systems served to focus the data management community for decades, with spectacular results. In recent years, however, the rapidly-expanding demands of "data everywhere" have led to a field comprised of interesting and productive efforts, but without a central focus or coordinated agenda. The most a...