Jeffrey Ullman

Jeffrey Ullman
Stanford University | SU · Department of Computer Science

About

412
Publications
74,381
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
73,510
Citations

Publications

Publications (412)
Article
Full-text available
Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in "on-demand" mode. However, to perform any computation on the data in th...
Chapter
Full-text available
Open addressing hash tables, possibly under double hashing policy, are regarded more memory efficient than linked list hashing; as the memory used for pointers can be used for a longer table, and allow better-expected performance as the load factor is smaller and there are fewer expected collisions. We suggest further eliminating the single pointer...
Chapter
Full-text available
Computations, such as syndromic surveillance and e-commerce, are executed over the datasets collected from different geographical locations. Modern data processing systems, such as MapReduce/Hadoop or Spark, also, require to collect the data from different geographical locations to a single global location, before executing an application, and thus...
Article
Full-text available
In recent years, an increasing amount of data is collected in different and often, not cooperative, databases. The problem of privacy-preserving, distributed calculations over separate databases and, a relative to it, the issue of private data release was intensively investigated. However, despite a considerable progress, computational complexity a...
Article
Full-text available
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along with the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of ex...
Preprint
Full-text available
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along with the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of ex...
Presentation
Full-text available
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of existin...
Conference Paper
Full-text available
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This poster continues along the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome some of the limitatio...
Preprint
Full-text available
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of existin...
Article
Full-text available
In this paper we offer an algorithm which computes the multiway join efficiently in MapReduce even when the data is skewed. Handling skew is one of the major challenges in query processing and computing joins is both important and costly. When data is huge distributed computational platforms must be used. The algorithm Shares for computing multiway...
Article
Full-text available
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs participating in the computation of this output. Reducers have a capacity that limits the sets of inputs they can be assigned. However, individual inputs may va...
Conference Paper
This panel critically examines the state of data: how its growth and ubiquity have confronted the computer science and particularly the database community, with new challenges. These challenges require practitioners and teachers to learn new skills and engage with other disciplines in ways they had not done before. Panelists will examine the impact...
Conference Paper
Full-text available
We consider the problem of computing the data-cube marginals of a fixed order k (i.e., all marginals that aggregate over k dimensions), using a single round of MapReduce. The focus is on the relationship between the reducer size (number of inputs allowed at a single reducer) and the replication rate (number of reducers to which an input is sent). W...
Article
Full-text available
In recent years, an increasing amount of data is collected in different and often, not cooperative, databases. The problem of privacy-preserving, distributed calculations over separated databases and, a relative to it, issue of private data release were intensively investigated. However, despite a considerable progress, computational complexity, du...
Article
Full-text available
We consider the problem of computing the data-cube marginals of a fixed order $k$ (i.e., all marginals that aggregate over $k$ dimensions), using a single round of MapReduce. The focus is on the relationship between the reducer size (number of inputs allowed at a single reducer) and the replication rate (number of reducers to which an input is sent...
Conference Paper
Full-text available
The federation of cloud and big data activities is the next challenge where MapReduce should be modified to avoid (big) data migration across remote (cloud) sites. This is exactly our scope of research, where only the very essential data for obtaining the result is transmitted, reducing communication, processing and preserving data privacy as much...
Conference Paper
Full-text available
An important parameter to be considered in MapReduce algorithms is the "reducer capacity," is introduced here for the first time. The reducer capacity is an upper bound on the sum of the sizes of the inputs that are assigned to the reducer. We consider, for the first time, the different sizes of the inputs, which are sent to the reducers. Another s...
Poster
Full-text available
An important parameter to be considered in MapReduce algorithms is the “reducer capacity,” is introduced here for the first time. The reducer capacity is an upper bound on the sum of the sizes of the inputs that are assigned to the reducer. We consider, for the first time, the different sizes of the inputs, which are sent to the reducers. Another s...
Article
Handling skew is one of the major challenges in query processing. In distributed computational environments such as MapReduce, uneven distribution of the data to the servers is not desired. One of the dominant measures that we want to optimize in distributed environments is communication cost. In a MapReduce job this is the amount of data that is t...
Conference Paper
Full-text available
We consider the problem of 2-way interval join, where we want to find all pairs of overlapping intervals, i.e., intervals that share at least one point in common. We present lower and upper bounds on the replication rate for this problem when it is implemented in MapReduce. We study three cases, where intervals in the input are: (i) unit-length and...
Data
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs that participate in the computation of this output. Reducers have a capacity, which limits the sets of inputs that they can be assigned. However, individual inp...
Data
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs that participate in the computation of this output. Reducers have a capacity, which limits the sets of inputs that they can be assigned. However, individual inp...
Article
Full-text available
We study the problem of computing the join of $n$ relations in multiple rounds of MapReduce. We introduce a distributed and generalized version of Yannakakis's algorithm, called GYM. GYM takes as input any generalized hypertree decomposition (GHD) of a query of width $w$ and depth $d$, and computes the query in $O(d)$ rounds and $O(n(\mathrm{IN}^w...
Article
In Dremel, data is stored as nested relations. The schema for a relation is a tree, all of whose nodes are attributes, and whose leaf attributes hold values. We explore filter and aggregate queries that are given in the Dremel dialect of SQL. Complications arise because of repeated attributes, i.e., attributes that are allowed to have more than one...
Book
The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, a...
Conference Paper
The all-pairs problem is an input-output relationship where each output corresponds to a pair of inputs, and each pair of inputs has a corresponding output. It models similarity joins where no simplification of the search for similar pairs, e.g., locality-sensitive hashing, is possible, and each input must be compared with every other input to dete...
Conference Paper
Full-text available
Recently, a great deal of interest for Big Data has risen, mainly driven from a widespread number of research problems strongly related to real-life applications and systems, such as representing, modeling, processing, querying and mining massive, distributed, large-scale repositories (mostly being of unstructured nature). Inspired by this main tre...
Article
As MapReduce/Hadoop grows in importance, we find more exotic applications being written this way. Not every program written for this platform performs as well as we might wish. There are several reasons why a MapReduce program can underperform expectations. One is the need to balance the communication cost of transporting data from the mappers to t...
Article
An introduction to designing algorithms for the MapReduce framework for parallel processing of big data.
Article
Full-text available
The theme of this paper is how to find all instances of a given "sample" graph in a larger "data graph," using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total...
Article
Full-text available
In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can be extracted, the greater will be the total communication between mappers and reducers. We introduce a model o...
Article
Full-text available
A significant amount of recent research work has addressed the problem of solving various data management problems in the cloud. The major algorithmic challenges in map-reduce computations involve balancing a multitude of factors such as the number of machines available for mappers/reducers, their memory requirements, and communication cost (total...
Article
Full-text available
Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce rou...
Conference Paper
Full-text available
In this paper, we design and analyze parallel algorithms for skyline queries. The skyline of a multidimensional set consists of the points for which no other point exists that is at least as good along every dimension. As a framework for parallel computation, we use both the MP model proposed in (Koutris and Suciu, PODS 2011), which requires that t...
Article
Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the "map-key," the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribut...
Article
Full-text available
ABSTRACT Implementations of map-reduce are being used to perform many,operations on very large data. We explore alternative ways that a system could use the environment and capa- bilities of map-reduce implementations such as Hadoop. In particular, we look at strategies for combining the natural join of several relations. The general strategy we em...
Article
Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribut...
Article
While classic data management focuses on the data itself, research on Business Processes considers also the context in which this data is generated and manipulated, namely the processes, the users, and the goals that this data serves. This allows ...
Conference Paper
Full-text available
We survey the recent wave of extensions to the popular map-reduce systems, including those that have begun to address the implementation of recursive queries using the same computing environment as map-reduce. A central problem is that recursive tasks cannot deliver their output only at the end, which makes recovery from failures much more complica...
Article
Full-text available
Implementing recursive algorithms on computing clusters presents a number of new challenges. In particular, we consider the endgame problem: later rounds of a recursion often transfer only small amounts of data, causing high overhead for interprocessor communication. One way to deal with the endgame problem is to use an algorithm that reduces the n...
Conference Paper
We consider the problem of recommending the best set of k items when there is an inherent ordering between items, expressed as a set of prerequisites (e.g., the movie 'Godfather I' is a prerequisite of 'Godfather II'). Since this general problem is computationally intractable, we develop 3 approximation algorithms to solve this problem for various...
Article
There has been considerable past work studying data integration and uncertain data in isolation. We develop the theory of local-as-view (LAV) data integration when the sources being integrated are uncertain. We motivate two distinct settings for uncertain-data integration. We then define containment of uncertain databases in these settings, which a...
Conference Paper
The cluster-computing environment typified by Hadoop, the open-source implementation of map-reduce, is receiving serious attention as the way to execute queries and other operations on very large-scale data. Datalog execution presents several unusual issues for this enviroment. We discuss the best way to execute a round of seminaive evaluation on a...
Conference Paper
ABSTRACT We address schema design in uncertain databases. Since uncertain data is relational in nature, decomposition be- comes a key issue in design. Decomposition relies on depen- dency theory, and primarily on functional dependencies. We study the theory of functional dependencies (FDs) for uncertain relations. We dene,several kinds of horizonta...
Article
We study the problem of generating efficient, equivalent rewritings using views to compute the answer to a query. We take the closed-world assumption, in which views are materialized from base relations, rather than views describing sources in terms of abstract predicates, as is common when the open-world assumption is used. In the closed-world mod...
Article
This book provides the foundation for understanding the theory and pracitce of compilers.
Chapter
A number of ideas concerning information-integration tools can be thought of as constructing answers to queries using views that represent the capabilities of information sources. We review the formal basis of these techniques, which are closely related to containment algorithms for conjunctive queries and/or Datalog programs. Then we compare the a...
Chapter
Wrappers provide access to heterogeneous information sources by converting application queries into source specific queries or commands. In this paper we present a wrapper implementation toolkit that facilitates rapid development of wrappers. We focus on the query translation component of the toolkit, called the converter. The converter takes as in...
Article
Full-text available
The database research with focus on integration of text, data, code, fusion of information from heterogeneous data sources, and information privacy, conducted at Lowell, is discussed. The object-oriented (OO) and object-relational (OR) database management systems (DBMS) showed how text and other data types can be added to a DBMS. Several goals ment...
Conference Paper
Gradiance On-Line Accelerated Learning (GOAL) is a system for creating and automatically grading homeworks, programming laboratories, and tests. Through the concept of "root questions," Gradiance encourages students to solve complete problems, even though the homework appears to be in a multiple-choice format. Students are offered a hint or advice...
Conference Paper
The education industry has a very poor record of productivity gains. In this brief article, I outline some of the ways the teaching of a college course in database systems could be made more efficient, and staff time used more productively. These ideas ...
Conference Paper
The education industry has a very poor record of productivity gains. In this brief article, I outline some of the ways the teaching of a college course in database systems could be made more efficient, and staff time used more productively. These ideas carry over to other programming-oriented courses, and many of them apply to any academic subject...
Conference Paper
As database system research evolves, there are severalenduring themes. One, of course, is how we deal with thelargest possible amounts of data. A less obvious theme isoptimization - it is an essential ingredient of all modernforms of database system. Because we deal with large volumesof data, we are often forced to process that data inregular ways....
Article
The goal of the Project is to develop tools that facilitate the rapid integration of heterogeneous information sources that may include both structured and unstructured data. This paper gives an overview of the project, describing components that extract properties from unstructured objects, that translate information into a common object model, th...
Article
We consider the problem of analyzing market-basket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We investigate the idea of item reordering, which can i...
Conference Paper
Full-text available
The problem of answering queries using views has been studied extensively due to its relevance in a wide variety of data-management applications. In these applications, we often need to select a subset of views to maintain due to limited resources. In this paper, we show that traditional query containment is not a good basis for deciding whether or...
Article
Full-text available
Introduction The TSIMMIS system [1] integrates data from multiple heterogeneous sources and provides users with seamless integrated views of the data. It translates a user query on the integrated views into a set of source queries and postprocessing steps that compute the answer to the user query from the results of the source queries. TSIMMIS uses...
Article
Full-text available
We study the problem or generating efficient, equivalent rewritings using views to compute the answer to a query. We take the closed-world assumption, in which views are materialized from base relations, rather than views describing sources in terms of abstract predicates, as is common when the open-world assumption is used. In the closed-world mod...
Article
In data-integration systems, the queries supported by a mediator are affected by the queryprocessing limitations of the sources being integrated. Existing mediation systems employ a variety of mechanisms to describe the query-processing capabilities of sources. However, these systems do not compute the capabilities of the mediators based on the cap...
Conference Paper
Several commercial applications, such as online comparison shopping and process automation, require integrating information that is scattered across multiple websites or XML documents. Much research has been devoted to this problem, resulting in several research prototypes and commercial implementations. Such systems rely on wrappers that provide r...
Book
Data Structures is a first book on algorithms and data structures, using an object- oriented approach. The target audience for the book is a second-year CS class introducing fundamental data structures and their associated algorithms. This second ...
Article
Full-text available
Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, the well-known a priori algorithm is only effective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, such as data mining, identification of similar Web do...
Conference Paper
The standard model for association-rule mining involves a set of “items” and a set of “baskets.” The baskets contain items that some customer has purchased at the same time. The problem is to find pairs, or perhaps larger sets, of items that frequently appear together in baskets. We mention the principal approaches to efficient, large-scale discove...
Article
Many applications compute aggregate functions (such as COUNT, SUM) over an attribute (or set of attributes) to find aggregate values above some specified threshold. We call such queries iceberg queries because the number of above-threshold results is often very small (the tip of an iceberg), relative to the large amount of input data (the iceberg)....
Article
Dynamic Miss-Countingalgorithms are proposed, which find all implication and similarity rules with confidence pruning but without support pruning. To handle data sets with a large number of columns, we propose dynamic pruning techniques that can be applied during data scanning. DMC counts the numbers of rows in which each pair of columns disagree i...
Article
Full-text available
Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the "a-priori" trick, make association-rule mining run much faster than mig...
Article
Existing data-integration systems based on the mediation architecture employ a variety of mechanisms to describe the query-processing capabilities of sources. However, these systems do not compute the capabilities of the mediators based on the capabilities of the sources they integrate. In this paper, we proposed a framework to capture a rich varie...
Chapter
When an application is built, an underlying data model is chosen to make that application effective. Frequently, other applications need the same data, only modeled differently. The naïve solution of copying the underlying data and modeling is costly in terms of storage and makes data maintenance and evolution impossible. View mechanisms are a tech...
Article
. Semistructured data has no absolute schema fixed in advance and its structure may be irregular or incomplete. Such data commonly arises in sources that do not impose a rigid structure (such as the World-Wide Web) and when data is combined from several heterogeneous sources. Data models and query languages designed for well structured data are ina...
Article
Mediators are used for integration of heterogeneous information sources. In this paper we present a system for declaratively specifying mediators. It is targeted for integration of sources with unstructured or semi-structured data and/or sources with changing schemas. In the paper we illustrate the main features of the Mediator Specification Langua...
Article
The goal of the Tsimmis Project is to develop tools that facilitate the rapid integration of heterogeneous information sources that may include both structured and unstructured data. This paper gives an overview of the project, describing components that extract properties from unstructured objects, that translate information into a common object m...
Article
this paper, we show how the TSIMMIS mediator takes into account the capabilities of the sources to generate feasible query plans for user queries. Section 2 explains how the mediator processes user queries in general, and also discusses the way in which source capabilities are described in TSIMMIS. Section 3 describes the details of the capability...
Article
Full-text available
The database research community is rightly proud of success in basic research, and its remarkable record of technology transfer. Now the field needs to radically broaden its research focus to attack the issues of capturing, storing, analyzing, and presenting the vast array of online data. The database research community should embrace a broader res...
Conference Paper
In data integration systems, queries posed to a mediator need to be translated into a sequence of queries to the underlying data sources. In a heterogeneous environment, with sources of diverse and limited query capabilities, not all the translations are feasible. In this paper, we study the problem of finding feasible and efficient query plans for...
Conference Paper
Full-text available
Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the "a-priori" trick, make association-rule mining run much faster than mig...
Conference Paper
Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the “a-priori” trick, make association-rule mining run much faster than mig...
Article
We have made advances in the following areas: Data cubes: these recent data-warehouse products need a way to optimize the use of space by selecting some views to maintain permanently. We have identified the 'monotonicy' property choosing one view cannot increase the value of materializing another view as guaranteeing the existence of a polynomial-t...