About
412
Publications
74,381
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
73,510
Citations
Publications
Publications (412)
Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in "on-demand" mode. However, to perform any computation on the data in th...
Open addressing hash tables, possibly under double hashing policy, are regarded more memory efficient than linked list hashing; as the memory used for pointers can be used for a longer table, and allow better-expected performance as the load factor is smaller and there are fewer expected collisions. We suggest further eliminating the single pointer...
Computations, such as syndromic surveillance and e-commerce, are executed over the datasets collected from different geographical locations. Modern data processing systems, such as MapReduce/Hadoop or Spark, also, require to collect the data from different geographical locations to a single global location, before executing an application, and thus...
In recent years, an increasing amount of data is collected in different and often, not cooperative, databases. The problem of privacy-preserving, distributed calculations over separate databases and, a relative to it, the issue of private data release was intensively investigated. However, despite a considerable progress, computational complexity a...
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along with the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of ex...
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along with the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of ex...
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of existin...
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This poster continues along the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome some of the limitatio...
Despite extensive research on cryptography, secure and efficient query processing over outsourced data remains an open challenge. This paper continues along the emerging trend in secure data processing that recognizes that the entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome limitations of existin...
In this paper we offer an algorithm which computes the multiway join efficiently in MapReduce even when the data is skewed. Handling skew is one of the major challenges in query processing and computing joins is both important and costly. When data is huge distributed computational platforms must be used. The algorithm Shares for computing multiway...
A MapReduce algorithm can be described by a mapping schema, which assigns inputs to a set of reducers, such that for each required output there exists a reducer that receives all the inputs participating in the computation of this output. Reducers have a capacity that limits the sets of inputs they can be assigned. However, individual inputs may va...
This panel critically examines the state of data: how its growth and ubiquity have confronted the computer science and particularly the database community, with new challenges. These challenges require practitioners and teachers to learn new skills and engage with other disciplines in ways they had not done before. Panelists will examine the impact...
We consider the problem of computing the data-cube marginals of a fixed order k (i.e., all marginals that aggregate over k dimensions), using a single round of MapReduce. The focus is on the relationship between the reducer size (number of inputs allowed at a single reducer) and the replication rate (number of reducers to which an input is sent). W...
In recent years, an increasing amount of data is collected in different and often, not cooperative, databases. The problem of privacy-preserving, distributed calculations over separated databases and, a relative to it, issue of private data release were intensively investigated. However, despite a considerable progress, computational complexity, du...
We consider the problem of computing the data-cube marginals of a fixed order
$k$ (i.e., all marginals that aggregate over $k$ dimensions), using a single
round of MapReduce. The focus is on the relationship between the reducer size
(number of inputs allowed at a single reducer) and the replication rate (number
of reducers to which an input is sent...
The federation of cloud and big data activities is the next challenge where
MapReduce should be modified to avoid (big) data migration across remote
(cloud) sites. This is exactly our scope of research, where only the very
essential data for obtaining the result is transmitted, reducing communication,
processing and preserving data privacy as much...
An important parameter to be considered in MapReduce algorithms is the "reducer capacity," is introduced here for the first time. The reducer capacity is an upper bound on the sum of the sizes of the inputs that are assigned to the reducer. We consider, for the first time, the different sizes of the inputs, which are sent to the reducers. Another s...
An important parameter to be considered in MapReduce algorithms is the “reducer capacity,” is introduced here for the first time. The reducer capacity is an upper bound on the sum of the sizes of the inputs that are assigned to the reducer. We consider, for the first time, the different sizes of the inputs, which are sent to the reducers. Another s...
Handling skew is one of the major challenges in query processing. In
distributed computational environments such as MapReduce, uneven distribution
of the data to the servers is not desired. One of the dominant measures that we
want to optimize in distributed environments is communication cost. In a
MapReduce job this is the amount of data that is t...
We consider the problem of 2-way interval join, where we want to find all pairs of overlapping intervals, i.e., intervals that share at least one point in common. We present lower and upper bounds on the replication rate for this problem when it is implemented in MapReduce. We study three cases, where intervals in the input are: (i) unit-length and...
A MapReduce algorithm can be described by a mapping schema, which assigns
inputs to a set of reducers, such that for each required output there exists a
reducer that receives all the inputs that participate in the computation of
this output. Reducers have a capacity, which limits the sets of inputs that
they can be assigned. However, individual inp...
A MapReduce algorithm can be described by a mapping schema, which assigns
inputs to a set of reducers, such that for each required output there exists a
reducer that receives all the inputs that participate in the computation of
this output. Reducers have a capacity, which limits the sets of inputs that
they can be assigned. However, individual inp...
We study the problem of computing the join of $n$ relations in multiple
rounds of MapReduce. We introduce a distributed and generalized version of
Yannakakis's algorithm, called GYM. GYM takes as input any generalized
hypertree decomposition (GHD) of a query of width $w$ and depth $d$, and
computes the query in $O(d)$ rounds and $O(n(\mathrm{IN}^w...
In Dremel, data is stored as nested relations. The schema for a relation is a tree, all of whose nodes are attributes, and whose leaf attributes hold values. We explore filter and aggregate queries that are given in the Dremel dialect of SQL. Complications arise because of repeated attributes, i.e., attributes that are allowed to have more than one...
The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, a...
The all-pairs problem is an input-output relationship where each output corresponds to a pair of inputs, and each pair of inputs has a corresponding output. It models similarity joins where no simplification of the search for similar pairs, e.g., locality-sensitive hashing, is possible, and each input must be compared with every other input to dete...
Recently, a great deal of interest for Big Data has risen, mainly driven from a widespread number of research problems strongly related to real-life applications and systems, such as representing, modeling, processing, querying and mining massive, distributed, large-scale repositories (mostly being of unstructured nature). Inspired by this main tre...
As MapReduce/Hadoop grows in importance, we find more exotic applications being written this way. Not every program written for this platform performs as well as we might wish. There are several reasons why a MapReduce program can underperform expectations. One is the need to balance the communication cost of transporting data from the mappers to t...
An introduction to designing algorithms for the MapReduce framework for parallel processing of big data.
The theme of this paper is how to find all instances of a given "sample"
graph in a larger "data graph," using a single round of map-reduce. For the
simplest sample graph, the triangle, we improve upon the best known such
algorithm. We then examine the general case, considering both the communication
cost between mappers and reducers and the total...
In this paper we study the tradeoff between parallelism and communication
cost in a map-reduce computation. For any problem that is not "embarrassingly
parallel," the finer we partition the work of the reducers so that more
parallelism can be extracted, the greater will be the total communication
between mappers and reducers. We introduce a model o...
A significant amount of recent research work has addressed the problem of
solving various data management problems in the cloud. The major algorithmic
challenges in map-reduce computations involve balancing a multitude of factors
such as the number of machines available for mappers/reducers, their memory
requirements, and communication cost (total...
Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce rou...
In this paper, we design and analyze parallel algorithms for skyline queries. The skyline of a multidimensional set consists of the points for which no other point exists that is at least as good along every dimension. As a framework for parallel computation, we use both the MP model proposed in (Koutris and Suciu, PODS 2011), which requires that t...
Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the "map-key," the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribut...
ABSTRACT Implementations of map-reduce are being used to perform many,operations on very large data. We explore alternative ways that a system could use the environment and capa- bilities of map-reduce implementations such as Hadoop. In particular, we look at strategies for combining the natural join of several relations. The general strategy we em...
Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribut...
While classic data management focuses on the data itself, research on Business Processes considers also the context in which this data is generated and manipulated, namely the processes, the users, and the goals that this data serves. This allows ...
We survey the recent wave of extensions to the popular map-reduce systems, including those that have begun to address the implementation of recursive queries using the same computing environment as map-reduce. A central problem is that recursive tasks cannot deliver their output only at the end, which makes recovery from failures much more complica...
Implementing recursive algorithms on computing clusters presents a number of new challenges. In particular, we consider the endgame problem: later rounds of a recursion often transfer only small amounts of data, causing high overhead for interprocessor communication. One way to deal with the endgame problem is to use an algorithm that reduces the n...
We consider the problem of recommending the best set of k items when there is an inherent ordering between items, expressed as a set of prerequisites (e.g., the movie 'Godfather I' is a prerequisite of 'Godfather II'). Since this general problem is computationally intractable, we develop 3 approximation algorithms to solve this problem for various...
There has been considerable past work studying data integration and uncertain data in isolation. We develop the theory of local-as-view (LAV) data integration when the sources being integrated are uncertain. We motivate two distinct settings for uncertain-data integration. We then define containment of uncertain databases in these settings, which a...
The cluster-computing environment typified by Hadoop, the open-source implementation of map-reduce, is receiving serious attention as the way to execute queries and other operations on very large-scale data. Datalog execution presents several unusual issues for this enviroment. We discuss the best way to execute a round of seminaive evaluation on a...
ABSTRACT We address schema design in uncertain databases. Since uncertain data is relational in nature, decomposition be- comes a key issue in design. Decomposition relies on depen- dency theory, and primarily on functional dependencies. We study the theory of functional dependencies (FDs) for uncertain relations. We dene,several kinds of horizonta...
We study the problem of generating efficient, equivalent rewritings using views to compute the answer to a query. We take the closed-world assumption, in which views are materialized from base relations, rather than views describing sources in terms of abstract predicates, as is common when the open-world assumption is used. In the closed-world mod...
This book provides the foundation for understanding the theory and pracitce of compilers.
A number of ideas concerning information-integration tools can be thought of as constructing answers to queries using views that represent the capabilities of information sources. We review the formal basis of these techniques, which are closely related to containment algorithms for conjunctive queries and/or Datalog programs. Then we compare the a...
Wrappers provide access to heterogeneous information sources by converting application queries into source specific queries or commands. In this paper we present a wrapper implementation toolkit that facilitates rapid development of wrappers. We focus on the query translation component of the toolkit, called the converter. The converter takes as in...
The database research with focus on integration of text, data, code, fusion of information from heterogeneous data sources, and information privacy, conducted at Lowell, is discussed. The object-oriented (OO) and object-relational (OR) database management systems (DBMS) showed how text and other data types can be added to a DBMS. Several goals ment...
Gradiance On-Line Accelerated Learning (GOAL) is a system for creating and automatically grading homeworks, programming laboratories, and tests. Through the concept of "root questions," Gradiance encourages students to solve complete problems, even though the homework appears to be in a multiple-choice format. Students are offered a hint or advice...
The education industry has a very poor record of productivity gains. In this brief article, I outline some of the ways the teaching of a college course in database systems could be made more efficient, and staff time used more productively. These ideas ...
The education industry has a very poor record of productivity gains. In this brief article, I outline some of the ways the teaching of a college course in database systems could be made more efficient, and staff time used more productively. These ideas carry over to other programming-oriented courses, and many of them apply to any academic subject...
As database system research evolves, there are severalenduring themes. One, of course, is how we deal with thelargest possible amounts of data. A less obvious theme isoptimization - it is an essential ingredient of all modernforms of database system. Because we deal with large volumesof data, we are often forced to process that data inregular ways....
The goal of the Project is to develop tools that facilitate the rapid integration of heterogeneous information sources that may include both structured and unstructured data. This paper gives an overview of the project, describing components that extract properties from unstructured objects, that translate information into a common object model, th...
We consider the problem of analyzing market-basket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We investigate the idea of item reordering, which can i...
The problem of answering queries using views has been studied extensively due to its relevance in a wide variety of data-management applications. In these applications, we often need to select a subset of views to maintain due to limited resources. In this paper, we show that traditional query containment is not a good basis for deciding whether or...
Introduction The TSIMMIS system [1] integrates data from multiple heterogeneous sources and provides users with seamless integrated views of the data. It translates a user query on the integrated views into a set of source queries and postprocessing steps that compute the answer to the user query from the results of the source queries. TSIMMIS uses...
We study the problem or generating efficient, equivalent rewritings using views to compute the answer to a query. We take the closed-world assumption, in which views are materialized from base relations, rather than views describing sources in terms of abstract predicates, as is common when the open-world assumption is used. In the closed-world mod...
In data-integration systems, the queries supported by a mediator are affected by the queryprocessing limitations of the sources being integrated. Existing mediation systems employ a variety of mechanisms to describe the query-processing capabilities of sources. However, these systems do not compute the capabilities of the mediators based on the cap...
Several commercial applications, such as online comparison shopping and process automation, require integrating information that is scattered across multiple websites or XML documents. Much research has been devoted to this problem, resulting in several research prototypes and commercial implementations. Such systems rely on wrappers that provide r...
Data Structures is a first book on algorithms and data structures, using an object- oriented approach. The target audience for the book is a second-year CS class introducing fundamental data structures and their associated algorithms. This second ...
Association-rule mining has heretofore relied on the condition of
high support to do its work efficiently. In particular, the well-known a
priori algorithm is only effective when the only rules of interest are
relationships that occur very frequently. However, there are a number of
applications, such as data mining, identification of similar Web
do...
The standard model for association-rule mining involves a set of “items” and a set of “baskets.” The baskets contain items
that some customer has purchased at the same time. The problem is to find pairs, or perhaps larger sets, of items that frequently
appear together in baskets. We mention the principal approaches to efficient, large-scale discove...
Many applications compute aggregate functions (such as COUNT, SUM) over an attribute (or set of attributes) to find aggregate values above some specified threshold. We call such queries iceberg queries because the number of above-threshold results is often very small (the tip of an iceberg), relative to the large amount of input data (the iceberg)....
Dynamic Miss-Countingalgorithms are proposed, which find all implication and similarity rules with confidence pruning but without support pruning. To handle data sets with a large number of columns, we propose dynamic pruning techniques that can be applied during data scanning. DMC counts the numbers of rows in which each pair of columns disagree i...
Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the "a-priori" trick, make association-rule mining run much faster than mig...
Existing data-integration systems based on the mediation architecture employ a variety of mechanisms to describe the query-processing capabilities of sources. However, these systems do not compute the capabilities of the mediators based on the capabilities of the sources they integrate. In this paper, we proposed a framework to capture a rich varie...
When an application is built, an underlying data model is chosen to make that application effective. Frequently, other applications need the same data, only modeled differently. The naïve solution of copying the underlying data and modeling is costly in terms of storage and makes data maintenance and evolution impossible. View mechanisms are a tech...
. Semistructured data has no absolute schema fixed in advance and its structure may be irregular or incomplete. Such data commonly arises in sources that do not impose a rigid structure (such as the World-Wide Web) and when data is combined from several heterogeneous sources. Data models and query languages designed for well structured data are ina...
Mediators are used for integration of heterogeneous information sources. In this paper we present a system for declaratively specifying mediators. It is targeted for integration of sources with unstructured or semi-structured data and/or sources with changing schemas. In the paper we illustrate the main features of the Mediator Specification Langua...
The goal of the Tsimmis Project is to develop tools that facilitate the rapid integration of heterogeneous information sources that may include both structured and unstructured data. This paper gives an overview of the project, describing components that extract properties from unstructured objects, that translate information into a common object m...
this paper, we show how the TSIMMIS mediator takes into account the capabilities of the sources to generate feasible query plans for user queries. Section 2 explains how the mediator processes user queries in general, and also discusses the way in which source capabilities are described in TSIMMIS. Section 3 describes the details of the capability...
The database research community is rightly proud of success in basic research, and its remarkable record of technology transfer. Now the field needs to radically broaden its research focus to attack the issues of capturing, storing, analyzing, and presenting the vast array of online data. The database research community should embrace a broader res...
In data integration systems, queries posed to a mediator need to be translated into a sequence of queries to the underlying data sources. In a heterogeneous environment, with sources of diverse and limited query capabilities, not all the translations are feasible. In this paper, we study the problem of finding feasible and efficient query plans for...
Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the "a-priori" trick, make association-rule mining run much faster than mig...
Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the “a-priori” trick, make association-rule mining run much faster than mig...
We have made advances in the following areas: Data cubes: these recent data-warehouse products need a way to optimize the use of space by selecting some views to maintain permanently. We have identified the 'monotonicy' property choosing one view cannot increase the value of materializing another view as guaranteeing the existence of a polynomial-t...