Mourad Ouzzani

Qatar Computing Research Institute, Ad Dawḩah, Baladīyat ad Dawḩah, Qatar

Are you Mourad Ouzzani?

Claim your profile

Publications (122)46.83 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R∗-tree and Bitmap, inequality joins have received little attention and queries containing such joins are usually very slow. In this paper, we introduce fast inequality join algorithms. We put columns to be joined in sorted arrays and we use permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array. In contrast to sort-merge join, we use space efficient bit-arrays that enable optimizations, such as Bloom filter indices, for fast computation of the join results. We have implemented a centralized version of these algorithms on top of PostgreSQL, and a distributed version on top of Spark SQL. We have compared against well known optimization techniques for inequality joins and show that our solution is more scalable and several orders of magnitude faster.
    No preview · Conference Paper · Jan 2016
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a " clean " version of the data. To support domain experts,in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail. In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.
    Full-text · Article · Dec 2015 · Proceedings of the VLDB Endowment
  • [Show abstract] [Hide abstract]
    ABSTRACT: Identifying similarities in large datasets is an essential operation in several applications such as bioinformatics, pattern recognition, and data integration. To make a relational database management system similarity-aware, the core relational operators have to be extended. While similarity-awareness has been introduced in database engines for relational operators such as joins and group-by, little has been achieved for relational set operators, namely Intersection, Difference, and Union. In this paper, we propose to extend the semantics of relational set operators to take into account the similarity of values. We develop efficient query processing algorithms for evaluating them, and implement these operators inside an open-source database system, namely PostgreSQL. By extending several queries from the TPC-H benchmark to include predicates that involve similarity-based set operators, we perform extensive experiments that demonstrate up to three orders of magnitude speedup in performance over equivalent queries that only employ regular operators.
    No preview · Article · Nov 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: We tackle the problem of automatically filtering studies while preparing Systematic Reviews (SRs) which normally entails manually inspecting thousands of studies to identify the few to be included. The problem is modeled as an imbalanced data classification task where the cost of misclassifying the minority class is higher than the cost of misclassifying the majority class. This work introduces a novel method for representing systematic reviews based not only on lexical features, but also utilizing word clustering and citation features. This novel representation is shown to outperform previously used features in representing systematic reviews, regardless of the classifier. Our work utilizes a random forest classifier with the novel features to accurately predict included studies with high recall. The parameters of the random forest are automatically configured using heuristics methods thus allowing us to provide a product that is usable in real scenarios. Experiments on a dataset containing 15 systematic reviews that were prepared by health care professionals show that our approach can achieve high recall while helping the SR author save time.
    No preview · Article · Oct 2015 · Machine Learning
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The ubiquity of location-aware devices, e.g., smartphones and GPS devices, has led to a plethora of location-based services in which huge amounts of geotagged information need to be efficiently processed by large-scale computing clusters. This demo presents AQWA, an adaptive and query-workload-aware data partitioning mechanism for processing large-scale spatial data. Unlike existing cluster-based systems, e.g., SpatialHadoop, that apply static partitioning of spatial data, AQWA has the ability to react to changes in the query-workload and data distribution. A key feature of AQWA is that it does not assume prior knowledge of the query-workload or data distribution. Instead, AQWA reacts to changes in both the data and the query-workload by incrementally updating the partitioning of the data. We demonstrate two prototypes of AQWA deployed over Hadoop and Spark. In both prototypes, we process spatial range and k-nearest-neighbor (kNN, for short) queries over large-scale spatial datasets, and we exploit the performance of AQWA under different query-workloads.
    Full-text · Conference Paper · Sep 2015
  • Article: KATARA
    Xu Chu · John Morcos · Ihab F. Ilyas · Mourad Ouzzani · Paolo Papotti · Nan Tang · Yin Ye

    No preview · Article · Aug 2015 · Proceedings of the VLDB Endowment
  • Source

    Full-text · Article · Aug 2015 · Proceedings of the VLDB Endowment
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.
    Full-text · Conference Paper · Jun 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: While syntactic transformations require the application of a formula on the input values, such as unit conversion or date format conversions, semantic transformations, such as zip code to city, require a look-up in some reference data. We recently presented DataXFormer, a system that leverages Web tables, Web forms, and expert sourcing to cover a wide range of transformations. In this demonstration, we present the user-interaction with DataXFormer and show scenarios on how it can be used to transform data and explore the effectiveness and efficiency of several approaches for transformation discovery, leveraging about 112 million tables and online sources.
    Full-text · Conference Paper · Jun 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Data-intensive Web applications usually require integrating data from Web sources at query time. The sources may refer to the same real-world entity in different ways and some may even provide outdated or erroneous data. An important task is to recognize and merge the records that refer to the same real world entity at query time. Most existing duplicate detection and fusion techniques work in the off-line setting and do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its off-line counterpart. (i) The latter assumes that the entire data is available, while the former cannot make such an assumption. (ii) Several query submissions may be required to compute the " ideal " representation of an entity in the online setting. This paper presents a general framework for the online setting based on an iterative record-based caching technique. A set of frequently requested records is deduplicated off-line and cached for future reference. Newly arriving records in response to a query are deduplicated jointly with the records in the cache, presented to the user and appended to the cache. Experiments with real and synthetic data show the benefit of our solution over traditional record linkage techniques applied to an online setting.
    Full-text · Conference Paper · Apr 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: Data curation activities in collaborative databases mandate that collaborators interact until they converge and agree on the content of their data. Typically, updates by a member of the collaboration are made visible to all collaborators for comments but at the same time are pending the approval or rejection of the data custodian, e.g., the principal scientist or investigator (PI). In current database technologies, approval and authorization of updates is based solely on the identity of the user, e.g., via the SQL GRANT and REVOKE commands. However, in collaborative enviroments, the updated data is open for collaborators for discussion and further editing and is finally approved or rejected by the PI based on the content of the data and not on the identity of the updater. In this paper, we introduce a cloud-based collaborative database system that promotes and enables collaboration and data curation scenarios. We realize content-based update approval and history tracking of updates inside HBase, a distributed and scalable open-source cluster-based database. The design and implementation as well as a detailed performance study of several approaches for update approval are presented and contrasted in the paper.
    No preview · Article · Mar 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Data transformation is a crucial step in data integration. While some transformations, such as liters to gallons, can be easily performed by applying a formula or a program on the input values, others, such as zip code to city, require sifting through a repository containing explicit value map-pings. There are already powerful systems that provide for-mulae and algorithms for transformations. However, the automated identification of reference datasets to support value mapping remains largely unresolved. The Web is home to millions of tables with many containing explicit value map-pings. This is in addition to value mappings hidden behind Web forms. In this paper, we present DataXFormer, a transformation engine that leverages Web tables and Web forms to perform transformation tasks. In particular, we describe an inductive, filter-refine approach for identifying explicit transformations in a corpus of Web tables and an approach to dynamically retrieve and wrap Web forms. Experiments show that the combination of both resource types covers more than 80% of transformation queries formulated by real-world users.
    Full-text · Conference Paper · Jan 2015
  • Source
    Ahmed M Aly · Walid G Aref · Mourad Ouzzani
    [Show abstract] [Hide abstract]
    ABSTRACT: The ubiquity of location-aware devices and smartphones has unleashed an unprecedented proliferation of location-based services that require processing queries with both spatial and relational predicates. Many algorithms and index structures already exist for processing k-Nearest-Neighbor (kNN, for short) predicates either solely or when combined with textual keyword search. Unfortunately , there has not been enough study on how to efficiently process queries where kNN predicates are combined with general rela-tional predicates, i.e., ones that have selects, joins and group-by's. One major challenge is that because the kNN is a ranking operation , applying a relational predicate before or after a kNN predicate in a query evaluation pipeline (QEP, for short) can result in different outputs, and hence leads to different query semantics. In particular , this renders classical relational query optimization heuristics, e.g., pushing selects below joins, inapplicable. This paper presents various query optimization heuristics for queries that involve combinations of kNN select/join predicates and relational predicates. The proposed optimizations can significantly enhance the performance of these queries while preserving their semantics. Experimental results that are based on queries from the TPC-H benchmark and real spatial data from OpenStreetMap demonstrate that the proposed optimizations can achieve orders of magnitude enhancement in query performance.
    Full-text · Conference Paper · Jan 2015
  • X. Chu · J. Morcos · I.F. Ilyas · M. Ouzzani · P. Papotti · N. Tang · Y. Ye
    [Show abstract] [Hide abstract]
    ABSTRACT: Data cleaning with guaranteed reliability is hard to achieve without accessing external sources, since the truth is not necessarily discoverable from the data at hand. Furthermore, even in the presence of external sources, mainly knowledge bases and humans, effectively leveraging them still faces many challenges, such as aligning heterogeneous data sources and decomposing a complex task into simpler units that can be consumed by humans. We present Katara, a novel end-to-end data cleaning system powered by knowledge bases and crowdsourcing. Given a table, a kb, and a crowd, Katara (i) interprets the table semantics w.r.t. the given kb; (ii) identifies correct and wrong data; and (iii) generates top-k possible repairs for the wrong data. Users will have the opportunity to experience the following features of Katara: (1) Easy specification: Users can define a Katara job with a browser-based specification; (2) Pattern validation: Users can help the system to resolve the ambiguity of different table patterns (i.e., table semantics) discovered by Katara; (3) Data annotation: Users can play the role of internal crowd workers, helping Katara annotate data. Moreover, Katara will visualize the annotated data as correct data validated by the kb, correct data jointly validated by the kb and the crowd, or erroneous tuples along with their possible repairs.
    No preview · Chapter · Jan 2015
  • Source
    Ahmed M. Aly · Walid G. Aref · Mourad Ouzzani
    [Show abstract] [Hide abstract]
    ABSTRACT: Advances in geo-sensing technology have led to an unprecedented spread of location-aware devices. In turn, this has resulted into a plethora of location-based services in which huge amounts of spa- tial data need to be efficiently consumed by spatial query proces- sors. For a spatial query processor to properly choose among the various query processing strategies, the cost of the spatial operators has to be estimated. In this paper, we study the problem of estimat- ing the cost of the spatial k-nearest-neighbor (k-NN, for short) op- erators, namely, k-NN-Select and k-NN-Join. Given a query that has a k-NN operator, the objective is to estimate the number of blocks that are going to be scanned during the processing of this operator. Estimating the cost of a k-NN operator is challenging for several reasons. For instance, the cost of a k-NN-Select operator is directly affected by the value of k, the location of the query focal point, and the distribution of the data. Hence, a cost model that captures these factors is relatively hard to realize. This paper in- troduces cost estimation techniques that maintain a compact set of catalog information that can be kept in main-memory to enable fast estimation via lookups. A detailed study of the performance and accuracy trade-off of each proposed technique is presented. Experimental results using real spatial datasets from OpenStreetMap demonstrate the robustness of the proposed estimation techniques.
    Full-text · Conference Paper · Jan 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The unprecedented spread of location-aware devices has resulted in a plethora of location-based services in which huge amounts of spatial data need to be efficiently processed by large-scale computing clusters. Existing cluster-based systems for processing spatial data employ static data-partitioning structures that cannot adapt to data changes, and that are insensitive to the query workload. Hence, these systems are incapable of consistently providing good performance. To close this gap, we present AQWA, an adaptive and query-workload-aware mechanism for partitioning large-scale spatial data. AQWA does not assume prior knowledge of the data distribution or the query workload. Instead, as data is consumed and queries are processed, the data partitions are incrementally updated. With extensive experiments using real spatial data from Twitter, and various workloads of range and k-nearest-neighbor queries, we demonstrate that AQWA can achieve an order of magnitude enhancement in query performance compared to the state-of-the-art systems.
    Full-text · Conference Paper · Jan 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytic stack.While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarity aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently materialize this approximate semantics, they primarily focus on one-dimensional attributes and treat multidimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multidimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multidimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.
    Full-text · Article · Dec 2014 · IEEE Transactions on Knowledge and Data Engineering
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Similarity group-by (SGB, for short) has been proposed as a relational database operator to match the needs of emerging database applications. Many SGB operators that extend SQL have been proposed in the literature, e.g., similarity operators in the one-dimensional space. These operators have various semantics. Depending on how these operators are implemented, some of the implementations may lead to different groupings of the data. Hence, if SQL code is ported from one database system to another, it is not guaranteed that the code will produce the same results. In this paper, we investigate the various semantics for the relational similarity group-by operators in the multi-dimensional space. We define the class of order-independent SGB operators that produce the same results regardless of the order in which the input data is presented to them. Using the notion of interval graphs borrowed from graph theory, we prove that, for certain SGB operators, there exist order-independent implementations. For each of these operators, we provide a sample algorithm that is order-independent. Also, we prove that for other SGB operators, there does not exist an order-independent implementation for them, and hence these SGB operators are ill-defined and should not be adopted in extensions to SQL to realize similarity group-by. In this paper, we introduce an SGB operator, namely SGB-All, for grouping multi-dimensional data using similarity. SGB-All forms groups such that a data item, say O, belongs to a group, say G, if and only if O is within a user-defined threshold from all other data items in G. In other words, each group in SGB-All forms a clique of nearby data items in the multi-dimensional space. We prove that SGB-All are order-independent, i.e., there is at least one algorithm for each option that is independent of the presentation order of the input data.
    Full-text · Article · Dec 2014
  • Zbys Fedorowicz · Hossam Hammady · Mourad Ouzzani
    [Show abstract] [Hide abstract]
    ABSTRACT: Preliminary filtering of searches is one of the most time consuming aspects of systematic reviewing. Cochrane Review authors use a variety of approaches (manual and electronic) but no single method satisfactorily fulfills the principal requirements of speed with accuracy. Objectives: Pilot testing of Rayyan (rayyan.qcri.org) focused on usability and assessment of how accurately the tool performed against manual methods, and evaluation of the added benefit of the prediction feature. Methods: Searches from two published Cochrane Reviews (1030 and 273 records, respectively) were used to test the app (December 2013 to March 2014). Included studies had been previously selected manually for the reviews; original searches were uploaded into Rayyan. Results: One recently updated Cochrane Review (273 records) was used as a taster, allowing a quick overview of the look and feel of Rayyan and for early feedback on usability to be addressed by the developers. The second Cochrane Review (1030 records) required several iterations to identify the 11 trials that had previously been selected manually. The suggestions and hints, based on the prediction rules, appeared as the testing progressed beyond five included studies. The selection process was responsive and effective; the options undecided / included / excluded / suggested were clearly displayed. Search functions included limiters relevance / title / date. Innovative features include: word clouds as graphical indicators of key terms; translation option linked to Google Translate to enable a quick translation or forwarding to a translator; similarity-based exploration of studies; labelling of studies, including reasons for exclusion. Key functionality includes the unambiguous way in which studies could be viewed in context together with the completed selections, and how undecided studies could be fed back into the system and were then highlighted as hint. Conclusion: Rayyan (beta-testing phase) is responsive, and largely intuitive to use, with a significant potential to lighten the load of review authors by speeding up this tedious part of the process.
    No preview · Conference Paper · Oct 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we address the problem of managing complex dependencies among the database items that involve human actions, e.g., conducting a wet-lab experiment or taking manual measurements. If y = F(x) where F is a human action that cannot be coded inside the DBMS, then whenever x changes, y remains invalid until F is externally performed and its output result is reflected into the database. Many application domains, e.g., scientific applications in biology, chemistry, and physics, contain multiple such derivations and dependencies that involve human actions. In this paper, we propose HandsOn DB, a prototype database engine for managing dependencies that involve human actions while maintaining the consistency of the derived data. HandsOn DB includes the following features: (1) semantics and syntax for interfaces through which users can register human activities and express the dependencies among the data items on these activities, (2) mechanisms for invalidating and revalidating the derived data, and (3) new operator semantics that alert users when the returned query results contain potentially invalid data, and enable evaluating queries on either valid data only, or both valid and potentially invalid data. Performance results demonstrate the feasibility and practicality in realizing HandsOn DB.
    Full-text · Article · Sep 2014 · IEEE Transactions on Knowledge and Data Engineering

Publication Stats

1k Citations
46.83 Total Impact Points

Institutions

  • 2012-2016
    • Qatar Computing Research Institute
      Ad Dawḩah, Baladīyat ad Dawḩah, Qatar
  • 2011
    • Qatar Foundation
      Ad Dawḩah, Baladīyat ad Dawḩah, Qatar
  • 1970-2011
    • Purdue University
      • • Department of Computer Science
      • • Cyber Center
      ウェストラファイエット, Indiana, United States
  • 1999
    • Queensland University of Technology
      • Information Systems School
      Brisbane, Queensland, Australia