Article

A Demo of the Data Civilizer System

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Finding relevant data for a specific task from the numerous data sources available in any organization is a daunting task. This is not only because of the number of possible data sources where the data of interest resides, but also due to the data being scattered all over the enterprise and being typically dirty and inconsistent. In practice, data scientists are routinely reporting that the majority (more than 80%) of their effort is spent finding, cleaning, integrating, and accessing data of interest to a task at hand. We propose to demonstrate Data Civilizer to ease the pain faced in analyzing data "in the wild". Data Civilizer is an end-to-end big data management system with components for data discovery, data integration and stitching, data cleaning, and querying data from a large variety of storage engines, running in large enterprises.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We then present several reduction operations that are used to enforce schema complements when schema augmentation yields a row multiplication in the augmented dataset. These operations extend previous contributions on schema augmentation and schema complement (e.g., [12]- [14]) to the case of analytic datasets. (See Sections 3.1 to 3.3 in Chapter 3). ...
... Let T ct , T cand be respectively the completion table and candidate completion table of T and T ′ 0 as defined in Definition 3. 14 ...
... Let T ct Q 1 and T cand Q 1 be respectively the completion table and candidate completion table of Q 1 and T ′ Q 1 = T 0 ▷◁ Q 1 as defined in Definition 3. 14 15) is now the same as proving: ...
Thesis
The production of analytic datasets is a significant big data trend and has gone well beyond the scope of traditional IT-governed dataset development. Analytic datasets are now created by data scientists and data analysts using bigdata frameworks and agile data preparation tools. However, it still remains difficult for a data analyst to start from a dataset at hand and customize it with additional attributes coming from other existing datasets.This thesis presents a new solution for business users and data scientists who want to augment the schema of analytic datasets with attributes coming from other semantically related datasets :• We introduce attribute graphs as a novel concise and natural way to represent literal functional dependencies over hierarchical dimension level types to infer unique dimension and fact table identifiers• We give formal definitions for schema augmentation, schema complement, and merge query in the context of analytic tables. We then introduce several reduction operations to enforce schema complements when schema augmentation yields a row multiplication in the augmented dataset.• We define formal quality criteria and algorithms to control the correctness, non-ambiguity, and completeness of generated schema augmentations and schema complements.• We describe the implementation of our solution as a REST service within the SAP HANA platform and provide a detailed description of our algorithms.• We evaluate the performance of our algorithms to compute unique identifiers in dimension and fact tables and analyze the effectiveness of our REST service using two application scenarios.
... These systems aim at speeding up the data analysis procedures for large datasets. However, as reported by many data scientists [29], only 20% of their time is spent doing the desired data analysis tasks. Data scientists need to spend 80% of time, sometimes even more, on data preparation which is a slow, di cult and tedious process. ...
... If there are more than one subgraph ID in sid_list, the subgraph is split into subgraphs whose IDs are from sid_list (line [26][27]. If there is only one subgraph ID in sid_list, the subgraph does not need to be split and the only subgraph ID becomes its new subgraph ID (line [28][29]. If sid_list is empty, the subgraph is simply deleted (line 31). ...
Thesis
We are living in a big data world, where data is being generated in high volume, high velocity and high variety. Big data brings enormous values and benefits, so that data analytics has become a critically important driver of business success across all sectors. However, if the data is not analyzed fast enough, the benefits of big data will be limited or even lost. Despite the existence of many modern large-scale data analysis systems, data preparation which is the most time-consuming process in data analytics has not received sufficient attention yet. In this thesis, we study the problem of how to accelerate data preparation for big data analytics. In particular, we focus on two major data preparation steps, data loading and data cleaning. As the first contribution of this thesis, we design DiNoDB, a SQL-on-Hadoop system which achieves interactive-speed query execution without requiring data loading. Modern applications involve heavy batch processing jobs over large volume of data and at the same time require efficient ad-hoc interactive analytics on temporary data generated in batch processing jobs. Existing solutions largely ignore the synergy between these two aspects, requiring to load the entire temporary dataset to achieve interactive queries. In contrast, DiNoDB avoids the expensive data loading and transformation phase. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata, that DiNoDB exploits to expedite the interactive queries. The second contribution is a distributed stream data cleaning system, called Bleach. Existing scalable data cleaning approaches rely on batch processing to improve data quality, which are very time-consuming in nature. We target at stream data cleaning in which data is cleaned incrementally in real-time. Bleach is the first qualitative stream data cleaning system, which achieves both real-time violation detection and data repair on a dirty data stream. It relies on efficient, compact and distributed data structures to maintain the necessary state to clean data, and also supports rule dynamics. We demonstrate that the two resulting systems, DiNoDB and Bleach, both of which achieve excellent performance compared to state-of-the-art approaches in our experimental evaluations, and can help data scientists significantly reduce their time spent on data preparation.
... Botnet detection techniques based on ML have been considered the most effective. Feature extraction is an important step in ML techniques that involves about 80% of the time before processing data, which helps to minimize data dimensionality and improve the accuracy of ML models [15]. Flowbased features are the features most often focused on in bot detection research. ...
Article
Full-text available
The number of botnet malware attacks on Internet devices has grown at an equivalent rate to the number of Internet devices that are connected to the Internet. Bot detection using machine learning (ML) with flow-based features has been extensively studied in the literature. Existing flow-based detection methods involve significant computational overhead that does not completely capture network communication patterns that might reveal other features of malicious hosts. Recently, Graph-Based Bot Detection methods using ML have gained attention to overcome these limitations, as graphs provide a real representation of network communications. The purpose of this study is to build a botnet malware detection system utilizing centrality measures for graph-based botnet detection and ML. We propose BotSward, a graph-based bot detection system that is based on ML. We apply the efficient centrality measures, which are Closeness Centrality (CC), Degree Centrality (CC), and PageRank (PR), and compare them with others used in the state-of-the-art. The efficiency of the proposed method is verified on the available Czech Technical University 13 dataset (CTU-13). The CTU-13 dataset contains 13 real botnet traffic scenarios that are connected to a command-and-control (C&C) channel and that cause malicious actions such as phishing, distributed denial-of-service (DDoS) attacks, spam attacks, etc. BotSward is robust to zero-day attacks, suitable for large-scale datasets, and is intended to produce better accuracy than state-of-the-art techniques. The proposed BotSward solution achieved 99% accuracy in botnet attack detection with a false positive rate as low as 0.0001%.
... Finally, this step often does some form of normalization and weighting in order to ensure that the further processing steps are acting on the right data. In the academic and commercial world, there are a number of technologies such as Data Civilizer [31], DataXFormer [32], Data Tamer [33], and Data Wrangler [34] that provide semi-automatic techniques to simplify data curation. ...
Preprint
Artificial Intelligence (AI) has the opportunity to revolutionize the way the United States Department of Defense (DoD) and Intelligence Community (IC) address the challenges of evolving threats, data deluge, and rapid courses of action. Developing an end-to-end artificial intelligence system involves parallel development of different pieces that must work together in order to provide capabilities that can be used by decision makers, warfighters and analysts. These pieces include data collection, data conditioning, algorithms, computing, robust artificial intelligence, and human-machine teaming. While much of the popular press today surrounds advances in algorithms and computing, most modern AI systems leverage advances across numerous different fields. Further, while certain components may not be as visible to end-users as others, our experience has shown that each of these interrelated components play a major role in the success or failure of an AI system. This article is meant to highlight many of these technologies that are involved in an end-to-end AI system. The goal of this article is to provide readers with an overview of terminology, technical details and recent highlights from academia, industry and government. Where possible, we indicate relevant resources that can be used for further reading and understanding.
Article
Full-text available
Data analysis often uses data sets that were collected for different purposes. Indeed, new insights are often obtained by combining data sets that were produced independently of each other, for example by combining data from outside an organization with internal data resources. As a result, there is a need to discover, clean, integrate and restructure data into a form that is suitable for an intended analysis. Data preparation, also known as data wrangling, is the process by which data are transformed from its existing representation into a form that is suitable for analysis. In this paper, we review the state-of-the-art in data preparation, by: (i) describing functionalities that are central to data preparation pipelines, specifically profiling, matching, mapping, format transformation and data repair; and (ii) presenting how these capabilities surface in different approaches to data preparation, that involve programming, writing workflows, interacting with individual data sets as tables, and automating aspects of the process. These functionalities and approaches are illustrated with reference to a running example that combines open government data with web extracted real estate data.
Article
Schema mappings enable declarative and executable specification of transformations between different schematic representations of application concepts. Most work on mapping generation has assumed that the source and target schemas are well defined, e.g., with declared keys and foreign keys, and that the mapping generation processes exist to support the data engineer in the labour-intensive process of producing a high-quality integration. However, organizations increasingly have access to numerous independently produced datasets, e.g., in a data lake, with a requirement to produce rapid, best-effort integrations, without extensive manual effort. As a result, there is a need to generate mappings in settings without declared relationships, and thus on the basis of inferred profiling data, and over large numbers of sources. Our contributions include a dynamic programming algorithm for exploring the space of potential mappings, and techniques for propagating profiling data through mappings, so that the fitness of candidate mappings can be estimated. The paper also describes how the resulting mappings can be used to populate single and multi-relation target schemas. Experimental results show the effectiveness and scalability of the approach in a variety of synthetic and real-world scenarios.
Chapter
The digital revolution, rapidly decreasing storage cost, and remarkable results achieved by state of the art machine learning (ML) methods are driving widespread adoption of ML approaches. While notable recent efforts to benchmark ML methods for canonical tasks exist, none of them address the challenges arising with the increasing pervasiveness of end-to-end ML deployments. The challenges involved in successfully applying ML methods in diverse enterprise settings extend far beyond efficient model training.
Chapter
This paper presents a Tensor based Data Model (TDM) for polystore systems meant to address two major closely related issues in big data analytics architectures, namely logical data independence and data impedance mismatch. The TDM is an expressive model that subsumes traditional data models, it allows to link different data models of various data stores, and which also facilitates data transformations by using operators with clearly defined semantics. Our contribution is twofold. Firstly, it is the addition of the notion of a schema for the tensor mathematical object using typed associative arrays. Secondly, it is the definition of a set of operators to manipulate data through the TDM. In order to validate our approach we first show how our TDM model is inserted into a given polystore architecture. We then describe some use cases of real analyses using our TDM and its operators in the context of the French Presidential Election in 2017.
Article
Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.
Conference Paper
Schema mappings enable declarative and executable specification of transformations between different schematic representations of application concepts. Most work on mapping generation has assumed that the source and target schemas are well defined, e.g., with declared keys and foreign keys, and that the mapping generation processes exist to support the data engineer in the labour-intensive process of producing a high-quality integration. However, organizations increasingly have access to numerous independently produced data sets, e.g., in a data lake, with a requirement to produce rapid, best-effort integrations, without extensive manual effort. This paper introduces Dynamap, a mapping generation algorithm for such settings, where metadata about sources and the relationships between them is derived from automated data profiling, and where there may be many alternative ways of combining source tables. Our contributions include a dynamic programming algorithm for exploring the space of potential mappings, and techniques for propagating profiling data through mappings, so that the fitness of candidate mappings can be estimated. Experimental results show the effectiveness and scalability of the approach in a variety of synthetic and real-world scenarios.
Conference Paper
Organizations such as GE are heavily invested in applying advanced data-intensive machine learning (ML) techniques towards continually improving the performance of complex physical assets and industrial operations. This paper highlights some unique industrial data management challenges and demonstrates the need for and benefits of a knowledge-driven approach to data management that complements existing efforts by the ML systems community. Specifically, we present a novel software abstraction called a NodeGroup for accessing (within some domain-specific context) heterogeneous data so that the development of end-to-end ML-driven applications is further streamlined. We present our preliminary use of NodeGroups for ML applications within a prototype (additive) manufacturing data platform at GE.
Conference Paper
SQL queries encapsulate the knowledge of their authors about the usage of the queried data sources. This knowledge also contains aspects that cannot be inferred by analyzing the contents of the queried data sources alone. Due to the complexity of analytical SQL queries, specialized mechanisms are necessary to enable the user-friendly formulation of meta-queries over an existing query log. Currently existing approaches do not sufficiently consider syntactic and semantic aspects of queries along with contextual information. During our demonstration, conference participants learn how to use the latest release of OCEANLog, a framework for analyzing SQL query logs. Our demonstration encompasses several scenarios. Participants can explore an existing query log using domain-specific graph traversal expressions, set up continuous subscriptions for changes in the graph, create time-based visualizations of query results, configure an OCEANLog instance and learn how to choose a decide which specific graph database to use. We also provide them with access to the native meta-query mechanisms of a DBMS to further emphasize the benefits of our graph-based approach.
Conference Paper
Analytical SQL queries are a valuable source of information. Query log analysis can provide insight into the usage of datasets and uncover knowledge that cannot be inferred from source schemas or content alone. To unlock this potential, flexible mechanisms for meta-querying are required. Syntactic and semantic aspects of queries must be considered along with contextual information. We present an extensible framework for analyzing SQL query logs. Query logs are mapped to a multi-relational graph model and queried using domain-specific traversal expressions. To enable concise and expressive meta-querying, semantic analyses are conducted on normalized relational algebra trees with accompanying schema lineage graphs. Syntactic analyses can be conducted on corresponding query texts and abstract syntax trees. Additional metadata allows to inspect the temporal and social context of each query. In this demonstration, we show how query log analysis with our framework can support data source discovery and facilitate collaborative data science. The audience can explore an exemplary query log to locate queries relevant to a data analysis scenario, conduct graph analyses on the log and assemble a customized logmonitoring dashboard.
Conference Paper
Data wrangling, the multi-faceted process by which the data required by an application is identified, extracted, cleaned and integrated, is often cumbersome and labor intensive. In this paper, we present an architecture that supports a complete data wrangling lifecycle, orchestrates components dynamically, builds on automation wherever possible, is informed by whatever data is available, refines automatically produced results in the light of feedback, takes into account the user's priorities, and supports data scientists with diverse skill sets. The architecture is demonstrated in practice for wrangling property sales and open government data.
Article
Full-text available
We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.
Conference Paper
Many emerging applications, from domains such as healthcare and oil & gas, require several data processing systems for complex analytics. This demo paper showcases system, a framework that provides multi-platform task execution for such applications. It features a three-layer data processing abstraction and a new query optimization approach for multi-platform settings. We will demonstrate the strengths of system by using real-world scenarios from three different applications, namely, machine learning, data cleaning, and data fusion.
Article
The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. However, integrating data from this relational web raises several challenges that are not addressed by current data integration systems or mash-up tools. First, the structured data is usually not published cleanly and must be extracted (say, from an HTML list) before it can be used. Second, due to the vastness of the corpus, a user can never know all of the potentially-relevant databases ahead of time (much less write a wrapper or mapping for each one); the source databases must be discovered during the integration process. Third, some of the important information regarding the data is only present in its enclosing web page and needs to be extracted appropriately. This paper describes Octopus, a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web. The key idea underlying Octopus is to offer the user a set of best-effort operators that automate the most labor-intensive tasks. For example, the Search operator takes a search-style keyword query and returns a set of relevance-ranked and similarity-clustered structured data sources on the Web; the Context operator helps the user specify the semantics of the sources by inferring attribute values that may not appear in the source itself, and the Extend operator helps the user find related sources that can be joined to add new attributes to a table. Octopus executes some of these operators automatically, but always allows the user to provide feedback and correct errors. We describe the algorithms underlying each of these operators and experiments that demonstrate their efficacy.
The Data Civilizer System
  • D Deng
  • R C Fernandez
  • Z Abedjan
  • S Wang
  • A Elmagarmid
  • I F Ilyas
  • S Madden
  • M Ouzzani
  • M Stonebraker
  • N Tang
D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. The Data Civilizer System. In CIDR, 2017.