Conference PaperPDF Available

ConceptMix: Self-Service Analytical Data Integration based on the Concept-Oriented Model

Authors:

Abstract and Figures

Data integration as well as other data wrangling tasks account for a great deal of the difficulties in data analysis and frequently constitute the most tedious part of the overall analysis process. We describe a new system, ConceptMix, which radically simplifies analytical data integration for a broad range of non-IT users who do not possess deep knowledge in mathematics or statistics. ConceptMix relies on a novel unified data model, called the concept-oriented model (COM), which provides formal background for its functionality.
Content may be subject to copyright.
A preview of the PDF is not available

Supplementary resource (1)

... Self-service analytics is one of the most significant trends in the BI industry over the last few years. It is aimed at giving users the ability to solve analytical tasks with little or no help from IT [24]. Self-service tools are intended for such users as data enthusiasts, business users, data artisans, analysts. ...
... COM has been described at conceptual level as well as syntactically using the concept-oriented query language (COQL) [30,29,25] with limited formalization. COM has also been implemented in two systems: a self-service tool for analytical data integration, ConceptMix [24] and a framework for data wrangling and agile data transformations, DataCommandr [22]. The main contribution of this paper is that we propose a new formalization of COM in terms of functions, sets and tuples. ...
... In this section, we describe one possible application of COM which has been implemented in a framework for agile data transformations and manipulations, called DataCommandr [22]. It is a data processing engine behind ConceptMix [24] -a tool for self-service data transformations. Data manipulations in Dat-aCommandr are described using the Concept-Oriented Expression Language (COEL). ...
Preprint
Full-text available
The plethora of existing data models and specific data modeling techniques is not only confusing but leads to complex, eclectic and inefficient designs of systems for data management and analytics. The main goal of this paper is to describe a unified approach to data modeling, called the concept-oriented model (COM), by using functions as a basis for its formalization. COM tries to answer the question what is data and to rethink basic assumptions underlying this and related notions. Its main goal is to unify major existing views on data (generality), using only a few main notions (simplicity) which are very close to how data is used in real life (naturalness).
... Such course of actions would allow the curation of reports to be sourced to the crowd, while IT users with insufficient knowledge of the business domain would no longer be responsible for assessing the quality of the reports or their components. In summary, we identified a trade-off in the area of governing SSBIA between a top-down environment (e.g., Mayer et al., 2015;Schuff et al., 2018;) that includes a pre-defined technical infrastructure, which typically lacks flexibility, and a bottom up environment (e.g., Abelló et al., 2013;Savinov, 2014) that gives users the full flexibly with the risk of taking wrong assumptions. ...
... (1) Understand the Trade-Off between Top-Down and Bottom-Up SSBIA Capabilities. Firstly, as users have diverse knowledge within analytical tasks, SSBIA tools need to address diverse capabilities ranging from data analysis with pre-defined sets of data (e.g., Schuff et al., 2018) to data integration, preparation, and modeling (e.g., Abelló et al., 2013;Savinov, 2014). The technical infrastructure should enable all user groups to conduct their analytical tasks based on the underlying data. ...
Conference Paper
Full-text available
Self-Service Business Intelligence and Analytics (SSBIA) is an upcoming approach and trend that enables casual business users to prepare and analyze data with easy-to-use Business Intelligence and Analytics (BIA) systems without being reliant on expert support or power users to perform their (complex) analytical tasks easier and faster than before. Despite a strong interest of scholars and practitioners in SSBIA, the understanding about its underlying characteristics is limited. Furthermore, there is a lack of a structured and systematic form in which SSBIA research can be classified. Against this backdrop, this article showcases the current state-of-the-art of SSBIA research along four key areas in the field: (1) perspectives on SSBIA, (2) user roles involved, (3) required expertise, and (4) supported levels of self-service. Analyzing 60 articles, our main contribution resides in the synopsis of SSBIA literature in these four areas. For instance, we illustrate that there exist three perspectives of SSBIA: artefact-centric (45% of analyzed studies), user-centric (82%), and governance-centric (25%). On the basis of our analysis, we suggest promising avenues, which will support scholars in their endeavors on how to pursue with future avenues in the field of SSBIA (for e.g., understanding the trade-off between top-down and bottom-up capabilities).
... Inference in COM relies not only on set operations but also on operations with functions (deriving new functions from existing functions). This changes the way data is processed and this approach was implemented in several systems such as Prosto 1 , Lambdo 2 , Bistro 3 for general purpose data processing and DataCommandr [6] and ConceptMix [8] for data wrangling. ...
Preprint
Full-text available
In this paper we argue that representing entity properties by tuple attributes, as evangelized in most set-oriented data models, is a controversial method conflicting with the principle of tuple immutability. As a principled solution to this problem of tuple immutability on one hand and the need to modify tuple attributes on the other hand, we propose to use mathematical functions for representing entity properties. In this approach, immutable tuples are intended for representing the existence of entities while mutable functions (mappings between sets) are used for representing entity properties. In this model, called the concept-oriented model (COM), functions are made first-class elements along with sets, and both functions and sets are used to represent and process data in a simpler and more natural way in comparison to purely set-oriented models.
... We describe the Bistro 1 toolkit only to illustrate one possible implementation of COM and do not discuss such (important) aspects as physical data organization, dependencies and topology of operations, incremental evaluation, optimization of function evaluation etc. This approach to data modeling and processing was also used for self-service data integration and analysis [11,12]. ...
Preprint
Full-text available
We describe a new logical data model, called the concept-oriented model (COM). It uses mathematical functions as first-class constructs for data representation and data processing as opposed to using exclusively sets in conventional set-oriented models. Functions and function composition are used as primary semantic units for describing data connectivity instead of relations and relation composition (join), respectively. Grouping and aggregation are also performed by using (accumulate) functions providing an alternative to group-by and reduce operations. This model was implemented in an open source data processing toolkit examples of which are used to illustrate the model and its operations. The main benefit of this model is that typical data processing tasks become simpler and more natural when using functions in comparison to adopting sets and set operations.
... In the overall analysis process, various data pre-processing tasks can account for most of the difficulties, and therefore choosing a technology for efficient development and execution of such scripts is of very high importance. This process is frequently referred to as data wrangling which is defined as "iterative data exploration and transformation that enables analysis" (Kandel, 2011;Savinov, 2014). ...
Conference Paper
Full-text available
This paper describes an approach to detecting anomalous behavior of devices by analyzing their event data. Devices from a fleet are supposed to be connected to the Internet by sending log data to the server. The task is to analyze this data by automatically detecting unusual behavioral patterns. Another goal is to provide analysis templates that are easy to customize and that can be applied to many different use cases as well as data sets. For anomaly detection, this log data passes through three stages of processing: feature generation, feature aggregation, and analysis. It has been implemented as a cloud service which exposes its functionality via REST API. The core functions are implemented in a workflow engine which makes it easy to describe these three stages of data processing. The developed cloud service also provides a user interface for visualizing anomalies. The system was tested on several real data sets, such as data generated by autonomous lawn mowers where it produced meaningful results by using the standard template and only little parameters.
... DataCommandr is a data processing engine behind ConceptMix (Savinov, 2014a). Although they both are based on the same theoretical basis (the concept-oriented model of data) these systems are targeted at different problems and have different implementations. ...
Conference Paper
Full-text available
In this paper, we describe a novel approach to data integration, transformation and analysis, called DataCommandr. Its main distinguishing feature is that it is based on operations with columns rather than operations with tables in the relational model or operations with cells in spreadsheet applications. This data processing model is free of such typical set operations like join, group-by or map-reduce which are difficult to comprehend and slow at run time. Due to this ability to easily describe rather complex transformations and high performance on analytical workflows, this approach can be viewed as an alternative to existing technologies in the area of ad-hoc and agile data analysis.
Article
Full-text available
Self-service business intelligence and analytics (SSBIA) empowers non-IT users to create reports and analyses independently. SSBIA methods and processes are discussed in the context of an increasing number of application scenarios. However, previous research on SSBIA has made distinctions among these scenarios only to a limited extent. These scenarios include a wide variety of activities ranging from simple data retrieval to the application of complex algorithms and methods of analysis. The question of which dimensions are suitable for differentiating SSBIA application scenarios remains unanswered. In this article, we develop a taxonomy to distinguish among SSBIA applications more effectively by analyzing the relevant scientific literature and current SSBIA tools as well as by conducting a case study in a company. Both researchers and practitioners can use this taxonomy to describe and analyze SSBIA scenarios in further detail. In this way, the opportunities and challenges associated with SSBIA application can be identified more clearly. In addition, we conduct a cluster analysis based on the SSBIA tools thus analyzed. We identify three archetypes that describe typical SSBIA tools. These archetypes identify the application scenarios that are addressed most frequently by SSBIA tool providers. We conclude by highlighting the limitations of this research and suggesting an agenda for future research.
Chapter
Full-text available
This study performed a content analysis of data retrieved from 30 peer-reviewed scientific publications (1996–2016) that describe the applied algorithm models for data wrangling in Big Data. This analysis method explores and evaluates applied algorithm models of data applications in the area of data wrangling methods in Big Data. Data wrangling unifies messy and complex data by a procedure of planning, which involves, clustering, and grouping of untidy and intricate sets of for easy access for the purposes of trending themes useful for business or company planning. This application of data wrangling is not only for business use, but also for the convenience of individuals, business users that consume data directly in reports, or schemes that further process data by streaming it into targets such as data warehouses, called data lakes. This method sets- up easy access and analysis of all untidy data. Data streaming procedure are exceptionally useful for planning, small and big businesses, all around the world who use data non-stop and constantly to produce emerging trends, structure and schemes that inadvertently makes a difference when sustaining and customising business by simply streaming data it into warehouses, or in other words data storage pools. This study analyzed and found that commonly used statistical figures and algorithms are used by major data application, however the information technology area certainly faces security challenges. However, Data wrangling algorithms used in different data applications such as medical data, textual data, financial data, topological data, governmental data, educational science, galaxy data, etc. could use clustering methods as it is much effective than others. This study has analyzed and found significant comparisons and contrasts between algorithms along with data applications and evaluated them to identify certain superior methods over others. Moreover, it shows that there is a significant use of medical data in the big data research area. Our results show that data wrangling when clustering algorithm can solve medical data storage issues by clustering algorithms. Similarly, clustering algorithms are frequently used for clustering data sets to analyze information from raw data. Fifty percent of the literature found that clustering algorithms for Data wrangling method is beneficial for algorithms used in different data applications to thoroughly analyze and evaluate their importance. After the analysis of Clustering algorithm, suggestions are made for applications used by medical data for the data wrangling purposes.
Chapter
Existing object concept of systems building where in center stage is put software object that implements the functions assigned to it, leading to a mismatch of the solutions obtained and domain. This applies both to information systems in general and in particular databases. The object approach uses concepts with a concrete sense, which greatly limits the variability of systems, including configurable systems. To solve this problem requires a transition to subject-oriented concept, in center of which is placed the process of obtaining the object. This concept requires working with concepts with implicit sense, which is specified according to the situation in different ways. In turn, the storage in a subject-oriented concept should provide advanced as compared to the database functionality. The paper proposes a model of subject-oriented storage of concepts sense for configurable information systems and the method of its implementation.
Chapter
Full-text available
Article
Full-text available
Tableau is a commercial business intelligence (BI) software tool that supports interactive, visual analysis of data. Armed with a visual interface to data and a focus on usability, Tableau enables a wide audience of end-users to gain insight into their datasets. The user experience is a fluid process of interaction in which exploring and visualizing data takes just a few simple drag-and-drop operations (no programming or DB experience necessary). In this context of exploratory, ad-hoc visual analysis, we describe a novel approach to integrating large, heterogeneous data sources. We present a new feature in Tableau called data blending, which gives users the ability to create data visualization mashups from structured, heterogeneous data sources dynamically without any upfront integration effort. Users can author visualizations that automatically integrate data from a variety of sources, including data warehouses, data marts, text files, spreadsheets, and data cubes. Because our data blending system is workload driven, we are able to bypass many of the pain-points and uncertainty in creating mediated schemas and schema-mappings in current pay-as-you-go integration systems.
Article
Full-text available
We present the concept-oriented model (COM) and demonstrate how its three main structural principles — duality, inclusion and partial order — naturally account for various typical data modeling issues. We argue that elements should be modeled as identity-entity couples and describe how a novel data modeling construct, called concept, can be used to model simultaneously two orthogonal branches: identity modeling and entity modeling. We show that it is enough to have one relation, called inclusion, to model value extension, hierarchical address spaces (via reference extension), inheritance and containment. We also demonstrate how partial order relation represented by references can be used for modeling multidimensional schemas, containment and domain-specific relationships.
Conference Paper
Full-text available
In spite of its fundamental importance, inference has not been an inherent function of multidimensional models and analytical applications. These models are mainly aimed at numeric analysis where the notion of inference is not well defined. In this paper we define inference using only multidimensional terms like axes and coordinates as opposed to using logic-based approaches. We propose an inference procedure which is based on a novel formal setting of nested partially ordered sets with operations of projection and de-projection.
Article
We present a vision of next-generation visual analytics services. We argue that these services should have three related capabilities: support visual and interactive data exploration as they do today, but also suggest relevant data to enrich visualizations, and facilitate the integration and cleaning of that data. Most importantly, they should provide all these capabilities seamlessly in the context of an uninterrupted data analysis cycle. We present the challenges and opportunities in building next-generation visual analytics services.
Article
We report the opinions expressed by well-known database researchers on the future of the relational model and SQL during a panel at the International Workshop on Non-Conventional Data Access (NoCoDa 2012), held in Florence, Italy in October 2012 in conjunction with the 31st International Conference on Conceptual Modeling. The panelists include: Paolo Atzeni (Università Roma Tre, Italy), Umeshwar Dayal (HP Labs, USA), Christian S. Jensen (Aarhus University, Denmark), and Sudha Ram (University of Arizona, USA). Quotations from movies are used as a playful though effective way to convey the dramatic changes that database technology and research are currently undergoing.
Article
Analytics enables businesses to increase the efficiency of their activities and ultimately increase their profitability. As a result, it is one of the fastest growing segments of the database industry. There are two usages of the word analytics. The first refers to a set of algorithms and technologies, inspired by data mining, computational statistics, and machine learning, for supporting statistical inference and prediction. The second is equally important: analytical thinking. Analytical thinking is a structured approach to reasoning and decision making based on facts and data. Most of the recent work in the database community has focused on the first, the algorithmic and systems problems. The people behind these advances comprise a new generation of data scientists who have either the mathematical skills to develop advanced statistical models, or the computer skills to develop or implement scalable systems for processing large, complex datasets. The second aspect of analytics -- supporting the analytical thinker -- although equally important and challenging, has received much less attention. In this talk, I will describe recent advances in in making both forms of analytics accessible to a broader range of people, who I call data enthusiasts. A data enthusiast is an educated person who believes that data can be used to answer a question or solve a problem. These people are not mathematicians or programmers, and only know a bit of statistics. I'll review recent work on building easy-to-use, yet powerful, visual interfaces for working with data; and the analytical database technology needed to support these interfaces.
Chapter
In the paper we describe a novel query language, called the concept-oriented query language (COQL), and demonstrate how it can be used for data modeling and analysis. The query language is based on a novel construct, called concept, and two relations between concepts, inclusion and partial order. Concepts generalize conventional classes and are used for describing domain-specific identities. Inclusion relation generalized inheritance and is used for describing hierarchical address spaces. Partial order among concepts is used to define two main operations: projection and de-projection. We demonstrate how these constructs are used to solve typical tasks in data modeling and analysis such as logical navigation, multidimensional analysis and inference.