Robert Wrembel

Robert Wrembel
Poznan University of Technology · Institute of Computing Science

dr. hab. inz. (PhD, DSc)

About

137
Publications
25,686
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,123
Citations
Introduction
My current research focuses on methods for optimizing data processing workflows in integration systems (a.k.a. ETL processess in data warehousing, DPP in data science).

Publications

Publications (137)
Chapter
The ability to query vast amount of historical data for statistical analysis and reporting is provided by Data Warehouses. They facilitate Business Intelligence for effective decision-making significantly. In recent years, great progress has been made in movement monitoring devices, such as smart phones and GPSs. The storing and managing of spatio-...
Chapter
Effective heat energy demand prediction is essential in combined heat power systems. The algorithms considered so far do not sufficiently take into account the computational costs and ease of implementation in industrial systems. However, computational cost is of key importance in edge and IoT systems, where prediction algorithms are constantly upd...
Conference Paper
The design and implementation of agro-ecology IoT applications is a non-trivial task since the data processed in such applications are typically complex and heterogeneous. Moreover, these applications are implemented using different systems and technologies, over complex IoT communication network layers (edge, fog, cloud). The existing system desig...
Chapter
This paper contributes a reference architecture of a reusable infrastructure for scientific experiments on data processing and data integration. The architecture is based on containerization and is integrated with an external machine learning cloud service to build performance models.
Chapter
Highly-heterogeneous and fast-arriving large amounts of data, otherwise said Big Data, induced the development of novel Data Management technologies. In this paper, the members of the IFIP Working Group 2.6 share their expertise in some of these technologies, focusing on: recent advancements in data integration, metadata management, data quality, g...
Article
This paper presents an exhaustive and unified repository of judgments documents, called ECHR-DB, based on the European Court of Human Rights. The need of such a repository is explained through the prism of the researcher, the data scientist, the citizen, and the legal practitioner. Contrarily to many open data repositories, the full creation proces...
Book
This volume LNCS 12925 constitutes the papers of the 23rd International Conference on Big Data Analytics and Knowledge Discovery, held in September 2021. Due to COVID-19 pandemic it was held virtually. The 12 full papers presented together with 15 short papers in this volume were carefully reviewed and selected from a total of 71 submissions. The p...
Article
Usually, data in data warehouses (DWs) are stored using the notion of the multidimensional (MD) model. Often, DWs change in content and structure due to several reasons, like, for instance, changes in a business scenario or technology. For accurate decision-making, a DW model must allow storing and analyzing time-varying data. This paper addresses...
Chapter
A de facto technological standard of data science is based on notebooks (e.g., Jupyter), which provide an integrated environment to execute data workflows in different languages. However, from a data engineering point of view, this approach is typically inefficient and unsafe, as most of the data science languages process data locally, i.e., in wor...
Chapter
Optimizing Data Processing Pipelines (DPPs) is challenging in the context of both, data warehouse architectures and data science architectures. Few approaches to this problem have been proposed so far. The most challenging issue is to build a cost model of the whole DPP, especially if user defined functions (UDFs) are used. In this paper we address...
Chapter
This paper presents an exhaustive and unified dataset based on the European Court of Human Rights judgments since its creation. The interest of such database is explained through the prism of the researcher, the data scientist, the citizen and the legal practitioner. Contrarily to many datasets, the creation process, from the collection of raw data...
Chapter
A data source integration layer, commonly called extract-transform-load (ETL), is one of the core components of information systems. It is applicable to standard data warehouse (DW) architectures as well as to data lake (DL) architectures. The ETL layer runs processes that ingest, transform, integrate, and upload data into a DW or DL. The ETL layer...
Chapter
Data sources (DSs) being integrated in a data warehouse frequently change their structures. As a consequence, in many cases, an already deployed ETL process stops its execution, generating errors. Since the number of deployed ETL processes may reach dozens of thousands and structural changes in DSs are frequent, being able to (semi-)automatically r...
Book
This book constitutes thoroughly reviewed and selected papers presented at Workshops and Doctoral Consortium of the 24th East-European Conference on Advances in Databases and Information Systems, ADBIS 2020, the 24th International Conference on Theory and Practice of Digital Libraries, TPDL 2020, and the 16th Workshop on Business Intelligence and B...
Book
This book constitutes thoroughly refereed short papers of the 24th European Conference on Advances in Databases and Information Systems, ADBIS 2020, held in August 2020. ADBIS 2020 was to be held in Lyon, France, however due to COVID-19 pandemic the conference was held in online format. The 18 presented short research papers were carefully reviewe...
Book
This book constitutes the proceedings of the 24th European Conference on Advances in Databases and Information Systems, ADBIS 2020, held in Lyon, France, in August 2020.* The 13 full papers presented were carefully reviewed and selected from 82 submissions. The papers cover a wide range of topics from different areas of research in database and inf...
Article
Data science requires constructing data processing pipelines (DPPs), which span diverse phases such as data integration, cleaning, pre-processing, and analysis. However, current solutions lack a strong data engineering perspective. As consequence, DPPs are error-prone, inefficient w.r.t. human efforts, and inefficient w.r.t. execution time. We clai...
Chapter
Metadata discovery is a prominent contributor towards understanding the semantics of data, relationships between data, and fundamental data features for the purpose of data management, query processing, and data integration. Metadata discovery is constantly evolving with the help of data profiling and manual annotators, resulting in various good qu...
Chapter
Today’s ETL tools provide capabilities for developing custom code as user-defined functions (UDFs) to extend the expressiveness of standard ETL operators. However, a custom code of an UDF may execute inefficiently due to its poor implementation (e.g., due to the lack of using parallel processing or adequate data structures). In this paper we addres...
Article
This paper introduces to the Information Systems special issue, including the four best papers submitted to DOLAP 2018. Additionally, the 20th anniversary of DOLAP motivated the analysis of DOLAP topics, as follows. First, the recent 5-years DOLAP topics were confronted with those of VLDB, SIGMOD, and ICDE. Next, the DOLAP topics were analyzed with...
Book
This book constitutes the thoroughly refereed short papers, workshops and doctoral consortium papers of the 23rd European Conference on Advances in Databases and Information Systems, ADBIS 2019, held in Bled, Slovenia, in September 2019. The 19 short research papers and the 5 doctoral consortium papers were carefully reviewed and selected from 103...
Preprint
Metadata presents a medium for connection, elaboration, examination, and comprehension of relativity between two datasets. Metadata can be enriched to calculate the existence of a connection between different disintegrated datasets. In order to do so, the very first task is to attain a generic metadata representation for domains. This representatio...
Chapter
Full-text available
A concrete classification algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. Typically, in order to improve the results, datasets need to be pre-processed. Taking into account all the pos...
Article
Full-text available
Big Data technology has discarded traditional data modeling approaches as no longer applicable to distributed data processing. It is, however, largely recognized that Big Data impose novel challenges in data and infrastructure management. Indeed, multiple components and procedures must be coordinated to ensure a high level of data quality and acces...
Article
Full-text available
Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operato...
Article
Full-text available
This paper shows how big data analysis opens a range of research and technological problems and calls for new approaches. We start with defining the essential properties of big data and discussing the main types of data involved. We then survey the dedicated solutions for storing and processing big data, including a data lake, virtual integration,...
Article
Full-text available
In this paper, we discuss the state of the art and current trends in designing and optimizing ETL workflows. We explain the existing techniques for: (1) constructing a conceptual and a logical model of an ETL workflow, (2) its corresponding physical implementation, and (3) its optimization, illustrated by examples. The discussed techniques are anal...
Conference Paper
Business Intelligence (BI) systems help organisations to monitor the fulfillment of business goals by means of tracking various Key Performance Indicators (KPIs). Data Warehouses (DWs) supply data to compute KPIs and therefore, are an important component of any BI system. While designing a DW to monitor KPIs, the following two important questions a...
Conference Paper
Predictive analytics provides organisations with insights about future outcomes. Despite the hype around it, not many organizations are using it. Organisations still rely on the descriptive insights provided by the traditional business intelligence (BI) solutions. The barriers to adopt predictive analytics solutions are that businesses struggle to...
Conference Paper
Completeness as one of the four major dimensions of data quality is a pervasive issue in modern databases. Although data imputation has been studied extensively in the literature, most of the research is focused on inference-based approach. We propose to harness Web tables as an external data source to effectively and efficiently retrieve missing d...
Article
A data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. Typically, a dataset needs to be pre-processed before being mined. Taking into account all the possible pre-processing ope...
Book
This book constitutes the thoroughly refereed short papers, workshops and doctoral consortium papers of the 21th East European Conference on Advances in Databases and Information Systems, ADBIS 2017, held in Nicosia, Cyprus, Greece, in September 2017. The 25 full and 4 short workshop papers and the 12 short papers of the main conference were caref...
Conference Paper
Ubiquitous devices and applications generate data that are naturally ordered by time. Thus elementary data items can form sequences. The most popular way of analyzing sequences is searching for patterns. To this end, sequential pattern discovery techniques were proposed in some research contributions and implemented in a few database systems, e.g.,...
Conference Paper
Multiple applications e.g., energy consumption meters, temperature or pressure sensors, generate series of discrete data. Such data have two characteristics, namely: they are naturally ordered by time and are frequently represented as intervals. Most of the research contributions, commercial software, or prototypes either (1) allow to analyze set o...
Conference Paper
Full-text available
A data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. As a matter of fact, a dataset usually needs to be pre-processed. Taking into account all the possible pre-processing oper...
Conference Paper
A Data Warehouse (DW) is one of the main components of every BI system. It has been convincingly argued that the success of BI projects can be strongly affected by the Requirements Engineering (RE) phase, when the requirements of a DW are captured. Multiple RE methods for DWs have been proposed which have goal models in the core of their approach....
Conference Paper
Full-text available
This paper proposes an approach to represent and analyze the content of workflow logs in a data warehouse. When analyzing workflow logs one big problem arises: typically, an underlying workflow model consists of loops (frequently interleaving), often implemented by using goto-statements. These structures increase the number of possible execution pa...
Conference Paper
Full-text available
Ubiquitous devices and applications generate data, whose natural feature is order. Most of the commercial software and research prototypes for data analytics allow to analyze set oriented data, neglecting their order. However, by analyzing both data and their order dependencies, one can discover new business knowledge. Few solutions in this field h...
Book
This book constitutes the refereed proceedings of the 5th IFIP TC 5 International Conference on Computer Science and Its Applications, CIIA 2015, held in Saida, Algeria, in May 2015. The 56 revised papers presented were carefully reviewed and selected from 225 submissions. The papers are organized in the following four research tracks: computationa...
Conference Paper
Full-text available
The ability to analyze data organized as sequences of events or intervals became important by nowadays applications since such data became ubiquitous. In this paper we propose a formal model and briefly discuss a prototypical implementation for processing interval data in an OLAP style. The fundamental constructs of the formal model include: events...
Article
Full-text available
Numerous nowadays applications generate huge sets of data, whose natural feature is order, e.g,. sensor installations, RFID devices, workflow systems, Website monitors, health care applications. By analyzing the data and their order dependencies one can acquire new knowledge. However, nowadays commercial BI technologies and research prototypes allo...
Article
Full-text available
One of the important research and technological issues in data warehouse performance is the optimization of analytical queries. Most of the research have been focusing on optimizing such queries by means of materialized views, data and index partitioning, as well as various index structures including: join indexes, bitmap join indexes, multidimensi...
Article
Full-text available
In a typical BI infrastructure, data, extracted from operational data sources, is transformed, cleansed, and loaded into a data warehouse by a periodic ETL process, typically executed on a nightly basis, i.e., a full day's worth of data is processed and loaded during off-hours. However, it is desirable to have fresher data for business insights at...
Book
This volume is the second one of the 16th East-European Conference on Advances in Databases and Information Systems (ADBIS 2012), held on September 18-21, 2012, in Poznań, Poland. The first one has been published in the LNCS series. This volume includes 27 research contributions, selected out of 90. The contributions cover a wide spectrum of topic...
Conference Paper
Full-text available
Nowadays business intelligence technologies allow to analyze mainly set oriented data, without considering order dependencies between data. Few approaches to analyzing data of sequential order have been proposed so far. Nonetheless, for storing and manipulating sequential data the approaches use either the relational data model or its extensions. W...
Conference Paper
Full-text available
In this paper we propose a benchmark, called RTDW-bench, for testing a performance of a real-time data warehouse. The benchmark is based on TPC-H. In particular, RTDW-bench permits to verify whether an already deployed RTDW is able to handle without any delays a transaction stream of a given arrival rate. The benchmark also includes an algorithm fo...
Conference Paper
Online or real-time BI has remained elusive despite significant efforts by academic and industrial research. Some of the most prominent problems in accomplishing faster turnaround are related to the data ingest. The process of extracting data from source systems, transforming, and loading (ETL) it is often bottlenecked by architectural choices and...
Article
One of the important research and technological problems in data warehouse query optimization concerns star queries. So far, most of the research focused on optimizing such queries by means of join indexes, bitmap join indexes, or various multidimensional indexes. These structures neither support navigation well along dimension hierarchies nor opti...
Chapter
Full-text available
Data stored in a data warehouse (DW) are retrieved and analyzed by complex analytical applications, often expressed by means of star queries. Such queries often scan huge volumes of data and are computationally complex. For this reason, an acceptable (or good) DW performance is one of the important features that must be guaranteed for DW users. Goo...
Article
Bitmap indexes are data structures applied to indexing attributes in databases and data warehouses. A drawback of a bitmap index is that its size increases when the domain of an indexed attribute increases. As a consequence, for wide domains, the size of a bitmap index is too large to be efficiently processed. Hence, various techniques of compressi...
Chapter
A data warehouse architecture (DWA) has been developed for the purpose of integrating data from multiple heterogeneous, distributed, and autonomous external data sources (EDSs) as well as for providing means for advanced analysis of integrated data. The major components of this architecture include: an external data source (EDS) layer, and extracti...
Conference Paper
Full-text available
Bitmap indexes are one of the basic data structures applied to query optimization in data warehouses. The size of a bitmap index strongly depends on the domain of an indexed attribute, and for wide domains it is too large to be efficiently processed. For this reason, various techniques of compressing bitmap indexes have been proposed. Typically, co...
Conference Paper
One of the important research and technological problems in data warehousing is the optimization of star queries. So far, most of the research focused on optimizing such queries by means of join indexes and bitmap join indexes. In this paper we propose an index, called Time-HOBI, for optimizing star queries and computing aggregates along dimension...
Chapter
Methods of designing a data warehouse (DW) usually assume that its structure is static. In practice, however, a DW structure changes among others as the result of the evolution of external data sources and changes of the real world represented in a DW. The most advanced research approaches to this problem are based on temporal extensions and versio...
Conference Paper
Full-text available
In this paper we propose a hierarchically organized bitmap index (HOBI) for optimizing star queries that filter data and compute aggregates along a dimension hierarchy. HOBI is created on a dimension hierarchy. The index is composed of hierarchically organized bitmap indexes, one bitmap index for one dimension level. It supports range predicates on...
Article
In this paper we propose a technique of compressing bitmap indexes for application in data warehouses. This technique, called run-length Huffman (RLH), is based on run-length encoding and on Huffman encoding. Additionally, we present a variant of RLH, called RLH-N. In RLH-N a bitmap is divided into N-bit words that are compressed by RLH. RLH and RL...
Article
Methods of designing a data warehouse (DW) usually assume that its structure is static. In practice, however, a DW structure changes among others as the result of the evolution of external data sources and changes of the real world represented in a D W The most advanced research approaches to this problem are based on temporal extensions and versio...
Article
The data warehouse (DW) technology is developed in order to support the integration of external data sources (EDSs) for the purpose of advanced data analysis by On-Line Analytical Processing (OLAP) applications. Since contents and structures of integrated EDSs may evolve in time, the content and schema of a DW must evolve too in order to correctly...
Article
Full-text available
Method materialization is a promising data access optimization technique for multiple applications, including, in particular object programming languages with persistence, object databases, distributed computing systems, object-relational data warehouses, multimedia data warehouses, and spatial data warehouses. A drawback of this technique is that...