Eike Schallehn’s research while affiliated with Otto-von-Guericke University Magdeburg and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (6)


Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability: 14th International Conference, BDAS 2018, Held at the 24th IFIP World Computer Congress, WCC 2018, Poznan, Poland, September 18-20, 2018, Proceedings
  • Chapter
  • Full-text available

August 2018

·

91 Reads

·

2 Citations

Communications in Computer and Information Science

·

·

Eike Schallehn

·

[...]

·

Download

Performance Comparison of Three Spark-Based Implementations of Parallel Entity Resolution: DEXA 2018 International Workshops, BDMICS, BIOKDD, and TIR, Regensburg, Germany, September 3–6, 2018, Proceedings

August 2018

·

171 Reads

·

3 Citations

Communications in Computer and Information Science

During the last decade, several big data processing frameworks have emerged enabling users to analyze large scale data with ease. With the help of those frameworks, people are easier to manage distributed programming, failures and data partitioning issues. Entity Resolution is a typical application that requires big data processing frameworks, since its time complexity increases quadratically with the input data. In recent years Apache Spark has become popular as a big data framework providing a flexible programming model that supports in-memory computation. Spark offers three APIs: RDDs, which gives users core low-level data access, and high-level APIs like DataFrame and Dataset, which are part of the Spark SQL library and undergo a process of query optimization. Stemming from their different features, the choice of API can be expected to have an influence on the resulting performance of applications. However, few studies offer experimental measures to characterize the effect of such distinctions. In this paper we evaluate the performance impact of such choices for the specific application of parallel entity resolution under two different scenarios, with the goal to offer practical guidelines for developers.


Analyzing data quality issues in research information systems via data profiling

March 2018

·

432 Reads

·

78 Citations

International Journal of Information Management

The success or failure of a RIS in a scientific institution is largely related to the quality of the data available as a basis for the RIS applications. The most beautiful Business Intelligence (BI) tools (reporting, etc.) are worthless when displaying incorrect, incomplete, or inconsistent data. An integral part of every RIS is thus the integration of data from the operative systems. Before starting the integration process (ETL) of a source system, a rich analysis of source data is required. With the support of a data quality check, causes of quality problems can usually be detected. Corresponding analyzes are performed with data profiling to provide a good picture of the state of the data. In this paper, methods of data profiling are presented in order to gain an overview of the quality of the data in the source systems before their integration into the RIS. With the help of data profiling, the scientific institutions can not only evaluate their research information and provide information about their quality, but also examine the dependencies and redundancies between data fields and better correct them within their RIS.


Table 2 : Classification based on the general criteria, grouped by programming models
Table 3 : Effectiveness consideration of parallel DBMS ER
Table 4 : Effectiveness consideration of Spark-based ER
Table 5 : Effectiveness consideration of MapReduce-based ER
Table 6 : Efficiency consideration of parallel DBMS ER

+3

Cloud-Scale Entity Resolution: Current State and Open Challenges

March 2018

·

292 Reads

·

16 Citations

Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field.


A Self-tuning Framework for Cloud Storage Clusters

September 2015

·

22 Reads

Lecture Notes in Computer Science

The well-known problems of tuning and self-tuning of data management systems are amplified in the context of Cloud environments that promise self management along with properties like elasticity and scalability. The intricate criteria of Cloud storage systems such as their modular, distributed, and multi-layered architecture add to the complexity of the tuning and self-tuning process. In this paper, we provide an architecture for a self-tuning framework for Cloud data storage clusters. The framework consists of components to observe and model certain performance criteria and a decision model to adjust tuning parameters according to specified requirements. As part of its implementation, we provide an overview on benchmarking and performance modeling components along with experimental results.


Citations (2)


... ER schemes may be evaluated from multiple perspectives [8] [9]: (1) Effectiveness or performance of clustering (for example in terms of recall and precision); (2) Efficiency, or the number of queries required per sample to achieve this performance; (3) Operation and scalability, i.e., whether the scheme is adaptive or non-adaptive, whether it runs online or in batch, and whether it is parallelizable; and, (4) Genericity, or how and whether the scheme may be applied to different scenarios. For example, the Jaccard similarity function is popular when only dealing with textual data. ...

Reference:

How to Query An Oracle Efficient Strategies to Label Data
Cloud-Scale Entity Resolution: Current State and Open Challenges

... Data profiling can help address DQ issues such as noise and outliers, inconsistent data, duplicate data, and missing values (Azeroual et al., 2018) (Taleb et al., 2019). Class imbalance issues develop during classification because of the data's minimal class representation, and as a result, the dominant class is favorably viewed by the classifiers. ...

Analyzing data quality issues in research information systems via data profiling
  • Citing Article
  • March 2018

International Journal of Information Management