ArticlePDF Available

Problems, methods, and challenges in comprehensive data cleansing

Authors:

Abstract and Figures

Cleansing data from impurities is an integral part of data processing and mainte-nance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. This paper pre-sents a survey of data cleansing problems, approaches, and methods. We classify the various types of anomalies occurring in data that have to be eliminated, and we define a set of quality criteria that comprehensively cleansed data has to ac-complish. Based on this classification we evaluate and compare existing ap-proaches for data cleansing with respect to the types of anomalies handled and eliminated by them. We also describe in general the different steps in data clean-sing and specify the methods used within the cleansing process and give an out-look to research directions that complement the existing systems.
Content may be subject to copyright.
A preview of the PDF is not available
... Many research endeavors delve into the realm of data cleaning and outlier detection algorithms, yet our focus rests on selecting the most crucial ones that encapsulate the breadth of research and studies in this domain. The following is a selection of notable articles that contribute to this discourse: E Rahm et al. [6] delve into the intricacies of data cleaning: problems and current approaches, while H Müller et al. [7] provide insights into problems, methods, and challenges in comprehensive data cleansing. The work of Xu et al. [8] sheds light on data cleaning in the process industries, while Chu et al. [9] expound upon data cleaning: overview and emerging challenges. ...
... Then for each subset we will split them into small x_train x_test and y_train and y_test datasets and reshape our data into the same shape of the original data (which is 1,120) We see that the 9th dataset contains a single row which automatically means that it's an outlier. We will also consider the (3,5,7,8,9,11,12,13,14) as a group of outliers based on the size difference between them and the other subsets.After it's done, we simply select all rows that don't contain outliers and summarize the shape of out dataset. Then we will fit our model and evaluate it with y_hat column and calculate the validation score between the predicted y_eval and the actual y_test. ...
Article
Data cleaning, also referred to as data cleansing, constitutes a pivotal phase in data processing subsequent to data collection. Its primary objective is to identify and eliminate incomplete data, duplicates, outdated information, anomalies, missing values, and errors. The influence of data quality on the effectiveness of machine learning (ML) models is widely acknowledged, prompting data scientists to dedicate substantial effort to data cleaning prior to model training. This study accentuates critical facets of data cleaning and the utilization of outlier detection algorithms. Additionally, our investigation encompasses the evaluation of prominent outlier detection algorithms through benchmarking, seeking to identify an efficient algorithm boasting consistent performance. As the culmination of our research, we introduce an innovative algorithm centered on the fusion of Isolation Forest and clustering techniques. By leveraging the strengths of both methods, this proposed algorithm aims to enhance outlier detection outcomes. This work endeavors to elucidate the multifaceted importance of data cleaning, underscored by its symbiotic relationship with ML models. Furthermore, our exploration of outlier detection methodologies aligns with the broader objective of refining data processing and analysis paradigms. Through the convergence of theoretical insights, algorithmic exploration, and innovative proposals, this study contributes to the advancement of data cleaning and outlier detection techniques in the realm of contemporary data-driven environments.
... Data cleaning deals with removing anomalies from a dataset in order to ensure its quality [4,11,17,22]. Detecting and repairing data with errors is one of the major challenges in data analysis, and failing to do so can result in unreliable decisions and poor study results [4,25]. For this reason, before attempting to analyze the data collected through the survey, we searched for errors and inconsistencies. ...
Preprint
Full-text available
Data collection is pervasively bound to our digital lifestyle. A recent study by the IDC reports that the growth of the data created and replicated in 2020 was even higher than in the previous years due to pandemic-related confinements to an astonishing global amount of 64.2 zettabytes of data. While not all the produced data is meant to be analyzed, there are numerous companies whose services/products rely heavily on data analysis. That is to say that mining the produced data has already revealed great value for businesses in different sectors. But to be able to fully realize this value, companies need to be able to hire professionals that are capable of gleaning insights and extracting value from the available data. We hypothesize that people nowadays conducting data-science-related tasks in practice may not have adequate training or formation. So in order to be able to fully support them in being productive in their duties, e.g. by building appropriate tools that increase their productivity, we first need to characterize the current generation of data scientists. To contribute towards this characterization, we conducted a public survey to fully understand who is doing data science, how they work, what are the skills they hold and lack, and which tools they use and need.
... This search method immensely helped to delimit the collection of required information and reduce the data that generated noise in the data cleaning and subsequent training. Therefore, it is possible to consider data cleaning as the totality of operations performed on the data to eliminate anomalies and obtain an accurate and unique representation (20). One of the major drawbacks was the high degree of duplicity in the information, since when searching by area of coverage and keywords, the information contained repeated results; this significantly increased the amount of information in the database. ...
Article
Full-text available
Context: This work aims to design and create a community-based early warning model as an alternative for the mitigation of disasters caused by stream overflow in Barranquilla (Colombia). This model is based on contributions from social networks, which are consulted through their API and filtered according to their location. Methods: With the information collected, cleaning and debugging are performed. Then, through natural language processing techniques, the texts are tokenized and vectorized, aiming to find the vector similarity between the processed texts and thus generating a classification. Results: The texts classified as dealing with stream overflow are processed again to obtain a location or assign a default one, in order to for them to be georeferenced in a map that allows associating the risk zone and visualizing it in a web application to monitor and reduce the potential damage to the population. Conclusions: Three classification algorithms were selected (random forest, extra trees, and k-neighbors) to determine the best classifier. These three algorithms exhibited the best performance and R2 regarding the data processed in the regressions. These algorithms were trained, with the k-neighbor algorithm exhibiting the best performance.
... Data quality and data cleansing are closely intertwined. Data cleansing is a procedure that is performed on existing data to eliminate errors and enhance its suitability [7]. This involves identifying, detecting, and correcting errors through a prescribed process. ...
Article
Full-text available
The world is entering the digital age, where advancements in computer technology have resulted in the emergence of data-driven applications in the healthcare sector. Data science is a multidisciplinary domain that employs scientific methodologies such as data mining techniques, machine learning algorithms, and big data to derive knowledge and insights from many types of structured and unstructured data. The healthcare business produces extensive datasets containing valuable information regarding patient demographics, treatment regimens, and tumor sizes. This study will examine the procedures of data purification, data mining, data preparation, and data analysis employed in healthcare applications, using methods of literature review and analysis. Efficiently managing and analyzing data can greatly benefit the healthcare industry. By leveraging data science, the application of data-driven decision-making in healthcare holds the potential to enhance the quality of healthcare services.
... Data cleansing Data cleansing (also called data cleaning, data scrubbing, or data wrangling) is the process of detecting and correcting dirty data, which is typically a prerequisite for interactive visualization. Müller and Freytag (2003) describe data cleansing as a four-step process: ...
Chapter
Full-text available
This chapter investigates in detail the characteristics of time and time-oriented data. Design aspects for modeling time and time-oriented data are introduced and discussed using examples. The chapter also sheds some light on data quality.
... Cada enfoque tiene sus ventajas y limitaciones, y la elección dependerá del alcance y la complejidad del conjunto de datos que se busca limpiar y mejorar. Por otro lado, autores como Müller & Freytag (2003) establecen una fase de "auditoría de datos" inicial y una fase final de "post procesamiento y control". Si durante esta última fase se encuentran tuplas no corregidas inicialmente, se inicia un nuevo ciclo en el proceso de limpieza de datos, comenzando por auditar los datos y buscando características en datos excepcionales. ...
Article
Full-text available
La calidad de los datos en el análisis y toma de decisiones es de vital importancia, sin embargo, los estudios que proporcionen los pasos claros para realizar este trabajo a través del lenguaje de programación Python son escasos, es por esto que el objetivo de esta investigación es desarrollar una guía para evaluar y mejorar la calidad de los datos utilizando el lenguaje de programación Python. Esta investigación tiene un enfoque cualitativo debido a que se describen la metodología para realizar este proceso y se aplica en un caso práctico, el mismo que es medido a través de características de calidad como: Exactitud, Integridad, Libre de Errores, y Valor Añadido. Los resultados indican que, mediante la aplicación de la metodología propuesta basada en 12 pasos, a través de Python los datos cumplen con las características de calidad requeridas.
Article
This review paper explores the impact of data analytics on guiding product development processes from conception to launch. It synthesizes findings from existing literature to outline how data-driven strategies can optimize each phase of product development, thereby enhancing efficiency and effectiveness in meeting market demands. The review spans various industries, highlighting the universality of data analytics applications in product innovation. The paper details how data analytics facilitates better decision-making through predictive insights into market trends and consumer preferences, which are crucial for defining product specifications and features. It also examines the role of data in refining production processes, ensuring quality control, and customizing marketing strategies to target potential customer segments effectively. Additionally, the review considers the benefits of continuous data evaluation during the product testing phase, enabling quicker adjustments and improvements. The findings indicate that data analytics significantly shortens the product development timeline and increases the likelihood of market success. Organizations leveraging data-driven insights from the outset of product development gain a competitive edge by creating more aligned and responsive products. The paper recommends broader adoption of robust data analytics tools and practices across industries to maximize product development outcomes.
Article
Full-text available
Vast amounts of life sciences data reside today in specialized data sources, with specialized query processing capabilities. Data from one source often must be combined with data from other sources to give users the information they desire. There are database middleware systems that extract data from multiple sources in response to a single query. IBM's DiscoveryLink is one such system, targeted to applications from the life sciences industry. DiscoveryLink provides users with a virtual database to which they can pose arbitrarily complex queries, even though the actual data needed to answer the query may originate from several different sources, and none of those sources, by itself, is capable of answering the query. We describe the DiscoveryLink offering, focusing on two key elements, the wrapper architecture and the query optimizer, and illustrate how it can be used to integrate the access to life sciences data from heterogeneous data sources.
Article
Full-text available
We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.
Article
Extraction-Transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Literature and personal experience have guided us to conclude that the problems concerning the ETL tools are primarily problems of complexity, usability and price. To deal with these problems we provide a uniform metamodel for ETL processes, covering the aspects of data warehouse architecture, activity modeling, contingency treatment and quality management. The ETL tool we have developed, namely Arktos, is capable of modeling and executing practical ETL scenarios by providing explicit primitives for the capturing of common tasks. Arktos provides three ways to describe an ETL scenario: a graphical point-and-click front end and two declarative languages: XADL (an XML variant), which is more verbose and easy to read and SADL (an SQL-like language) which has a quite compact syntax and is, thus, easier for authoring.
Article
Integration of data sources opens up possibilities for new and valuable applications of data that cannot be supported by the individual sources alone. Unfortunately, many data integration projects are hindered by the inherent heterogeneities in the sources to be integrated. In particular, differences in the way that real world data is encoded within sources can cause a range of difficulties, not least of which is that the conflicting semantics may not be recognised until the integration project is well under way. Once identified, semantic conflicts of this kind are typically dealt with by configuring a data transformation engine, that can convert incoming data into the form required by the integrated system. However, determination of a complete and consistent set of data transformations for any given integration task is far from trivial. In this paper, we explore the potential application of techniques for integrity enforcement in supporting this process. We describe the design of a data reconciliation tool (LITCHI) based on these techniques that aims to assist taxonomists in the integration of biodiversity data sets. Our experiences have highlighted several limitations of integrity enforcement when applied to this real world problem, and we describe how we have overcome these in the design of our system.
Conference Paper
Integration and analysis of data from different sources have to deal with several problems resulting from potential heterogeneities. The activities addressing these problems are called data preparation and are supported by various available tools. However, these tools process mostly in a batch-like manner, not supporting the iterative and explorative nature of the integration and analysis process. The authors present a framework for important data preparation tasks based on a multidatabase language. This language offers features for solving common integration and cleaning problems as part of query processing. Combining data preparation mechanisms and multidatabase query facilities permits applying and evaluating different integration and cleaning strategies without explicit loading and materialization of data. The paper introduces the language concepts and discusses their application for individual tasks of data preparation
Article
A broad spectrum of data is available on the Web in distinct heterogeneous sources, stored under different formats. As the number of systems that utilize this data grows, the importance of data conversion mechanisms increases greatly. We present here an overview of a French-Israeli research project aimed at developing tools to simplify the intricate task of data translation. The solution is based on a middleware data model to which various data sources are mapped, and a declarative language for specifying translations within the middleware model. A complementary schema-based mechanism is used to automate some of the translation. Some particular aspects of the solution are detailed in [3, 7, 10]. 1 Introduction Data integration and translation is a problem facing many organizations that wish to utilize Web data. A broad spectrum of data is available on the Web in distinct heterogeneous sources, stored under different formats: a specific database vendor format, XML or Latex (documents)...