Table 1 - uploaded by Fábio Lobato
Content may be subject to copyright.
Example of a dataset with missing values.

Example of a dataset with missing values.

Source publication
Conference Paper
Full-text available
Data analysis plays an important role in our Information Era; however, most of statistical and machine learning algorithms were not developed to tackle the ubiquitous issue of missing values. In pattern classification, several strategies have been proposed to handle this problem, where missing data imputation is the most used one, which can be view...

Context in source publication

Context 1
... said previously, missing values can be defined as the absence of information in instances, which brings harmful consequences to the validity of the subsequent analyzes. Ta- ble 1 shows an example of dataset with missing values. The instances 1-5 are commonly called "complete cases" because they have no missing values, while instances 6-8 are called "incomplete cases" because they have missing values, usually represented by "?". ...

Similar publications

Article
Full-text available
Radar working state recognition is the basis of cognitive electronic countermeasures. Aiming at the problem that the traditional supervised recognition technology is difficult to obtain prior information and process the incremental signal data stream, an unsupervised and incremental recognition method is proposed. This method is based on a backprop...

Citations

... Priya and Kuppuswami (2012), a GA is used to impute missing values in discrete features. Patil and Bichkar (2010) and Lobato et al. (2015b), GA-based imputation is also used for classification with missing values. Gautam and Ravi (2015), two imputation methods are proposed by combining particle swarm optimization (PSO), auto-associative extreme learning machine, and evolving clustering method. ...
Article
Full-text available
Incompleteness is one of the problematic data quality challenges in real-world machine learning tasks. A large number of studies have been conducted for addressing this challenge. However, most of the existing studies focus on the classification task and only a limited number of studies for symbolic regression with missing values exist. In this work, a new imputation method for symbolic regression with incomplete data is proposed. The method aims to improve both the effectiveness and efficiency of imputing missing values for symbolic regression. This method is based on genetic programming (GP) and weighted K-nearest neighbors (KNN). It constructs GP-based models using other available features to predict the missing values of incomplete features. The instances used for constructing such models are selected using weighted KNN. The experimental results on real-world data sets show that the proposed method outperforms a number of state-of-the-art methods with respect to the imputation accuracy, the symbolic regression performance, and the imputation time.
... Moreover, studies reported that the best predictive accuracy did not necessarily lead to the lowest classification bias. This suggests that the modelling task, i.e. the following data-processing step, is imperative in the evaluation of the performance of any imputation method adopted for handling the missing data [19]. ...
Article
Full-text available
There is a growing interest in mining and handling of big data, which has been rapidly accumulating in the repositories of bioprocess industries. Biopharmaceutical industries are no exception; the implementation of advanced process control strategies based on multivariate monitoring techniques in biopharmaceutical production gave rise to the generation of large amounts of data. Real-time measurements of critical quality and performance attributes collected during production can be highly useful to understand and model biopharmaceutical processes. Data mining can facilitate the extraction of meaningful relationships pertaining to these bioprocesses, and predict the performance of future cultures. This review evaluates the suitability of various metaheuristic methods available for data pre-processing, which would involve the handling of missing data, the visualisation of the data, and dimension reduction; and for data processing, which would focus on modelling of the data and the optimisation of these models in the context of biopharmaceutical process development. The advantages and the associated challenges of employing different methodologies in pre-processing and processing of the data are discussed. In light of these evaluations, a summary guideline is proposed for handling and analysis of the data generated in biopharmaceutical process development.
... The authors examined the performance of their method by using three algorithms which represents three groups of classification technique: 1) rule induction learning, 2) approximate models and 3) lazy learning. The classifiers used were C4.5, Naïve Bayes, and kNN respectively and adapted the Wilcoxon signed-rank test in evaluating the imputation scheme [11]. Similarly, another imputation method was proposed for text mining applications which was applied and investigated using only C4.5 classifier; however, the comparison with other imputation methods was not performed. ...
... Considering the complexity accounted to the estimation of missing data, many authors chose to invest in evolutionary algorithms to build models for this application. [11] for example, proposed a Genetic Algorithm to treat missing values in classification datasets, making use of both numeric and categorical data. Also, [12] invested in a Genetic Programming Algorithm to treat missing values, achieving great results in both prediction and classification accuracy. ...
... c ← best individual regression function11 for each example x ∈ a do 12 if x is missing then13 ...
Conference Paper
Time series have been used in several applications such as process control, environmental monitoring, financial analysis and scientific researches. However, in the presence of missing data, this study may become more complex due to a strong break of correlation among samples. Therefore, this work proposes an imputation method for time series using Genetic Programming (GP) and Lagrange Interpolation. The heuristic adopted builds an interpretable regression model that explores time series statistical features such as mean, variance and auto-correlation. It also makes use of interrelation among multivariate time series to estimate missing values. Results show that the proposed method is promising, being capable of imputing data without loosing the datasets statistical properties, as well as allowing a better understanding of the missing data pattern from the obtained interpretable model.
... Considering the complexity accounted to the estimation of missing data, many authors chose to invest in evolutionary algorithms to build models for this application. [11] for example, proposed a Genetic Algorithm to treat missing values in classification datasets, making use of both numeric and categorical data. Also, [12] invested in a Genetic Programming Algorithm to treat missing values, achieving great results in both prediction and classification accuracy. ...
... c ← best individual regression function11 for each example x ∈ a do 12 if x is missing then13 ...
Conference Paper
Time series have been used in several applications such as process control, environment monitoring, financial analysis and scientific researches. However, in the presence of missing data, this study may become more complex due to a strong break of correlation among samples. Therefore, this work proposes an imputation method for time series using Genetic Programming (GP) and Lagrange Interpolation. The heuristic adopted builds an interpretable regression model that explores time series statistical features such as mean, variance and auto-correlation. It also makes use of interrelation among multivariate time series to estimate missing values. Results show that the proposed method is promising, being capable of imputing data without loosing the dataset's statistical properties, as well as allowing a better understanding of the missing data pattern from the obtained interpretable model.
... [Cai et al. 2015] treated imputation as an optimization problem using tensor factorization to build a model to regress values, and [Figueroa Garcia et al. 2010] Although, due to the complexity accounted to the estimation of missing data, many authors chose to invest in evolutionary algorithms to build models for this application. [Lobato et al. 2015b] for example, proposed a Genetic Algorithm to treat missing values in classification datasets, making use of both numeric and categorical data. [Patil and Bichkar 2010], also studied missing data in pattern classification databases. ...
Article
Missing data is a considerable problem in knowledge extraction where the completeness and the quality of the data play a major role in data analysis. In many applications, ignoring the records with missing values may adversely affect the prediction process and creates a significant bias in the resulting data. Therefore, Missing Data Imputation (MDI) has become mandatory to tackle the negative consequences of the presence of missing data. However, different features show different behaviours to data imputation, as the imputation of some features can enhance the learning process while others may lead to worse results according to the feature properties. This paper proposes the use of evolutionary algorithms to evaluate the usefulness of the imputation for each feature on the performance of the prediction model, in order to select the best subset of incomplete features that can enhance the learning process and maximize the prediction power of the model after it has been handled properly. This paper proposes a new approach for handling missing values while performing feature selection simultaneously to enhance the model’s learning performance and reduce the negative consequences of imputation. The performance of the proposed method was evaluated using 10 bench-marking datasets under 10-folds cross validation test. The results were compared with five classical imputation methods (mean, median, multiple imputation, expectation maximization, and K-nearest neighbours). The proposed methodology significantly outperformed all other methods in terms of accuracy, sensitivity, specificity, geometric means, and the area under the curve. Moreover, the effectiveness of the proposed method was compared against three recent evolutionary based imputation methods, where the proposed methodology outperformed other methods in terms of accuracy in 75% of the datasets.
Conference Paper
The growth of video surveillance devices increases the rate of streaming data. However, even working in the Fog Computing environment, these smart devices may fail collecting information, producing missing or invalid data. This issue can affect the user quality of experience, because the PTZ-controller may lose the target object tracking. Therefore, this paper presents the Singular Spectrum Analysis - (SSA), as the method to replace missing values in this complex environment of intelligent surveillance cameras. SSA is characterized within time series field by performing a non-parametric spectral estimation with spatial-temporal correlations. The values not correctly monitored, were estimated by SSA with accuracy, allowing the tracking of a suspect object.