ArticlePDF Available

Comparison of Linear Interpolation Method and Mean Method to Replace the Missing Values in Environmental Data Set

Authors:

Abstract and Figures

Data collected in air pollution monitoring such as PM10, sulphur dioxide, ozone and carbon monoxide are obtained from automated monitoring stations. These data usually contained missing values due to machine failure, routine maintenance, changes in the siting of monitors and human error. Incomplete datasets can cause bias due to systematic differences between observed and unobserved data. Therefore, it is important to find the best way to estimate these missing values to ensure the quality of data analysed are of high quality. Incomplete data matrices are problematic: incomplete datasets may lead to results that are different from those that would have been obtained from a complete dataset (Hawthorne and Elliott, 2004). There are three major problems that may arise when dealing with incomplete data. First, there is a loss of information and, as a consequence, a loss of efficiency. Second, there are several complications related to data handling, computation and analysis, due to the irregulaties in data structure and the impossibility of using standard software. Third, and more important, there maybe bias due to systematic differences between observed and unobserved data. One approach to solve incomplete data problems is the adoption of imputation techniques (Junninen et al., 2004). Thus, this study compared the performance between linear interpolation method (imputation technique) and substitution of mean value for replacement of missing values in environmental data set.
Content may be subject to copyright.
ICoSM2007 Paper accepted on 15th May 2007
85
COMPARISON OF LINEAR INTERPOLATION METHOD AND MEAN
METHOD TO REPLACE THE MISSING VALUES IN ENVIRONMENTAL
DATA SET
1Norazian Mohamed Noor, 2Mohd Mustafa Al Bakri Abdullah, 3Ahmad Shukri Yahaya, 3Nor Azam
Ramli
1School of Environmental Engineering, 2School of Material Engineering, Universiti Malaysia Perlis, P.O
Box 77, d/a Pejabat Pos Besar, 01007 Kangar. Perlis, Malaysia.
3School of Civil Engineering, Engineering Campus, Universiti Sains Malaysia, 14300 Nibong Tebal,
Pulau Pinang, Malaysia.
Abstract
Missing data is a very frequent problem in many scientific field including environmental research. These
are usually due to machine failure, routine maintenance, changes in siting monitors and human error.
Incomplete datasets can cause bias due to systematic differences between observed and unobserved data.
Therefore, the need to find the best way in estimating missing values is very important so that the data
analysed is ensured of high quality. In this study, two methods were used to estimate the missing values
in environmental data set and the performances of these methods were compared. The two methods are
linear interpolation method and mean method. Annual hourly monitoring data for PM10 were used to
generate simulated missing values. Four randomly simulated missing data patterns were generated for
evaluating the accuracy of imputation techniques in different missing data conditions. They are 10%,
15%, 25% and 40%. Three types of performance indicators that are mean absolute error (MAE), root
mean squared error (RMSE) and coefficient of determination (R2) were calculated in order to describe the
goodness of fit for the two methods. From the two methods applied, it was found that linear interpolation
method gave better results compared to mean method in substituting data for all percentage of missing
data considered.
Introduction
Data collected in air pollution monitoring such as PM10, sulphur dioxide, ozone and carbon monoxide are
obtained from automated monitoring stations. These data usually contained missing values due to
machine failure, routine maintenance, changes in the siting of monitors and human error. Incomplete
datasets can cause bias due to systematic differences between observed and unobserved data. Therefore, it
is important to find the best way to estimate these missing values to ensure the quality of data analysed
are of high quality. Incomplete data matrices are problematic: incomplete datasets may lead to results that
are different from those that would have been obtained from a complete dataset (Hawthorne and Elliott,
2004). There are three major problems that may arise when dealing with incomplete data. First, there is a
loss of information and, as a consequence, a loss of efficiency. Second, there are several complications
related to data handling, computation and analysis, due to the irregulaties in data structure and the
impossibility of using standard software. Third, and more important, there maybe bias due to systematic
differences between observed and unobserved data. One approach to solve incomplete data problems is
the adoption of imputation techniques (Junninen et al., 2004). Thus, this study compared the performance
between linear interpolation method (imputation technique) and substitution of mean value for
replacement of missing values in environmental data set.
Material and Methods
Data
Annual hourly monitoring records for PM10 in Seberang Perai, Penang were selected to carry out the
simulation of missing data. The test dataset consisted of particulate matter (PM10) concentration on a
time-scale of one per hour (hourly averaged) for one year. Table 1 gives the summary of particulate
matter (PM10).
Simulation of Missing Data
Five randomly simulated missing data patterns were used for evaluating the accuracy of imputation
techniques in different missing data conditions. The simulated data patterns were divided into three
degree of complexity that are small, medium and large. The patterns of missing data simulation are
represented in Table 2.
ICoSM2007 Paper accepted on 15th May 2007
86
Table 1 Descriptive statistic of PM10 data
Valid data 8757
Missing data 3
Mode 45.0
Standard Deviation 58.5
Minimum Value 8.0
Maximum Value 718.0
Table 2 The patterns of missing data simulation
Degree of Complexities Percentage of Missing Data (%)
Small 5
10
Medium 15
25
Large 40
Computational Methods
a) Linear Interpolation Method
The equation of the linear interpolation function is (Chapra and Canale, 1998):
()
0
01
01
0)()(
)()( xx
xx xfxf
xfxf
+= (1)
where x is the independent variable, x1 and xo are known values of the independent variable and f(x) is the
value of the dependent variable for a value x of the independent variable.
b) Mean Method
This method replaces all missing values with the mean of all available data. Thus the equation is (Yahaya
et al., 2005) :
=
=
=ni
ii
y
n
y1
1 (2)
where n is the number of available data and yi is the data points.
Performance Indicators
Several performance indicators were used to describe the goodness of the imputation methods used in this
research. The theoretical data and observed data were compared to select the best method for estimating
missing values. Three performance indicators were used that are mean absolute error (MAE), root mean
squared error (RMSE) and coefficient of determination (R2).
a) Mean Absolute Error (MAE)
The mean absolute error (MAE) is evaluated by the equation (Junninen et al., 2004):
=
= N
iii OP
N
MAE 1
1 (3)
where N is the number of imputations, Oi the observed data points and Pi the imputed data point. Mean
absolute error (MAE) ranges from 0 to infinity and a perfect fit is obtained when MAE equals to 0.
b) Root Mean Squared Error (RMSE)
The mean-squared error is computed by (Junninen et.al., 2004):
[]
2
1
1
2
1
=
=
N
iii OP
N
RMSE (4)
where N is the number of imputations, Oi the observed data points and Pi the imputed data point. The
RMSE gives the error value the same dimensionality as the actual and predicted values. The smaller
value of RMSE indicates the better performance of the model.
ICoSM2007 Paper accepted on 15th May 2007
87
c) Coefficient of Determination (R2)
The coefficient of determination (R2) takes on values between 0 and 1, with values closer to 1 implying a
better fit. The equation of coefficient of determination (R2) is given as follows (Junninen et al., 2004):
()()
[]
2
1
21
=
=
OP
N
iii OOPP
N
R
σσ
(5)
where N is the number of imputations, Oi the observed data points, Pi the imputed data point,
P
is the
average of imputed data, O is the average of observed data, P
σ
is the standard deviation of the imputed
data and O
σ
is the standard deviation of the observed data.
Results and Discussion
Figure 1 below plots the performance of linear interpolation methods and mean methods for replacing the
simulated PM10 data. From Figure 1, obviously, linear interpolation method gives the best results for all
percentage of missing values compared to mean method. The mean method contributes to very large
errors compared to linear interpolation method. The R2 values of linear interpolation method for all
percentages of missing values are from 0.69 to 0.86 whereas mean method is 0.00 for all percentage of
missing values. This is consistent with that reported by Junninen et al. (2004) which stated that the
substitution of mean values for missing data disrupt the inherent sructure of the data and lead to large
error in the matrix correlation thus degrading the performance of the statistical modelling.
Conclusions
This paper discusses the comparison of linear interpolation method and mean method to estimate missing
values. This study is carried out to prove that substitution of mean values will degrade the statistical
performance of the data. The PM10 hourly data for a year was used to compare the performance of the
methods. Simulated missing values which were categorised as small, medium and large complexities
were used. The best imputation techniques for all percentages of the simulated missing data were
obtained. Three performance indicators were calculated in order to select the best method replacing the
missing values. They are mean absolute error (MAE), root mean square error (RMSE) and coefficient of
determination (R2),. From these performance indicators, for all degree of complexities the best method
was found to be the linear interpolation method.
ICoSM2007 Paper accepted on 15th May 2007
88
Figure 1: Performance indicators for two methods
References
[1] Chapra, S.C. and Canale, R.P., (1998) Numerical Methods for Engineers. Singapore:McGraw-Hill.
[2] Hawthorne, G. and Elliot, P. (2005) Imputing Cross-Sectional Missing Data: Comparison of
Common Techniques. Australian and New Zealand Journal of Psychiatry, 39, p. 583-590.
[3] Junninen, H., Niska, H., Tuppurrainen, K., Ruuskanen, J., Kolehmainen, M., (2002) Methods for
Imputation of Missing Values in Air Quality Data Sets. Journal of Atmospheric Environment, 38, p.
2895-2907.
[4] Yahaya, A.S, Ramli, N.A. and Yusof, N.F., (2005) Effects of Estimating Missing Values on Fitting
Distributions: International Conference On Quantitative Sciences and Its Applications, 6-8
December 2005, Penang, Malaysia : Universiti Utara Malaysia.
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
510152540
Percentage of Missing Values (%)
RMSE
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
5 10152540
Percentage of Missing Values (%)
MAE
Linear
Mean
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
5 10152540
Percentage of Missing Values (%)
R
2
... Data were normally distributed. As is common, the missing values were replaced using the mean, although this method may degrade the statistical performance of the data [63]. Bonferroni correction was used for post hoc analyses. ...
Article
Full-text available
Despite the surge in studies on fussy eating in recent years, anxiety as an associated factor is generally not considered, even though children with fussy eating and those with neurodevelopmental disorders, including Autism Spectrum Disorder or Attention Deficit/Hyperactivity Disorder (ADHD) often have higher levels of anxiety than typically developing children. The current study investigated changes in anxiety scores during a Taste Education intervention, a seven-week school-based intervention for 71 children with fussy eating. Comparisons were made based on neurodevelopmental status (between children with (n = 30) and without (n = 41) neurodevelopmental disorders). Participants were paired based on age, sex, and neurodevelopmental disorder. The Multidimensional Anxiety Scale for Children (MASC) was administered at delayed intervention (for those waiting 7 weeks before starting the intervention), pre-intervention, post-intervention, and at six-month follow-up. Results did not indicate elevated anxiety based on mean MASC T-scores. MASC Total T-scores ranged from slightly elevated to average, decreasing significantly between pre-intervention and post-intervention, plateauing at six-month follow-up. Significant reductions between measurement points were seen for the physical symptoms, social anxiety, and separation anxiety subscales, but not for harm avoidance. Repeated measures analysis of variance with neurodevelopmental disorders as between-subjects factors did not reveal a significant interaction effect between neurodevelopmental disorders and changes in MASC Total score or subscales. The results indicated that our food-based intervention did not elevate MASC scores in fussy eating children, with or without neurodevelopmental disorders.
... Secondly, it requires a high density of data points, and sparse or unevenly distributed data may lead to inaccurate estimates. Additionally, linear interpolation cannot capture nonlinear trends or special patterns in the data [24]. Therefore, when applying linear interpolation, it is crucial to assess the characteristics and validity of the data within the specific context and consider the suitability of alternative interpolation methods [25]. ...
Article
Full-text available
In the context of escalating global environmental concerns, the importance of preserving water resources and upholding ecological equilibrium has become increasingly apparent. As a result, the monitoring and prediction of water quality have emerged as vital tasks in achieving these objectives. However, ensuring the accuracy and dependability of water quality prediction has proven to be a challenging endeavor. To address this issue, this study proposes a comprehensive weight-based approach that combines entropy weighting with the Pearson correlation coefficient to select crucial features in water quality prediction. This approach effectively considers both feature correlation and information content, avoiding excessive reliance on a single criterion for feature selection. Through the utilization of this comprehensive approach, a comprehensive evaluation of the contribution and importance of the features was achieved, thereby minimizing subjective bias and uncertainty. By striking a balance among various factors, features with stronger correlation and greater information content can be selected, leading to improved accuracy and robustness in the feature-selection process. Furthermore, this study explored several machine learning models for water quality prediction, including Support Vector Machines (SVMs), Multilayer Perceptron (MLP), Random Forest (RF), XGBoost, and Long Short-Term Memory (LSTM). SVM exhibited commendable performance in predicting Dissolved Oxygen (DO), showcasing excellent generalization capabilities and high prediction accuracy. MLP demonstrated its strength in nonlinear modeling and performed well in predicting multiple water quality parameters. Conversely, the RF and XGBoost models exhibited relatively inferior performance in water quality prediction. In contrast, the LSTM model, a recurrent neural network specialized in processing time series data, demonstrated exceptional abilities in water quality prediction. It effectively captured the dynamic patterns present in time series data, offering stable and accurate predictions for various water quality parameters.
... Therefore, it is essential to determine the optimal method for estimating these missing values to ensure that the analysed data are of good quality. Herein, we used the linear interpolation method to ll out the missing values in the datasets (Noor et al. 2015). ...
Preprint
Full-text available
Polybrominated diphenyl ethers (PBDEs) and per- and polyfluoroalkyl substances (PFAS) are environmental contaminants that have been widely detected in various matrices, including air, water, sediment, and biota, across the globe, but their sources and fate remain poorly understood. This review aims to explore the occurrence of PBDEs and PFAS in the Danube River. The study employs the NORMAN database repository as a source of data pertaining to persistent organic pollutants (POPs). This study compares and evaluates the occurrence patterns of PBDEs and PFAS in various countries along the Danube River basin. The spatial results demonstrate a decreasing trend for PBDEs in surface water and biota, while a significant increase for PFAS is observed. The distributions of PBDE congeners in biota samples mirrored the compositional profiles in the water, which were dominated by BDE-47 and/or BDE-99, while BDE-209 predominated in sediments. In regards to PFAS, PFOA and PFOS are prevalent in surface water. In conclusion, the occurrence of PBDEs and PFAS in Europe is of significant concern, and regulatory policies have been implemented to control their use and release into the environment. The results of this study can be used to assess the health and environmental risks posed by POPs in the Black Sea and can aid in the formulation of future public health policies.
... The analysis procedure included data preprocessing, matching the input with target variables, the LSTM model, hyperparameter optimization, and objective function (Fig. 2 34 . Third, missing values for the dam/weir operational data were predicted using linear interpolation 35 . Finally, after processing the missing values for each input data point, all input data were normalized to values between 0 and 1. ...
Article
Full-text available
Recently, weather data have been applied to one of deep learning techniques known as “long short-term memory (LSTM)” to predict streamflow in rainfall-runoff relationships. However, this approach may not be suitable for regions with artificial water management structures such as dams and weirs. Therefore, this study aims to evaluate the prediction accuracy of LSTM for streamflow depending on the availability of dam/weir operational data across South Korea. Four scenarios were prepared for 25 streamflow stations. Scenarios #1 and #2 used weather data and weather and dam/weir operational data, respectively, with the same LSTM model conditions for all stations. Scenarios #3 and #4 used weather data and weather and dam/weir operational data, respectively, with the different LSTM models for individual stations. The Nash–Sutcliffe efficiency (NSE) and the root mean squared error (RMSE) were adopted to assess the LSTM’s performance. The results indicated that the mean values of NSE and RMSE were 0.277 and 292.6 (Scenario #1), 0.482 and 214.3 (Scenario #2), 0.410 and 260.7 (Scenario #3), and 0.592 and 181.1 (Scenario #4), respectively. Overall, the model performance was improved by the addition of dam/weir operational data, with an increase in NSE values of 0.182–0.206 and a decrease in RMSE values of 78.2–79.6. Surprisingly, the degree of performance improvement varied according to the operational characteristics of the dam/weir, and the performance tended to increase when the dam/weir with high frequency and great amount of water discharge was included. Our findings showed that the overall LSTM prediction of streamflow was improved by the inclusion of dam/weir operational data. When using dam/weir operational data to predict streamflow using LSTM, understanding of their operational characteristics is important to obtain reliable streamflow predictions.
Conference Paper
Air pollution is a critical issue for our world today, the emissions of air pollutants cause serious environmental and health issues. In this the main objective is to forecast one of the most dangerous pollutant on human health named particulate matter that has a diameter of 2.5 µm or less (PM2.5). Improving accuracy of prediction as early warnings of PM2.5 concentration can save individuals from many threats’ exposure to pollutant. Beijing Multi-Site Air-Quality (12 air quality stations) dataset was utilized to improve PM2.5 concentration forecasting. One of the most important factors for improving forecasting models is features selection. In this study, distinctive features selection techniques were examined for selecting best features selection method such as by correlation coefficient, Select-K-Best and XGBoost and used the selected features to feed artificial neural networks (ANN) models. Many ANN models were constructed using widely used neural networks architectures that deal with multivariate time series regressions problems, namely bi-direction long short-term memory (BiLSTM), long short-term memory (LSTM), gated recurrent unit (GRU) and convolutional neural network (CNN). To evaluate models’ forecasting results, mean absolute error (MAE) and root mean square error (RMSE) were used and results showed that each features selection method produces distinctive features and has direct impact on model performance according to evaluation metrics. Based on experiments, we conclude that Select-K-Best can outperform other features selection methods applied in this work in forecasting PM2.5 concentration utilizing Beijing Multi-Site Air-Quality dataset.
Article
Full-text available
Earth Observation (EO) data, such as Landsat 7 (L7) and Sentinel 2 (S2) imagery, are often used to monitor the state of natural resources all over the world. However, this type of data tends to suffer from high cloud cover percentages during rainfall/snow seasons. This has led researchers to focus on developing algorithms for filling gaps in optical satellite imagery. The present work proposes two modifications to an existing gap-filling approach known as the Direct Sampling (DS) method. These modifications refer to ensuring the algorithm starts filling unknown pixels (UPs) that have a specified minimum number of known neighbors (Nx) and to reducing the search area to pixels that share similar reflectance as the Nx of the selected UP. Experiments were performed on images acquired from coastal water bodies in France. The validation of the modified gap-filling approach was performed by imposing artificial gaps on originally gap-free images and comparing the simulated images with the real ones. Results indicate that satisfactory performance can be achieved for most spectral bands. Moreover, it appears that the bi-layer (BL) version of the algorithm tends to outperform the uni-layer (UL) version in terms of overall accuracy. For instance, in the case of B04 of an L7 image with a cloud percentage of 27.26%, accuracy values for UL and BL simulations are, respectively, 64.05 and 79.61%. Furthermore, it has been confirmed that the introduced modifications have indeed helped in improving the overall accuracy and in reducing the processing time. As a matter of fact, the implementation of a conditional filling path (minNx = 4) and a targeted search (n2 = 200) when filling cloud gaps in L7 imagery has contributed to an average increase in accuracy of around 35.06% and an average gain in processing time by around 78.18%, respectively.
Article
Kunqu, one of the oldest forms of Chinese opera, features a unique artistic expression arising from the interplay between vocal melody and the tonal quality of its lyrics. Identifying Kunqu’s character tone trend (vocal melodies derived from tonal quality of the lyrics) is critical to understanding and preserving this art form. Traditional research methods, which rely on qualitative descriptions by musicologists, have often been debated due to their subjective nature. In this study, we present a novel approach to analyze the character tone trend in Kunqu by employing computer modeling machine learning techniques. By extracting the character tone trend of Kunqu using computational modeling methods and employing machine learning techniques to apply cluster analysis on Kunqu’s character tone melody, our model uncovers musical structural patterns between singing and speech, validating and refining the qualitative findings of musicologists. Furthermore, our model can automatically assess whether a piece adheres to the rhythmic norms of ‘the integration of literature and music’ in Kunqu, thus contributing to the digitization, creation, and preservation of this important cultural heritage.
Article
Full-text available
Particulate Matter (PM 10 ) is one of the most significant contributors towards haze or high particulate event (HPE) that occurs in Malaysia. HPE can severely affect human health, environment and economic so it is important to create a reliable prediction model in predicting future PM 10 concentration especially during HPE. Therefore, the aim of this study is to investigate the performance of modified linear regression models in predicting the next-day Particulate Matter (PM 10+24 ) concentration at two areas in the peninsular Malaysia namely, Bukit Rambai and Nilai. Hourly air quality dataset during historic HPE in 1997, 2005, 2013 and 2015 were used for analysis. Pearson correlation was used to select the input of the PM 10 prediction model where only parameters with moderate (0.6 > r > 0.3) and strong (r > 0.6) correlation with PM 10 concentration were selected as independent variables input in creating the multiple linear regression (MLR) model. The performance of modified linear regression model was evaluated by using several performance indicator which is Prediction Accuracy (PA), Index of Agreement ( d 2 ), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). The results show that the modified MLR (parameter with r > 0.6 included as input) gave the best prediction model for the next-day PM 10 concentration in both Bukit Rambai and Nilai.
Article
Methods for data imputation applicable to air quality data sets were evaluated in the context of univariate (linear, spline and nearest neighbour interpolation), multivariate (regression-based imputation (REGEM), nearest neighbour (NN), self-organizing map (SOM), multi-layer perceptron (MLP)), and hybrid methods of the previous by using simulated missing data patterns. Additionally, a multiple imputation procedure was considered in order to make comparison between single and multiple imputations schemes. Four statistical criteria were adopted: the index of agreement, the squared correlation coefficient (R2), the root mean square error and the mean absolute error with bootstrapped standard errors. The results showed that the performance of interpolation in respect to the length of gaps could be estimated separately for each variable of air quality by calculating a gradient and an exponent α (Hurst exponent). This can be further utilised in hybrid approach in which the imputation has been performed either by interpolation or multivariate method depending on the length of gaps and variable under study. Among the multivariate methods, SOM and MLP performed slightly better than REGEM and NN methods. The advantage of SOM over the others was that it was less dependent on the actual location of the missing values. If priority is given to computational speed, however, NN can be recommended. The results in general showed that the slight improvement in the performances of multivariate methods can be achieved by using the hybridisation and more substantial one by using the multiple imputations where a final estimate is composed of the outputs of several multivariate fill-in methods.
Article
Increasing awareness of how missing data affects the analysis of clinical and public health interventions has led to increasing numbers of missing data procedures. There is little advice regarding which procedures should be selected under different circumstances. This paper compares six popular procedures: listwise deletion, item mean substitution, person mean substitution at two levels, regression imputation and hot deck imputation. Using a complete dataset, each was examined under a variety of sample sizes and differing levels of missing data. The criteria were the true t-values for the entire sample. The results suggest important differences. If missing data are from a scale where about half the items are present, hot deck imputation or person mean substitution are best. Because person mean substitution is computationally simpler, similar in its efficiency, advocated by other researchers and more likely to be an option on statistical software packages, it is the method of choice. If the missing data are from a scale where more than half the items are missing, or with single-item measures, then hot deck imputation is recommended. The findings also showed that listwise deletion and item mean substitution performed poorly. Person mean and hot deck imputation are preferred. Since listwise deletion and item mean substitution performed poorly, yet are the most widely reported methods, the findings have broad implications.
Methods for Imputation of Missing Values in Air Quality Data Sets
  • H Junninen
  • H Niska
  • K Tuppurrainen
  • J Ruuskanen
  • M Kolehmainen
Junninen, H., Niska, H., Tuppurrainen, K., Ruuskanen, J., Kolehmainen, M., (2002) Methods for Imputation of Missing Values in Air Quality Data Sets. Journal of Atmospheric Environment, 38, p. 2895-2907.