Conference PaperPDF Available

A Comparison of Methods for Missing Data Treatment in Building Sensor Data

Authors:

Abstract and Figures

Data collection is a fundamental component in the study of energy and buildings. Errors and inconsistencies in the data collected from test environment can negatively influence the energy consumption modelling of a building and other control and management applications. This paper addresses the gap in the current study of missing data treatment. It presents a comparative study of eight methods for imputing missing values in building sensor data. The data set used in this study, are real data collected from our test bed, which is a living lab in the Newcastle University. When the data imputation process is completed, we used Mean Absolute Error, and Root Mean Squared Error methods to evaluate the difference between the imputed values and real values. In order to achieve more accurate and robust results, this process has been repeated 1000, and the average of 1000 simulation is demonstrated in this paper. Finally, it is concluded that it is necessary to identify the percentage of missing data before selecting the proper imputation method, in order to achieve the best result.
Content may be subject to copyright.
A Comparison of Methods for Missing Data Treatment in Building Sensor Data
Mehdi Pazhoohesh, Zoya Pourmirza, Sara Walker
School of Engineering, Newcastle University, Newcastle, UK
e-mail: Mehdi.pazhoohesh@ncl.ac.uk, Zoya.pourmirza@newcastle.ac.uk, Sara.walker@newcastle.ac.uk
Abstract—Data collection is a fundamental component in the
study of energy and buildings. Errors and inconsistencies in
the data collected from test environment can negatively
influence the energy consumption modelling of a building and
other control and management applications. This paper
addresses the gap in the current study of missing data
treatment. It presents a comparative study of eight methods for
imputing missing values in building sensor data. The data set
used in this study, are real data collected from our test bed,
which is a living lab in the Newcastle University. When the
data imputation process is completed, we used Mean Absolute
Error, and Root Mean Squared Error methods to evaluate the
difference between the imputed values and real values. In
order to achieve more accurate and robust results, this process
has been repeated 1000, and the average of 1000 simulation is
demonstrated in this paper. Finally, it is concluded that it is
necessary to identify the percentage of missing data before
selecting the proper imputation method, in order to achieve the
best result.
Keywords-energy and building data, data imputation;
missing value; KNN; MCMC; MAE; RMSE
I. INTRODUCTION
Nowadays, data collection is a key process in the study of
Energy and buildings. For instance, Building Energy controls
and retrofit analysis are two applications of collecting large
amount of data from installed sensors. In addition, data
collected from building has been used for modelling the
energy consumption in buildings through different software
such as EnergyPlus [1].
However, significant discrepancies between simulated
and measured energy consumption of buildings is the
motivation to focus more on analysing data collected through
extensive sensor networks.
A. Related Work and Gap Analysis
Different calibration techniques such as Bayesian
calibration [2], [3] and systematic evidence-based
approaches [4] has been used to uncover discrepancies
between simulated and measured energy consumption of
buildings. However, a considerable amount of data are
usually missed due to different reasons such as low signal-to-
noise ratio, measurement error, malfunctioning of sensors,
power outages at the sensors or network failure which can
lead to data analysis problems. Hence, the estimating of
missing values play a significant role in calibration of
building energy models as a pre-processing step. Moreover,
evaluation and prediction of building’s energy consumption
through statistical and data mining methods require time-
series data in which missing values can significantly
influence the analysis results, further emphasizing the
importance of missing value estimation. Different
approaches are used to deal with missing values in most
scientific research domains such as Biology [5], Medicine [6]
or Climatic Science [7]. However, there are limited studies to
deal with missing data for the building energy system. One
approach is to delete all missing values and analyse the
behaviour of the building based on available data. The issue
which may arise with this method is that there may be very
few observations and a very small dataset to model the
behaviour of the building based on that [8], [9]. Another
approach is mean imputation, where missing data will be
replaced with the mean value of all variables [8], [10]. This
method distorts the distribution of the variable and also
relationships between variables and can result in large errors
between predicted and actual values. The other method used
to treat the missing data is replacing missing data values with
some constant (eg. zero). This has been used for the
applications where they cannot tolerate having gaps between
data [5]. Although ,a variety of techniques have been
developed to treat missing values with statistical prediction
in other fields, there is a lack of research concerning the
substituting of missing values in order to provide guidelines
to make the more appropriate methodological choice in
energy and building related data. In the this study, we
compare eight different imputation methods, namely, Monto
Carlo Markov chain (MCMC) [11], Hmisc aregImpute [12],
K-nearest neighbours (KNN) [13], simple Mean,
Expectation-Maximization [14], [15], Random value,
Regression and stochastic regression [15] methods, to find
which method is the best fit for energy and building data sets.
Comparison was performed on real lightning dataset
collected from a 6 months period, under an Missing
Completely at Random (MCAR) assumption and based on
Mean Absolute Error (MAE), and Root Mean Squared Error
(RMSE) evaluation criteria for estimating missing values in
building data.
II. METHOD
A. Study Site
The data used for this study is lighting time-series data as
the main dataset and corresponding occupancy data as the
supportive dataset which were collected from the 3rd floor of
Urban Science Building, Newcastle University, United
Kingdom (Figure 1). Data collection took place between
February 2018 and July 2018 at 1 minute intervals. The
255
2019 the 7th International Conference on Smart Energy Grid Engineering
978-1-7281-2440-7/19/$31.00 ©2019 IEEE
collected data were averaged to obtain half-hourly values
with 7968 data points.
Figure 1. USB building.
B. Selection of Imputation Method
In order to conduct this study, we have selected eight
imputation methods, which are the most well-known
techniques that covers various statistical strategies in terms
of simplicity to multiple imputation methods. These
techniques are Mean, Random, Nearest Neighbour algorithm
(KNN), aregImpute (Hmisc) in R, Markov Chain Monte
Carlo (MCMC) [15], expectation-maximization (EM)
algorithm [11], Regression and Stochastic regression
methods. Here we briefly discuss each techniques. Mean
method is based on imputation by replacing the missing data
by the mean of all known values of that variable.
Random technique is used based on randomly predicting
the missing values according to the maximum and minimum
values of the dataset.
The nearest neighbour algorithm [16] is a nonparametric
method which is used to replace the missing data for the
variable by averaging non-missing values of its neighbours.
In this method, K-nearest Neighbours are selected to predict
the missing value and the influence is the same for each of
these neighbours. Depends on the number of selected
neighbours (K value), the estimated value could be
significantly tolerated. Hence, choosing the proper number
of neighbours, has great influence on the prediction. In this
paper, the effect of different values of the parameter k on
estimation accuracy is discussed.
The aregImpute function in the HMisc library [12]
consists of replacing the missing value with predictive mean
matching which is computed by optional weighted
probability sampling from similar cases. In aregImpute
function, missing values for any parameter are estimated
based on other parameters. In this paper, occupancy data is
considered as the supportive value for estimating the missing
value in lighting dataset.
Markov Chain Monte Carlo (MCMC) is an iterative
algorithm based on chained equations that uses an
imputation model specified separately for each variable and
involving the other variables as predictors. Monte Carlo
Markov chain (MCMC) method is used to generate pseudo-
random draws and provides several imputed data sets.
MCMC requires either MAR or MCAR data sets and can be
implemented on both arbitrary and monotone patterns of
missing data. A Markov Chain is a sequence of possible
variables in which the probability of each element depends
only on the value of the previous one.
In MCMC simulation, by constructing a Markov chain
that has the stationary distribution which is the distribution
of interest, one can obtain a sample of the desired
distribution by repeatedly simulating steps of the chain.
Refer to Schafer [17] for a detailed discussion of this method.
In the regression imputation method, the missing values
will be replaced with predicted score from regression
equation. Although, the imputed data are computed using
information from the observed data, only one representative
value will be considered for each group of missing data
which may result in weakens variance. Another method
which is inspired from regression concept is stochastic
regression method. This method aims to reduce the bias
using additional step of augmenting each predicted score
with a residual term. Therefore, each missing value has a
different imputed number to be replaced with [15].
III. DISCUSSION
A. Evaluation Criteria
To evaluate the forecast, mean absolute error (MAE), and
root mean square error (RMSE) were computed over the
given period for imputed lightening data.
These techniques are valuable measurement techniques
that are used to compare eight imputation algorithms. RMSE
represents the sample standard deviation of the difference
between actual and estimated values as:
RMSE=
  
MAE measures the average magnitude of the errors in a
set of prediction as:
  =

  
where n denotes the number of test samples,  represents
the ith target value, stands for the predicted value
for the ith test sample.
RMSE and MAE both indicate how close the modelled
and observed values are. RMSE takes the square root of the
average square error, it gives a relatively high weight to the
large errors. Therefore, it is appropriate when penalizing
large errors are desirable.
B. Estimation Process
The process of the analysis is depicted in Figure 2. Due
to the large size of the original dataset, from the original
dataset with one minute intervals, the half-hourly dataset is
generated based on the average of each 30 minute data and
called calibrated dataset. Considering the assumption of
“Missing Completely at Random” (MCAR), the percentage
of 10%, 20% and 30% missing data were generated from the
calibrated dataset. Afterward, missing data were imputed
using the 8 methods. In the next step, the difference between
the substituted values and real values was computed by
256
RMSE and MAE methods. To provide more accurate
comparison, the missing value generation step and the
corresponding imputation algorithms were performed for
1000 simulations and the average of the 1000 simulations
were used for the final evaluations.
Figure 2. Principle of the analysis.
C. Result Analysis
As it was mentioned before, in KNN method the number
of selected neighbours play an important role. By increasing
the percentage of missing data, bigger K value is suitable for
the best KNN results. In other word, when the missing data
is about 10%, the closest value, to the missed data, is the best
value for imputation (Figure 3(a)). However, by increasing
the missing data, i.e. for 20% missing, the most optimized K
could achieve by considering the K value as 2 or in other
word, by considering an hourly boundary, the best value
achieved (Figure 3 (b)). For the 30% missing dataset, the
best K was 4 which means the boundary of 2 hours could
result in better imputation of missing data (Figure 3 (c)). The
trend of best K value in terms of missing percentage is
depicted in Figure 4. From this figure, it is also obvious that
increasing the percentage of missing data results in higher
RMSE value which can be considered as a logical
confirmation of the principle of our analysis.
(a)
(b)
AE
(c)
Figure 3. Relationship between missing percentage and best K value.
Figure 4. Trend of K values.
Figure 5 illustrates the comparison of all methods in
terms of computed RMSE.
It should be mentioned that to simplify the evaluation, for
KNN method, the average of RMSE for each set of missing
data (10%, 20% and 30%) is considered for this comparison.
For 10 percent missing data (Figure 5(a)), Random and
Stochastic regression and MCMC techniques achieved the
highest percentage of error based on root mean square
analysis. With a remarkable gap, KNN shows less error than
other methods. AregImpute, Mean, regression and EM
techniques achieve the same RMSE, approximately.
1086420
0.30
0.29
0.28
0.27
0.26
0.25
K
RMSE
K vs RMSE (10% missing data)
1086420
0.45
0.44
0.43
0.42
0.41
0.40
0.39
K
RMSE
K vs RMSE (20% missing data)
1086420
0.565
0.560
0.555
0.550
0.545
0.540
K
RMSE
K vs RMSE (30% missing data)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
10% 20% 30%
RMSE
Missing data percentage
K value
k1 k2 k4 k6 k8 k10
257
For 20 percent missing values in the dataset (Figure 5(b)),
approximately, the same manner in terms of the RMSE
values achieved. KNN shows the best and MCMC, Random
and Stochastic regression methods achieve the worst
methods. RMSE value for AregImpute technique, slightly
increased compare with Mean, regression and EM methods.
For 30 percent missing value dataset (Figure 5(c)), KNN,
Regression and Mean techniques show the most suitable
methods while higher percentage of missing values are
available. There is a significant error increase for EM
algorithm in this dataset. The KNN and random methods
show the best and approximately the worst methods,
respectively.
(a)
(b)
(c)
Figure 5. Comparison of methods based on RMSE.
The evaluation of Mean Absolut Errors are depicted in
figure 6. Figure 6(a), shows that KNN has a remarkable less
error than the other methods. The computed MAE for 20
percent missing data set (Figure 6(b)) and 30 percent missing
data (Figure (6(c)) show that the KNN technique archives the
lowest error.
(a)
(b)
(c)
Figure 6. Comparison of methods based on MAE.
IV. EVALUATION AND OUTCOME
The objective of this research is to highlight the
importance of the method that will be used in energy and
building fields to treat the missing values. This paper shows
that it is important to identify the percentage of missing data
before selecting the proper method. In this research eight
popular imputation techniques are used on the generated
258
datasets with 10, 20 and 30 percent missing values. The
results show that for 10% missing data, KNN achieves a
better accuracy in prediction of missing values. Moreover,
the best value for K ( number of neighbours) find out as one
or two which means in this research the best value to be used
for replacing missing data for 10 percent data set is the next
30 minutes or next hour of the recorded data.
For the 20% missing data, KNN shows the best results
again. In this dataset, it is also concluded that the best value
for K is the next 30 minute or next hour to fill the missing
data.
For the data set with 30% missing data, KNN again
archives the best result. However, the best value for K
increased to 4 which means the next two hours of data would
be more suitable to be used for the current missing data.
Therefore, it is concluded that increasing the percentage
of missing data, requires more neighbours to estimate the
missing data.
Additionally, the results of this study showed that the
lighting data are more depends on the time instead of the
other variables like occupancy. One reason that authors find
out is due to the topology of the sensors. The test bed area
was equipped with seven occupancy sensors but only one
lighting meter. Therefore, the value of occupancy that was
used for the imputation was the average of this data in each
30 minutes interval.
The achievement of this research is limited to the lighting
variable, which is strongly time-dependent. In future, we will
further investigate other parameters in buildings. Also, the
type of the tested building is an educational building. Further
investigations are required for other types of building.
ACKNOWLEDGEMENT
The research reported in this paper was supported by
Building as a Power Plant: The use of buildings to provide
demand response project, funded by the Engineering and
Physical Sciences Research Council under Programme Grant
EP/P034241/1, and the Active Building Centre (ABC),
supported by Industrial Strategy Challenge Fund under
Programme Grant EP/S016627/1.
REFERENCES
[1] US. Department of Energy, “EnergyPlus:Engineering Reference,”
2016.
[2] A. Chong and K. Lam, “Uncertainty analysis and parameter
estimation of HVAC systems in building energy models,” In 14th
Conference of International Building Performance Simulation
Association, Hyderabad, India, 2015.
[3] Y. Heo, R. Choudhary and G. Augenbroe, “Calibration of building
energy models for retrofit analysis under uncertainty,” Energy and
Buildings, vol. 47, pp. 550-560, 2012.
[4] P. Raftery, M. Keane and J. O’Donnell, “Calibrating whole building
energy models: An evidence-based methodology,” Energy and
Buildings, vol. 43, no. 9, pp. 2356-2364, 2011.
[5] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R.
Tibshirani, D. Botstein and R. Altman, “Missing value estimation
methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp.
520-525, 2001.
[6] H. Lewis, “Missing Data in Clinical Trials,” The New England
Journal of Medicine, vol. 367, pp. 2557-2558, 2012.
[7] T. Schneider, “Analysis of incomplete climate data: Estimation of
mean values and covariance matrices and imputation of missing
values,” Journal of climate, vol. 14, no. 5, pp. 853-871, 2001.
[8] A. Gelman and J. Hill, Data analysis using regression and
multilevel/hierarchical models, Cambridge university press, 2006.
[9] C. Robinson, B. Dilkina, J. Hubbs, W. Zhang, S. Guhathakurta, M.
Brown and R. Pendyala, “Machine learning approaches for estimating
commercial building energy consumption,” Applied energy, vol. 208,
pp. 889-904, 2017.
[10] D. Cabrera and H. Zareipour, “Data association mining for
identifying lighting energy waste patterns in educational institutes,”
Energy and Buildings, vol. 62, pp. 210-216, 2013.
[11] F. Nelwamondo, S. Mohamed and T. Marwala, “Missing data: A
comparison of neural network and expectation maximization
techniques,” Current Science, vol. 93, pp. 1514-1521, 2007.
[12] F. Harrell, “ Hmisc (v 3.0-12): Harrell miscellaneous library for R
statistical software,” R package (v 2.2-3), 2006.
[13] J. Leek, E. Monsen, A. Dabney and J. Storey, “ EDGE: extraction and
analysis of differential gene expression,” Bioinformatics, vol. 22, no.
4, pp. 507-508, 2005.
[14] C. Musil, C. Warner, P. Yobas and S. Jones, “A comparison of
imputation techniques for handling missing data,” Western Journal of
Nursing Research, vol. 24, no. 7, pp. 815-829, 2002.
[15] D. Schunk, “A Markov chain Monte Carlo algorithm for multiple
imputation in larg surveys,” Advances in Statistical Analysis, vol. 92,
pp. 101-114, 2008.
[16] T. Cover and H. Peter, “Nearest neighbor pattern classification,”
IEEE transactions on information theory, vol. 13, no. 1, pp. 21-27,
1967.
[17] J. Schafer, Analysis of incomplete multivariate data, New York:
Chapman and Hall/CRC, 1997.
259
... Le domaine du bâtiment ne fait pas exception. Des méthodes d'imputation y ont par exemple été appliquées pour traiter des données sur les ambiances thermique et lumineuse, et sur la consommation énergétique de systèmes (Chong et al. 2016;Pazhoohesh et al. 2019;Cho et al. 2020). ...
... Paradoxalement, les auteurs partent souvent d'une base de données complète, i.e. sans données manquantes, et en suppriment un pourcentage plus ou moins important (5 à 40 %), en appliquant l'un des trois mécanismes, pour obtenir une base incomplète (Chong et al. 2016;Pazhoohesh et al. 2019;Cho et al. 2020;Okafor et Delaney 2021). Cette approche permet, lors du calcul des indicateurs de performance, de comparer les écarts entre les versions complète et incomplète de la base. ...
... En complément, ils conseillent d'utiliser des variables décalées pour améliorer les performances des méthodes. Le paramétrage des méthodes dépend aussi du pourcentage de données manquantes selon Pazhoohesh et al. (2019), qui ont observé que le nombre de voisins de la méthode KNN doit augmenter avec la part de données manquantes pour conserver de bonnes performances pour leurs données MCAR. ...
Conference Paper
Full-text available
Le traitement des données issues de bâtiments connectés doit permettre d'optimiser leur gestion énergétique, tout en garantissant un niveau de confort aux occupants. Les données collectées peuvent toutefois être incomplètes du fait de défaillances lors de l'acquisition des mesures. Cela compromet le traitement de l'information. Des méthodes, dites d'imputation, sont alors à appliquer pour consolider les données. Cet article propose un état de l'art sur les méthodes de gestion des données manquantes et d'évaluation de la qualité de l'imputation. Neuf méthodes d'imputation sont ensuite appliquées au cas de données d'ambiance d'un appartement T2, pour lequel le statut de présence de l'occupant est connu. Les méthodes sont comparées, d'une part en étudiant la qualité de l'imputation sur ces séries temporelles multivariées et d'autre part en évaluant la performance des méthodes sur la tâche finale, i.e. la classification du statut de présence. Il ressort de ces comparaisons que la performance de la tâche finale est peu affectée par la performance des méthodes d'imputation dans notre cas. MOTS-CLÉS : Bâtiments connectés, Imputation de données, Occupation. ABSTRACT. The processing the data from smart building is a way to both optimise their energy management and provide a high level of comfort to the occupants. Because of various possible failures in the data collection process, the information gathered can be incomplete or incorrect. In such case, so-called data imputation methods should be used in order to make possible the data processing. This article reviews methods to deal with missing data and to assess the performance of the imputation. Nine of these methods are applied to the data collected from an indoor environment-monitoring sensor located in an apartment for which the presence status of the occupant is known. The methods are compared based on the imputation quality of the multivariate time series as well as based on the performance of the final classification task, i.e. classifying the occupancy status. For this case study, it turns out that the performance imputation task has little impact on the performance of the final task.
... A simple random sampling was then applied to identify the patients' file numbers as sample units from both clinics. The study variables (like age of patient, and BMI among others) were extracted from several previous related studies concerning breast cancer [6], [13], and [14]. ...
... The idea based on this approach is to use a mean value of each non-missing variable to fill in missed values for all observations [13]. The mean imputation technique is more appropriate when the amount of missingness is small whilst the size of the sample is large. ...
... A non-parametric approach used to impute missing data by averaging its neighbouring observed data [13]. ...
Article
Background: Clinical datasets are at risk of having missing data for several reasons including patients’ failure to attend clinical measurements and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission incomplete records during analysis especially if a dataset is small. This study aims to compare several imputation methods in terms of efficiency in filling-in missing data so as to increase prediction and classification accuracy in breast cancer dataset. Methodology: Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, expected maximisation via bootstrapping, and multiple imputation by chained equations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination. Results: The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity. Conclusion: The predictive mean matching imputation showed higher accuracy in estimating and replacing missing data values in a real breast cancer dataset under the study. It is a more effective and good approach to handle missing data. We recommend replacing missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables. It improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.
... The study concluded that when a Gated Recurrent Units (GRU) architecture is properly set up "it pulled signi icantly ahead of non-deep learning methods" [2]. Pazhoohesh et al.(2019) [3]found that for datasets where 10% to 30 % of the data is missing, the KNN algorithm does great compared to eight other methods. Poloczek et al. 2014 [4] analysed the use of KNN regression and LOCF and found that both did well for the study, but that KNN regression outperformed other methods. ...
... The study concluded that when a Gated Recurrent Units (GRU) architecture is properly set up "it pulled signi icantly ahead of non-deep learning methods" [2]. Pazhoohesh et al.(2019) [3]found that for datasets where 10% to 30 % of the data is missing, the KNN algorithm does great compared to eight other methods. Poloczek et al. 2014 [4] analysed the use of KNN regression and LOCF and found that both did well for the study, but that KNN regression outperformed other methods. ...
... There are limited studies to clarify how to deal with missing data in BMS datasets. Previous research has focused on lighting and occupancy [3] data or created a generic framework for imputing data from multiple sensors [5]. In the case of (Zhang,2020) it is advised that a more generic plug-n-play framework is to be further studied. ...
Conference Paper
Full-text available
Completeness of data is vital for the decision making and forecasting on Building Management Systems (BMS) as missing data can result in biased decision making down the line. This study creates a guideline for imputing the gaps in BMS datasets by comparing four methods: K Nearest Neighbour algorithm (KNN), Recurrent Neural Network (RNN), Hot Deck (HD) and Last Observation Carried Forward (LOCF). The guideline contains the best method per gap size and scales of measurement. The four selected methods are from various backgrounds and are tested on a real BMS and meteorological dataset. The focus of this paper is not to impute every cell as accurately as possible but to impute trends back into the missing data. The performance is characterised by a set of criteria in order to allow the user to choose the imputation method best suited for its needs. The criteria are: Variance Error (VE) and Root Mean Squared Error (RMSE). VE has been given more weight as its ability to evaluate the imputed trend is better than RMSE. From preliminary results, it was concluded that the best K‐values for KNN are 5 for the smallest gap and 100 for the larger gaps. Using a genetic algorithm the best RNN architecture for the purpose of this paper was determined to be Gated Recurrent Units (GRU). The comparison was performed using a different training dataset than the imputation dataset. The results show no consistent link between the difference in Kurtosis or Skewness and imputation performance. The results of the experiment concluded that RNN is best for interval data and HD is best for both nominal and ratio data. There was no single method that was best for all gap sizes as it was dependent on the data to be imputed.
... The study used discrimination and calibration measures to assess usefulness of the prognostic model. On the other hand, the comparison of imputation algorithms for building sensor data across several percentage of missing data was conducted by comparing the differences between real and imputed values through the use of Root Mean Squared Error and Mean Absolute Error estimates; its conclusion emphasized the necessity of identifying percentage of missingness prior to selecting proper imputation technique so as to reach plausible results [14]. Moreover, imputation techniques were used to evaluate the performance of model via discrimination, calibration, and effectiveness of classifiers in relation to time used to build a model in estimating the risk of unprovoked venous thromboembolism recurrence [15]. ...
... A non-parametric approach used to impute missing data by averaging its neighbouring observed data [14]. The approach is donor-based in which imputed values are either measured as a single records in the dataset (1-NN) or as an average value obtained from k records (k-NN) [31]. ...
Article
Full-text available
Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers’ discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers’ accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.
... The advent of sensor technologies and the integration of smart systems have intensified the need for precise short-term electricity consumption predictions within these structures [2]. As the energy sector seeks to navigate this era of heightened data availability, a critical challenge arises in ensuring accurate forecasting when faced with limited access to detailed building information [3,4]. This limitation amplifies the attractiveness of data-driven Machine Learning (ML) models [5], which can adeptly navigate uncertainties and hidden patterns in energy consumption data. ...
Article
Full-text available
This paper addresses the critical issue of missing data in power demand time series by emphasizing the relevance of imputation-based approaches in data-driven technologies. A comparative analysis of imputation methods is performed, where the reference from the state of the art is selected as K-Nearest Neighbors (KNN) applied in the time domain. Two innovative methods are proposed. The former method is defined as Historical Data Informed Regression Technique (H-DIRT) and is based on incorporating historical data for setting up a multivariate linear regression and then imputing through the estimated relation between the missing power demand measurement and the historical data. When the available historical data are insufficient, the algorithm proceeds by averaging or by a linear interpolation between the first available measurement before and after the missing value. The latter proposed method is defined as Seasonal KNN (SKNN) and it is based on enriching the data set with features related to yearly, seasonal, weekly and daily trends and then proceeding by baseline KNN. Experiments are set up with random and continuous data clipping, even with rather extreme pruning (up 70% of the data). The results in general demonstrate a significant improvement in imputation accuracy compared to the state of the art. In general, the SKNN method provides more accurate results and better captures the statistical features of the data set to impute. Anyway, if the share of data to impute is not too large, the H-DIRT method provides comparable accuracy at a much lower computational cost. Hence, this study presents an easily implementable and computationally affordable approach for improving, in various contexts, the state of the art in power demand data imputation. It establishes a foundation for future exploration into trends, seasonal factors, and external variables influencing power load parameters.
... Majidpour et al [10] perform a comparison between five techniques: Constant (zero), Mean, Median, Maximum Likelihood, and Multiple Imputation applied to compensate for missing values in Electric Vehicle (EV) charging, both Constant (zero) and Median techniques presented the best results. Pazhoohesh et al [11] perform a comparative study of eight techniques for imputation of missing values in building sensor data, that is, Monte Carlo Markov Chain (MCMC), Hmisc aregImpute, K-Nearest Neighbours (KNN), Simple Mean, Expectation-Maximization, Random Value, Regression and Stochastic Regression, the authors got to the conclusion that one needs to identify the percentage of missing data before selecting the appropriate imputation technique in order to achieve the best result. ...
Article
Full-text available
The electricity sector has added plenty of new technologies in recent years. Smart Grids are characterized by the use of monitoring and communication technologies almost in whole system. The application and use of such new technologies triggers a significant growth in the data number, increasing the amount of errors and missing data, thus hindering the analysis. In this context, this paper performs the modeling, implementation, validation and comparative analysis of four data imputation techniques: K-Nearest Neighbor, Median Imputation, Last Observation Carried Forward, and Makima. The aim is to verify if they could be applied to the electric segment - more specifically to the Smart Grids environment. The database used in the research is obtained from the electricity utility CEEE and its underground substations, located in southern Brazil. Following this, five simulation scenarios are created and one data set is removed, based on pre-established criteria. Finally, the techniques are applied and the new database is compared with the original one. From the simulation results, the technique which presented the best results is Makima, it is validated as robust to be applied in the Smart Grids environment, especially in electrical data missing from an electric power substation.
... E-infrastructure is facing upcoming challenges for future energy system, such as the incorporation of diverse and new data types and data sources. These may bring difficulties in terms of data storage, data communication [19], data interoperability, cyberinterdependencies [20], missing data treatment [21], and coordination of security policy, that could be mitigated by studying in the bridges and connectors amongst energy sector and computational and e-infrastructure. Additionally, e-infrastructure of energy system suffers from cyber security challenges caused by interconnectivity between sensors and controllers. ...
Article
Full-text available
Research and development are critical for driving economic growth. To realise the UK government’s Industrial Strategy, we develop an energy research and innovation infrastructure roadmap and landscape for the energy sector looking to the long term (2030). This study is based on a picture of existing UK infrastructure on energy. It shows the links between the energy sector and other sectors, the distribution of energy research and innovation infrastructures, the age of these infrastructures, where most of the energy research and innovation infrastructures are hosted, and the distribution of energy research and innovation infrastructures according to their legal structure. Next, this study identifies the roadmap of energy research and innovation infrastructures by 2030, based on a categorisation of the energy sector into seven subsectors. Challenges and future requirements are explored for each of the sub-sectors, encompassing fossil fuels and nuclear energy to renewable energy sources and hydrogen, and from pure science to applied engineering. The study discusses the potential facilities to address these challenges within each sub-sector. It explores the e-infrastructure and data needs for the energy sector and provides a discussion on other sectors of the economy that energy research and innovation infrastructures contribute to. Some of the key messages identified in this study are the need for further large-scale initiative and large demonstrators of multi-vector energy systems, the need for multi-disciplinary research and innovation, and the need for greater data sharing and cyber-physical demonstrators. Finally, this work will serve as an important study to provide guidance for future investment strategy for the energy sector.
Conference Paper
Full-text available
Building performance simulation has the potential toquantitatively evaluate design alternatives and various energy conservation measures for retrofit projects.However before design strategies can be evaluated, accurate modeling of existing conditions is crucial. Thispaper extends current model calibration practice bypresenting a probabilistic method for estimating uncer-tain parameters in HVAC systems for whole buildingenergy modeling. Using Markov Chain Monte Carlo(MCMC) methods, probabilistic estimates of the parameters in two HVAC models were generated for usein EnergyPlus. Demonstrated through a case study, theproposed methodology provides predictions that moreaccurately match observed data than base case mod-els that are developed using default values, typical assumptions and rules of thumb.
Article
Full-text available
A significant portion of the energy consumption in post-secondary educational institutes is for lighting classrooms. The occupancy patterns in post-secondary educational institutes are not stable and predictable, and thus, alternative solutions may be required to match energy consumption and occupancy in order to increase energy efficiency. In this paper, we report an experimental research on quantifying and understanding lighting energy waste patterns in a post-secondary educational institute. Data has been collected over a full academic year in three typical classrooms. Data association mining, a powerful data mining tool, is applied to the data in order to extract association rules and explore lighting waste patterns. The simulations results show that if the waste patterns are avoided, significant savings, as high as 70% of the current energy use, are achievable.
Article
Full-text available
Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. Results: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.
Article
Building energy consumption makes up 40% of the total energy consumption in the United States. Given that energy consumption in buildings is influenced by aspects of urban form such as density and floor-area-ratios (FAR), understanding the distribution of energy intensities is critical for city planners. This paper presents a novel technique for estimating commercial building energy consumption from a small number of building features by training machine learning models on national data from the Commercial Buildings Energy Consumption Survey (CBECS). Our results show that gradient boosting regression models perform the best at predicting commercial building energy consumption, and can make predictions that are on average within a factor of 2 from the true energy consumption values (with an r2 score of 0.82). We validate our models using the New York City Local Law 84 energy consumption dataset, then apply them to the city of Atlanta to create aggregate energy consumption estimates. In general, the models developed only depend on five commonly accessible building and climate features, and can therefore be applied to diverse metropolitan areas in the United States and to other countries through replication of our methodology.
Article
Retrofitting existing buildings is urgent given the increasing need to improve the energy efficiency of the existing building stock. This paper presents a scalable, probabilistic methodology that can support large scale investments in energy retrofit of buildings while accounting for uncertainty. The methodology is based on Bayesian calibration of normative energy models. Based on CEN-ISO standards, normative energy models are light-weight, quasi-steady state formulations of heat balance equations, which makes them appropriate for modeling large sets of buildings efficiently. Calibration of these models enables improved representation of the actual buildings and quantification of uncertainties associated with model parameters. In addition, the calibrated models can incorporate additional uncertainties coming from retrofit interventions to generate probabilistic predictions of retrofit performance. Probabilistic outputs can be straightforwardly translated to quantify risks of under-performance associated with retrofit interventions. A case study demonstrates that the proposed methodology with the use of normative models can correctly evaluate energy retrofit options and support risk conscious decision-making by explicitly inspecting risks associated with each retrofit option.
Book
Introduction Assumptions EM and Inference by Data Augmentation Methods for Normal Data More on the Normal Model Methods for Categorical Data Loglinear Models Methods for Mixed Data Further Topics Appendices References Index