Conference PaperPDF Available

A Comparison of Methods for Missing Data Treatment in Building Sensor Data


Abstract and Figures

Data collection is a fundamental component in the study of energy and buildings. Errors and inconsistencies in the data collected from test environment can negatively influence the energy consumption modelling of a building and other control and management applications. This paper addresses the gap in the current study of missing data treatment. It presents a comparative study of eight methods for imputing missing values in building sensor data. The data set used in this study, are real data collected from our test bed, which is a living lab in the Newcastle University. When the data imputation process is completed, we used Mean Absolute Error, and Root Mean Squared Error methods to evaluate the difference between the imputed values and real values. In order to achieve more accurate and robust results, this process has been repeated 1000, and the average of 1000 simulation is demonstrated in this paper. Finally, it is concluded that it is necessary to identify the percentage of missing data before selecting the proper imputation method, in order to achieve the best result.
Content may be subject to copyright.
A Comparison of Methods for Missing Data Treatment in Building Sensor Data
Mehdi Pazhoohesh, Zoya Pourmirza, Sara Walker
School of Engineering, Newcastle University, Newcastle, UK
Abstract—Data collection is a fundamental component in the
study of energy and buildings. Errors and inconsistencies in
the data collected from test environment can negatively
influence the energy consumption modelling of a building and
other control and management applications. This paper
addresses the gap in the current study of missing data
treatment. It presents a comparative study of eight methods for
imputing missing values in building sensor data. The data set
used in this study, are real data collected from our test bed,
which is a living lab in the Newcastle University. When the
data imputation process is completed, we used Mean Absolute
Error, and Root Mean Squared Error methods to evaluate the
difference between the imputed values and real values. In
order to achieve more accurate and robust results, this process
has been repeated 1000, and the average of 1000 simulation is
demonstrated in this paper. Finally, it is concluded that it is
necessary to identify the percentage of missing data before
selecting the proper imputation method, in order to achieve the
best result.
Keywords-energy and building data, data imputation;
missing value; KNN; MCMC; MAE; RMSE
Nowadays, data collection is a key process in the study of
Energy and buildings. For instance, Building Energy controls
and retrofit analysis are two applications of collecting large
amount of data from installed sensors. In addition, data
collected from building has been used for modelling the
energy consumption in buildings through different software
such as EnergyPlus [1].
However, significant discrepancies between simulated
and measured energy consumption of buildings is the
motivation to focus more on analysing data collected through
extensive sensor networks.
A. Related Work and Gap Analysis
Different calibration techniques such as Bayesian
calibration [2], [3] and systematic evidence-based
approaches [4] has been used to uncover discrepancies
between simulated and measured energy consumption of
buildings. However, a considerable amount of data are
usually missed due to different reasons such as low signal-to-
noise ratio, measurement error, malfunctioning of sensors,
power outages at the sensors or network failure which can
lead to data analysis problems. Hence, the estimating of
missing values play a significant role in calibration of
building energy models as a pre-processing step. Moreover,
evaluation and prediction of building’s energy consumption
through statistical and data mining methods require time-
series data in which missing values can significantly
influence the analysis results, further emphasizing the
importance of missing value estimation. Different
approaches are used to deal with missing values in most
scientific research domains such as Biology [5], Medicine [6]
or Climatic Science [7]. However, there are limited studies to
deal with missing data for the building energy system. One
approach is to delete all missing values and analyse the
behaviour of the building based on available data. The issue
which may arise with this method is that there may be very
few observations and a very small dataset to model the
behaviour of the building based on that [8], [9]. Another
approach is mean imputation, where missing data will be
replaced with the mean value of all variables [8], [10]. This
method distorts the distribution of the variable and also
relationships between variables and can result in large errors
between predicted and actual values. The other method used
to treat the missing data is replacing missing data values with
some constant (eg. zero). This has been used for the
applications where they cannot tolerate having gaps between
data [5]. Although ,a variety of techniques have been
developed to treat missing values with statistical prediction
in other fields, there is a lack of research concerning the
substituting of missing values in order to provide guidelines
to make the more appropriate methodological choice in
energy and building related data. In the this study, we
compare eight different imputation methods, namely, Monto
Carlo Markov chain (MCMC) [11], Hmisc aregImpute [12],
K-nearest neighbours (KNN) [13], simple Mean,
Expectation-Maximization [14], [15], Random value,
Regression and stochastic regression [15] methods, to find
which method is the best fit for energy and building data sets.
Comparison was performed on real lightning dataset
collected from a 6 months period, under an Missing
Completely at Random (MCAR) assumption and based on
Mean Absolute Error (MAE), and Root Mean Squared Error
(RMSE) evaluation criteria for estimating missing values in
building data.
A. Study Site
The data used for this study is lighting time-series data as
the main dataset and corresponding occupancy data as the
supportive dataset which were collected from the 3rd floor of
Urban Science Building, Newcastle University, United
Kingdom (Figure 1). Data collection took place between
February 2018 and July 2018 at 1 minute intervals. The
2019 the 7th International Conference on Smart Energy Grid Engineering
978-1-7281-2440-7/19/$31.00 ©2019 IEEE
collected data were averaged to obtain half-hourly values
with 7968 data points.
Figure 1. USB building.
B. Selection of Imputation Method
In order to conduct this study, we have selected eight
imputation methods, which are the most well-known
techniques that covers various statistical strategies in terms
of simplicity to multiple imputation methods. These
techniques are Mean, Random, Nearest Neighbour algorithm
(KNN), aregImpute (Hmisc) in R, Markov Chain Monte
Carlo (MCMC) [15], expectation-maximization (EM)
algorithm [11], Regression and Stochastic regression
methods. Here we briefly discuss each techniques. Mean
method is based on imputation by replacing the missing data
by the mean of all known values of that variable.
Random technique is used based on randomly predicting
the missing values according to the maximum and minimum
values of the dataset.
The nearest neighbour algorithm [16] is a nonparametric
method which is used to replace the missing data for the
variable by averaging non-missing values of its neighbours.
In this method, K-nearest Neighbours are selected to predict
the missing value and the influence is the same for each of
these neighbours. Depends on the number of selected
neighbours (K value), the estimated value could be
significantly tolerated. Hence, choosing the proper number
of neighbours, has great influence on the prediction. In this
paper, the effect of different values of the parameter k on
estimation accuracy is discussed.
The aregImpute function in the HMisc library [12]
consists of replacing the missing value with predictive mean
matching which is computed by optional weighted
probability sampling from similar cases. In aregImpute
function, missing values for any parameter are estimated
based on other parameters. In this paper, occupancy data is
considered as the supportive value for estimating the missing
value in lighting dataset.
Markov Chain Monte Carlo (MCMC) is an iterative
algorithm based on chained equations that uses an
imputation model specified separately for each variable and
involving the other variables as predictors. Monte Carlo
Markov chain (MCMC) method is used to generate pseudo-
random draws and provides several imputed data sets.
MCMC requires either MAR or MCAR data sets and can be
implemented on both arbitrary and monotone patterns of
missing data. A Markov Chain is a sequence of possible
variables in which the probability of each element depends
only on the value of the previous one.
In MCMC simulation, by constructing a Markov chain
that has the stationary distribution which is the distribution
of interest, one can obtain a sample of the desired
distribution by repeatedly simulating steps of the chain.
Refer to Schafer [17] for a detailed discussion of this method.
In the regression imputation method, the missing values
will be replaced with predicted score from regression
equation. Although, the imputed data are computed using
information from the observed data, only one representative
value will be considered for each group of missing data
which may result in weakens variance. Another method
which is inspired from regression concept is stochastic
regression method. This method aims to reduce the bias
using additional step of augmenting each predicted score
with a residual term. Therefore, each missing value has a
different imputed number to be replaced with [15].
A. Evaluation Criteria
To evaluate the forecast, mean absolute error (MAE), and
root mean square error (RMSE) were computed over the
given period for imputed lightening data.
These techniques are valuable measurement techniques
that are used to compare eight imputation algorithms. RMSE
represents the sample standard deviation of the difference
between actual and estimated values as:
  
MAE measures the average magnitude of the errors in a
set of prediction as:
  =
  
where n denotes the number of test samples,  represents
the ith target value, stands for the predicted value
for the ith test sample.
RMSE and MAE both indicate how close the modelled
and observed values are. RMSE takes the square root of the
average square error, it gives a relatively high weight to the
large errors. Therefore, it is appropriate when penalizing
large errors are desirable.
B. Estimation Process
The process of the analysis is depicted in Figure 2. Due
to the large size of the original dataset, from the original
dataset with one minute intervals, the half-hourly dataset is
generated based on the average of each 30 minute data and
called calibrated dataset. Considering the assumption of
“Missing Completely at Random” (MCAR), the percentage
of 10%, 20% and 30% missing data were generated from the
calibrated dataset. Afterward, missing data were imputed
using the 8 methods. In the next step, the difference between
the substituted values and real values was computed by
RMSE and MAE methods. To provide more accurate
comparison, the missing value generation step and the
corresponding imputation algorithms were performed for
1000 simulations and the average of the 1000 simulations
were used for the final evaluations.
Figure 2. Principle of the analysis.
C. Result Analysis
As it was mentioned before, in KNN method the number
of selected neighbours play an important role. By increasing
the percentage of missing data, bigger K value is suitable for
the best KNN results. In other word, when the missing data
is about 10%, the closest value, to the missed data, is the best
value for imputation (Figure 3(a)). However, by increasing
the missing data, i.e. for 20% missing, the most optimized K
could achieve by considering the K value as 2 or in other
word, by considering an hourly boundary, the best value
achieved (Figure 3 (b)). For the 30% missing dataset, the
best K was 4 which means the boundary of 2 hours could
result in better imputation of missing data (Figure 3 (c)). The
trend of best K value in terms of missing percentage is
depicted in Figure 4. From this figure, it is also obvious that
increasing the percentage of missing data results in higher
RMSE value which can be considered as a logical
confirmation of the principle of our analysis.
Figure 3. Relationship between missing percentage and best K value.
Figure 4. Trend of K values.
Figure 5 illustrates the comparison of all methods in
terms of computed RMSE.
It should be mentioned that to simplify the evaluation, for
KNN method, the average of RMSE for each set of missing
data (10%, 20% and 30%) is considered for this comparison.
For 10 percent missing data (Figure 5(a)), Random and
Stochastic regression and MCMC techniques achieved the
highest percentage of error based on root mean square
analysis. With a remarkable gap, KNN shows less error than
other methods. AregImpute, Mean, regression and EM
techniques achieve the same RMSE, approximately.
K vs RMSE (10% missing data)
K vs RMSE (20% missing data)
K vs RMSE (30% missing data)
10% 20% 30%
Missing data percentage
K value
k1 k2 k4 k6 k8 k10
For 20 percent missing values in the dataset (Figure 5(b)),
approximately, the same manner in terms of the RMSE
values achieved. KNN shows the best and MCMC, Random
and Stochastic regression methods achieve the worst
methods. RMSE value for AregImpute technique, slightly
increased compare with Mean, regression and EM methods.
For 30 percent missing value dataset (Figure 5(c)), KNN,
Regression and Mean techniques show the most suitable
methods while higher percentage of missing values are
available. There is a significant error increase for EM
algorithm in this dataset. The KNN and random methods
show the best and approximately the worst methods,
Figure 5. Comparison of methods based on RMSE.
The evaluation of Mean Absolut Errors are depicted in
figure 6. Figure 6(a), shows that KNN has a remarkable less
error than the other methods. The computed MAE for 20
percent missing data set (Figure 6(b)) and 30 percent missing
data (Figure (6(c)) show that the KNN technique archives the
lowest error.
Figure 6. Comparison of methods based on MAE.
The objective of this research is to highlight the
importance of the method that will be used in energy and
building fields to treat the missing values. This paper shows
that it is important to identify the percentage of missing data
before selecting the proper method. In this research eight
popular imputation techniques are used on the generated
datasets with 10, 20 and 30 percent missing values. The
results show that for 10% missing data, KNN achieves a
better accuracy in prediction of missing values. Moreover,
the best value for K ( number of neighbours) find out as one
or two which means in this research the best value to be used
for replacing missing data for 10 percent data set is the next
30 minutes or next hour of the recorded data.
For the 20% missing data, KNN shows the best results
again. In this dataset, it is also concluded that the best value
for K is the next 30 minute or next hour to fill the missing
For the data set with 30% missing data, KNN again
archives the best result. However, the best value for K
increased to 4 which means the next two hours of data would
be more suitable to be used for the current missing data.
Therefore, it is concluded that increasing the percentage
of missing data, requires more neighbours to estimate the
missing data.
Additionally, the results of this study showed that the
lighting data are more depends on the time instead of the
other variables like occupancy. One reason that authors find
out is due to the topology of the sensors. The test bed area
was equipped with seven occupancy sensors but only one
lighting meter. Therefore, the value of occupancy that was
used for the imputation was the average of this data in each
30 minutes interval.
The achievement of this research is limited to the lighting
variable, which is strongly time-dependent. In future, we will
further investigate other parameters in buildings. Also, the
type of the tested building is an educational building. Further
investigations are required for other types of building.
The research reported in this paper was supported by
Building as a Power Plant: The use of buildings to provide
demand response project, funded by the Engineering and
Physical Sciences Research Council under Programme Grant
EP/P034241/1, and the Active Building Centre (ABC),
supported by Industrial Strategy Challenge Fund under
Programme Grant EP/S016627/1.
[1] US. Department of Energy, “EnergyPlus:Engineering Reference,”
[2] A. Chong and K. Lam, “Uncertainty analysis and parameter
estimation of HVAC systems in building energy models,” In 14th
Conference of International Building Performance Simulation
Association, Hyderabad, India, 2015.
[3] Y. Heo, R. Choudhary and G. Augenbroe, “Calibration of building
energy models for retrofit analysis under uncertainty,” Energy and
Buildings, vol. 47, pp. 550-560, 2012.
[4] P. Raftery, M. Keane and J. O’Donnell, “Calibrating whole building
energy models: An evidence-based methodology,” Energy and
Buildings, vol. 43, no. 9, pp. 2356-2364, 2011.
[5] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R.
Tibshirani, D. Botstein and R. Altman, “Missing value estimation
methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp.
520-525, 2001.
[6] H. Lewis, “Missing Data in Clinical Trials,” The New England
Journal of Medicine, vol. 367, pp. 2557-2558, 2012.
[7] T. Schneider, “Analysis of incomplete climate data: Estimation of
mean values and covariance matrices and imputation of missing
values,” Journal of climate, vol. 14, no. 5, pp. 853-871, 2001.
[8] A. Gelman and J. Hill, Data analysis using regression and
multilevel/hierarchical models, Cambridge university press, 2006.
[9] C. Robinson, B. Dilkina, J. Hubbs, W. Zhang, S. Guhathakurta, M.
Brown and R. Pendyala, “Machine learning approaches for estimating
commercial building energy consumption,” Applied energy, vol. 208,
pp. 889-904, 2017.
[10] D. Cabrera and H. Zareipour, “Data association mining for
identifying lighting energy waste patterns in educational institutes,”
Energy and Buildings, vol. 62, pp. 210-216, 2013.
[11] F. Nelwamondo, S. Mohamed and T. Marwala, “Missing data: A
comparison of neural network and expectation maximization
techniques,” Current Science, vol. 93, pp. 1514-1521, 2007.
[12] F. Harrell, “ Hmisc (v 3.0-12): Harrell miscellaneous library for R
statistical software,” R package (v 2.2-3), 2006.
[13] J. Leek, E. Monsen, A. Dabney and J. Storey, “ EDGE: extraction and
analysis of differential gene expression,” Bioinformatics, vol. 22, no.
4, pp. 507-508, 2005.
[14] C. Musil, C. Warner, P. Yobas and S. Jones, “A comparison of
imputation techniques for handling missing data,” Western Journal of
Nursing Research, vol. 24, no. 7, pp. 815-829, 2002.
[15] D. Schunk, “A Markov chain Monte Carlo algorithm for multiple
imputation in larg surveys,” Advances in Statistical Analysis, vol. 92,
pp. 101-114, 2008.
[16] T. Cover and H. Peter, “Nearest neighbor pattern classification,”
IEEE transactions on information theory, vol. 13, no. 1, pp. 21-27,
[17] J. Schafer, Analysis of incomplete multivariate data, New York:
Chapman and Hall/CRC, 1997.
... Le domaine du bâtiment ne fait pas exception. Des méthodes d'imputation y ont par exemple été appliquées pour traiter des données sur les ambiances thermique et lumineuse, et sur la consommation énergétique de systèmes (Chong et al. 2016;Pazhoohesh et al. 2019;Cho et al. 2020). ...
... Paradoxalement, les auteurs partent souvent d'une base de données complète, i.e. sans données manquantes, et en suppriment un pourcentage plus ou moins important (5 à 40 %), en appliquant l'un des trois mécanismes, pour obtenir une base incomplète (Chong et al. 2016;Pazhoohesh et al. 2019;Cho et al. 2020;Okafor et Delaney 2021). Cette approche permet, lors du calcul des indicateurs de performance, de comparer les écarts entre les versions complète et incomplète de la base. ...
... En complément, ils conseillent d'utiliser des variables décalées pour améliorer les performances des méthodes. Le paramétrage des méthodes dépend aussi du pourcentage de données manquantes selon Pazhoohesh et al. (2019), qui ont observé que le nombre de voisins de la méthode KNN doit augmenter avec la part de données manquantes pour conserver de bonnes performances pour leurs données MCAR. ...
Conference Paper
Full-text available
Le traitement des données issues de bâtiments connectés doit permettre d'optimiser leur gestion énergétique, tout en garantissant un niveau de confort aux occupants. Les données collectées peuvent toutefois être incomplètes du fait de défaillances lors de l'acquisition des mesures. Cela compromet le traitement de l'information. Des méthodes, dites d'imputation, sont alors à appliquer pour consolider les données. Cet article propose un état de l'art sur les méthodes de gestion des données manquantes et d'évaluation de la qualité de l'imputation. Neuf méthodes d'imputation sont ensuite appliquées au cas de données d'ambiance d'un appartement T2, pour lequel le statut de présence de l'occupant est connu. Les méthodes sont comparées, d'une part en étudiant la qualité de l'imputation sur ces séries temporelles multivariées et d'autre part en évaluant la performance des méthodes sur la tâche finale, i.e. la classification du statut de présence. Il ressort de ces comparaisons que la performance de la tâche finale est peu affectée par la performance des méthodes d'imputation dans notre cas. MOTS-CLÉS : Bâtiments connectés, Imputation de données, Occupation. ABSTRACT. The processing the data from smart building is a way to both optimise their energy management and provide a high level of comfort to the occupants. Because of various possible failures in the data collection process, the information gathered can be incomplete or incorrect. In such case, so-called data imputation methods should be used in order to make possible the data processing. This article reviews methods to deal with missing data and to assess the performance of the imputation. Nine of these methods are applied to the data collected from an indoor environment-monitoring sensor located in an apartment for which the presence status of the occupant is known. The methods are compared based on the imputation quality of the multivariate time series as well as based on the performance of the final classification task, i.e. classifying the occupancy status. For this case study, it turns out that the performance imputation task has little impact on the performance of the final task.
... A simple random sampling was then applied to identify the patients' file numbers as sample units from both clinics. The study variables (like age of patient, and BMI among others) were extracted from several previous related studies concerning breast cancer [6], [13], and [14]. ...
... The idea based on this approach is to use a mean value of each non-missing variable to fill in missed values for all observations [13]. The mean imputation technique is more appropriate when the amount of missingness is small whilst the size of the sample is large. ...
... A non-parametric approach used to impute missing data by averaging its neighbouring observed data [13]. ...
Background: Clinical datasets are at risk of having missing data for several reasons including patients’ failure to attend clinical measurements and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission incomplete records during analysis especially if a dataset is small. This study aims to compare several imputation methods in terms of efficiency in filling-in missing data so as to increase prediction and classification accuracy in breast cancer dataset. Methodology: Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, expected maximisation via bootstrapping, and multiple imputation by chained equations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination. Results: The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity. Conclusion: The predictive mean matching imputation showed higher accuracy in estimating and replacing missing data values in a real breast cancer dataset under the study. It is a more effective and good approach to handle missing data. We recommend replacing missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables. It improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.
... The study used discrimination and calibration measures to assess usefulness of the prognostic model. On the other hand, the comparison of imputation algorithms for building sensor data across several percentage of missing data was conducted by comparing the differences between real and imputed values through the use of Root Mean Squared Error and Mean Absolute Error estimates; its conclusion emphasized the necessity of identifying percentage of missingness prior to selecting proper imputation technique so as to reach plausible results [14]. Moreover, imputation techniques were used to evaluate the performance of model via discrimination, calibration, and effectiveness of classifiers in relation to time used to build a model in estimating the risk of unprovoked venous thromboembolism recurrence [15]. ...
... A non-parametric approach used to impute missing data by averaging its neighbouring observed data [14]. The approach is donor-based in which imputed values are either measured as a single records in the dataset (1-NN) or as an average value obtained from k records (k-NN) [31]. ...
Full-text available
Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers’ discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers’ accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.
... Majidpour et al [10] perform a comparison between five techniques: Constant (zero), Mean, Median, Maximum Likelihood, and Multiple Imputation applied to compensate for missing values in Electric Vehicle (EV) charging, both Constant (zero) and Median techniques presented the best results. Pazhoohesh et al [11] perform a comparative study of eight techniques for imputation of missing values in building sensor data, that is, Monte Carlo Markov Chain (MCMC), Hmisc aregImpute, K-Nearest Neighbours (KNN), Simple Mean, Expectation-Maximization, Random Value, Regression and Stochastic Regression, the authors got to the conclusion that one needs to identify the percentage of missing data before selecting the appropriate imputation technique in order to achieve the best result. ...
Full-text available
The electricity sector has added plenty of new technologies in recent years. Smart Grids are characterized by the use of monitoring and communication technologies almost in whole system. The application and use of such new technologies triggers a significant growth in the data number, increasing the amount of errors and missing data, thus hindering the analysis. In this context, this paper performs the modeling, implementation, validation and comparative analysis of four data imputation techniques: K-Nearest Neighbor, Median Imputation, Last Observation Carried Forward, and Makima. The aim is to verify if they could be applied to the electric segment - more specifically to the Smart Grids environment. The database used in the research is obtained from the electricity utility CEEE and its underground substations, located in southern Brazil. Following this, five simulation scenarios are created and one data set is removed, based on pre-established criteria. Finally, the techniques are applied and the new database is compared with the original one. From the simulation results, the technique which presented the best results is Makima, it is validated as robust to be applied in the Smart Grids environment, especially in electrical data missing from an electric power substation.
... E-infrastructure is facing upcoming challenges for future energy system, such as the incorporation of diverse and new data types and data sources. These may bring difficulties in terms of data storage, data communication [19], data interoperability, cyberinterdependencies [20], missing data treatment [21], and coordination of security policy, that could be mitigated by studying in the bridges and connectors amongst energy sector and computational and e-infrastructure. Additionally, e-infrastructure of energy system suffers from cyber security challenges caused by interconnectivity between sensors and controllers. ...
Full-text available
Research and development are critical for driving economic growth. To realise the UK government’s Industrial Strategy, we develop an energy research and innovation infrastructure roadmap and landscape for the energy sector looking to the long term (2030). This study is based on a picture of existing UK infrastructure on energy. It shows the links between the energy sector and other sectors, the distribution of energy research and innovation infrastructures, the age of these infrastructures, where most of the energy research and innovation infrastructures are hosted, and the distribution of energy research and innovation infrastructures according to their legal structure. Next, this study identifies the roadmap of energy research and innovation infrastructures by 2030, based on a categorisation of the energy sector into seven subsectors. Challenges and future requirements are explored for each of the sub-sectors, encompassing fossil fuels and nuclear energy to renewable energy sources and hydrogen, and from pure science to applied engineering. The study discusses the potential facilities to address these challenges within each sub-sector. It explores the e-infrastructure and data needs for the energy sector and provides a discussion on other sectors of the economy that energy research and innovation infrastructures contribute to. Some of the key messages identified in this study are the need for further large-scale initiative and large demonstrators of multi-vector energy systems, the need for multi-disciplinary research and innovation, and the need for greater data sharing and cyber-physical demonstrators. Finally, this work will serve as an important study to provide guidance for future investment strategy for the energy sector.
While clustering has been commonly used to profile the building electricity consumption data, its application to HVAC system data is relatively less. Based on the operation data of a residential ground source heat pump (GSHP) system, a pre-processing procedure was set up, including acquisition, cleaning, missing-data fill-in, and standardization. Then hierarchical clustering was used, based on the dynamic time warping (DTW) distance calculation method. A new index, the sum of squares of errors based on DTW, was used to determine the best cluster number. The patterns were extracted based on clustering results. The heat exchange on the user-side during one cooling season was processed as a case study. We obtained 5 valid clusters and extracted patterns from each. For the studied case, the result based on the DTW method yields a better homogeneity compared to the Euclidean method. The method is then applied to a five-year operation data, where 9 and 6 patterns were obtained for the cooling and heating seasons, respectively. They mainly differed in shape, and the cooling patterns fluctuate more. Overall, clustering and pattern extraction can reduce the data dimension, and provide a data-driven view of how the system supplies the cooling and heating.
Diagnosing data or object detection in medical images is one of the important parts of image segmentation especially those data which is less effective to identify inMRI such as low-grade tumors or cerebral spinal fluid (CSF) leaks in the brain. The aim of the study is to address the problems associated with detecting the low-grade tumor and CSF in brain is difficult in magnetic resonance imaging (MRI) images and another problem also relates to efficiency and less execution time for segmentation of medical images. For tumor and CSF segmentation using trained light field database (LFD) datasets of MRI images. This research proposed the new framework of the hybrid k-Nearest Neighbors (k-NN) model that is a combination of hybridization of Graph Cut and Support Vector Machine (GCSVM) and Hidden Markov Model of k-Mean Clustering Algorithm (HMMkC). There are four different methods are used in this research namely (1) SVM, (2) GrabCut segmentation, (3) HMM, and (4) k-mean clustering algorithm. In this framework, on the one hand, phase one is to perform the classification of SVM and Graph Cut algorithm to create the maximum margin distance. This research use GrabCut segmentation method which is the application of the graph cut algorithm and extract the data with the help of scaleinvariant features transform. On the other hand, in phase two, segment the low-grade tumors and CSF using a method adapted for HMkC and extract the information of tumor or CSF fluid by GCHMkC including iterative conditional maximizing mode (ICMM) with identifying the range of distant. Comparative evaluation is also performing by the comparison of existing techniques in this research. In conclusion, our proposed model gives better results than existing. This proposed model helps to common man and doctor that can identify their condition of brain easily. In future, this will model will use for other brain related diseases.
Low-grade tumor or CSF fluid, the symptoms of brain tumor and CSF liquid, usually require image segmentation to evaluate tumor detection in brain images. This research uses systematic literature review (SLR) process for analysis of the different segmentation approach for detecting the low-grade tumor and CSF fluid presence in the brain. This research work investigated how to evaluate and detect the tumor and CSF fluid, improve segmentation method to detect tumor through graph cut hidden markov model of k-mean clustering algorithm (GCHMkC) techniques and parameters, extract the missing values in k-NN algorithm through correlation matrix of hybrid k-NN algorithm with time lag and discrete fourier transformation (DFT) techniques and parameters, and convert the non-linear data into linear transformation using LE-LPP and time complexity techniques and parameters.
Conference Paper
Full-text available
Building performance simulation has the potential toquantitatively evaluate design alternatives and various energy conservation measures for retrofit projects.However before design strategies can be evaluated, accurate modeling of existing conditions is crucial. Thispaper extends current model calibration practice bypresenting a probabilistic method for estimating uncer-tain parameters in HVAC systems for whole buildingenergy modeling. Using Markov Chain Monte Carlo(MCMC) methods, probabilistic estimates of the parameters in two HVAC models were generated for usein EnergyPlus. Demonstrated through a case study, theproposed methodology provides predictions that moreaccurately match observed data than base case mod-els that are developed using default values, typical assumptions and rules of thumb.
Full-text available
A significant portion of the energy consumption in post-secondary educational institutes is for lighting classrooms. The occupancy patterns in post-secondary educational institutes are not stable and predictable, and thus, alternative solutions may be required to match energy consumption and occupancy in order to increase energy efficiency. In this paper, we report an experimental research on quantifying and understanding lighting energy waste patterns in a post-secondary educational institute. Data has been collected over a full academic year in three typical classrooms. Data association mining, a powerful data mining tool, is applied to the data in order to extract association rules and explore lighting waste patterns. The simulations results show that if the waste patterns are avoided, significant savings, as high as 70% of the current energy use, are achievable.
Full-text available
Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. Results: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.
Building energy consumption makes up 40% of the total energy consumption in the United States. Given that energy consumption in buildings is influenced by aspects of urban form such as density and floor-area-ratios (FAR), understanding the distribution of energy intensities is critical for city planners. This paper presents a novel technique for estimating commercial building energy consumption from a small number of building features by training machine learning models on national data from the Commercial Buildings Energy Consumption Survey (CBECS). Our results show that gradient boosting regression models perform the best at predicting commercial building energy consumption, and can make predictions that are on average within a factor of 2 from the true energy consumption values (with an r2 score of 0.82). We validate our models using the New York City Local Law 84 energy consumption dataset, then apply them to the city of Atlanta to create aggregate energy consumption estimates. In general, the models developed only depend on five commonly accessible building and climate features, and can therefore be applied to diverse metropolitan areas in the United States and to other countries through replication of our methodology.
Retrofitting existing buildings is urgent given the increasing need to improve the energy efficiency of the existing building stock. This paper presents a scalable, probabilistic methodology that can support large scale investments in energy retrofit of buildings while accounting for uncertainty. The methodology is based on Bayesian calibration of normative energy models. Based on CEN-ISO standards, normative energy models are light-weight, quasi-steady state formulations of heat balance equations, which makes them appropriate for modeling large sets of buildings efficiently. Calibration of these models enables improved representation of the actual buildings and quantification of uncertainties associated with model parameters. In addition, the calibrated models can incorporate additional uncertainties coming from retrofit interventions to generate probabilistic predictions of retrofit performance. Probabilistic outputs can be straightforwardly translated to quantify risks of under-performance associated with retrofit interventions. A case study demonstrates that the proposed methodology with the use of normative models can correctly evaluate energy retrofit options and support risk conscious decision-making by explicitly inspecting risks associated with each retrofit option.
Introduction Assumptions EM and Inference by Data Augmentation Methods for Normal Data More on the Normal Model Methods for Categorical Data Loglinear Models Methods for Mixed Data Further Topics Appendices References Index