Available via license: CC BY 4.0
Content may be subject to copyright.
Citation: Duarte, O.G.; Rosero, J.A.;
Pegalajar, M.d.C. Data Preparation
and Visualization of Electricity
Consumption for Load Profiling.
Energies 2022,15, 7557.
https://doi.org/
10.3390/en15207557
Academic Editor: Abu-Siada
Ahmed
Received: 7 September 2022
Accepted: 8 October 2022
Published: 13 October 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
energies
Article
Data Preparation and Visualization of Electricity Consumption
for Load Profiling
Oscar G. Duarte 1,*,† , Javier A. Rosero 1,† and María del Carmen Pegalajar 2,†
1Facultad de Ingeniería, Universidad Nacional de Colombia, Bogotá 111321, Colombia
2Escuela Técnica Superior de Ingenierías Informática y de Telecomunicaciones, Universidad de Granada,
18014 Granada, Spain
*Correspondence: ogduartev@unal.edu.co; Tel.: +57-6013165180
† These authors contributed equally to this work.
Abstract: The construction of daily electricity consumption profiles is a common practice for user
characterization and segmentation tasks. As in any data analysis project, to obtain these load profiles,
a stage of data preparation is necessary. This article explores to what extent does the selection of
the data preparation technique impacts load profiling. The techniques discussed are used in the
following tasks: standardization, construction of data, dimensionality reduction and data enrichment.
The analysis reveals a great incidence of the data preparation on the result. The need to make the data
preparation process explicit in each report is identified. In particular, it is highlighted that the most
usual default standardization process, column standardization, is not adequate in the preparation of
energy consumption profiles.
Keywords: energy profiling; data preparation; data visualization; enrichment of energy data
1. Introduction
The study of electrical energy consumption is a prolific field. The need to predict
demand has been its main driver throughout the history of power systems. However,
in recent years other applications have appeared, that require new ways of analyzing
consumption. The clearest example is in the Demand Management sector, where it is
necessary a better understanding of the users behaviors in order to design strategies that
modify those behaviors.
This fact, together with the availability of more precise and frequent measurements of
consumption, has promoted the development of research and applications with the purpose
of characterizing the user’s consumption, either for prediction or classification purposes. In
the near future, in those Smart Grids that use the Internet of Things paradigm, it is expected
that more sophisticated characterization tasks will be needed; demand response programs
will be possible to offer not just to users but to sets of users at an appliance level [1].
Consumptions can be analyzed as data series on different time scales. Based on the
sampling scale of the available measurements, measurements are sometimes grouped
on hourly, daily, weekly, monthly or annual scales, depending on the type of analysis
to be carried out. However, when we want to understand the behaviors that explain
consumption, it is useful to get a daily visualization of such consumption. In other words,
it is useful to study the variability of consumption throughout the 24 h of the day. To do
this, the consumption of the same day is organized in load curves. Each curve is an ordered
series of pairs of data (Time, Consumption) in a day.
From a procedural point of view, we start from a measurement record of the type
[DATE, HOUR, POWER]
that must be cleaned and processed to generate a data table such
Energies 2022,15, 7557. https://doi.org/10.3390/en15207557 https://www.mdpi.com/journal/energies
Energies 2022,15, 7557 2 of 30
as Table 1, in which each row is a load curve. It is usual to use a time resolution of one hour,
with 24 slots, although it is not mandatory. Table 1can be represented by the array X:
X=
X1,1 X1,2 · · · X1,m
X2,1 X2,2 · · · X2,m
.
.
.
Xn,1 Xn,2 · · · Xn,m
(1)
Table 1. Measurements organized by days and hours.
DAY HOUR
00 01 02 ··· 22 23
1X(1, 00)X(1, 01)X(1, 02)· · · X(1, 22)X(1, 23)
2X(2, 00)X(2, 01)X(2, 02)· · · X(2, 22)X(2, 23)
.
.
.
.
.
.
n X(n, 00)X(n, 01)X(n, 02)· · · X(n, 22)X(n, 23)
Load profiling can be understood as the search of similarities in a set of load curves. If
the curves come from different users, load profiling is the search of groups of consumers
with similar energy consumption patterns (similar load curves). If the curves come from
the same user, load profiling is the search of groups of days in which the user has similar
energy consumption patterns.
As in any data analysis project, to obtain the load profiles a stage of data preparation
is necessary. Some of the usual tasks at this stage are record selection, data cleaning,
missing data imputation, outliers detection, data standardization, data construction, and
enrichment with other sources [
2
]. Data preparation and data visualization are two tasks
that go hand in hand with one another. Visualization guides preparation and preparation
allows visualization. However, most papers do not report the details of the data preparation
stage, while subsequent stages are often well documented. In addition, there is a lack of
information in the literature regarding the analysis of the effect that the application of one
or another form of preprocessing can have on the final results.
This article addresses precisely that issue. Our research question has been stated as: to
what extent does the selection of the data preparation technique impacts load profiling?
2. Literature Review
Load profiling was defined in [
3
] by the International Energy Agency (IEA) as “the
study of the consumption habits of consumers to estimate the amount of power they use at
various times of the day and for which they are billed”. It has been a key instrument in
technical and economic analysis of power systems, distribution systems, electricity markets
and demand response programs. However, this definition is not of practical use today, due
to the changes that have been occurring in the last two decades. In fact, as stated in [
4
], we
do not need to estimate the amount of power because we can measure it and consumers not
just use energy but some of them also produce and storage it. Another interesting change is
the appearance of small local energy markets in which the diversity of consumption habits
are quit different than in large markets. According to [
4
], load profiling should follow nine
principles that guide the procedure, some of them are related with the preparation of data
and the others with the modeling itself.
To get an idea of the importance of load profiling, consider the eDream project of the
European Commission (eDream—enabling new Demand REsponse Advanced, Market ori-
ented and Secure technologies, solutions and business models). In [
5
], the authors describe
“the techniques and methodologies for extracting load and generation profiles of prosumers
and for dividing the prosumers portfolio in clusters”. This document is one of the deliv-
Energies 2022,15, 7557 3 of 30
erables of the project, because load profiling is conceived as a key component to enable
effective demand response programs, a necessary condition to enable new energy markets.
A recent example of load profiling is found in [
6
], in which the consumption profiles of
university buildings are studied by applying a decomposition with wavelets, to differentiate
high and low frequency variations. In [
7
], the average load profiles for different months
and years are analyzed, as a visualization strategy of the long-term dynamics of demand
in Spain. Ref. [
8
] proposes a strategy that combines clustering by
k
-means and feature
extraction by random forest to classify the consumption of residential users. Load profiling is
not just useful to analyze energy consumptions, but also to study the grid itself; for example,
in [
9
], the performance of a distributed energy management system is evaluated making a
comparison of the load profile in some points of the network with and without the system.
Some works use synthetic load profiles (i.e., not obtained from real data); in [
10
], a proposal
of comparative measures is made, that are expected to indicate the representativeness or
similarity between synthetic and measured electricity load profile data.
A systematic review about data preprocessing methods used in load profiling can be
found in [
11
]. It is based on published documents between 2010 and 2021 and available
in IEEE Xplore and ScienceDirect databases. The authors found that previous reviews
focused on techniques used in collecting and applying smart meter data, but not in data
preprocessing (for example, in [
12
,
13
]). Moreover, they found that many technical works
of literature are silent about the techniques used in critical tasks as data cleaning. They
decided to study three types of preprocessing tasks: (1) missing data treatment (2) outlier
detection and (3) data normalization.
Only few studies make explicit reference to data preparation. For example, in [
14
], an
empirical study is carried out on the effect of data preparation on wind energy forecasting;
specifically, the effect of including or not including the available data that is outside the
range of use of the searched model is studied, and it is concluded that the decision strongly
affects the performance of the obtained model. In [
15
], a procedure for preparing data
from municipal buildings is proposed; the main purpose is to make comparable data
from buildings with different conditions of thermal suitability and meters that record
monthly consumption of electricity, natural gas, heat, among others. Ref. [
16
] proposes a
consumption prediction method that combines a data preparation stage with the use of
the Long Short-Term Memory (LSTM) technique; as part of the data preparation process,
features are standardized using a MaxMin scaler. The data are treated as time series data,
without building daily consumption profiles. In [
17
], the effect of temperature and relative
humidity on the peak load is analyzed, as well as the effect of some disturbances on load
profiling; in the data preparation stage, the data of the public holidays are removed from
the dataset.
It is well known that the selection of the data preparation techniques may impact
the results in any data analysis project [
18
–
20
]. However, there is a gap in the literature
regarding its impact on load profiling.
3. Methodology
To address our research question, we consider four data preparation tasks:
• standardization;
• construction of data;
• dimensionality reduction;
• data enrichment.
Each task can be performed with different techniques. We organized our experiments
in two stages: in stage 1, we conducted similar experiments in order to compare the
techniques of the same task; in stage 2, we compare all the techniques of all the tasks in a
single experiment.
The experiments in stage 1 have the same procedure. Consider a specific task
T
that
can be conducted using
p
different techniques over the raw dataset
D
; in such conditions,
the steps in our experiments are the following:
Energies 2022,15, 7557 4 of 30
1. From the raw dataset D, the daily load profile are constructed.
2. We execute task Twith all the techniques and obtain new datasets D1,D1,· · · ,D1p.
3.
We classify the data of every data set into three categories, using the
k
-
means
algorithm
(k=3).
4. We compare and visualize the classifications and discuss the results.
We use a classification problem to compare the techniques because load profiling is a
classification of load curves problem. We use the
k
-means algorithm due to its popularity
and simplicity. In total, five experiments have been developed in stage 1, because the data
enrichment task was analyzed in two scenarios.
In stage 2 we conduct an experiment using all the classifications obtained in stage 1. We
estimate the possible changes in the income of a utility when implementing a fee scheme of
the type Time Of Use (TOU). As the estimation is based on the classification, this experiment
allow us to see the data preparation effect on this specific application.
The rest of this article is organized as follows: Sections 4–7are dedicated to stage 1,
Section 8to stage 2 and Section 9to the conclusions. All the experiments have been
conducted using the standard libraries of python for data processing and visualization
(pandas, numpy, matplotlib).
Dataset
We have used in this paper real measurements of the energy consumption of a univer-
sity campus (Campus Bogotá, Universidad Nacional de Colombia). The data set has the
hourly electricity consumption of the whole campus of 779 days between 2016 and 2019.
Figure 1shows the average profile (average power consumption for every hour) and the
corresponding box plot.
From the average profile we notice a low and almost flat behavior in the first 6 h, and
a concave profile with a peak around midday. From the box plot we notice that dispersion
changes throughout the day, and it is very high also around midday.
0 5 10 15 20
Hour
0
500
1000
1500
2000
2500
Consumption
Average profile of original data
Original
1 2 3 4 5 6 7 8 9 1011 12 13 14 15 16 17 18 19 20 21 22 23 24
Hour
0
500
1000
1500
2000
2500
Consumption
Average profile of original data
Figure 1. Average profile and corresponding box plots in kWh. Blue: mean. Orange: median.
4. Task 1. Standardization
Standardization is a process that transform original numeric data in some equivalent
data within a new interval. It is a change of scale, usually conducted in order to make
two or more variables numerically comparable. The most popular techniques use affine
transformations over every single variable with different effects:
• Standar scaling: the new data set has media = 0.0 and variance = 1.0
• MaxMin scaling: the new data set has min = 0.0 and max = 1.0
However, other techniques may be applied over matrix
X
of Equation (1). In this
paper we use the following options, based on the MaxMin scaler:
1.
Matrix standarization: here we choose the extreme values of the whole matrix
X
to
define the scaler. Every term Xi,jin Xis transformed into:
Xm
i,j=Xi,j−mn
mx −mn mn =min
i,j(Xi,j)mx =max
i,j(Xi,j)(2)
Energies 2022,15, 7557 5 of 30
2.
Standardization by columns: every column has an independent standardization. For
every column we chose their own extreme values to define the scaler. Every term
Xi,j
in Xis transformed into:
Xc
i,j=Xi,j−mnj
mxj−mnj
mnj=min
j(Xi,j)mxj=max
j(Xi,j)(3)
3.
Standardization by rows: remember that every column in
X
has the information of
the energy consumption of every hour. Therefore, their sum is the daily consumption
and we can obtain the fraction of the daily consumption for every hour:
˜
Xi,j=Xi,j
Si
Si=
m
∑
j=1
Xi,j(4)
A more natural standardization can be obtained if we compare that fraction with the
average consumption, in other words, with the fraction of a flat profile:
˜
Xr
i,j=Xi,j/Si
1/m
˜
Xr
i,j=Xi,j
¯
Xi
¯
Xi=1
m∑m
j=1Xi,j
(5)
Values of
˜
Xr
i,j
lie within the interval
[
0,
m]
. It is very unusual to get the maximum
value 1
/m
, because it is only possible for a singleton profile, a profile with all the
consumptions equal to zero except in an hour. Even that it is not impossible, it is more
usual that ˜
Xr
i,jis around 1/m.
In order to get a new set of values within the interval
[
0, 1
]
an additional standardiza-
tion by columns is done:
Xr
i,j=
˜
Xr
i,j−mnj
mxj−mnj
mnj=min
j(˜
Xr
i,j)mxj=max
j(˜
Xr
i,j)(6)
4.
Extended standardization by rows: the standardization by rows allows the comparison
of the shapes of the profiles. However, the information of the daily consumption is lost.
If we need to keep this information, in order to compare low and high consumptions,
we can add one column to
X
including the column vector
¯
X
that contains the daily
consumption and their standardized version:
¯
X=
¯
X1
¯
X2
.
.
.
¯
Xn
¯
Xrx =
¯
Xrx
1
¯
Xrx
2
.
.
.
¯
Xrx
n
(7)
where ¯
Xi=1
m∑m
j=1Xi,j
¯
Xrx
i=¯
Xi−mn
mx −mn mn =mini(¯
Xi)mx =maxi(¯
Xi)
(8)
The new matrix being:
Xrx =hXf¯
Xi(9)
4.1. Experiment 1
We have applied the standardization explained in Section 4to our dataset. Figure 2
shows the average profiles for the resulting dataset obtained with each technique. As the
two standardization by rows (extended and not extended) are the same we have just plot
one of them. It is clear that the techniques are not equivalent. To visualize the effect of the
additional column of the extended standardization by rows, we have plotted in Figure 3
the histogram of the of daily consumption, before and after standardization.
Energies 2022,15, 7557 6 of 30
0 5 10 15 20
Hora
0.0
0.2
0.4
0.6
0.8
1.0
Consumo
Average standardized profiles
Matrix
By columns
By rows and extended by rows
Figure 2. Average profiles of the standardized records.
0 250 500 750 1000 1250 1500 1750 2000
0
20
40
60
80
100
120
0.0 0.2 0.4 0.6 0.8 1.0
0
20
40
60
80
100
120
Figure 3. Histogram of daily consumption, before and after the extended standardization by rows.
To illustrate the effect that standardizations can have on the classification process,
the
k
-means clustering algorithm (with k = 3) has been applied to each of the normalized
data sets. After performing the clustering, the prototype profiles of each cluster have been
built in the original space. The result is shown in the Figure 4where it is clear that each
standardization leads to a different result.
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
Matrix
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
By columns
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
By rows
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
Extended by rows
Figure 4. Averege profiles for three different standardizations.
Energies 2022,15, 7557 7 of 30
The effect is not only observed in the prototypes of the clusters, but also in the resulting
classification. To quantify this impact, we have calculated how many classifications coincide
in each standardization pair. With these values, the heatmap shown in Figure 5has
been built.
Matrix By columns By rowsExtended by rows
Matrix
By columns
By rows
Extended by rows
1
0.7 1
0.97 0.68 1
0.55 0.79 0.54 10.6
0.7
0.8
0.9
1.0
Figure 5. Heatmap of classification coincidences for different standardizations.
The effect on the optimal number of groups for each standardization has also been
explored. The Figure 6shows the elbow curves taken to the interval [0, 1]with each case.
1 2 3 4 5 6 7 8 9
Number of clusters
0.0
0.2
0.4
0.6
0.8
1.0
Normalizaded score
Elbow curve
Matrix
By columns
By rows
Extended by rows
Figure 6. Elbow curves for different standardizations.
4.2. Discussion
•
Matrix standardization produces exactly the same classification results as applying
k
-means to the original data. This fact makes sense, because that standardization maps
all the data with the same maximum and minimum values.
•
Standardization by columns is problematic. By treating each column independently,
it breaks the link between adjacent consumptions on the same day and therefore
distorts the profiles. This fact is of special relevance because many machine learning
tools use this technique by default. In general, the implementation of standardization
routines has been carried out under the premise that each column of the dataset is an
independent feature of the others and therefore it makes sense to make independent
standardizations. This premise is not true in the case of a matrix such as
X
in the
Equation (1) because they are consumptions of the same day.
•
Standardization by rows preserves the shape of consumption profiles and is therefore
suitable for those applications where the goal is to identify the shape of the profiles.
Energies 2022,15, 7557 8 of 30
•
In the heatmap of Figure 5a high coincidence is observed between the classifications
obtained between the matrix and by row standardizations. This fact suggests that
both standardizations preserve very well the shape of the profiles.
•
The effect on the prototypes of expanding the information on the standardization by
rows with the information on the average consumption is very interesting. When
comparing the three profiles of each of the two standardizations by rows, it is observed
how the profile of lower consumption is better delineated, describing what can be a
fundamentally nocturnal consumption (typical of public lighting, for example.)
•
In Figure 6, it can be seen that the curve corresponding to the expanded standardiza-
tion by rows is below the others and its shape suggests that the optimal number of
groups is greater. This fact allows us to affirm that the extended standardization by
rows seems to reveal more differences between the profiles, which is to be expected,
since in addition to the shape of the profile it contains information on the volume
of consumption.
5. Task 2. Construction of Data
Construction of data is the computation of new features that replace the original
attributes [
2
]. This task is usually based on knowledge of the meaning of the original data
as well as the new calculated data, and is guided by the application objectives. As some
applications of the load profiling are related with the shape of the profiles, we propose here
to calculate numerical indicators that describe geometric properties of each of them and
build a new dataset with them. We use the following geometric indicators:
Sum : Si=∑m
j=1Xi,j
Mean : ¯
Xi=1
mSi
Peak value : mxi=maxj(Xi,j)
Standard deviation : σi=v
u
u
t
1
m
m
∑
j=1
(Xi,j−¯
Xi)2
Kurtosis : gi=v
u
u
t
1
mσ4
i
m
∑
j=1
(Xi,j−¯
Xi)4
(10)
When using these indicators it is important to take into account some considerations:
• Sum Siand mean ¯
Xicontain the same information and are therefore redundant.
•
Peak value
mxi
is especially useful in certain applications where it is important to study
extreme behaviors; such is the case of network congestion analysis or transmission
line overheating studies.
•
Standard deviation
σi
and the kurtosis
gi
are two different indicators of the shape.
σi
is used to measure how different the profile is from that of a homogeneous (flat)
consumption, while
gi
measures how much the values are concentrated around the
same time.
•
It is usual to use
(gi−
3
)
instead of
gi
, because the value of 3 corresponds to that of a
normal distribution. In this way, the sign of
(gi−
3
)
allows us to tell if the profile is
more or less concentrated than that of a normal curve.
These are not the only possible indicators, of course. The number of peaks, the center
of gravity, and in general any shape parameter can also be used.
5.1. Experiment 2
For each of the daily profiles, the following indicators has been computed: (a) Mean
(b) Peak (c) Standard deviation (d) Kurtosis.
Figure 7shows the histogram of the indicators. To explore possible correlations
between them, the paired point clouds shown in Figure 8.
Energies 2022,15, 7557 9 of 30
0.0 0.2 0.4 0.6 0.8 1.0
0
50
100
150
MEAN
0.0 0.2 0.4 0.6 0.8 1.0
0
25
50
75
100
125
150
PEAK
0.0 0.2 0.4 0.6 0.8 1.0
0
50
100
150
200
STD-DEV
0.0 0.2 0.4 0.6 0.8 1.0
0
100
200
300
400
500
KURTOSIS
Figure 7. Histograms of geometric indicators.
1000 1500 2000
MEAN
750
1000
1250
1500
1750
2000
MEAN
1000 1500 2000
MEAN
1000
1500
2000
2500
PEAK
1000 1500 2000
MEAN
0
100
200
300
400
500
600
STD-DEV
1000 1500 2000
MEAN
−2
0
2
KURTOSIS
1000 1500 2000 2500
PEAK
750
1000
1250
1500
1750
2000
MEAN
1000 1500 2000 2500
PEAK
1000
1500
2000
2500
PEAK
1000 1500 2000 2500
PEAK
0
100
200
300
400
500
600
STD
-DEV
1000 1500 2000 2500
PEAK
−2
0
2
KURTOSIS
0 200 400 600
STD
-DEV
750
1000
1250
1500
1750
2000
MEAN
0 200 400 600
STD-DEV
1000
1500
2000
2500
PEAK
0 200 400 600
STD-DEV
0
100
200
300
400
500
600
STD-DEV
0 200 400 600
STD-DEV
−2
0
2
KURTOSIS
−2 0 2
KURTOSIS
750
1000
1250
1500
1750
2000
MEAN
−2 0 2
KURTOSIS
1000
1500
2000
2500
PEAK
−2 0 2
KURTOSIS
0
100
200
300
400
500
600
STD
-DEV
−2 0 2
KURTOSIS
−2
0
2
KURTOSIS
Figure 8. Correlogram of geometric indicators.
Energies 2022,15, 7557 10 of 30
The
k
-means algorithm has been applied with
k=
3 for each indicator separately and
for each possible combination of 2, 3 or 4 indicators. The results are shown in the Figure 9.
The heatmap of the Figure 10 has also been constructed, which shows the coincidences in
the classification between each pair of cases. The elbow curve has also been constructed for
each of the cases studied, as shown in the Figure 11.
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
MEAN-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
PEAK-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
STD-DEV-
0 5 10 15 20
Hour
0
200
400
600
800
1000
1200
1400
1600
Consumption
KURTOSIS-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
MEAN-PEAK-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
MEAN-STD-DEV-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
MEAN-KURTOSIS-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
PEAK-STD-DEV-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
PEAK-KURTOSIS-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
STD-DEV-KURTOSIS-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
MEAN-PEAK-STD-DEV-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
MEAN-PEAK-KURTOSIS-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
MEAN-STD-DEV-KURTOSIS-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
PEAK-STD-DEV-KURTOSIS-
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
MEAN-PEAK-STD-DEV-KURTOSIS-
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 9. Cluster prototypes for each combination of geometric indicators. Each color represents a
single cluster.
Energies 2022,15, 7557 11 of 30
MEAN-
PEAK-
STD-DEV-
KURTOSIS-
MEAN-PEAK-
MEAN-STD-DEV-
MEAN-KURTOSIS-
PEAK-STD-DEV-
PEAK-KURTOSIS-
STD-DEV-KURTOSIS-
MEAN-PEAK-STD-DEV-
MEAN-PEAK-KURTOSIS-
MEAN-STD-DEV-KURTOSIS-
PEAK-STD-DEV-KURTOSIS-
MEAN-PEAK-STD-DEV-KURTOSIS-
MEAN-
PEAK-
STD-DEV-
KURTOSIS-
MEAN-PEAK-
MEAN-STD-DEV-
MEAN-KURTOSIS-
PEAK-STD-DEV-
PEAK-KURTOSIS-
STD-DEV-KURTOSIS-
MEAN-PEAK-STD-DEV-
MEAN-PEAK-KURTOSIS-
MEAN-STD-DEV-KURTOSIS-
PEAK-STD-DEV-KURTOSIS-
MEAN-PEAK-STD-DEV-KURTOSIS-
1
0.88 1
0.85 0.94 1
0.44 0.36 0.35 1
0.9 0.98 0.93 0.37 1
0.95 0.93 0.9 0.4 0.95 1
1 0.88 0.85 0.44 0.9 0.95 1
0.88 0.99 0.94 0.36 0.97 0.92 0.88 1
0.88 1 0.94 0.36 0.98 0.93 0.88 0.99 1
0.85 0.94 1 0.35 0.93 0.9 0.85 0.94 0.94 1
0.9 0.99 0.94 0.37 0.99 0.95 0.9 0.98 0.99 0.94 1
0.9 0.98 0.93 0.37 1 0.95 0.9 0.97 0.98 0.93 0.99 1
0.95 0.93 0.9 0.4 0.95 1 0.95 0.92 0.93 0.9 0.95 0.95 1
0.88 0.99 0.94 0.36 0.97 0.92 0.88 1 0.99 0.94 0.98 0.97 0.92 1
0.9 0.99 0.94 0.37 0.99 0.95 0.9 0.98 0.99 0.94 1 0.99 0.95 0.98 1
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Figure 10. Heatmap of classification coincidences between clustering of each combination of geomet-
ric indicators.
123456789
Number of clusters
0.0
0.2
0.4
0.6
0.8
1.0
Normalizaded score
Elbow curve
MEAN-
PEAK-
STD-DEV-
KURTOSIS-
MEAN-PEAK-
MEAN-STD-DEV-
MEAN-KURTOSIS-
PEAK-STD-DEV-
PEAK-KURTOSIS-
STD-DEV-KURTOSIS-
MEAN-PEAK-STD-DEV-
MEAN-PEAK-KURTOSIS-
MEAN-STD-DEV-KURTOSIS-
PEAK-STD-DEV-KURTOSIS-
MEAN-PEAK-STD-DEV-KURTOSIS-
Figure 11. Curvas de codo para clasificaciones con las combinaciones de indicadores geométricos.
Energies 2022,15, 7557 12 of 30
5.2. Discussion
•
The histograms in Figure 7show that kurtosis is unimodal, while the other 3 indicators
are bimodal. This fact suggests that kurtosis is an inadequate geometric indicator
for the data studied. This statement is also supported by the heatmap in Figure 10,
where a very different behavior of this indicator is observed, compared to the others.
Kurtosis end skewness have been used as indicators in the analysis of electricity
markets [
21
–
23
]. However, their descriptive power of load profile features remains as
a research topic.
•
The point clouds in the Figure 8show that through the geometric indicators hidden
structures can be discovered in the data. A pair of indicators, such as the mean and
the standard deviation, allow you to visualize data clusters.
•
The point clouds also reveal that the shape of the clusters is not simple, and therefore
suggests that the clustering method used (
k
-means with the euclidean distance) is
not the most appropriate. In this exercise we have decided not to change the method
to keep the focus on the preparation and visualization of the data and not on its
subsequent processing.
•
Most of the prototype profiles shown in Figure 9are similar to those in Figure 4, in
particular those of standardization by rows. The great exception corresponds to those
obtained with kurtosis.
•
It is important to consider the advantage of applying processing techniques on a space
of dimension 2 or 3, instead of one of dimension 24. It is possible to identify, at least,
two great advantages: (a) reduction of processing requirements and (b) reduction of
the number of data required for a given statistical significance.
6. Task 2. Dimensionality Reduction (PCA)
Matrix
X
in Equation (1) is a set of points in a space of dimension 24. The use of geo-
metric indicators allows to analyze the profiles in spaces of much smaller dimensions with
good results. In this section we apply the well-known technique of Principal Component
Analysis (PCA) with the same purpose of reducing the dimensionality of the problem.
There are other energy consumption problems in which PCA has been successfully used;
for example, to improve the performance of non-intrusive load monitoring methods [24].
6.1. Experiment 3
To explore the number of dimensions to which the original set could be reduced, we
have plotted in Figure 12 the fraction of the variance explained as a function of the number
of extracted components. It is observed that very few components (3, for example) manage
to explain a large amount of the variance of the original data set.
0 2 4 6 8 10 12 14
Components
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of the variance explained
Figure 12. Fraction of the variance explained with PCA.
Energies 2022,15, 7557 13 of 30
We have applied PCA and applied
k
-means with sets of 1, 2, 3, 6, 9 and 12 components.
Figure 13 shows the profiles of the obtained prototypes (in the original space). The heatmap
of the Figure 14 has been constructed with the coincidences in the classifications obtained.
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
1
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
2
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
3
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
6
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
9
0 5 10 15 20
Hour
0
250
500
750
1000
1250
1500
1750
2000
Consumption
12
Figure 13. Cluster prototypes for PCA. Each color represents a single cluster.
1 2 3 6 9 12
1236912
1
0.98 1
0.97 0.99 1
0.97 0.98 0.99 1
0.97 0.98 0.99 1 1
0.97 0.98 0.99 1 1 1
0.970
0.975
0.980
0.985
0.990
0.995
1.000
Figure 14. Heatmap of classification coincidences for PCA.
Energies 2022,15, 7557 14 of 30
6.2. Discussion
•
Figures 13 and 14 show the potential of the PCA technique. A single dimension is
enough to achieve results comparable to those obtained in the previous sections.
•
The fact that it is feasible to reduce the information of the 24 time slots to very few
dimensions gives mathematical meaning to the study of daily consumption profiles,
instead of treating them as independent readings.
7. Task 4. Data Enrichment
Data enrichment is the integration of data that is not in our dataset but in other sources.
It is usual for energy consumption measurements to be recorded with instruments that
simultaneously measure other physical variables. In this section, we enrich our dataset
with some of those variables and analyze the effect of incorporating them. To do this, we
are going to assume two different scenarios, in a three-wire, three-phase system
• Scenario A: active, reactive and apparent power measurements are available.
•
Scenario B: In addition to the above measurements, there are current and line voltage
measurements.
7.1. Experiment 4. Scenario A
Denoting
P
,
Q
and
S
the measured three-phase active, reactive and apparent power; it
is possible to calculate the three-phase power factor PF as:
PF =P
S=pS2−Q2
S
Figure 15 shows the mean profile and the box plot for each of the variables
P
,
Q
,
S
and
FP
. It is observed that the apparent power follows a profile very similar to that of the
active power (which is entirely expected in a system such as the one measured), while the
power factor and the reactive power present a very different behavior.
0 5 10 15 20
900
1000
1100
1200
1300
1400
1500
P
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
500
750
1000
1250
1500
1750
2000
2250
2500
P
0 5 10 15 20
900
1000
1100
1200
1300
1400
1500
S
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
500
750
1000
1250
1500
1750
2000
2250
2500
S
0 5 10 15 20
60
80
100
120
140
160
180
200
Q
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
50
100
150
200
250
300
350
Q
0 5 10 15 20
0.989
0.990
0.991
0.992
0.993
0.994
0.995
0.996
0.997
PF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0.965
0.970
0.975
0.980
0.985
0.990
0.995
1.000
PF
Figure 15. Cont.
Energies 2022,15, 7557 15 of 30
0 5 10 15 20
900
1000
1100
1200
1300
1400
1500
P
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
500
750
1000
1250
1500
1750
2000
2250
2500
P
0 5 10 15 20
900
1000
1100
1200
1300
1400
1500
S
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
500
750
1000
1250
1500
1750
2000
2250
2500
S
0 5 10 15 20
60
80
100
120
140
160
180
200
Q
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
50
100
150
200
250
300
350
Q
0 5 10 15 20
0.989
0.990
0.991
0.992
0.993
0.994
0.995
0.996
0.997
PF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0.965
0.970
0.975
0.980
0.985
0.990
0.995
1.000
PF
Figure 15. Average profile and box plots of the extended variables in Scenario A. Blue: mean.
Orange: median.
To explore the information of the four variables, the histograms (Figure 16) and the
point clouds between each pair of variables (Figure 17) have been constructed. As the
expansion of the information has been carried out on hourly readings, before constructing
the Table 1, a color code has been used in the point clouds to indicate the hour of the day to
which each corresponds. one of them.
500 1000 1500 2000 2500
0
1000
2000
3000
4000
5000
6000
P
500 1000 1500 2000 2500
0
1000
2000
3000
4000
5000
6000
S
0 100 200 300
0
1000
2000
3000
4000
5000
Q
0.97 0.98 0.99 1.00
0
1000
2000
3000
4000
5000
6000
7000
PF
Figure 16. Histograms of the extended variables in Scenario A.
Energies 2022,15, 7557 16 of 30
Figure 17. Correlograms of the extended variables in Scenario A. Each color represents the hour of
the day.
By applying
k
-means to each variable separately, the profiles shown in the Figure 18
have been obtained. To compare the effect of each of these variables, the labeling matches
for each pair of variables have been computed. The result is displayed in Table 2. Due
to the fact that when using
k
-means the resulting labeling order is random and that each
workspace is different, what must be analyzed is how dispersed the data are in each matrix.
Table 2. Coincidence table between classifications obtained with the extended variables in Scenario A.
P S Q PF
P
252 0 0
0 313 0
0 0 214
0 0 252
313 0 0
2 212 0
74 0 178
91 221 1
157 12 45
117 1 134
34 103 176
70 3 141
S
0 313 2
0 0 212
252 0 0
315 0 0
0 212 0
0 0 252
92 222 1
156 11 45
74 0 178
34 103 178
70 3 139
117 1 134
Q
74 91 157
0 221 12
178 1 45
92 156 74
222 11 0
1 45 178
322 0 0
0 233 0
0 0 224
59 57 206
14 46 173
148 4 72
PF
117 34 70
1 103 3
134 176 141
34 70 117
103 3 1
178 139 134
59 14 148
57 46 4
206 173 72
221 0 0
0 107 0
0 0 451
Energies 2022,15, 7557 17 of 30
0 5 10 15 20
Hour
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Consumption
P
0 5 10 15 20
Hour
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Consumption
S
0 5 10 15 20
Hour
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Consumption
Q
0 5 10 15 20
Hour
0.0
0.2
0.4
0.6
0.8
Consumption
PF
Figure 18. Prototype clusters obtained with the extended variables in Scenario A. Each color repre-
sents a single cluster.
Using a metric based on the number of zeros in each table, the information shown
in the Table 3has been constructed. What this table reflects is that when using each of
the extended variables, different classifications are obtained; or what is the same, that
each of the variables contains information that allows a different classification. The most
similar variables are
P
and
S
, which makes sense since these are generally loads with a
high power factor.
Table 3. Dispersion table in the coincidences. Scenario A.
P S Q PF
P
252 0 0
0 313 0
0 0 214
0 0 252
313 0 0
2 212 0
74 0 178
91 221 1
157 12 45
117 1 134
34 103 176
70 3 141
S
0 313 2
0 0 212
252 0 0
315 0 0
0 212 0
0 0 252
92 222 1
156 11 45
74 0 178
34 103 178
70 3 139
117 1 134
Q
74 91 157
0 221 12
178 1 45
92 156 74
222 11 0
1 45 178
322 0 0
0 233 0
0 0 224
59 57 206
14 46 173
148 4 72
PF
117 34 70
1 103 3
134 176 141
34 70 117
103 3 1
178 139 134
59 14 148
57 46 4
206 173 72
221 0 0
0 107 0
0 0 451
Energies 2022,15, 7557 18 of 30
7.2. Experiment 5. Scenario B
Denoting
•VAB,VBC,VCA the three measured line voltages
•IA,IB,ICthe three measured line currents
It is possible to calculate the unbalance of voltages and currents as follows:
δI=max(|I−Im|
Im
Im= (IA+IB+IC)/3
δV=max(|V−Vm|
Vm
Im= (VAB +VBC +VCA)/3
(11)
It is also possible to obtain the phasor values of the line currents and voltages and
build a model of the consumer as an unbalanced three-phase load in Delta connection (see
Appendix A). We denote these quantities as follows:
• Voltage line phasors: VAB ,VBC ,VCA.
• Current line phasors: IA,IB,IC.
• Current phase phasors: IAB,IBC ,IC A .
• Phase impedance: ZAB,ZBC,ZCA .
Each of the 12 complex magnitudes in the previous list can be represented in rectangular
form (real part and imaginary part) or in polar form (magnitude and angle). For that reason,
it is possible to calculate 4 numerical values for each of the 12 magnitudes (48 in total).
In this experiment, it was decided to build a data table with the following variables:
•P: the measured three-phase active power;
•Q: the measured three-phase reactive power;
•δV: the voltage unbalance;
•δI: the current unbalance;
•ˆ
Zm: the average angle of phase impedances.
Figure 19 shows the average profile and the box plot for each of the variables
P
,
Q
,
δV
δIand ˆ
Zm.
0 5 10 15 20
900
1000
1100
1200
1300
1400
1500
P
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
500
1000
1500
2000
2500
P
0 5 10 15 20
60
80
100
120
140
160
180
200
Q
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
100
200
300
Q
0 5 10 15 20
4
5
6
7
8
ANG
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0.0
2.5
5.0
7.5
10.0
12.5
15.0
ANG
0 5 10 15 20
0.014
0.016
0.018
0.020
0.022
0.024
0.026
dIL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0.00
0.02
0.04
0.06
dIL
0 5 10 15 20
0.0033
0.0034
0.0035
0.0036
0.0037
dVL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0.000
0.002
0.004
0.006
0.008
0.010
dVL
Figure 19. Cont.
Energies 2022,15, 7557 19 of 30
0 5 10 15 20
900
1000
1100
1200
1300
1400
1500
P
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
500
1000
1500
2000
2500
P
0 5 10 15 20
60
80
100
120
140
160
180
200
Q
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
100
200
300
Q
0 5 10 15 20
4
5
6
7
8
ANG
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0.0
2.5
5.0
7.5
10.0
12.5
15.0
ANG
0 5 10 15 20
0.014
0.016
0.018
0.020
0.022
0.024
0.026
dIL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0.00
0.02
0.04
0.06
dIL
0 5 10 15 20
0.0033
0.0034
0.0035
0.0036
0.0037
dVL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0.000
0.002
0.004
0.006
0.008
0.010
dVL
Figure 19. Average profile and box plots of the extended variables in Scenario B. Blue: mean. Orange:
median.
To explore the information of the five variables, the histograms (Figure 20) and the
point clouds between each pair of variables (Figure 21) have been constructed. As in
scenario A, a color code has also been used to indicate the time of day to which each
point corresponds.
500 1000 1500 2000 2500
0
1000
2000
3000
4000
5000
6000
P
500 1000 1500 2000 2500
0
1000
2000
3000
4000
5000
6000
S
0 100 200 300
0
1000
2000
3000
4000
5000
Q
0.97 0.98 0.99 1.00
0
1000
2000
3000
4000
5000
6000
7000
PF
Figure 20. Histograms of the extended variables in Scenario B.
Energies 2022,15, 7557 20 of 30
Figure 21. Correlograms of the extended variables in Scenario B. Each color represents the hour of
the day.
By applying
k
-means to each variable separately, the profiles shown in the Figure 22
were obtained. To compare the effect of each of these variables, the labeling matches for
each pair of variables have been computed. The result is shown in the Table 4, as well as
the measure of dispersion shown in the Table 5. We notice that variables of scenario B are
even more independent of each other than those of scenario A.
Table 4. Coincidence table between classifications. Scenario B.
P Q ANG dIL dVL
P
248 0 0
0 315 0
0 0 216
0 175 73
221 1 93
12 48 156
118 130 0
186 31 98
130 83 3
1 63 184
173 70 72
4 71 141
99 115 34
74 143 98
66 119 31
Q
0 221 12
175 1 48
73 93 156
233 0 0
0 224 0
0 0 322
187 5 41
47 173 4
200 66 56
120 52 61
2 69 153
56 83 183
55 104 74
85 109 30
99 164 59
ANG
118 186 130
130 31 83
0 98 3
187 47 200
5 173 66
41 4 56
434 0 0
0 244 0
0 0 101
102 113 219
8 71 165
68 20 13
140 205 89
86 123 35
13 49 39
dIL
1 173 4
63 70 71
184 72 141
120 2 56
52 69 83
61 153 183
102 8 68
113 71 20
219 165 13
178 0 0
0 204 0
0 0 397
29 85 64
69 101 34
141 191 65
dVL
99 74 66
115 143 119
34 98 31
55 85 99
104 109 164
74 30 59
140 86 13
205 123 49
89 35 39
29 69 141
85 101 191
64 34 65
239 0 0
0 377 0
0 0 163
Energies 2022,15, 7557 21 of 30
0 5 10 15 20
Hour
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Consumption
P
0 5 10 15 20
Hour
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Consumption
Q
0 5 10 15 20
Hour
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Consumption
ANG
0 5 10 15 20
Hour
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Consumption
dIL
0 5 10 15 20
Hour
0.0
0.1
0.2
0.3
0.4
0.5
Consumption
dVL
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 22. Prototype clusters obtained with the extended variables in Scenario B. Each color represents
a single cluster.
Table 5. Dispersion table in the coincidences obtained with the extended variables in Scenario B.
P Q ANG dIL dVL
P 1.00 0.17 0.17 0.00 0.00
Q 0.17 1.00 0.00 0.00 0.00
ANG
0.17 0.00 1.00 0.00 0.00
ILD 0.00 0.00 0.00 1.00 0.00
VLD
0.00 0.00 0.00 0.00 1.00
7.3. Discussion
As expected, the incorporation of new information allow better analyzes of consump-
tion. The purpose of this experiment has been to illustrate how the additional measurements
that are often available along with consumption records can be used to enrich the profiles.
However, the type of additional information that may be useful depends both on the type
of problem addressed and the origin of the measurements. It is not the same if all the
records correspond to the same user, as if they correspond to a set of them.
Some common examples that help describe the context in which the consumptions are
made are
•
For a single user: day of the week, day of the year, business day or holiday, and
weather conditions. As an illustration, the Figure 23 shows the incidence of the day of
the year and the day of the week in the daily energy consumption, for the dataset of
this article.
Energies 2022,15, 7557 22 of 30
•
For several users in the same geographic location: Economic activity, financial capacity.
• For several users in different places: climate and weather conditions.
Moreover, the importance of measurements other than energy consumption is under-
estimated. The unbalance of the grid and the harmonic distortion are easy to know with
most of the common smart meters. It is well known that both phenomena, unbalance and
distortion, have direct incidence over the grid energy efficiency [
25
–
27
]. The enrichment of
the data set may be useful if we want to design, for example, investment incentives for users
in order to solve power quality issues [
28
,
29
] or locate Distribution Static Compensators
(DSTATCOM) [30].
0 50 100 150 200 250 300 350
Day of the year
600
800
1000
1200
1400
1600
1800
2000
2200
Daily energy consumption
1
2
3
4
5
6
7
Figure 23. Incidence of the day of the year and the day of the week in energy consumption.
1 = Monday
,
2 = Tuesday, etc.
8. Experiment 6. Analysis of a Fee Scheme
In order to assess the possible effect of each of the data preparation techniques studied,
an application will be used to analyze a fee scheme of the type Time Of Use (TOU).
To do this, we are going to assume that each of the daily consumption records of the
dataset corresponds to the average profile of a user of a certain company (a utility). The
company is interested in analyzing the effects that a change from a flat-rate scheme to a
TOU-type scheme could have on its income. To do this, perform the following procedure:
1.
The average consumption profile is obtained, that is, the profile shown on the left of
the Figure 1.
2.
Based on the average profile, a TOU-type rate scheme is designed. The procedure is
explained in the Section 8.1.
3. The users are classified based on their profiles in two types:
(a)
A: Users with a variable consumption throughout the 24 h of the day.
(b)
B: Users with a constant consumption throughout the 24 h of the day.
4. ∆op
is calculated, the relative change that would happen in its collection if each user
selects the fee scheme (Flat or TOU) that suits him best.
5. ∆cl
is calculated, the relative change that would occur in its collection if each user is
assigned the rate scheme according to the classification of the step 3.
The above analysis will depend on the results of the classification process of step 3
and this, in turn, will depend on the data preparation process. This analysis is performed
in Section 8.2.
Energies 2022,15, 7557 23 of 30
8.1. Design of the Fee Scheme
For the construction of the TOU fee scheme, we use the ideas formulated in [
31
] and
that we reproduce below. The purpose is to obtain a fee scheme
f= [ f1
,
f2
,
· · ·
,
fN]
, where
fi
is the fee charged for energy consumption in the day slot
i
. In this exercise hourly slots
are assumed, and therefore N=24.
It is established as an additional condition that the daily cost of energy for a given
consumption profile must be
C
, equal to what would correspond to a certain known flat
rate, that is, for a consumption profile p= [p1,p2,· · · ,pN]:
C=
N
∑
i=1
fipi(12)
To do this, we start from a prototype fee scheme
¯
f= [ ¯
f1
,
¯
f2
,
· · · ¯
fN]
that defines the
desired shape of f. The operation to obtain ffrom ¯
fis an affine scaling:
fi=fmin +¯
fi−¯
fmin
¯
fmax −¯
fmin (fmax −fmin)
where the subscripts
max
and
min
identify the maximum and minimum values of the schemes.
Under these conditions, it is shown in [31] that (12) is satisfied if and only if
fmin =C
Pp+ (γ−1)Pv
fmax =γfmin (13)
with γa design factor and
Pp=
N
∑
i=1
piPv=
N
∑
i=1¯
fi−¯
fmin
¯
fmax −¯
fmin pi
On the other hand, for the construction of the prototype scheme
¯
f
from the profile
p
,
the following procedure was followed:
¯
fi=
0.0 si 0.0 ≤¯
pi<0.2
0.1 si 0.2 ≤¯
pi<0.8
1.0 si 0.8 ≤¯
pi<1.0
¯
pi=pi−pmin
pmax −pmin
i=1, 2, · · · ,N(14)
Using a flat fee of arbitrary value 100, a design factor
γ=
4, and the average profile of
our dataset, the fee scheme obtained shown in Figure 24.
0 5 10 15 20
Hour
0
25
50
75
100
125
150
175
200
Fee
Flat
TOU
Figure 24. Fee schemes.
Energies 2022,15, 7557 24 of 30
8.2. Results and Analysis
To analyze the effect of data preparation, we have analyzed the following cases
• No preparation. The grouping has been conducted on the original dataset.
• The four standardization options presented in the Section 4.
•
All the possible combinations of geometric indicators presented in the Section 5. In
total there are 15 combinations.
• Dimensionality reduction by PCA, with 1, 2 and 3 components.
•
Expanded information with each of the indicators presented in the Sections 7.1 and 7.2.
The above listing generates 29 ways to prepare the data before performing the classifi-
cation process of step 3.
The total income will be the sum of the individual incomes
Rk
corresponding to the
Musers:
R=
M
∑
k=1
Rk
and Rkwill depend on the consumption of the individual kand the rate fthat is applied
Rk=
N
∑
i=1
fipi,k
We define:
Rp: Income using a flat-fee for all the users.
Rop : Income using the fee that best fits for every user.
Rcl : Income using the fee chosen according with the user classification.
The fee assignment for the calculation of Rcl has been carried out as follows:
• For type A users: TOU fee.
• For type B users: flat fee.
The relative changes ∆op y∆cl in steps 4and 5are obtained as:
∆op =Ro p − Rp
Rp
∆cl =Rcl − Rp
Rp
Figure 25 shows the relative changes
∆op
and
∆cl
for each of the analyzed cases.
∆op
is independent of the classification process and is therefore useful as a reference value to
compare the effect of different data preparation processes.
−0.02 −0.01 0.00 0.01 0.02
Original
Matrix
B. columns
B. rows
Ext nd d b. rows
MEAN
-
PEAK-
STD-DEV-
KURTOSIS-
MEAN-PEAK-
MEAN-STD-DEV-
MEAN-KURTOSIS-
PEAK-STD-DEV-
PEAK-KURTOSIS-
STD-DEV-KURTOSIS-
MEAN-PEAK-STD-DEV-
MEAN-PEAK-KURTOSIS-
MEAN-STD-DEV-KURTOSIS-
PEAK-STD-DEV-KURTOSIS-
MEAN-PEAK-STD-DEV-KURTOSIS-
PCA-1
PCA-2
PCA-3
P-S
P-Q
P-PF
P-PHI
P-dIL
P-dVL
Best
fit for user
According to clasification
Figure 25. Relative changes of the income.
Energies 2022,15, 7557 25 of 30
In Figure 25, it can be seen that if the users are free to select the fee scheme, the total
income would decrease (it is the value of the extreme left,
−
0.0267 for the data of the
experiment). On the other hand, if the allocation of schemes is made according to the result
of the classification, the income would increase. This higher value is what is expected to be
an economic incentive for the user to modify their consumption patterns and allow for a
flatter curve.
The incentive is strongly affected by the data preparation process. In particular, it
is observed how the use of the geometric indicator DEV (Standard Deviation) achieves
a better value, while KURTOSIS (Kurtosis) has the poorest performance. This is not
surprising, because the Standard Deviation is a very good measure of ‘how flat’ the profile
is, and therefore it is a very good indicator of the expected impact of the TOU fee scheme.
Curiously, the simultaneous use of these two indicators (DEV-KURTOSIS) also has a very
good performance.
9. Conclusions
It is evident that the preparation of the data affects the final result of the analysis. This
is why the data preparation process should be made explicit in each report and not be
ignored as a matter of little added value. It is particularly important to note that the most
common default standardization process, standardization by columns, is not suitable for
preparing energy consumption profiles, because it distorts the shape of load curves. We
emphasize that there is a strong relation between columns.
Principal Component Analysis has been shown to be particularly efficient with our
dataset and should always be considered. However, if some kind of interpretability of
the intermediate results is required, some geometric characteristics such as the mean and
standard deviation are a good alternative. These two options, when working in low-
dimensional spaces, are attractive for the preparation of massive data.
On the other hand, the enrichment of the data with information from other measure-
ments, or with external data that explains the context of these measurements, is an area of
work that has yet to be standardized. An effort by the academic community that facilitates
the comparability of the multiple investigations carried out would be convenient for the
development of Energy Data Science.
Although in the experiment on the analysis of a fee scheme a good performance was
obtained for a certain type of data preparation (the use of the standard deviation), this
result does not imply that it is our suggestion of use: quite the contrary. What we are stating
is that the best data preparation process will depend on each application, and possibly
on each dataset; therefore, we want to emphasize the need to report it properly in each
academic communication.
Author Contributions: Conceptualization, O.G.D. and M.d.C.P.; methodology, O.G.D.; software,
O.G.D.; validation, J.A.R.; writing—review and editing, O.G.D. All authors have read and agreed to
the published version of the manuscript.
Funding: This project has partial support from the Asociación Universitaria Iberoamericana de
Posgrados (AUIP), mobility scholarship for postdoctoral stays at Andalusian universities, 2022 from
the Ministerio de Ciencia e Innovación (Spain) (Research Project PID2020-112495RB-C21) and from
the I+D+i FEDER 2020 project B-TIC-42-UGR20.
Data Availability Statement: All data used in this paper are available at https://github.com/
ogduartev/energyDataScience/tree/main/data/campus.
Acknowledgments: The authors thank the engineer Alvaro Alfonso Zambrano, who provided the
raw dataset.
Conflicts of Interest: The authors declare no conflict of interest.
Energies 2022,15, 7557 26 of 30
Appendix A. Obtaining Phasors from Line Values in Three-Phase Delta Systems
Appendix A.1. Problem Statement
In the system of Figure A1 we have the following measurements:
• Line voltage magnitudes: VAB ,VBC,VCA.
• Line current magnitudes: IA,IB,IC.
• Total three-phase active power consumed by the load: P3ϕ.
• Total three-phase active repower consumed by the load: Q3ϕ.
• Total three-phase active apparent consumed by the load: S3ϕ.
We need to compute:
1. Line voltage phasors: VAB,VBC,VCA .
2. Line current phasors: IA,IB,IC.
3. Phase current phasors: IAB,IBC,ICA .
4. Phase impedances: ZAB,ZBC ,ZCA.
Version September 7, 2022 submitted to Energies 26 of 29
ZCA
ZBC
ZAB
A
B
C
IA
IB
IC
Figure A1. Problem statement
BC
A
a
bc
Figure A2. Angles and sides of a triangle
Appendix A Obtaining phasors from line values in three-phase Delta systems 394
Appendix A.1 Problem statement 395
In the system of Figure A1 we have the following measurements: 396
• Line voltage magnitudes: VAB,VBC,VCA .397
• Line current magnitudes: IA,IB,IC.398
• Total three-phase active power consumed by the load: P3φ399
• Total three-phase active repower consumed by the load: Q3φ400
• Total three-phase active apparent consumed by the load: S3φ401
We need to compute: 402
1. Line voltage phasors: VAB,VBC,VCA 403
2. Line current phasors: IA,IB,IC404
3. Phase current phasors: IAB ,IBC,ICA 405
4. Phase impedances: ZAB,ZBC,ZCA 406
Appendix A.2 Angles in a three-phase system 407
The first two problems are similar. In both cases the magnitudes of a three-phase sys- 408
tem of variables (albeit balanced