ArticlePDF Available

Abstract and Figures

Non-intrusive load monitoring (NILM) is the process of obtaining appliance-level data from a single metering point, measuring total electricity consumption of a household or a business. Appliance-level data can be directly used for demand response applications and energy management systems as well as for awareness raising and motivation for improvements in energy efficiency. Recently, classical machine learning and deep learning (DL) techniques became very popular and proved as highly effective for NILM classification, but with the growing complexity these methods are faced with significant computational and energy demands during both their training and operation. In this paper, we introduce a novel DL model aimed at enhanced multi-label classification of NILM with improved computation and energy efficiency. We also propose an evaluation methodology for comparison of different models using data synthesized from the measurement datasets so as to better represent real-world scenarios. Compared to the state-of-the-art, the proposed model has its energy consumption reduced by more than 23% while providing on average approximately 8 percentage points in performance improvement when evaluating on data derived from REFIT and UK-DALE datasets. We also show a 12 percentage point performance advantage of the proposed DL based model over a random forest model and observe performance degradation with the increase of the number of devices in the household, namely with each additional 5 devices, the average performance degrades by approximately 7 percentage points.
Content may be subject to copyright.
Date of publication xxxx 00, 0000, date of current version 13.9.2023.
Digital Object Identifier xx.xxxx/ACCESS.202x.xxxxxxx
Energy Efficient Deep Multi-Label ON/OFF
Classification of Low Frequency Metered Home
Appliances
ANŽE PIRNAT1, BLAŽ BERTALANIČ1, GREGOR CERAR1, MIHAEL MOHORČIČ1and CAROLINA
FORTUNA1
1Jozef Stefan Institute, Jamova cesta 39, 1000 Ljubljana (e-mails: ap6928@student.uni-lj.si, {blaz.bertalanic, gregor.cerar, mihael.mohorcic,
carolina.fortuna}@ijs.si)
‘‘This work was funded in part by the Slovenian Research Agency under the grant P2-0016. This project has received funding from the
European Union’s Horizon Europe Framework Programme under grant agreement No 872525 (BD4OPEM).
ABSTRACT
Non-intrusive load monitoring (NILM) is the process of obtaining appliance-level data from a single metering
point, measuring total electricity consumption of a household or a business. Appliance-level data can be
directly used for demand response applications and energy management systems as well as for awareness
raising and motivation for improvements in energy efficiency. Recently, classical machine learning and deep
learning (DL) techniques became very popular and proved as highly effective for NILM classification, but
with the growing complexity these methods are faced with significant computational and energy demands
during both their training and operation. In this paper, we introduce a novel DL model aimed at enhanced
multi-label classification of NILM with improved computation and energy efficiency. We also propose an
evaluation methodology for comparison of different models using data synthesized from the measurement
datasets so as to better represent real-world scenarios. Compared to the state-of-the-art, the proposed
model has its energy consumption reduced by more than 23% while providing on average approximately
8 percentage points in performance improvement when evaluating on data derived from REFIT and UK-
DALE datasets. We also show a 12 percentage point performance advantage of the proposed DL based
model over a random forest model and observe performance degradation with the increase of the number
of devices in the household, namely with each additional 5 devices, the average performance degrades by
approximately 7 percentage points.
INDEX TERMS non-intrusive load monitoring (NILM), deep learning (DL), convolutional recurrent neural
network (CRNN), multi-label classification, load profiling
I. INTRODUCTION
CLIMATE change represents a formidable challenge, and
mitigating its impacts requires a concerted effort to
maintain the increase in global average temperature below
1.5 C relative to pre-industrial levels. Electrical energy pro-
duction is estimated to contribute more than 40 % of the
total CO2equivalent produced by humankind1,2 . So, some
of the necessary steps in mitigating climate change are to
reduce energy consumption and subsequently its production
as well as to increase the share of renewable energy sources3
1https://tinyurl.com/electricity-production-CO2-1 (accessed 4.3.2024)
2https://tinyurl.com/electricity-production-CO2-2 (accessed 4.3.2024)
3https://tinyurl.com/renewable-energy-doubled (accessed 4.3.2024)
that produce far less CO2equivalent compared to traditional
power plants that burn fossil fuels. However, the renewable
energy sources are mostly dependant on external conditions
such as wind, sun, etc. and are thus less predictable and pose
a challenge to the stability and reliability of the electrical
power system [1]. To solve this problem we have to work
with the concept of demand response; change the electrical
power consumption to better match the demand with the sup-
ply [2]. Because of demand response, efforts are being made
to monitor and manage energy consumption more effectively
in residential building, which makes monitoring the activity
of devices (ON/OFF events) relevant [3].
Monitoring each device separately is costly and invasive
VOLUME 11, 2023 i
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
since it requires an installation of an electricity meter on each
appliance. As an alternative, a non-intrusive load monitoring
(NILM) supported with disaggregation methods is able to
reach the same result with just one electricity meter per
household and is thus much more economical [4]. NILM
is the process of obtaining appliance-level data from a sin-
gle metering point measuring total electricity consumption
of a household or a business. By subsequent processing, it
is possible to decompose NILM data into individual com-
ponents, and by classification we can determine the state
(ON/OFF) of devices and thus monitor their activity for de-
mand response applications. In Europe, households consume
27.4 % of all electricity produced4. Thus cutting down on their
consumption would play an important role in relieving our
carbon footprint. As several research studies have shown, if
given real-time feedback on their electricity savings residents
achieve a more comprehensive understanding of their electri-
cal consumption and develop more energy aware behaviour.
Consequently they consume 12 % less electricity than they
would normally [5]. With classification on NILM we can
provide feedback on the device activity and help that way.
In the described application areas for classification on
NILM there can be more than one device active at a time.
Thus, the best approach to determine activity states of the
appliances is multi-label classification, where the state of the
appliance is used as the class label and the recorded readings
from a single meter on the household as input samples. Multi-
label classification has been attempted on NILM with numer-
ous methods that can be divided into two categories. The first
category includes single channel source separation techniques
such as matrix factorization [6], sparse coding [7], dictionary
learning [8], and non-negative tensor factorization [9], while
the second category comprises machine learning approaches
such as support vector machines (SVM) [10], random forests
(RF) [11], decision trees (DT) [12] and deep neural net-
works [13]–[16].
A. CONTRIBUTIONS AN D PAPER ORGANIZATION
In this paper we are concerned with energy efficient ON/OFF
classification of NILM data aimed at decreasing the over-
all energy consumption, which includes the investigation
whether more complex and accurate DL approaches outweigh
simpler and less consuming classical ML approaches. Im-
proved energy consumption, encouraged for aforementioned
ecological reasons, is caused by the reduction in computa-
tional cost which is highly encouraged by the cloud comput-
ing community [17] and seen as a necessity in the field [18].
We propose a new DL architecture that is inspired by the
VGG family of architectures and RNNs for the multi-label
device activity classification on NILM data. We prove that its
performance is better than state of the art CNNs and CRNNs
and state of the art classical ML algorithms, while being much
more energy efficient compared to the state of the art DL
models.
4https://tinyurl.com/home-consumption-statistic (accessed 4.3.2024)
Classical ML and DL models such as used in [16], [19]–
[23] and [24] are trained and evaluated with only five partic-
ular devices, which occur commonly in different datasets and
houses within. This method is well suited for the comparison
with other models, but lacks the ability to reflect realistic
results of performance as it fully disregards all other devices.
Those 5 devices are chosen because they each draw large
proportion of energy and represent various different power
signatures. As stated in [25] that can be especially prob-
lematic since all devices with smaller energy consumption
specifically, are thus disregarded, while modern homes tend
to have a lot of them. Moreover, different problem types
on NILM data such as disaggregation and appliance clas-
sification already employ much more than 5 devices [13],
[26]–[28]. The methodology should thus be extended to more
devices and it should be done according to specifics of the
dataset to make it as close to real-world examples as possible.
To this end, this work examines a novel methodology de-
picted in Figure 1 that more accurately represents the perfor-
mance of models in practical use cases. The extended method-
ology complements existing methodologies [16], [19]–[24],
by creating a so-called Realistic evaluation dataset and an
Sensitivity evaluation dataset. The existing methodologies
typically assume that there are 5 devices per household and
any of these can be active in the observation window. In
the Realistic evaluation dataset we allow for households to
have a varying number of active devices, say 5, 10, 15, ... N
devices, of which any random combination can be active. In
the Sensitivity evaluation dataset we further extend the pos-
sible combinations by allowing a household to have varying
number of devices (i.e. 5, 10, 15, ... N) however we study
what happens when only 1, only 2, etc. are active at a time
separately - thus enabling the assessment of the sensitivity to
various combinations of devices.
We evaluate our model on both the established evaluation
methodology and the proposed evaluation methodology to
assure its performance.
Our main contributions are as follows:
We propose a novel Convolutional transpose Reccurrent
Neural Network (CtRNN) architecture focusing on re-
duced computational complexity which offers superior
performance compared to the existing state of the art
architectures with an average improvement of approx-
imately 8 percentage points on mixed datasets derived
from REFIT and UK-DALE and more than 23 % lesser
energy consumption, making it a more sustainable solu-
tion.
We propose a novel evaluation methodology that be-
gins with a dataset analysis and involves generating
two groups of mixed datasets that are utilized for both
training and testing. By taking into account the unique
properties of the original dataset when generating mixed
datasets, our approach results in a more realistic evalua-
tion of model performance, more closely reflecting real-
world scenarios.
ii
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
5
10
N
random (1 - Nmax.AD)
Active devices
Devices D2
. . .
. . .
Proposed extended methodology
Realistic Evaluation (RE)
Sensitivity Evaluation (SE)
Devices D2
Active devices
5
10
. . .
N
31 2 . . . Nmax.AD
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
10 devices of
which exactly 2
are active per
sample
EXAMPLE
D2 = {fridge, washing machine,
dishwasher, microwave, kettle, ..., N}
...
... ... ...
1.
5.
10.
15.
N.
D1 = {fridge, washing machine,
dishwasher, microwave, kettle}
5
random (0 - 5)
Active devices
Devices D1
ON
ON
ON
OFF
OFF
10 devices of
which 1 or 2 or
... Nmax.AD are
active per
sample
5 devices of
which 0 or 1 or
2 or 3 or 4 or 5
are active per
sample
EXAMPLE
EXAMPLE
Existing methodology:
[16], [19]–[24]
Nmax.AD and Nmax.AD obtained
through dataset analysis
Figure 1: The proposed evaluation methodology of groups SE and RE of mixed datasets compared to the proposed evaluation methodology.
We perform a comprehensive analysis taking into ac-
count performance and energy efficiency of compared
approaches for NILM ON/OFF classification. We ob-
serve an average of approximately 7 percentage point
drop in F1 score for each 5 newly added devices in the
household.
This paper is organized as follows. Section II analyzed
related work while Section III provides the problem state-
ment and elaborates on methodological aspects. Section IV
presents the proposed model and Section V provides a com-
prehensive evaluation with Section VI concluding the paper.
II. RELATED WORK
In this section, we present related work focusing on multi-
label classification on NILM with the use of classical machine
learning (ML) algorithms and deep learning (DL) techniques.
To provide a comprehensive overview of the state-of-the-art
in this area, we have compiled Table 1 summarizing selected
more important references to prior work, including the type
of the problem addressed, the approach used, the number and
name of datasets utilized, and the number of devices involved
in each study, however in the following subsections when
discussing some of these specific aspects we also refer to
some further relevant works.
A. NILM PROBLE M TYPE
The second column in Table 1 demonstrates that state-of-
the-art approaches for NILM can be categorized into three
distinct types: disaggregation, ON/OFF classification, and ap-
iii
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Table 1: Comparison of results from other related works.
Work Problem Type Approach Type Approach Datasets Devices no.
Tabatabaei et al. [29] ON/OFF classification Classic ML RAkEL, MLkNN REDD (LF) up to 5
Raiker et al. [26] Disaggregation Classic ML fHMM DRED, AMPds, BLUED, UK-DALE, WHITED, PLAID (LF&HF) up to 11
Wu et al. [11] ON/OFF classification Classic ML RF BLUED (HF) 5
Singh et al. [30] ON/OFF classification Classic ML SRC REDD, Pecan Street (LF) 4
Çimen et al. [31] Disaggregation DL AAE UK-DALE, REDD (LF) 5
Ciancetta et al. [27] Appliance class. DL CNN BLUED (HF) 34
Chen et al. [28] Appliance class. DL TSCNN PLAID, WHITED (HF) up to 15
Zhou et al. [32] Appliance class. DL SNN PLAID (HF) 11
Yin et al. [33] Appliance class. DL DCNN custom (HF) up to 5
Tanoni et al. [16] ON/OFF classification DL CRNN REFIT, UK-DALE (LF) 5
Langevin et al. [19] ON/OFF classification DL CRNN REFIT, UK-DALE (LF) 5
This work ON/OFF classification DL CtRNN REFIT, UK-DALE (LF) up to 54
pliance classification. The disaggregation problem is focused
on decomposing the NILM signal into individual components
that correspond to distinct power signatures of active appli-
ances [26], [31]. The ON/OFF classification of appliances
aims to determine which devices are active and which inactive
in an aggregated power signal [11], [16], [19], [29], [30]. The
appliance classification problem also assumes that the dis-
aggregated signals are accessible and intends to classify the
devices that generated each unique power signature extracted
from the NILM signal [27], [28], [32], [33].
The focus of this paper is on the ON/OFF classification
problem type, which pertains to the identification of the ac-
tivity state of individual appliances from an aggregated power
signal without requiring prior disaggregation.
B. METHODS FOR SOLVING NILM PROBLEMS
As the analysis of the related work shows, the approaches to
solving NILM involve either a two stage process in which first
disaggregation is done that is followed by classification that
performs automatic appliance identification or both, or a one
stage process in which ON/OFF classification is done directly
on aggregated data. In the last few years, several classic ML
and DL methods have been proposed in this area.
Classic ML algorithms utilized in reviewed work are Ran-
dom k-labELset (RAkEL) [11], [29], [34], factorial Hidden
Markov Model (fHMM) [26], Random Forrest (RF) [11],
[35], Sparse Representation based Classification (SRC) [30],
Classification And Regression Tree (CART) [35], Extra Tree
(ET) [35], k-Nearest Neighbors (kNN) [11], [35], Linear Dis-
crimination Analysis (LDA) [35] and Naïve Bayes (NB) [35].
The latest state-of-the-art approaches for NILM are based
on DL algorithms, as shown in Table 1. In the reviewed
works, Convolutional Neural Networks (CNN) [27], [28],
[33] and Recurrent CNNs (CRNN) [15], [16], [19] are the
most common choice. However, a variety of other algorithms
are also used, e.g. Adverserial Auto Encoders (AAE) [31], a
custom archtecture TTRNet [13] and Spiking Neural Network
(SNN) [32].
Utilizing the fully supervised learning method, Wu et al.
[11] conducted an experiment to evaluate various classical
machine learning algorithms for multi-label classification of
NILM data for load identification. Their findings indicate
that Random Forest (RF) outperforms other learning algo-
rithms. Similarly, Rehmani et al. [35] demonstrated that com-
putationally intensive deep learning (DL) algorithms, such
as CNN and RNN, were not required for their particular
datasets, as already classical machine learning algorithms,
such as kNN and RF, yielded accuracy of 99 %. However,
openly available and well documented REFIT and UK-DALE
datasets, recently used in many reference works as well as
in this study, do not exhibit suitable performance with the
classical machine learning and are used with DL models.
To address the high computational complexity and energy
consumption associated with DL models, we designed a novel
DL architecture based on the principles established in our
previous work [36]. Additionally, to ensure consistency with
the latest research, we compared our approach to works by
Langevin et al. [19] and Tanoni et al. [16], who also utilized
the same datasets of REFIT and UK-DALE. Furthermore, we
also considered the findings of Ahajjam et al. [37], who dis-
covered that the optimal signal length varies across datasets,
and hence, we adopted the same signal length as Tanoni et al.
[16].
C. NILM DATA FOR ML MODEL TRAIN ING
As per the 5th column in Table 1, the related works employed
low-frequency (LF) and high-frequency (HF) datasets. The
European Union and UK technical specifications suggest the
use of LF smart meters with a sampling rate of around 10
seconds5for units installed in typical households. To circum-
vent the need to purchase and install new HF smart meters,
and instead utilize the existing LF meters whose readings are
already available via the COSEM interface classes and OBIS
Object Identification System6, this paper proposes the devel-
opment of an ON/OFF classification model for LF meters.
Typically, the number of devices employed in different
works is fixed, with the exception of Raiker et al. [26], Chen
et al. [28] and Yin et al. [33], who utilized up to 11, 15 and
5https://tinyurl.com/SMIP-E2E-SMETS2 (accessed 4.3.2024)
6https://tinyurl.com/COSEM-interface-classes (accessed 4.3.2024)
iv
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
1.0
0.5
D1 D2 D3 D4 D5 . . . DN
si
ON
OFF
Figure 2: Classification of devices as active or inactive based on the household NILM data using classical ML or DL model to obtain sifor each device; devices
with si>0.5are classified as active, the others are classified as inactive.
5 devices, respectively. In this work, however, we utilize a
flexible range of up to 54 devices.
D. ENERGY CONSUMPTION
The energy consumption of hardware used for running DL
models and the resulting energy consumption has only re-
cently become a growing concern in the community. In [38],
Hsueh conducted an analysis of the energy consumption of
ML algorithms and found that convolutional layers, operat-
ing in three dimensions, consume significantly more power
compared to fully connected layers, which operate in two
dimensions. Verhelst et al. [39] delved into the complexity
of CNNs and explored hardware optimization techniques,
particularly for the Internet of Things (IoT) and embedded de-
vices. Another study by Garcia et al. [40] surveyed the energy
consumption of various models and proposed a taxonomy
of power estimation models at both software and hardware
levels. They also discussed existing approaches for estimating
energy consumption, noting that using the number of weights
alone is not accurate enough. They suggested that a more
precise calculation of energy consumption requires the cal-
culation of either FLOPs or multiply-accumulate operations.
III. PROBLEM STATEMENT AND METHODOLOGY
A. PROBLEM STATEMENT
The objective of this study is to identify which devices are
currently active. The total electrical power pconsumed by a
household at any given moment tis calculated as the sum of
the power used by each electrical device, denoted as pi(t),
where there are Nddevices in total as defined in Eq. 1.
Additionally, measurement noise (including any unidentified
residual devices) e(t)is also taken into account. The status
indicator s(t)determines the activity of each device, where
s(t) = 0 indicates that the device is inactive and s(t) = 1
indicates that the device is active at the given moment t.
p(t) =
Nd
X
i=1
si(t)pi(t) + e(t)(1)
To solve the problem and thus estimate the status indicator
si(t)for each device, we can employ classical ML or DL for
multi-label classification of devices. Devices are classified
as active if the corresponding status indicator si(t)predicted
123456789
Active devices
0
5
10
15
20
25
Probability in a 6h time window [%]
(a) REFIT
2 4 6 8 10 12 14 16 18 20 22 24 26
Active devices
0
2
4
6
8
10
12
14
16
Probability in a 6h time window [%]
(b) UK-DALE
Figure 3: Probability distribution of the number of active devices in a 6 hours
time window in REFIT and UK-DALE datasets.
by the model exceeds 0.5, as illustrated in Figure 2. The
cardinality of the set srepresenting all the possible active
devices, denoted by |s|, indicates the number of labels that
need to be recognized. In the context of this paper, the value of
svaries between experiments as explained in Sections III-B1
and III-B2.
v
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
GRU, 64
FC, [5,10,15,20,54]
k=3, Conv1D, f=64
k=2, AvgPooling, s=2
k=3, Conv1D, f=64
k=3, Conv1D, f=128
k=2, AvgPooling, s=2
k=3, Conv1D, f=128
k=3, Conv1D, f=256
k=2, AvgPooling, s=2
k=3, Conv1D, f=256
k=3, Conv1D, f=512
k=2, AvgPooling, s=2
k=3, Conv1D, f=512
k=3, TransConv1D, f=512
FC, 4096
FC, 4096
Input
(120000, 2550, 1)
Output
[5, 10, 15, 20,54]
k=2, AvgPooling, s=2
Figure 4: The proposed Convolutional transpose Reccurrent Neural Network (CtRNN) architecture inspired by the VGG family of architectures is explained
within the figure, where "k" signifies the kernel, "f" represents the number of filters, and "s" denotes the stride value.
B. METHODOLOGY
To this end, this work examines a novel methodology depicted
in Figure 1 that more accurately represents the performance
of models in practical use cases. We approached the prob-
lem by first analysing the dataset to identify the maximum
number of active devices Nmax.AD in the time windows that
the model is trained on and the average maximum number
Nmax.AD . We then generated two distinct groups of mixed
datasets, each comprising different sets of active and inactive
devices. The first group contained mixed datasets with a
fixed number of Nmax.AD active devices, whereas the second
group included mixed datasets with varying number of active
devices between 1 and Nmax.AD. To generate the groups of
mixed datasets we used REFIT [41] and UK-DALE [42] low-
frequency datasets, also present in Tanoni et al. [16] and
Langevin et al. [19].
We propose a methodology7that aims to assess classical
ML and DL models in a realistic scenario, which differs from
the approaches commonly taken in recent works [16], [20]–
[23] and [24] that only use 5 distinct devices for evaluation.
This limited number does not represent the typical diversity
of devices encountered in real-world settings. For instance,
the analysis of REFIT and UK-DALE datasets in Figure 3
reveals the presence of up to 9 and 26 active devices, respec-
tively. Therefore, our methodology considers a wider range
of devices for a more accurate evaluation of DL models in
realistic conditions as depicted.
1) Sensitivity Evaluation Group
The Sensitivity Evaluation (SE) is a group of multiple mixed
datasets that cover cases where there are 5, 10, 15 or 20 DiT
in the household and 1, 2, ..., Nmax.AD of them are AD. The
number of DiT is used universally across all datasets and the
maximum number of AD (Nmax.AD ) is used as a parameter
depending on the maximum number of active devices in the
time window that we are training our model on.
Mixed datasets in SE Group provide an insight into how
the model in testing performs depending on the number of
AD in four general cases of DiT, i.e. 5, 10, 15 and 20. In case
of the UK-DALE dataset we also give an insight into a case
of 54 DiT, which significantly exceeds the maximum number
of DiT in the REFIT dataset.
7https://github.com/anzepirnat/CtRNN/
In our case we were using 2550 samples for training,
with sample rate of REFIT and UK-DALE that results in
approximately 6 hour long time window. In the given time
window there is a number of AD ranging from 1 to 9 for
REFIT and from 2 to 26 for UK-DALE, as shown in Figure 3,
thus Nmax.AD = 9 for REFIT and Nmax.AD = 26 for UK-
DALE.
2) Realistic Evaluation Group
The Realistic Evaluation Group (RE) is an extension of the
methodology employed in many recent works [16], [19]–
[24] which account for only 5 distinct devices. We propose
a group of multiple mixed datasets that cover cases where
there are 5 and also 10, 15 and 20 devices in total (DiT) in the
household and chosen at random. We generate mixed datasets
with an equal mix of all possible numbers of ADs. Thus, we
generated 4 mixed datasets, each containing samples with 1,
2, ..., Nmax.AD ADs. Such RE Group presents more practical
evaluation of the model by simulating a real-world scenario
in which households utilize a variety of active devices rather
than a fixed number.
In our case the average maximum number (Nmax.AD ) of
active devices was 8 for REFIT and 14 for UK-DALE as
supported by the results in Figure 3. Therefore, the training
data comprised time windows with ADs ranging from 1 to 8
and 1 to 14 for REFIT and UK-DALE, respectively. However,
in cases where there were fewer devices in the household than
active devices, the range of ADs was set to 1 to DiT-1. For
instance, when there were 5 DiT in the household, the range
of ADs was from 1 to 4, as was the case for both REFIT and
UK-DALE datasets. Similarly, when there were 10 devices
in the household, the range was from 1 to 9 ADs. Lastly it
should be noted that we used an 80:20 split for training and
evaluation parts of all datasets in this research.
IV. PROPOSED NEURAL NETWORK ARCHITECTURE
To solve the problem defined in Section III-A we introduce
the novel CtRNN architecture based on the VGG family.
Architectures from this family, adapted for time series data,
have previously proved successful in NILM disaggregation
tasks [43]. Additionally, the hyper-parameters of the archi-
tectures were determined empirically following the principles
derived from our prior work [36], determining the ratio be-
tween prediction performance and computational complex-
vi
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
ity. The computational complexity of a network, measured
in floating-point operations (FLOPs), is determined by the
number of layers Land the computational complexity of
each separate layer Fl, as described in Eq. 2. Throughout
our empirical design phase, we explored networks with layers
ranging from L {16, ..., 22}.
NFLOPs =
L
X
l=1
Fl,(2)
In addition to the VGG adaptation, our architecture also
includes the transposed convolutional layer (TCNN) and the
gated recurrent unit (GRU) layers. The transposed convo-
lutional layer increases the temporal resolution of features
while reducing the number of features from the previous
layer [13], while GRU is utilised to better capture the temporal
correlation in the TS and showed great potential in solving
NILM related tasks [16], [19]. The combination of CNNs,
TCNNs and GRU layers enable the architecture to better
capture the spatial and temporal correlation within the time
series data.
The resulting architecture is illustrated in Figure 4, where
each layer is depicted with its type and hyper-parameters.
The architecture comprises four blocks, each consisting of
two convolutional layers and one average pooling layer. The
number of filters in each block doubles, starting from 64.
Following the convolutional blocks, there is a TCNN layer
and a GRU layer. Prior to the output layer, there are two fully-
connected layers with 4096 nodes. The number of nodes in the
output layer is adjusted to meet the specific requirements and
ranges from 5 to 54, depending on the used dataset. All layers
utilize the ReLU activation function, except for the output
layer, which employs the sigmoid activation function.
A. COMPUTATION EFFICIENCY CONSIDE RATIONS
As the purpose of using NILM is to reduce energy consump-
tion, it is logical to ensure that the process itself is as energy-
efficient as possible. Thus, our goal was to design a deep
learning architecture that surpasses the state of the art not only
in performance but also in terms of energy efficiency.
In order to assess the energy consumption of the architec-
ture, it is necessary to calculate its complexity. This typically
involves adding up the total number of FLOPs required for
each layer.
We estimate the complexity of the most energy-consuming
layers, namely the convolutional, pooling, and fully-
connected layers, using the equations presented by Pirnat et
al. in [36]. In addition, we calculate the complexity of the
GRU layer using the equation proposed in [44].
We use equations from [36] to estimate the energy con-
sumption of the proposed architecture and for comparison
also for one popular reference architecture from the VGG
family of architectures, i.e., VGG11 [45], and for the two
architectures used as a reference in performance evaluation,
i.e., TanoniCRNN [16] and VAE-NILM [19]. This is done
with an assumption that the architecture is trained and used on
an Nvidia A100 graphics card and that each kWh of electricity
produced results in 250g of CO2equivalent emissions (as it
is the case for Slovenia).
B. EVALUATION DATASETS AND TRAINING PARAMETE RS
We first compared the performance of our model to
VGG11 [45] and with the performance of the model created
by Tanoni et al. [16] adapted to fully supervised DL, and to
results achieved by Langevin et al. [19]. This comparison was
done on the standard evaluation methodology defined in [16]
that comprises a total of 5 distinct devices: fridge, washing
machine, dish washer, microwave and kettle. Those 5 devices
were also used by many recent works [16], [19]–[24] where
they didn’t exactly specify the number of active devices, thus
we chose to reproduce the one in [16]. Samples with varying
numbers of active devices from 1 to 4 are randomly inter-
spersed throughout the mixed dataset. To make the training
and evalution parts of the dataset we used an 80:20 split.
For this comparison we used a learning rate of 0.0003 and
20 epochs for our model while for the TanoniCRNN model
we adopted the parameters specified as optimal in [16], which
include the same number of epochs and a different learning
rate of 0.002. Moreover, we used the same batch size of 128
for both models.
Subsequently we compared the performance of our model
with that of VGG11 and RF on the two groups of mixed
datasets described in Section III-B. We choose VGG11 as a
benchmark because VGG architectures are adopted in recent
works [46], [47] for classification in NILM due to their
effectiveness, and VGG11 is the closest match regarding
the complexity. RF was chosen as the benchmark as it was
reported to be the best classic ML algorithm for ON/OFF
classification on NILM data in a previous study [11].
Training parameters used for SE Group are presented in
Table 2 for CtRNN and VGG11 as follows. Each sub-table
includes information about the parameters corresponding to
that combination of model and dataset, the first one address-
ing CtRNN with REFIT, the second concerning CtRNN and
UK-DALE, while the third and fourth describing VGG11
with REFIT and UK-DALE respectively. For each model-
dataset combination, parameters for training data with various
DiT, namely 5, 10, ..54, for the SE group described in Sec-
tion III-B1 are provided. The grouped columns include the
parameter values of the architectures, namely the BS, LR and
E. BS denotes the batch size, representing the total number of
training packets that are sent to the architecture at a time. LR
stands for learning rate, which is a parameter that determines
the step size at each iteration of the architecture. Finally,
E signifies the number of epochs, one epoch representing
one pass of training data through the architecture. As can
be seen in the table, the three parameters differ across the
architecture-dataset, DiT and AD combinations. For instance,
for CtRNN with REFIT, 5 DiT and 1 AD, the BS is 512, LR
is 104and E is 40.
In summary, the batch size used for training both models
was predominantly set to 128, with some variations of 256
vii
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Table 2: Table of training parameters for CtRNN and VGG11, for SE Group mixed datasets derived from REFIT and UK-DALE (BS - batch size; LR - learning
rate; E - no. of epochs).
CtRNN REFIT
1 AD 2 AD 3, 4 AD 5, 6, 7, 8 AD 9 AD
BS LR E BS LR E BS LR E BS LR E BS LR E
5 DiT 512 10440 256 5×10420 128 5×10420 / / / / / /
10 DiT 512 10440 256 5×10420 128 5×10420 128 5×10420 128 10420
15 DiT 512 10440 256 5×10420 128 5×10420 128 5×10420 128 5×10420
20 DiT 512 10440 256 5×10420 128 5×10420 128 5×10420 128 5×10420
CtRNN UK-DALE
2, 3, 4 AD 5, 9 AD 11, 13 AD 14 AD 15, 16, 17 AD 24, 26 AD
BS LR E BS LR E BS LR E BS LR E BS LR E BS LR E
5 DiT 128 3×10420 / / / / / / / / / / / / / / /
10 DiT 128 3×10420 128 3×10420 / / / / / / / / / / / /
15 DiT 128 3×10420 128 3×10420 128 5×10420 128 5×10420 / / / / / /
20 DiT 128 3×10420 128 3×10420 128 5×10420 128 10420 128 10420 / / /
54 DiT 128 10420 128 10420 128 3×10420 128 3×10420 128 3×10420 128 3×10420
VGG11 REFIT
1 AD 2 AD 3, 4 AD 5, 6, 7, 8, 9 AD
BS LR E BS LR E BS LR E BS LR E
5 DiT 512 10450 256 10420 128 10420 / / /
10 DiT 512 10450 256 10420 128 10420 128 10420
15 DiT 512 10450 256 10420 128 10420 128 10420
20 DiT 512 10450 256 10420 128 10420 128 10420
VGG11 UK-DALE
2, 3, 4 AD 5, 9 AD 11, 13, 14 AD 16, 17 24, 26 AD
BS LR E BS LR E BS LR E BS LR E BS LR E
5 DiT 128 10420 / / / / / / / / / / / /
10 DiT 128 10420 128 10420 / / / / / / / / /
15 DiT 128 10420 128 10420 128 10420 / / / / / /
20 DiT 128 10420 128 10420 128 10420 128 5×10520 / / /
54 DiT 128 10420 128 10420 128 10420 128 5×10520 128 5×10520
or 512. The epoch count was set to 20 for both models in
most cases, but it was also set to 50 for VGG11 and 40
for CtRNN, respectively, in certain scenarios, because they
benefited from more training passes. The learning rate ranged
changed between 104and 5×105for VGG11 and between
5×104and 5×105for CtRNN, since both required a larger
or smaller step size in certain scenarios. The batch size and the
number of epochs were similar to [16], and fine tuned through
an empirical process. The learning rate was also empirically
tuned.
The training parameters used for RE Group had less varia-
tion, always using batch size 128 and 20 epochs, therefore
the replication of the study can be done with the numbers
provided in this paragraph. The learning rate for CtRNN was
0.0003 on both REFIT and UK-DALE datasets; for VGG11
it was 0.0001 on both REFIT and UK-DALE datasets. In all
tests the batch size used is selected as the largest we could run
or the one that gives the best results, selected through trial and
error. It is equal for both models in the test as performance
vary only slightly depending on its size.
C. METRICS
We evaluate the performance with a combined metric average
weighted F1 score (F1scorew), since performance evaluation
based on a simple arithmetic mean of the F1 score would
fail to provide an accurate reflection of the overall perfor-
mance because our mixed datasets are generated in a way
that does not provide each device equal representation. The
use of weights ensures that all devices will affect the average
score proportionally to how often they appear in the particular
mixed dataset.
F1scorew=
Nd
X
i=1
F1scorei×Weighti(3)
Average weighted F1 score is based on three metrics: true
positive (TP), false positive (FP), and false negative (FN). TP
represents the cases where the device is correctly classified as
active, FP represents the cases where the device is incorrectly
classified as active, and FN represents the cases where the
device is incorrectly classified as inactive.
Using these metrics, we calculate the precision Precision =
TP
TP+FP and recall Recall =TP
TP+FN , and from these, we derive
the F1 score F1score = 2 ×Precision×Recall
Precision+Recall . To obtain the
average weighted F1 score defined in Eq. 3, we calculate
the F1 score for each device and then take the average based
on their weight Weight =SSD
SAD , which is determined by the
support for the specified device (SSD) and the support of all
devices (SAD).
V. PERFORMANCE EVALUATION
Using the proposed methodology with two groups of mixed
datasets, we carried out comprehensive performance evalua-
tion of CtRNN DL architecture and benchmarked it against
selected state-of-the-art architectures in terms of energy effi-
ciency and accuracy of determining the status of devices, as
described in the following.
A. ENERGY CONSUMPTION
The results of our energy consumption evaluation, performed
according to the considerations in Section II-D and method-
ology described in Section IV-B, for training different ar-
chitectures for each mixed dataset from RE Group or SE
viii
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Table 3: Energy used in training the proposed CtRNN model in comparison
to VGG11, TanoniCRNN and VAE-NILM.
NN parameters FLOPs energy
CtRNN 19.6 ·1060.85 ·1091.51 MJ
VGG11 [45] 185.6 ·1061.21 ·1092.15 MJ
TanoniCRNN [16] 0.75 ·1061.11 ·1091.97 MJ
VAE-NILM [19] 3.8 ·1060.42 ·10913.2 - 263 MJ
*For VAE-NILM, it is only possible to compute a range of values based on
the reported epochs [19] between 5 and 100.
0M 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M
Predictions
0MJ
10MJ
20MJ
30MJ
40MJ
50MJ
60MJ
Energy
CtRNN
VGG11
TanoniCRNN
VAE-NILM
Figure 5: Energy used for making predictions with the proposed model in
comparison to VGG11, TanoniCRNN and VAE-NILM.
Group using batch size 128 are displayed in Table 3. The rows
of the table list the neural-network arhitectures considered
in this work, namely the proposed CtRNN and the VGG,
TanoniCRNN and VAE-NILM baselines selected in Section
IV. The first column of the table displays the number of
internal parameters for each of the NNs, that represent the
total number of weights and biases in the NNs. The second
column lists the number of FLOPs, which is the number
of floating point operations needed for a pass through the
NN. The third column showcases energy consumed during
the training of the models. The values in second and third
columns are calculated as explained in Section IV-A. Despite
TanoniCRNN having the lowest number of parameters, this
does not result in lowest number of FLOPs, energy con-
sumption. Similarly, although VAE-NILM exhibits the lowest
number of FLOPs, it has the highest energy consumption.
These factors are clearly influenced by additional training
parameters, such as the number of epochs and batch size
required to achieve satisfactory results. The proposed model
outperforms state-of-the-art TanoniCRNN in terms of energy
consumption, with 23.3 % less energy consumed on a mixed
dataset tested on five commonly used devices. In addition,
compared to VGG11 on groups A and B, the proposed model
demonstrates 29.7 % less energy consumed.
In addition to the energy used during the training, energy
consumed for making predictions can also be significant
when the number of requests for predictions is high as de-
picted in Figure 5. On the x-axis the figure plots the number
of predictions from 0 to 10 million while on the y-axis it plots
the consumed energy in mega Joules. The results show that in
making 10M predictions our model consumes 41.8MJ, while
TanoniCRNN, VGG11 and VAE-NILM consume: 54.4MJ,
59.5MJ and 11.5MJ. VAE-NILM consumes notably less en-
ergy then other models, that can be attributed to Langevin et
al. [19] using a window of 1024 samples for training, while
we used 2550 for all others, and because it has less FLOPs.
The figure also shows that for more than a million predictions,
the energy consumed exceeds the energy used for training the
models, with exception of VAE-NILM for which only a range
can be computed based on the number of epoch provided in
[19].
B. RESULTS ON MIXED DATASET WITH 5 COMMONLY
USED DEVICES
Comparison with Tanoni [16] and Langevin [19] on datasets
of 5 devices derived from the REFIT and UK-DALE demon-
strates superior performance of our proposed model with a
significant gap in F1 score as can be seen in Table 4. The
first column of the table lists the devices, columns 2-4 lists
F1 scores for the models trained on the REFIT dataset while
columns 5-7 on the ones trained on the UK-DALE dataset.
For each of the datasets, the subcolumns list the evaluated ar-
chitectures, namely CtRNN, TanoniCRNN and VAE-NILM.
Rows 2-6 present per device F1 scores while row 7 provides
the weighted average of the F1 score. It can be seen from the
table that the proposed model displays superior performance
on all devices on both datasets with the exception of fridge
in UK-DALE where its even with TanoniCRNN. That is
also evident from the average weighted F1-score in the final
row. Specifically, on the REFIT derived dataset, our model
achieves an average weighted F1 score of 91% compared to
83 % and 78 % obtained by TanoniCRNN and VAE-NILM,
respectively, an improvement of 8 and 13 percentage points.
On the UK-DALE derived dataset, our model outperforms
TanoniCRNN and VAE-NILM by 7 and 26 percentage points,
respectively, achieving an average weighted F1 score of 94%
compared to their 87 % and 68 %.
A closer analysis of the two best models from Table 4
shows, according to the second row of the table, that our
approach slightly outperforms the approach of [16] in recog-
nizing the fridge class with an F1 of 0.93 compared to 0.92 on
the REFIT dataset, while both approaches work perfectly on
the UK-DALE dataset. However, the fridge class of devices
is easier to identify as its consumption pattern is periodic
and is the most pronounced of all appliances. The real dif-
ference in performance between the two models is seen in the
detection of appliances with short consumption intervals. In
row 3 of Table 4, our method outperforms TanoniCRNN [16]
in detecting washing machines on the REFIT dataset by 5
percentage points, and is 11 percentage points better on the
UK-DALE dataset. When detecting dishwashers in row 4,
our method is 8 percentage points superior than [16] on the
ix
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Table 4: Average weighted F1 score results for CtRNN compared to TanoniCRNN [16] and VAE-NILM [19] on mixed dataset of 5 devices derived from REFIT
and UK-DALE.
devices
REFIT UK-DALE
CtRNN TanoniCRNN [16] VAE-NILM [19] CtRNN TanoniCRNN [16] VAE-NILM [19]
fridge 0,93 0,92 0,85 1,0 1,0 0,81
washing machine 0,89 0,84 0,78 0,92 0,81 0,74
dish washer 0,88 0,80 0,84 0,87 0,86 0,65
microwave 0,89 0,71 0,59 0,96 0,80 0,32
kettle 0,95 0,87 0,87 0,93 0,86 0,87
weighted avg 0,91 0,83 0,78 0,94 0,87 0,68
REFIT dataset, while it is 1 percentage point better on the UK-
DALE dataset. The largest difference in performance can be
observed in row 5 for the microwave class. Here, our method
with an F1 of 0.89 is significantly superior compared to [16],
with an F1 of 0.71 on the REFIT dataset. Something similar
can be observed on the UK-DALE dataset, where our method
achieves an F1 score of 0.96 compared to the F1 score of 0.80
for [16]. If we look at the kettle class in row 6 of Table 4, we
can again see that our method outperforms [16] by 8 and 7
percentage points in both the REFIT and UK-DALE datasets,
respectively.
These results show that our proposed architecture is signif-
icantly better in detecting appliances with shorter consump-
tion duration, compared to [16]. The reason for that is, that our
architecture design is superior at detecting both spatial and
temporal correlation within the signal. Spatial correlations are
detected by the convolutional layers while temporal by the
GRU layers of the architecture described in Section IV and
depicted in Figure 4.
C. RESULTS WITH SE GROUP MIXED DATASETS
The results of comparing the performance of CtRNN, VGG11
and RF models on Sensitivity Evaluation Group mixed
datasets are displayed in Figure 6. Figures 6 a-c exhibit
heatmaps presenting the evaluation results of the models on
the REFIT dataset, whereas Figures 6.e-g depict heatmaps
displaying the evaluation results on the UK-DALE dataset.
Figure 6d and Figure 6h, on the other hand, illustrate the
probabilities of obtaining correct results as per random guess,
calculated with Eq. 4.
PGroupSE =1
DiT
NAD(4)
As it can be seen from Figures 6a-d, for the REFIT dataset,
the accuracy of the model is affected both by the increase
of the number of active devices (AD), as well as increase
in the total number of devices (DiT) that can be active at
the same time. It can be seen that the average performance
degradation of our model is 9.2 percentage points per each 5
DiT added on REFIT dataset and by 5.4 percentage points
on UK-DALE dataset. However, as it can be seen for all
three utilised classifiers and the theoretical calculation, the
classification performance increases when the number of AD
is approaching the number of DiT. All three classifier mod-
els significantly outperform the random classifier, with our
approach CtRNN being the best out of the three. Looking at
the the third row of heatmaps depicting results for 15 DiT
in Figures 6a-c, it can be seen that our approach achieves
scores above 71 %, while models based on VGG11 and RF,
achieve scores above 57% and 56 %. All models significantly
outperform the random chance as it reaches numbers as low
as 0.02 %.
Similar observations can be also seen in Figures 6e-h
for the UK-DALE dataset. All three models significantly
outperformed the random model, where again our approach
achieved the highest score across all tested scenarios. Looking
at the the third row of heatmaps depicting results for 15 DiT
in Figures 6e-g, it can be seen that our approach achieves
slightly lower accuracy scores compared to the VGG11 and
RF algorithms, when the number of AD approach the number
of DiT, by up to 0.02. The reason for this is due to the fact
that our approach is less prone to overfitting, compared to the
other two approaches, which is supported by the fact that for
up to 11 AD our approach significantly outperforms the other
two approaches by up to 18 percentage points.
To summarize the difference between the results of differ-
ent models we calculated the average improvement (I) across
all mixed datasets in SE Group using Eq. 5, our model outper-
forms the VGG11 model by 11.03 and 9.4 percentage points
on the REFIT and UK-DALE derived datasets, respectively.
Compared to the RF model, our model achieves even greater
improvement with 14.15 and 13.88 percentage points on the
REFIT and UK-DALE derived datasets, respectively.
I=PNdatasets
n=1 (F1scorew_CtRNNnF1scorew_Xn)
Ndatasets (5)
From the analysis of heatmaps in Figure 6 we notice that
once the number of AD surpasses 50 % of DiT in the mixed
dataset, the chance of correct classification increases. This
trend is clearly visible for both REFIT and UK-DALE in
the lines with 5, 10 and 15 DiT and less so in the line
with 20 DiT and for UK-DALE in the line with 54 DiT. To
better understand the trend, we calculate the probability of
correctly classifying devices by random guess in SE Group
mixed datasets with Eq. 4 and depict the results in Figure 7.
The x-axis represents the number of active devices (AD)
x
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
1 AD 2 AD 3 AD 4 AD 5 AD 6 AD 7 AD 8 AD 9 AD
5 DiT10 DiT15 DiT20 DiT
97% 94% 92% 94% N/A N/A N/A N/A N/A
92% 86% 81% 81% 83% 81% 84% 89% 95%
86% 81% 76% 74% 71% 71% 72% 74% 77%
81% 73% 69% 67% 66% 65% 64% 64% 66%
50
60
70
80
90
(a) REFIT CtRNN
1 AD 2 AD 3 AD 4 AD 5 AD 6 AD 7 AD 8 AD 9 AD
5 DiT10 DiT15 DiT20 DiT
95% 85% 91% 90% N/A N/A N/A N/A N/A
89% 72% 64% 66% 69% 74% 77% 86% 93%
81% 61% 57% 57% 57% 57% 61% 65% 70%
78% 56% 51% 49% 47% 49% 52% 53% 57%
50
60
70
80
90
(b) REFIT VGG11
1 AD 2 AD 3 AD 4 AD 5 AD 6 AD 7 AD 8 AD 9 AD
5 DiT10 DiT15 DiT20 DiT
84% 79% 81% 85% N/A N/A N/A N/A N/A
70% 64% 68% 64% 68% 72% 78% 86% 93%
62% 58% 56% 56% 58% 58% 62% 66% 71%
52% 52% 51% 50% 49% 51% 53% 54% 58%
50
60
70
80
90
(c) REFIT RF
(d) REFIT Random
2 AD 3 AD 4 AD 5 AD 9 AD 11 AD 13 AD 14 AD 15 AD 16 AD 17 AD24 AD 26 AD
5 DiT10 DiT15 DiT20 DiT54 DiT
98%98%95% N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
96%93%90% 91%96% N/A N/A N/A N/A N/A N/A N/A N/A
92%92%87% 88%89%89%91% 94% N/A N/A N/A N/A N/A
91%88%87% 85%80%81%83% 86%88%90%93% N/A N/A
73%65%61% 57%55%52%52% 50%49%51%51% 59%62% 40
50
60
70
80
90
(e) UK-DALE CtRNN
2 AD 3 AD 4 AD 5 AD 9 AD 11 AD 13 AD 14 AD 15 AD 16 AD 17 AD24 AD 26 AD
5 DiT10 DiT15 DiT20 DiT54 DiT
98%96%93% N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
93%87%82% 82%94% N/A N/A N/A N/A N/A N/A N/A N/A
85%81%73% 75%78%82%91% 96% N/A N/A N/A N/A N/A
83%73%70% 67%65%69%74% 78%82%86%90% N/A N/A
63%51%45% 42%36%37%38% 38%39%38%39% 50%54% 40
50
60
70
80
90
(f) UK-DALE VGG11
2 AD 3 AD 4 AD 5 AD 9 AD 11 AD 13 AD 14 AD 15 AD 16 AD 17 AD24 AD 26 AD
5 DiT10 DiT15 DiT20 DiT54 DiT
92%88%89% N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
82%76%76% 76%94% N/A N/A N/A N/A N/A N/A N/A N/A
71%71%65% 70%77%82%92% 96% N/A N/A N/A N/A N/A
66%62%61% 62%64%69%76% 80%84%88%91% N/A N/A
47%44%41% 38%33%32%33% 34%34%35%37% 50%54% 40
50
60
70
80
90
(g) UK-DALE RF
1 AD 2 AD 3 AD 4 AD 5 AD 6 AD 7 AD 8 AD 9 AD 10 AD 11 AD 12 AD 13 AD 14 AD
5 DiT10 DiT15 DiT20 DiT
20.0%10.0%10.0%20.0% N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
10.0%2.22%0.83%0.48%0.4% 0.48%0.83%2.22%10.0% N/A N/A N/A N/A N/A
6.67%0.95%0.22%0.07%0.03%0.02%0.02%0.02%0.02%0.03%0.07%0.22%0.95%6.67%
5.0% 0.53%0.09%0.02%0.01%0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
(h) UK-DALE Random
Figure 6: Results from RF, VGG11 and CtRNN on the SE Group of mixed datasets.
xi
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
123456789
Active devices (AD)
10 3
10 2
10 1
100
101
Probability [%]
5 DiT
10 DiT
15 DiT
20 DiT
(a) REFIT
0 5 10 15 20 25
Active devices (AD)
10 13
10 11
10 9
10 7
10 5
10 3
10 1
101
Probability [%]
5 DiT
10 DiT
15 DiT
20 DiT
54 DiT
(b) UK-DALE
Figure 7: Probability of accurately determining the status (ON/OFF) of devices through random chance. The impact of this chance is reflected in the results
from RF, VGG11 and CtRNN on the SE Group of mixed datasets, which are partially numerically aligned with these probability lines.
out of devices in total (DiT), while the y-axis represents the
probability of guessing the results correctly. We display a
curve for each number of DiT in the SE group, thus showcase
the probability for all employed combinations of AD and DiT.
Consider the curve for 5 DiT in Figure 7a. It can be seen that
when randomly picking 1 AD out of 5 DiT the likelihood
of it being correct is 20%. Next, the likelihood of correctly
guessing 2AD of 5DiT is 10%, 3 out of 5 is 10% while 4
out of 4 is 20%. The results in the figures show that, for a
small number of AD or a number of AD that is comparable
with the number of DiT, the random guess works the best,
while for the cases when the number of AD is about 50% of
the number of DiT it yields the worst results. That is because
probability is expressed with combinations as shown in Eq.
4, since the order of predicted active devices doesn’t matter.
Looking at Figure 7 we notice similar decrease and increase
in performance as previously seen in rows of the heatmaps in
Figure 6.
D. RESULTS WITH RE GROUP MIXED DATASETS
The results of comparing the performance of CtRNN, VGG11
and RF models on RE Group mixed datasets are displayed in
Figure 8, Figures 8a-c showcase heatmaps presenting results
from evaluation on REFIT, while Figures 8e-g characterize
heatmaps displaying results from evaluation on UK-DALE.
Figures 8d and h, contain heatmaps filled with probabilities
of obtaining correct results randomly, calculated with Eq. 6.
We observe that with the increase in DiT the accuracy
of classification decreases in all cases, which is consistent
with our prior observations related to performance being
connected with the proportion between ADs and DiTs. The
average performance degradation of our model is 9 percent-
age points per each 5 DiT added on REFIT dataset and by 4.3
percentage points on UK-DALE dataset. Random probability
of correct classification, calculated by Eq. 6 is much lower
compared to the accuracy of the models. For example, on
REFIT dataset in the row with 15 DiT, our model achieves a
score of 71 %, VGG11 achieves 58 % and RF achieves 63%,
whereas the random probability is rounded to 0 %.
PGroupRE =1
2DiT DiT
0DiT
NAD+1... DiT
DiT (6)
We calculate the average improvement over the entire RE
Group of mixed datasets using Eq. 5. Our model reaches
results that are 11.32 percentage points better than VGG11
and 9.22 percentage points better than RF on REFIT derived
dataset, and 8.07 percentage points better than VGG11 and
9.46 percentage points better than RF on UK-DALE derived
dataset, respectively.
E. END PER FORMANCE COMPARISON
Assuming a quick technology selection for an application
requiring ON/OFF classification would need to be performed
based only on the end performance, type of ML and number
of devices, we summarize in Table 5 the required information.
To compile Table 5, we take the related ON/OFF classifica-
tion works summarized in Table 1 and analyzed in Section
II, and we extract the best final results from the respective
papers, except for Tanoni in the fourth row of the table where
we present the results reported in this work. The exception
for Tanoni is due to the fact that the original work [16]
employs weak supervision so the results are slightly worse
then achieved with supervised learning in all other works.
For fairness to Tanoni, we re-rune their experiments in a fully
supervised manner and report those results.
The first column of the related works in Table 5 lists the
considered ON/OFF classification works, the second lists the
type of ML approach (classical or deep), the third provides
the specific method, the fourth the dataset and result when
training with it, the fifth lists the type of sampling of the
energy data while the last lists the number of considered
devices. From the results it can be seen that Wu et al. achieved
the highest F1 score 98% however this has been done on the
HF dataset BLUED. When a signal is sampled with higher
xii
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
REFIT UK-DALE
1-4 or 1-8 AD
5 DiT10 DiT15 DiT20 DiT
91%
84%
71%
64%
50
60
70
80
90
(a) CtRNN
1-4 or 1-8 AD
5 DiT10 DiT15 DiT20 DiT
86%
74%
58%
48%
50
60
70
80
90
(b) VGG11
1-4 or 1-8 AD
5 DiT10 DiT15 DiT20 DiT
79%
78%
63%
55%
50
60
70
80
90
(c) RF
5 DiT10 DiT15 DiT20 DiT
3.33%
0.1%
0.0%
0.0% 0.5
1.0
1.5
2.0
2.5
3.0
(d) Random
1-4 or 1-9 or 1-14 AD
5 DiT10 DiT15 DiT20 DiT
98%
93%
88%
85% 72.5
75.0
77.5
80.0
82.5
85.0
87.5
90.0
(e) CtRNN
1-4 or 1-9 or 1-14 AD
5 DiT10 DiT15 DiT20 DiT
94%
87%
80%
71% 72.5
75.0
77.5
80.0
82.5
85.0
87.5
90.0
(f) VGG11
1-4 or 1-9 or 1-14 AD
5 DiT10 DiT15 DiT20 DiT
92%
83%
79%
72% 72.5
75.0
77.5
80.0
82.5
85.0
87.5
90.0
(g) RF
5 DiT10 DiT15 DiT20 DiT
3.33%
0.1%
0.0%
0.0% 0.5
1.0
1.5
2.0
2.5
3.0
(h) Random
Figure 8: Results from RF, VGG11 and CtRNN on the RE Group of mixed datasets.
Table 5: Summary of related work for multi-label classification on NILM.
Work Approach Type Approach Reported Evaluation Type Devices no.
Dataset avg. F1 Dataset avg. F1
Tabatabaei et al. [29] Classic ML MLkNN REDD 0.528 / / LF 5
Wu et al. [11] Classic ML RF BLUED 0.98 / / HF 5
Singh et al. [30] Classic ML SRC REDD 0.70 Pecan Street 0.71 LF, LF 4
Tanoni et al. [16] DL CRNN REFIT 0.83 UK-DALE 0.87 LF, LF 5
Langevin et al. [19] DL CRNN REFIT 0.78 UK-DALE 0.68 LF, LF 5
This work DL CtRNN REFIT 0.91 UK-DALE 0.94 LF, LF 5
frequency, higher definition data is available therefore it is
easier to recognize its shape compared to a signal that is
sampled with less granularity. When compared to other sim-
ilar models developed on LF data, our model reached scores
around 92.5%, surpassing others. TanoniCRNN ranked sec-
ond on LF datasets, with scores around 0.85% while Langevin
et al. and Singh et al. had similar scores, approximately 0.7%.
The result reported by Tabatabei et al. ranked the lowest at
53%.
F. LIMITATIONS AND FUTURE WORK
The limitations of the study presented in this paper are
twofold. First, we show that the approach is not robust to
increasing number of devices which characterize modern
households. Unlike prior works, we quantify the drop in
performance that we mostly attribute to the imbalanced nature
of the available datasets in which some devices occur more
frequently than others. Second, the empirical design of the
proposed architecture could be automated to find a superior
architecture with respect to both performance and energy
efficiency.
While there are already a number of works on ON/OFF
classification that improve on prior ones, including this study,
we see three main lines of future work as follows.
Benchmark dataset
Typically in the machine learning communities they have
benchmark datasets that are used in all model evaluations. A
good direction for future work is to take the existing datasets
that are suitable for ON/OFF classification and generate a
harmonized set suitable for training. Besides the harmonized
set, also balanced versions could be created, using statistical
oversampling or under-sampling methods such as ADASYN
and AllKNN. Also a more general simulator than in [48]
could be developed to permit complementing measured data.
Generic model development
Current ON/OFF classification models cannot just be down-
loaded and used in a real application. While there is recent
work for transferring models across household they have lim-
itations, especially reflecting in performance drop [15]. For
text, the GPT breakthrough has shown that the architecture
can be trained on large amounts of unlabelled data to capture
sufficient knowledge and then further trained on labelled data
to better structure that knowledge. The same could be done
for ON/OFF classification where a model trained on large
amounts of general time series data could be adapted for the
domain at hand. The development of such a model could also
significantly lower the adoption barrier of such technology.
Role in smart energy management
According to [5] households consume 12% less energy if they
receive specific feedback on the consumption of individual
devices. Furthermore, increasingly automated energy man-
agement systems may rely on machine learning models to
detect appliances and forecast usage. Quantification of the
relationship between the performance of a ML model (i.e. F1-
score, MSE), its energy consumption and the energy that its
decision help saving is a worthy line of research. For instance,
if on average the household consumption drops by 12% [5]
xiii
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
and the model on average recognizes only 94% of the devices
correctly, as in this study, what would be the impact on the
trust and behaviour of the user? Also, misclassifying an air
conditioning device that consumes significant energy may
have a higher cost than misclassifying a water heater.
VI. CONCLUSIONS
In this paper, we propose a new DL architecture CtRNN used
in our models, paying special attention to improve its energy
efficiency during training and operation as well as its perfor-
mance compared to state of the art and other similar models.
In developing the architecture, we used a typical VGG family
architecture as a starting point and adapted it to time series
data and reduced computational complexity by reducing the
number of convolutional layers in some blocks, replacing
one block with a single transposed convolution layer and
adding the GRU layer. We benchmarked the proposed new
model with other similar models, showing that it is possible
to develop new DL models for NILM ON/OFF classification
that provide a major improvement in both performance and
energy efficiency, which results in lesser energy consumption.
We also proposed a new methodology with two tests that
more realistically assess the performance of NILM ON/OFF
classification algorithms. They are using groups of multi-
ple mixed datasets, derived from measurement datasets with
specificities of the real-world use cases in mind. One group
covers the numbers of active devices commonly used in the
time window of the learning samples in separate datasets.
The other group covers mixed number of devices from 1 to
the average maximum number of devices used in the time
window of the learning sample. Our findings demonstrate
that the proposed methodology is necessary to obtain results,
which reflect more realistic situations. The obtained results
indicate that the commonly used testing methodology can
lead to overly optimistic conclusions, underscoring the impor-
tance of employing a more rigorous evaluation framework.
Moreover, as part of performance evaluation we also com-
pared DL approaches and the best classical ML approach
according to related work, and we concluded that DL ap-
proaches have a much higher performance potential com-
pared to ML. In our experiments, we observed on average
approximately 12 percentage point advantage of our model
compared to the best classical ML approach.
References
[1] A. Q. Al-Shetwi, M. Hannan, K. P. Jern, M. Mansur, and T. Mahlia,
‘‘Grid-connected renewable energy sources: Review of the recent
integration requirements and control methods,’ Journal of Cleaner
Production, vol. 253, p. 119831, 2020. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S0959652619347018
[2] J. Aghaei and M.-I. Alizadeh, ‘‘Demand response in smart electricity
grids equipped with renewable energy sources: A review, Renewable and
Sustainable Energy Reviews, vol. 18, pp. 64–72, 2013. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S1364032112005205
[3] R. Gopinath, M. Kumar, C. Prakash Chandra Joshua, and K. Srinivas,
‘‘Energy management using non-intrusive load monitoring techniques
state-of-the-art and future research directions,’ Sustainable Cities
and Society, vol. 62, p. 102411, 2020. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S2210670720306326
[4] G. Hart, ‘‘Nonintrusive appliance load monitoring,’’ Proceedings of the
IEEE, vol. 80, no. 12, pp. 1870–1891, 1992.
[5] K. Ehrhardt-Martinez, K. A. Donnelly, S. Laitner et al., ‘‘Advanced me-
tering initiatives and residential feedback programs: a meta-review for
household electricity-saving opportunities,’ in Advanced metering initia-
tives and residential feedback programs: a meta-review for household
electricity-saving opportunities. American Council for an Energy-
Efficient Economy Washington, DC, 2010.
[6] A. Rahimpour, H. Qi, D. Fugate, and T. Kuruganti, ‘Non-intrusive energy
disaggregation using non-negative matrix factorization with sum-to-k con-
straint,’ IEEE Trans. on Power Systems, vol. 32, no. 6, pp. 4430–4441,
2017.
[7] J. Kolter, S. Batra, and A. Ng, ‘‘Energy disaggregation via discrimina-
tive sparse coding,’’ Advances in neural information processing systems,
vol. 23, 2010.
[8] S. Singh and A. Majumdar, ‘‘Deep sparse coding for non–intrusive load
monitoring,’ IEEE Trans. on Smart Grid, vol. 9, no. 5, pp. 4669–4678,
2017.
[9] M. Figueiredo, B. Ribeiro, and A. de Almeida, ‘‘Electrical signal source
separation via nonnegative tensor factorization using on site measurements
in a smart home,’ IEEE Trans. on Instrumentation and Measurement,
vol. 63, no. 2, pp. 364–373, 2013.
[10] K. T. Chui, M. D. Lytras, and A. Visvizi, ‘‘Energy sustainability in smart
cities: Artificial intelligence, smart monitoring, and optimization of energy
consumption,’ Energies, vol. 11, no. 11, p. 2869, 2018.
[11] X. Wu, Y. Gao, and D. Jiao, ‘Multi-label classification based on
random forest algorithm for non-intrusive load monitoring system,’’
Processes, vol. 7, no. 6, 2019. [Online]. Available: https://www.mdpi.com/
2227-9717/7/6/337
[12] B. Buddhahai, W. Wongseree, and P. Rakkwamsuk, ‘‘A non-intrusive load
monitoring system using multi-label classification approach,’ Sustainable
Cities and Society, vol. 39, pp. 621–630, 2018. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S2210670717315366
[13] M. Zhou, S. Shao, X. Wang, Z. Zhu, and F. Hu, ‘Deep learning-based
non-intrusive commercial load monitoring,’’ Sensors, vol. 22, no. 14,
2022. [Online]. Available: https://www.mdpi.com/1424-8220/22/14/5250
[14] L. Massidda, M. Marrocu, and S. Manca, ‘‘Non-intrusive load
disaggregation by convolutional neural network and multilabel
classification,’ Applied Sciences, vol. 10, no. 4, 2020. [Online]. Available:
https://www.mdpi.com/2076-3417/10/4/1454
[15] B. Bertalanič and C. Fortuna, ‘‘Carmel: Capturing spatio-temporal corre-
lations via time-series sub-window imaging for home appliance classifi-
cation,’ Engineering Applications of Artificial Intelligence, vol. 127, p.
107318, 2024.
[16] G. Tanoni, E. Principi, and S. Squartini, ‘‘Multi-label appliance classifi-
cation with weakly labeled data for non-intrusive load monitoring,’’ IEEE
Trans. on Smart Grid, pp. 1–1, 2022.
[17] M. W. Asres, L. Ardito, and E. Patti, ‘Computational cost analysis and
data-driven predictive modeling of cloud-based online-nilm algorithm,
IEEE Transactions on Cloud Computing, vol. 10, no. 4, pp. 2409–2423,
2022.
[18] G.-F. Angelis, C. Timplalexis, S. Krinidis, D. Ioannidis, and D. Tzovaras,
‘‘Nilm applications: Literature review of learning approaches, recent
developments and challenges,’’ Energy and Buildings, vol. 261, p. 111951,
2022. [Online]. Available: https://www.sciencedirect.com/science/article/
pii/S0378778822001220
[19] A. Langevin, M.-A. Carbonneau, M. Cheriet, and G. Gagnon, ‘‘Energy
disaggregation using variational autoencoders,’’ Energy and Buildings,
vol. 254, p. 111623, 2022. [Online]. Available: https://www.sciencedirect.
com/science/article/pii/S0378778821009075
[20] Y. Pan, K. Liu, Z. Shen, X. Cai, and Z. Jia, ‘Sequence-to-subsequence
learning with conditional gan for power disaggregation,’’ in 2020 IEEE
International Conf. on Acoustics, Speech and Signal Processing (ICASSP),
2020, pp. 3202–3206.
[21] M. D’Incecco, S. Squartini, and M. Zhong, ‘‘Transfer learning for non-
intrusive load monitoring,’’ IEEE Trans. on Smart Grid, vol. 11, no. 2, pp.
1419–1429, 2020.
[22] L. Wang, S. Mao, B. M. Wilamowski, and R. M. Nelms, ‘Pre-trained
models for non-intrusive appliance load monitoring,’’ IEEE Trans. on
Green Communications and Networking, vol. 6, no. 1, pp. 56–68, 2022.
[23] C. Zhang, M. Zhong, Z. Wang, N. Goddard, and C. Sutton, ‘‘Sequence-to-
point learning with neural networks for non-intrusive load monitoring,’’
xiv
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
in 32nd AAAI Conf. on Artificial Intelligence (AAAI-18), 2018. [Online].
Available: https://ojs.aaai.org/index.php/AAAI/article/view/11873
[24] J. Kelly and W. Knottenbelt, ‘Neural nilm: Deep neural networks
applied to energy disaggregation,’’ in 2nd ACM international Conf.
on embedded systems for energy-efficient built environments (BuildSys
’15), ser. BuildSys ’15. New York, NY, USA: Association for
Computing Machinery, 2015, p. 55–64. [Online]. Available: https:
//doi.org/10.1145/2821650.2821672
[25] ——, ‘‘Neural nilm: Deep neural networks applied to energy disaggre-
gation,’ in Proceedings of the 2nd ACM international conference on
embedded systems for energy-efficient built environments, 2015, pp. 55–
64.
[26] G. A. Raiker, U. Loganathan, S. Agrawal, A. S. Thakur, K. Ashwin, J. P.
Barton, M. Thomson et al., ‘‘Energy disaggregation using energy demand
model and iot-based control,’ IEEE Trans. on Industry Applications,
vol. 57, no. 2, pp. 1746–1754, 2020.
[27] F. Ciancetta, G. Bucci, E. Fiorucci, S. Mari, and A. Fioravanti, ‘A new
convolutional neural network-based system for nilm applications, IEEE
Trans. on Instrumentation and Measurement, vol. 70, pp. 1–12, 2021.
[28] J. Chen, X. Wang, X. Zhang, and W.Zhang, ‘‘Temporal and spectral feature
learning with two-stream convolutional neural networks for appliance
recognition in nilm,’ IEEE Trans. on Smart Grid, vol. 13, no. 1, pp. 762–
772, 2022.
[29] S. M. Tabatabaei, S. Dick, and W. Xu, ‘Toward non-intrusive load mon-
itoring via multi-label classification,’ IEEE Trans. on Smart Grid, vol. 8,
no. 1, pp. 26–40, 2017.
[30] S. Singh and A. Majumdar, ‘‘Non-intrusive load monitoring via multi-label
sparse representation-based classification,’ IEEE Trans. on Smart Grid,
vol. 11, no. 2, pp. 1799–1801, 2020.
[31] H. Çimen, E. J. Palacios-Garcia, N. Çetinkaya, J. C. Vasquez, and
J. M. Guerrero, ‘‘A dual-input multi-label classification approach for non-
intrusive load monitoring via deep learning,’’ in 2020 Zooming Innovation
in Consumer Technologies Conference (ZINC), 2020, pp. 259–263.
[32] Z. Zhou, Y. Xiang, H. Xu, Y. Wang, and D. Shi, ‘‘Unsupervised learning for
non-intrusive load monitoring in smart grid based on spiking deep neural
network,’ Journal of Modern Power Systems and Clean Energy, vol. 10,
no. 3, pp. 606–616, 2022.
[33] H. Yin, K. Zhou, and S. Yang, ‘Non-intrusive load monitoring by load
trajectory and multi feature based on dcnn,’ IEEE Trans. on Industrial
Informatics, pp. 1–12, 2023.
[34] D. Li, K. Sawyer, and S. Dick, ‘‘Disaggregating household loads via semi-
supervised multi-label classification,’ in 2015 Annual Conf. of the North
American Fuzzy Information Processing Society (NAFIPS), 2015, pp. 1–5.
[35] M. A. A. Rehmani, S. Aslam, S. R. Tito, S. Soltic, P. Nieuwoudt,
N. Pandey, and M. D. Ahmed, ‘Power profile and thresholding assisted
multi-label nilm classification,’ Energies, vol. 14, no. 22, 2021. [Online].
Available: https://www.mdpi.com/1996-1073/14/22/7609
[36] A. Pirnat, B. Bertalanič, G. Cerar, M. Mohorčič, M. Meža, and C. Fortuna,
‘‘Towards sustainable deep learning for wireless fingerprinting localiza-
tion,’ in IEEE International Conference on Communications (ICC 2022),
2022, pp. 3208–3213.
[37] M. A. Ahajjam, C. Essayeh, M. Ghogho, and A. Kobbane, ‘‘On multi-
label classification for non-intrusive load identification using low sampling
frequency datasets,’ in 2021 IEEE International Instrumentation and Mea-
surement Technology Conference (I2MTC), 2021, pp. 1–6.
[38] G. Hsueh, Carbon Footprint of Machine Learning Algorithms.
Senior Projects Spring 2020. 296., 2020. [Online]. Available:
https://digitalcommons.bard.edu/senproj_s2020/296
[39] M. Verhelst and B. Moons, ‘Embedded deep neural network processing:
Algorithmic and processor techniques bring deep learning to iot and edge
devices,’ IEEE Solid-State Circuits Magazine, vol. 9, no. 4, pp. 55–65,
2017.
[40] E. García-Martín, C. F. Rodrigues, G. Riley, and H. Grahn, ‘Estimation
of energy consumption in machine learning,’ Journal of Parallel and
Distributed Computing, vol. 134, pp. 75–88, 2019. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0743731518308773
[41] D. Murray, L. Stankovic, and V. Stankovic, ‘‘An electrical load
measurements dataset of united kingdom households from a two-year
longitudinal study, Scientific data, vol. 4, no. 1, pp. 1–12, 2017. [Online].
Available: https://doi.org/10.1038/sdata.2016.122
[42] J. Kelly and W. Knottenbelt, ‘The uk-dale dataset, domestic appliance-
level electricity demand and whole-house demand from five uk homes,’’
Scientific data, vol. 2, no. 1, pp. 1–14, 2015.
[43] D. García-Pérez, D. Pérez-López, I. Díaz-Blanco, A. González-Muñiz,
M. Domínguez-González, and A. A. Cuadrado Vega, ‘Fully-convolutional
denoising auto-encoders for nilm in large non-residential buildings,’’ IEEE
Trans. on Smart Grid, vol. 12, no. 3, pp. 2722–2731, 2021.
[44] M. Zhang, W. Wang, X. Liu, J. Gao, and Y. He, ‘‘Navigating with graph
representations for fast and scalable decoding of neural language models,’
Advances in neural information processing systems, vol. 31, 2018.
[45] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
large-scale image recognition,’ in 3rd International Conference on Learn-
ing Representations (ICLR 2015). Computational and Biological Learning
Society, 2015.
[46] W. Kong, Z. Y. Dong, B. Wang, J. Zhao, and J. Huang, ‘‘A practical solution
for non-intrusive type ii load monitoring based on deep learning and post-
processing,’ IEEE Trans. on Smart Grid, vol. 11, no. 1, pp. 148–160, 2020.
[47] D. Yang, X. Gao, L. Kong, Y. Pang, and B. Zhou, ‘‘An event-driven convo-
lutional neural architecture for non-intrusive load monitoring of residential
appliance,’ IEEE Trans. on Consumer Electronics, vol. 66, no. 2, pp. 173–
182, 2020.
[48] C. Klemenjak, C. Kovatsch, M. Herold, and W. Elmenreich, A synthetic
energy dataset for non-intrusive load monitoring in households,’’ Scientific
data, vol. 7, no. 1, p. 108, 2020.
ANŽE PIRNAT Anže Pirnat was born in Ljubljana,
Slovenia in 2001. He received the bachelor’s de-
gree in electrical engineering in 2023 from Faculty
of Electrical Engineering, University of Ljubljana.
He is currently pursuing masters degree in electri-
cal engineering and working as a research intern
at Jožef Stefan Institute. His work and interests
consist of ML, DL, AI, Feature Stores and MLOps.
BLAŽ BERTALANIČ is a research assistant (ju-
nior researcher) at the Department of Communi-
cation Systems and a third-year PhD student at
the Faculty of Electrical Engineering, University
of Ljubljana. His main research interests are based
on time series analysis, including machine learning
techniques for motif detection in time series and
techniques for dimensionality expansion of time
series, resulting in a IF > 10 journal publication in
his first year of PhD studies. He co-authored 10+
peer reviewed scientific publications and is the chair of IEEE YP Slovenia.
He is currently involved in H2020 INERGY, H2020 BD4NRG, and HE
ENERSHARE projects, while he is also a work package leader on HE
NANCY project.
GREGOR CERAR Gregor Cerar received his
Bachelor’s (2013) and Master’s (2016) degrees
from the Faculty of Electrical Engineering of the
University of Ljubljana, where he completed the
Telecommunications study programme, and the
Ph.D. degree (2021) from International Postgrad-
uate School of Jožef Stefan in Information and
Communication technologies with the Department
of Communication Systems, Jožef Stefan Institute.
He is currently a research associate with the De-
partment of Communication Systems, Jožef Stefan Institute.
xv
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
MIHAEL MOHORČIČ is head of the Department
of Communication Systems and Scientific Advisor
at the Jozef Stefan Institute, and associate pro-
fessor at the Jozef Stefan International Postgrad-
uate School. His research and working experience
include development and performance evaluation
of network protocols and architectures for mobile
and wireless communication systems, and resource
management in terrestrial, stratospheric and satel-
lite networks. His recent research interest is fo-
cused on cognitive radio networks, cross-layer protocol design and optimiza-
tion, “smart” applications of wireless sensor networks, dynamic composition
of communication services and wireless experimental testbeds. He partic-
ipated in several COST actions and FP projects considering terrestrial &
satellite mobile communications, stratospheric telecommunication systems
and wireless sensor networks, as well as in basic and applied national
projects. He is Senior Member of IEEE (VTS and ComSoc).
CAROLINA FORTUNA is a Senior Research Fel-
low at the Jozef Stefan Institute and leads Sensor-
Lab. In 2017 she was visiting Infolab at Stanford
University, USA, she was a postdoc at Ghent Uni-
versity in Belgium in 2014-2015 and finished her
PhD on composable communication services using
symbolic AI in Ljubljana, Slovenia. Her research
focuses on developing the next generation smart
infrastructures that surround us and improve the
quality of our life. Leveraging advanced compu-
tational techniques, such as machine learning and symbolic AI, her group
designs new cutting edge components and systems. She led, under various
positions, teams for 7 EU funded projects and published over 100 peer
reviewed scientific works, contributed to community work as track chair,
TPC member and reviewer.
xvi
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3382830
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
... This method requires less data and no need for data augmentation (DA) approaches. Reference [17] proposes a novel Convolutional transpose Reccurrent Neural Network (CtRNN) architecture focusing on reduced computational complexity and improving energy efficiency. Reference [18] proposes the few-shot Transfer learning (TL) based on metalearning and relational network to improve the load recognition generalization performance, which does not require complex inference and recurrent structures. ...
Article
Full-text available
The loads that have several working states cannot be accurately distinguished by the conventional Non-Intrusive Load Monitoring (NILM) methods. This paper proposed an improved NILM method based on the Resnet18 Convolutional Neural Network (CNN) and Support Vector Machine (SVM) algorithm to address the misidentification of multi-state appliances. The V-I trajectories of loads are at first classified with Resnet18. Then, load features with low redundancy is obtained through the Max-Relevance and Min-Redundancy (mRMR) feature selection algorithm from various operating states of loads that were not successfully classified. The SVM algorithm is developed for two-stage identification to achieve high accuracy of classification for identifying the multi-state appliances quickly. This proposed NILM method can significantly improve the accuracy of identification for multi-state loads. Finally, the Plaid dataset is acquired to validate the effectiveness and accuracy of the proposed method.
Article
Full-text available
Due to growing population and technological advances, global electricity consumption is increasing. Although CO2 emissions are projected to plateau or slightly decrease by 2025 due to the adoption of clean energy sources, they are still not decreasing enough to mitigate climate change. The residential sector makes up 25% of global electricity consumption and has potential to improve efficiency and reduce CO2 footprint without sacrificing comfort. However, a lack of uniform consumption data at the household level spanning multiple regions hinders large-scale studies and robust multi-region model development. This paper introduces a multi-region dataset compiled from publicly available sources and presented in a uniform format. This data enables machine learning tasks such as disaggregation, demand forecasting, appliance ON/OFF classification, etc. Furthermore, we develop an RDF knowledge graph that characterizes the electricity consumption of the households and contextualizes it with household-related properties enabling semantic queries and interoperability with other open knowledge bases like Wikidata and DBpedia. This structured data can be utilized to inform various stakeholders towards data-driven policy and business development.
Article
Full-text available
Demand Response (DR) has become a key strategy for enhancing energy system sustainability and reducing costs. Deep Learning (DL) has emerged as crucial for managing DR's complexity and large data volumes, enabling near real-time decision-making. DL techniques can effectively tackle challenges such as selecting responsive users, understanding consumption behaviours, optimizing pricing, monitoring and controlling devices, engaging more consumers in DR schemes, and determining fair remuneration for participants. This research work presents an integrated architecture for smart grid energy management, combining a Deep Attention-Enhanced Sequence-to-Sequence Model (AES2S) with Energy-Aware Optimized Reinforcement Learning (EAORL). The objective is to design a system that performs non-intrusive load monitoring and optimizes demand response to enhance energy efficiency while maintaining user comfort. The AES2S module accurately performs appliance state identification and load disaggregation using convolutional layers, Enhanced Sequence-to-Sequence Model networks, and an attention mechanism. The EAORL module employs a multi-agent system, where each agent uses a Deep Q-Learning Network to learn optimal policies for adjusting energy consumption in response to grid conditions and user demand. The system uses an Iterative Policy Update mechanism, where agents update their policies sequentially, ensuring stable and effective learning. The integration ensures seamless data flow, with AES2S outputs enhancing EAORL state representations. Validated in a simulated smart grid environment, the architecture dynamically adjusts energy consumption, demonstrating significant improvements in energy efficiency, cost reduction, and user comfort. Evaluation metrics confirm the system's effectiveness, making AES2S-EAORL a robust solution for smart grid energy management and demand response optimization.
Article
Full-text available
This article proposes a non-intrusive load monitoring (NILM) framework based on a deep convolutional neural network (DCNN) to profile each household appliance on/off status and the residential power consumption. It uses only load trajectory, which can overcome the limitations of existing voltage-current trajectory NILM techniques. The DCNN architecture with a load trajectory as the input enables the NILM to directly analyze the electricity consumption at the appliance-level. Meanwhile, the temporal feature transferring procedure improves load monitoring performance and extends its application range include monitoring appliances based on multiple and combined characteristics. Furthermore, the power variation augmentation technique enhances the load signature uniqueness. The fusion of temporal and power variation features provides rich identification information for NILM and improves the accuracy of appliance identification. Experimental results demonstrate that the proposed NILM framework is effective and superior for enhancing demand side management and energy efficiency.
Article
Full-text available
This paper investigates the intelligent load monitoring problem with applications to practical energy management scenarios in smart grids. As one of the critical components for paving the way to smart grids' success, an intelligent and feasible non-intrusive load monitoring (NILM) algorithm is urgently needed. However, most recent researches on NILM have not dealt with practical problems when applied to power grid, i.e., ① limited communication for slow-change systems; ② requirement of low-cost hardware at the users' side; and ③ inconvenience to adapt to new households. Therefore, a novel NILM algorithm based on biology-inspired spiking neural network (SNN) has been developed to overcome the existing challenges. To provide intelligence in NILM, the developed SNN features an unsupervised learning rule, i.e., spike-time dependent plasticity (STDP), which only requires the user to label one instance for each appliance while adapting to a new household. To upgrade the feasibility in NILM, the designed spiking neurons mimic the mechanism of human brain neurons that can be constructed by a resistor-capacitor (RC) circuit. In addition, a distributed computing system has been designed that divides the SNN into two parts, i.e., smart outlets and local servers. Since the information flows as sparse binary vectors among spiking neurons in the developed SNN-based NILM, the high-frequency data can be easily compressed as the spike times, and are sent to the local server with limited communication capability, whereas it is unable to handle the traditional NILM. Finally, a series of experiments are conducted using a benchmark public dataset. Meanwhile, the effectiveness of developed SNN-based NILM can be demonstrated through comparisons with other emerging NILM algorithms such as the convolutional neural networks.
Article
Full-text available
Non-Intrusive Load Monitoring consists in estimating the power consumption or the states of the appliances using electrical parameters acquired from a single metering point. State-of-the-art approaches are based on deep neural networks, and for training, they require a significant amount of data annotated at the sample level, defined as strong labels. This paper presents an appliance classification method based on a Convolutional Recurrent Neural Network trained with weak supervision. Learning is formulated as a Multiple-Instance Learning problem, and the network is trained on labels provided for an entire segment of the aggregate power, defined as weak labels. Weak labels are coarser annotations that are intrinsically less costly to obtain compared to strong labels. An extensive experimental evaluation has been conducted on the UK-DALE and REFIT datasets comparing the proposed approach to three benchmark methods. The results obtained for different amounts of strongly and weakly labeled data and mixing UK-DALE and REFIT confirm the effectiveness of weak labels compared to fully supervised and semi-supervised benchmarks methods.
Article
Full-text available
Commercial load is an essential demand-side resource. Monitoring commercial loads helps not only commercial customers understand their energy usage to improve energy efficiency but also helps electric utilities develop demand-side management strategies to ensure stable operation of the power system. However, existing non-intrusive methods cannot monitor multiple commercial loads simultaneously and do not consider the high correlation and severe imbalance among commercial loads. Therefore, this paper proposes a deep learning-based non-intrusive commercial load monitoring method to solve these problems. The method takes the total power signal of the commercial building as input and directly determines the state and power consumption of several specific appliances. The key elements of the method are a new neural network structure called TTRNet and a new loss function called MLFL. TTRNet is a multi-label classification model that can autonomously learn correlation information through its unique network structure. MLFL is a loss function specifically designed for multi-label classification tasks, which solves the imbalance problem and improves the monitoring accuracy for challenging loads. To validate the proposed method, experiments are performed separately in seen and unseen scenarios using a public dataset. In the seen scenario, the method achieves an average F1 score of 0.957, which is 7.77% better than existing multi-label classification methods. In the unseen scenario, the average F1 score is 0.904, which is 1.92% better than existing methods. The experimental results show that the method proposed in this paper is both effective and practical.
Article
Full-text available
This paper presents a critical approach to the non-intrusive load monitoring (NILM) problem, by thoroughly reviewing the experimental framework of both legacy and state-of-the-art studies. Some of the most widely used NILM datasets are presented and their characteristics, such as sampling rate and measurements availability are presented and correlated with the performance of NILM algorithms. Feature engineering approaches are analyzed, comparing the hand-made with the automatic feature extraction process, in terms of complexity and efficiency. The evolution of the learning approaches through time is presented, making an effort to assess the contribution of the latest state-of-the-art deep learning models to the problem. Performance evaluation methods and evaluation metrics are demonstrated and it is attempted to define the necessary requirements for the conduction of fair evaluation across different methods and datasets. NILM limitations are highlighted and future research directions are suggested.
Conference Paper
Full-text available
Location based services, already popular with end users, are now inevitably becoming part of new wireless infras-tructures and emerging business processes. The increasingly popular Deep Learning (DL) artificial intelligence methods perform very well in wireless fingerprinting localization based on extensive indoor radio measurement data. However, with the increasing complexity these methods become computationally very intensive and energy hungry, both for their training and subsequent operation. Considering only mobile users, estimated to exceed 7.4 billion by the end of 2025, and assuming that the networks serving these users will need to perform only one localization per user per hour on average, the machine learning models used for the calculation would need to perform 65 × 10 12 predictions per year. Add to this equation tens of billions of other connected devices and applications that rely heavily on more frequent location updates, and it becomes apparent that localization will contribute significantly to carbon emissions unless more energy-efficient models are developed and used. This motivated our work on a new DL-based architecture for indoor localization that is more energy efficient compared to related state-of-the-art approaches while showing only marginal performance degradation. A detailed performance evaluation shows that the proposed model produces only 58 % of the carbon footprint while maintaining 98.7 % of the overall performance compared to state of the art model external to our group. Additionally, we elaborate on a methodology to calculate the complexity of the DL model and thus the CO 2 footprint during its training and operation. Index Terms-localization, fingerprinting, wireless, deep learning (DL), neural network (NN), carbon footprint, energy efficiency , green communications I. I Location-based services (LBS) are software services that take into account a geographic location and even context of an entity [1] in order to adjust the content, information or functionality delivered. Entities can be people, animals, plants, assets and any other object. Perhaps the most widely used LBS is the Global Positioning System (GPS), which integrates data from satellite navigation systems and cell towers [2] and is used daily in navigation systems. Another popular application of LBS is locating tagged items and assets in indoor environments. With 5G systems, accurate localization is no longer only important for the provision of more relevant information to the end user, but also for optimal operation and management of the network, e.g. for creating and steering the beams of antenna array-based radio heads [3]. As discussed in [3], the poor performance of fundamental geometry-based techniques in challenging indoor environments characterized by non line-of-sight (NLoS) and/or multipath propagation can be significantly improved by using higher mmWave frequency bands and steer-able multiple-input multiple-output (MIMO) antennas along with advanced techniques such as cooperative localization, machine learning (ML) and user tracking. Given the ubiquitous presence of wireless networks and the associated availability of radio-frequency (RF) measurements, ML methods promise the highest accuracy, albeit at a higher deployment cost. In particular, in the offline training phase, ML methods use available RF measurements to create a fingerprint database of the wireless environment, hence we refer to this localization approach as wireless fingerprinting. The fingerprint database is then used in the online localization phase to compare the real-time RF measurement with the stored (measured or estimated) values associated with exact or estimated locations. Recent advances in Deep Learning (DL) [4] have enabled particularly accurate localization, and such models trained with large amounts of data are considered the most promising enablers for the future LBS. However, the development and use of DL models involves additional technical complexity, increased energy consumption and corresponding environmental impacts. Recently, the impact of such technologies has received increased attention from regulators and the public, triggering related research activities [5]. One way to reduce the environmental impact of power-hungry AI technology is to increase the proportion of electricity from clean energy sources such as wind, solar and hydro. However, this must be complemented by further efforts to optimize energy consumption relative to the performance of existing and emerging technologies. Studies on estimating the energy consumption of ML models [6] show that the increasing complexity of models, manifested in the number of weights, the type of layers and their respective parameters, affects both their performance and energy efficiency. In DL architectures, one way to optimize the use of energy is to reduce the size of the filters, also referred to as kernels, that represent matrices used to extract features from the image. In these filters, we can adjusts the amount of movement over the image by a stride. Another way is to adjust pools, which represent layers that resize the output of a filter and thus reduce the number of parameters passed to subsequent layers, making a model lighter and faster.
Article
Full-text available
Next-generation power systems aim at optimizing the energy consumption of household appliances by utilising computationally intelligent techniques, referred to as load monitoring. Non-intrusive load monitoring (NILM) is considered to be one of the most cost-effective methods for load classification. The objective is to segregate the energy consumption of individual appliances from their aggregated energy consumption. The extracted energy consumption of individual devices can then be used to achieve demand-side management and energy saving through optimal load management strategies. Machine learning (ML) has been popularly used to solve many complex problems including NILM. With the availability of the energy consumption datasets, various ML algorithms have been effectively trained and tested. However, most of the current methodologies for NILM employ neural networks only for a limited operational output level of appliances and their combinations (i.e., only for a small number of classes). On the contrary, this work depicts a more practical scenario where over a hundred different combinations were considered and labelled for the training and testing of various machine learning algorithms. Moreover, two novel concepts—i.e., thresholding/occurrence per million (OPM) along with power windowing—were utilised, which significantly improved the performance of the trained algorithms. All the trained algorithms were thoroughly evaluated using various performance parameters. The results shown demonstrate the effectiveness of thresholding and OPM concepts in classifying concurrently operating appliances using ML.
Article
Energy management systems (EMS), as enablers of more efficient energy consumption, monitor and manage appliances to help residents be more energy efficient and thus more frugal. Recent appliance detection and identification techniques for such systems rely on machine learning. However, machine learning solutions for appliance classification on existing low-frequency household metering have not yet been thoroughly investigated. In this paper, we propose CARMEL, a new approach for identifying home appliances from load monitoring in building EMS based on a new data representation technique and a new model that leverages spatio-temporal correlations in the new representation. The proposed data representation technique performs dimensionality expansion of time series that scales linearly, rather than quadratically and, together with the proposed model, outperform the state of the art image transformation models by 5 percentage points. Evaluation on 5 different low-frequency household metering datasets, considering 29 appliances in total, shows that the proposed representation and the corresponding resource-aware deep learning architecture (1) achieve an average weighted F1 score of 0.92 and (2) require only 230 labeled samples and 3x fewer epochs to transfer to new households.
Article
Energy disaggregation (a.k.a nonintrusive load monitoring, NILM), a single-channel blind source separation problem, aims to decompose the mains which records the whole house electricity consumption into appliance-wise readings. This problem is difficult because it is inherently unidentifiable. Recent approaches have shown that the identifiability problem could be reduced by introducing domain knowledge into the model. Deep neural networks have been shown to be a promising approach for these problems, but sliding windows are necessary to handle the long sequences which arise in signal processing problems, which raises issues about how to combine predictions from different sliding windows. In this paper, we propose sequence-to-point learning, where the input is a window of the mains and the output is a single point of the target appliance. We use convolutional neural networks to train the model. Interestingly, we systematically show that the convolutional neural networks can inherently learn the signatures of the target appliances, which are automatically added into the model to reduce the identifiability problem. We applied the proposed neural network approaches to real-world household energy data, and show that the methods achieve state-of-the-art performance, improving two standard error measures by 84% and 92%.
Article
Non-intrusive load monitoring (NILM) is a technique that uses a single sensor to measure the total power consumption of a building. Using an energy disaggregation method, the consumption of individual appliances can be estimated from the aggregate measurement. Recent disaggregation algorithms have significantly improved the performance of NILM systems. However, the generalization capability of these methods to different houses as well as the disaggregation of multi-state appliances are still major challenges. In this paper we address these issues and propose an energy disaggregation approach based on the variational autoencoders framework. The probabilistic encoder makes this approach an efficient model for encoding information relevant to the reconstruction of the target appliance consumption. In particular, the proposed model accurately generates more complex load profiles, thus improving the power signal reconstruction of multi-state appliances. Moreover, its regularized latent space improves the generalization capabilities of the model across different houses. The proposed model is compared to state-of-the-art NILM approaches on the UK-DALE and REFIT datasets, and yields competitive results. The mean absolute error reduces by 18% on average across all appliances compared to the state-of-the-art. The F1-Score increases by more than 11%, showing improvements for the detection of the target appliance in the aggregate measurement.