PreprintPDF Available

On Designing Data Models for Energy Feature Stores

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The digitization of the energy infrastructure enables new, data driven, applications often supported by machine learning models. However, domain specific data transformations, pre-processing and management in modern data driven pipelines is yet to be addressed. In this paper we perform a first time study on data models, energy feature engineering and feature management solutions for developing ML-based energy applications. We first propose a taxonomy for designing data models suitable for energy applications, analyze feature engineering techniques able to transform the data model into features suitable for ML model training and finally also analyze available designs for feature stores. Using a short-term forecasting dataset, we show the benefits of designing richer data models and engineering the features on the performance of the resulting models. Finally, we benchmark three complementary feature management solutions, including an open-source feature store.
1
On Designing Data Models for Energy Feature
Stores
Gregor Cerar∗†, Blaˇ
z Bertalaniˇ
c∗† , Anˇ
ze Pirnat, Andrej ˇ
Campa, Carolina Fortuna
Department of Communication Systems, Joˇ
zef Stefan Institute, Slovenia.
These authors contributed equally
Corresponding author: gregor.cerar@ijs.si
Abstract—The digitization of the energy infrastructure enables
new, data driven, applications often supported by machine learn-
ing models. However, domain specific data transformations, pre-
processing and management in modern data driven pipelines is
yet to be addressed. In this paper we perform a first time study on
data models, energy feature engineering and feature management
solutions for developing ML-based energy applications. We first
propose a taxonomy for designing data models suitable for
energy applications, analyze feature engineering techniques able
to transform the data model into features suitable for ML model
training and finally also analyze available designs for feature
stores. Using a short-term forecasting dataset, we show the
benefits of designing richer data models and engineering the
features on the performance of the resulting models. Finally, we
benchmark three complementary feature management solutions,
including an open-source feature store.
Index Terms—energy data model, feature store, energy feature
management, machine learning
I. INTRODUCTION
With the transformation of the traditional power grid to
the smart grid, the complexity of the system continues to
evolve [1], especially with the penetration of smart meters
(SM), energy management systems (EMS) and other intelli-
gent electronic devices (IED) especially at the low voltage
(LV) level of the grid. IEDs, together with EMS enable an
innovative set of energy [2] and non-energy applications [3].
EMSes enable the control of various assets in homes or
buildings with limited knowledge of grid status. Example en-
ergy applications are energy cost optimization, matching con-
sumption with self-production from renewable energy sources
(RES), or by trying to help distribution system operator (DSO)
or aggregator to reach their predictive performance curves.
On the DSO side of the LV grid, the main challenges are
represented by reliability and latency. In the case of controlling
at substation level, the complete observability of the LV grid
for that substation is of great importance. The data collected
from all SMs in the grid of one substation would provide
enough data to plan and minimize the possible congestion
that may occur during the peak demand hours, or too high
production of RESs that could lead to power quality issues
(e.g. over-voltage).
In the case of medium voltage (MV) and high voltage (HV)
network wide-area measurement systems (WAMS) already
monitor and collect data. However, the data is collected only
to provide observability and to efficiently handle the critical
situations that might result in catastrophic events, in the worst
case, power outages. Since reliability is the most important
factor, the penetration of auxiliary services in the MV and
HV grid is low. However, the data collected in the LV grid
could be processed and used to enrich the collected data at
MV and HV levels. The enriched data can be used to create
a limited control loop that extends from the observability of
the HV grid to control and make smaller adjustments as in the
LV grid, all the way down to the prosumer.
While IEDs, WAMS and EMSes have been around for a
long time, with their increased penetration, the amount of
generated data is triggering the adoption on big data and
machine learning techniques that are integrated in applications
serving all segments of the grid [4], [5]. Data driven machine
learning (ML) models are different than traditional statistical
models in that they are able to automatically learn an under-
lying distribution. However, to achieve that, a well defined
knowledge discovery process (KDP) needs to be followed [6].
The main steps of KDP consist of 1) data analysis, 2) data
preparation (pre-processing), 3) model training and evaluation
and 4) model deployment [7] as also represented in Figure 1.
In the past, such process and the enabling tools were familiar
only to a limited number of domain experts and the process
involved intense manual effort. However, in the last five years,
coordinated efforts have been taken by the private and public
sectors to democratize AI and model development [8] to
empower less specialized users.
The democratization process involves a division of labour
and automation like approach applied to the KDP, as elabo-
rated more in details in [7], where rather than a domain expert
executing the step by step the process in Figure 1 from start
to end, they only need to control the process at a few key
steps. For instance, to develop a home energy consumption
prediction model, the users need carefully select the relevant
data, also referred to as data model, engineer the desired
features, and configure the desired pipeline by selecting the
ML methods to be applied and then selecting the best model
to be deployed to production. Such automation is enabled by
machine learning operations (MLOps) [7] and is being piloted
in projects such as I-NERGY1and MATRYCS2[9].
More recently, the authors in [10] proposed a “unification
of machine learning features” in which a common/unified
data preparation phase is best automated by feature stores
1https://i-nergy.eu
2https://matrycs.eu
arXiv:2205.04267v1 [cs.AI] 9 May 2022
2
Fig. 1: End-to-end infrastructure for machine learning model development and management.
(see phase 2 in Figure 1). Such approach further reduces
the time spent on the most time-consuming phase of ML
application development, benefiting data scientists, engineers,
and stakeholders. While most of the MLOps automation steps
are generic across domains, the ones concerned with data
analysis and data preparation (pre-processing) have domain
specifics and may significantly impact the model fairness
and performance [11]. Assume aggregated home consumption
is ingested from a smart meter and sent as is to train a
ML algorithm. In such case, the model will learn the likely
distribution of the values in the metering data and predict
future energy consumption based on that, similar to the work
in [12]. However, if additional weather data would also be used
for training, the model would learn to associated lower energy
consumption with sunny days and high temperatures, thus
yielding superior performance. It is common practice in the
literature to use such additional data. For instance, in [13] they
estimated consumption using timestamp (month, hour, week
of year, day of week, season), energy consumption, weather
(condition, severity, temperature, humidity), energy price. In
[14], the ML based estimation was done using timestamp,
electricity contract type, energy consumption and city area.
In this paper we perform a first time study on data models,
energy feature engineering and feature management solutions
for developing ML-based energy applications. This is the
first study that formalizes energy data modelling and shows
feature importance and model performance trade-odds while
also benchmarking feature management solutions. The contri-
butions of this paper are:
A taxonomy for designing data models suitable for energy
applications and identification of relevant sources for the
data categories in the taxonomy.
Analysis of feature generation techniques and evaluation
of their impact.
An analysis of available solutions for realizing energy
feature stores and the benchmarking of three solutions.
The rest of the paper is structured as follows. Section II
elaborates on the proposed taxonomy for designing energy
data models, Section III analyzes the feature generation tech-
niques, Section IV discusses feature stores and available solu-
tions while Section V details the evaluation and benchmarks.
Finally, Section VI concludes the paper.
II. TAXONOMY FOR DATA MOD EL DESIGN
In this section, we propose a taxonomy that identifies and
structures various types of data related to energy applications.
Based on this taxonomy, data models can be designed and im-
plemented in database-like systems or feature storing systems
for ML model training. The proposed taxonomy is depicted
in Figure 2 and distinguishes three large categories: domain
specific, contextual and behavioural.
A. Domain specific
Domain-specific features are measurements of energy con-
sumption and production collected by IEDs installed at the
various points of the energy grid. Additionally, information
associated with energy-related appliances, such as the type
of heating (i.e., heat pump, gas furnace, electric fireplace),
can be presented in meta-data. In Figure 2 we identify
PV power plant generation, electric vehicles (EVs), wind
power generation, and household consumption. Power plants
data include battery/super-capacitor capacity, voltage, current,
power, and energy measurements and can be found in at least
two publicly available datasets as listed in Table I. These
datasets may also include metadata such as plant id, source
key, geographical location, and measurements that fall under
the contextual features group, such as air temperature. Wind
power generation datasets contain the generated power and
sometimes electrical capacity.
With a recent spike in the popularity of EVs, power grids
need to be designed/improved accordingly because of their
significant energy capacity and power draw. In the third
column of Table I, we listed good quality publicly accessible
datasets related to EVs. The datasets consider, for instance,
battery capacity, charge rate, and discharge rate.
For household measurements, consumption may be mea-
sured by a single or several smart meters, thus being aggre-
gated or per (set of) appliances. Depending on the dataset, they
contain active, reactive, and apparent power, current, phase,
and voltage. In some cases, meta-data about the geolocation,
orientation, or size of the house may also be present. There
are a number of good quality publicly accessible datasets, as
can be seen from the fourth column of Table I.
B. Contextual features
We refer to contextual features as measured data that is
not directly collected by measuring device(s), however such
data [39] may be critical in developing better energy con-
sumption or production estimates. In Figure 2 we identify
1) weather data such as wind speed/direction, temperature,
relative humidity, pressure, cloud coverage and visibility, 2)
3
Applicable
Features
Domain
Specific
Features
PV power plant
measurements
batteries
voltage
current
power
energy
Electric
vehicles
capacity
charge rate
discharge rate
Household
Measurements
Appliances
Active power
Voltage Current
Reactive Power
Apparent power
Phase
wind power plant
measurements
Wind capacity
Wind generation
Contextual
Features
weather
conditions
relative
humidity
temperature
pressure wind
direction
speed
visibility
precipitation
cloud coverage
building
properties
plug load
age
type
area density
orientation
geolocation
time
daytime duration
time of the day
seasons
geolocation
latitute
longitude
region
Behavioral
Features
wealth
class
social
activities weekday
weekend
holidays
near-holidays
heating
age
work
schedule
personal
hygiene
duration
eco shower
head
bathtub
size
showering
frequency
electrical
devices
bathing
frequency
water
temperature
cooking
number of
inhabitants
gender
geolocation
culture
Fig. 2: A taxonomy of features relevant for energy application data model design.
building properties such as type of insulation, year of building,
type of property, orientation, area density and 3) time related
features such as part of the day and daytime duration.
Weather datasets, such as the ones listed in the fifth column
of Table I, usually provide numbers related to temperature and
precipitation. Some also provide more climate elements like
fog and hail, wind speed, humidity, pressure and sunshine/solar
data. They also all provide the geographical location and the
time period of measurement.
C. Behavioral features
Behavioral features refer to aspects related to people’s
behavior such as work schedule, age, wealth class and hygiene
habits as per Figure 2. Studies [20], [25], [27] show that
electricity consumption is influenced by the behavior of the
inhabitants. These are, for instance, personal hygiene, person’s
age, person’s origins, work habits, cooking habits, social
activities [20], and wealth class [25]. People of different ages
have diverse electricity consumption patterns because of the
difference in sleep habits and lifestyles in general. Social activ-
ities such as holidays, near-holidays, weekdays, and weekends
also make a difference in the electricity consumption in homes,
offices, and other buildings. Wealth class also influences
electricity consumption because it impacts the lifestyle. If we
are talking about a home, personal hygiene also has a major
impact. Showering or bathing can take a considerable amount
of warm water, depending on the showering frequency, shower
duration, and how much water the shower head pours each
minute. Besides that, we have to consider electrical devices
such as a fan, hairdryer, or infrared heater.
Gender of the inhabitants also plays a role as men and
women have different preferences when it comes to air tem-
perature, water temperature [27] and daily routines.
While such data is useful for estimating and optimizing
energy consumption, it raises ethical and privacy concerns.
To some extent, such data can be collected through ques-
tionnaires, studies, or simulations to describe the behavioral
specifics of users, groups, or communities.
III. FEATU RE GE NER ATION TECHNIQUES
The data model designed for ML model training based on
features from the taxonomy proposed in Section II, typically
consists of a mix of raw and engineered features as con-
ceptually illustrated in the upper part of Figure 3. Feature
4
TABLE I: Datasets/studies for the feature taxonomy.
Domain specific features Contextual features Behavioral features
PVs EVs Wind Household Weather Building
UKPN [15] V2G [16] SandovalGER [17] BLOND-50 [18] UKPN [15] HUE [19] Social [20]
Kannal [21] emobpy [22] LafazGER [23] FIRED [24] KannalIN [21] Wealth [25]
UK-DALE [26] SandovalGER [17] Gender [27]
REFIT [28] LafazGER [23]
ECO [29] ECN [30]
REDD [31] MIDAS [32]
iAWE [33] KellerGER [34]
COMBED [35] KukrejaIN [36]
HUE [19]
HVAC [37]
Synthetic [38]
engineering typically requires domain knowledge, experience
and experimentation tends to be time consuming. New features
can be produced by transforming the raw domain specific,
behavioral and/or contextual data through linear or non-linear
interaction, such as summation, subtraction and absolute value,
statistical processing, i.e. by min, max, avg, moment compu-
tation, etc., dimensional reduction or expansion techniques as
well as through feature interaction (cross-feature), an interac-
tion of one feature and its historical values, or a combination
of multiple interactions.
A. Statistical techniques
The most intuitive way to start building features from raw
energy time series signals is to search for their statistical
properties. Many researchers, such as [40], [41], suggest using
statistical features to improve pattern recognition in time
series. The most straightforward examples are computing sum-
maries such as max, min, mean, standard deviation, variance,
median, percentiles, etc. Tools, such as TSFresh 3[42], offer an
abundance of statistical feature extraction options that can help
automate the process of generating features for storing and
model development. Recently, with the increased performance
and popularity of deep learning algorithms, statistical features
are less popular for building machine learning models, but are
still very useful for dataset analysis and insight.
Statistical based features can be used for solving various
problems in the energy domain. For example [43], [44] used
these features to forecast energy usage in smart grids and
residential area, respectively. Another example of using sta-
tistical based features is to detect appliances in Non Intrusive
Load Monitoring (NILM) problem [45], while [46] proposed
the use of statistical features for power consumption anomaly
detection.
B. Dimensionality reduction techniques
In general, machine learning approaches provide better
results when more, possibly independent features that, which
are directly or indirectly related to a task, are added. These
can expose previously unknown correlations or patterns to
3https://tsfresh.com/
learn from, leading to better overall performance. However,
increasing the number of input features can also increase
computational complexity, also known as the “curse of dimen-
sionality” [47]. Computational complexity then slows down
the inference process [48].
To balance the performance/complexity trade-offs, dimen-
sionality reduction techniques help reduce the number of
input features while keeping as much variation as possible.
These techniques are concerned with finding a smaller set
of new variables, where each can combine the raw data.
Some of the common dimensionality reduction techniques
are Principal Component Analysis (PCA) [49], t-distributed
Stochastic Neighbor Embedding (t-SNE) [50], Locality Pre-
serving Projections (LPP) [51], and autoencoders (AEs) [52]
in case of deep neural networks (DNNs). While LPP 4does
not have an official implementation, both PCA and t-SNE
are part of the scikit-learn python library [53]. AEs can be
constructed using open-source libraries such as pytorch 5[54]
or tensorflow/keras 6[55].
Dimensionality reduction techniques are widely used in
energy domain for solving different set of problems. PCA
is one of the most used techniques and has been utilised in
forecasting of energy production [56], [57], energy consump-
tion estimation [58], and NILM disaggregation and appliance
classification [59], [60]. Although other presented methods
are usually utilized for visualization of high dimensional
data, some researchers proposed t-SNE as a dimensionality
reduction method for NILM [61]. The DNN AE aproach was
proposed by [62] to reduce the dimension of input data for
later energy production forecast.
C. Dimensionality expansion techniques
Although dimensionality reduction techniques strive to-
wards avoiding the curse of dimensionality, recently a trend
is occurring to actually expand the dimensions of time se-
ries (TS) data. With the significant breakthroughs in image
recognition [63] and image object detection [64] in the last
decade, Wang et al. [65] proposed to use these algorithms to
4https://github.com/jakevdp/lpproj
5https://pytorch.org/
6https://www.tensorflow.org/
5
solve TS classification problems, but first the TS traces had to
be transformed into an image-like format. They proposed two
new image-like representations of TS data in Gramian Angular
Fields (GAF) and Markov Transition Field (MTF), which are
now all part of the open-source library pyts [66]. In the same
paper they showed that using image recognition deep learning
algorithms on transformed TS data improved the classification
results. Before Wang et al., Silva et al. [67] proposed a
different TS representation to solve a TS classification problem
called Recurrence Plots (RP) [68]. The idea of expanding TS
dimensions is to incorporate additional information into the
TS representation. Although in a different way, both GAF
and RP calculate the temporal correlation between points
within a time series, while a MTF represents a field of
transition probabilities for a TS trace. All three transformations
produce a square image representation of the input time series.
Recently, another high dimensionality expansion technique
was proposed, where subsections of TS were transformed into
GAF and stacked together into a video-like format [69].
In energy domain, presented transformations are most com-
monly considered, but not limited to, for solving classification
problems, such as device detection in Non Intrusive Load
Monitoring (NILM). The usefullness of GAF transforma-
tion was shown by [69]–[71] where they selected GAF to
distinguish between appliance types in NILM. Similar was
done by [72], [73] only that they proposed a disaggregation
method using GAF and subsequently determining whether an
appliance is turned ON or OFF. An example of using RP
for NILM device classification was shown by [74]. Though
it is not strictly dimensionality expansion technique, [75]
proposed a weighted pixelated image representation of the
voltage–current trajectory (VI) to detect different types of
appliances in the dataset. Alternative use of dimensionality
expansion techniques in energy domain is also to use them for
forecasting [76], anomaly detection in measuring equipment
[77] and power consumption estimation [78].
IV. FEATUR E ST OR E
In modern data infrastructures, the features produced us-
ing the techniques discussed in Section III, to be used for
model training, are managed by feature stores. These are data
management services that harmonize data processing steps
producing features for different pipelines, making it more cost-
effective and scalable compared to traditional approaches [10],
[79].
As depicted in Figure 1 and described in Section I, a feature
store ingests data, transforms it according to the instructions
kept in a feature store registry, and serves features. To better
illustrate how a feature transformation for Figure 1 works, we
presented feature transformation as data flow in Figure 3.
As shown in Figure 3, a flow starts where a feature store
takes raw data discussed II. Within a feature store, the data
is transformed according to the instructions in a feature store
registry. Here, the data is transformed into primitive or derived
into more complex features as discussed in Section III. A
feature can be built out of single or multiple sources or
features. The flow ends with feature serving, where features
Fig. 3: The “data flow” of transformation(s) in feature store.
Model(s) in staging can be developed in parallel with a
deployed model(s).
are passed to the model(s), where each requires a different
set of features to work correctly. Many different features can
coexist simultaneously in a feature store, and a model may not
use all features.
Certain feature stores, such as those listed in Table II, are
specialized in time-series data. With those, the ML model can
request data from any point in time, and specialized feature
store will retrieve valid data for that timestamp and ensure
there is no data leakage.
A feature store’s registry contains instructions on how every
feature should be produced. These descriptions (or recipes) de-
fine what ingredients from available data sources are required
to build every feature through transformation(s). Once a new
feature is introduced into a registry, it becomes immediately
available to other pipelines and workflows. Because of this,
it fits nicely into existing continuous integration development
processes and decouples feature engineering and model devel-
opment.
Table II summarizes existing feature stores. It can be
seen that there are three open source stores, namely Feast,
Hopsworks and Butterfree and several proprietary ones. Be-
sides containing recipes for automatically generating features,
functionality for re-generation and feature serving, feature
stores also decrease the engineering efforts in connecting
to various data storage and delivery technologies. It can be
seen from the third column of the table that they include
connectors to support fast interconnection with various storage
solutions (BigQuery, S3, Postgres...) and streaming platforms
(Kafka, Spark). As per columns three and four, it can be
seen that all open source stores support offline and online
storage such as public cloud provider’s BigQuery, Azure, S3
and Snowflake or open source solutions such as PostgreSQL
and Cassandra. For the proprietary feature store solutions the
choice of technologies is sometimes unclear as the case of
Iguazio, Molecula and Rasgo. As can be seen from the sixth
column of the table, the open source feature stores can be
6
TABLE II: List of proprietary & open-source feature store solutions.
Name Open
Source Data Sources Offline Storage Online Storage Deployment Misc
Feast Y
BigQuery, Hive, Kafka,
Parquet, Postgres,
Redshift, Snowflake,
Spark, Synapse
BigQuery, Hive, Pandas,
Postgres, Redshift,
Snowflake, Spark,
Synapse, Trino, custom
DynamoDB, Datastore,
Redis, Azure Cache for
Redis, Postgres, SQLite,
custom
AWS Lambda,
Kubernetes, local
Hopsworks Y
Flink, Spark, custom
Python, Java, or Scala
connectors
Azure Data Lake Storage,
HopsFS, any SQL with
JDBC, Redshift, S3,
Snowflake
any SQL with JDBC,
Snowflake
AWS, Azure,
Google Cloud,
local
Butterfree Y Kafka, S3, Spark S3, Spark Metastore Cassandra local Apache Airflow
SageMaker N integrates with the AWS ecosystem AWS
Databricks Y/N
DataGrip, Azure Data
Factory, dbt, DBeaver,
Delta Lake, JSON, Kafka,
Parquet, Prophecy, Spark,
XML, many commercial
providers, custom
BigQuery, S3, Azure Data
Lake, Snowflake, custom
DBFS, S3, Azure Blob
Storage, custom
AWS, Azure,
Google Cloud
Airflow, MLFlow,
integrates with
Tableau,
Databricks is E2E
solution and
extends beyond
just feature store.
Iguazio N SQL DBs, ustructured data
sources N/A N/A AWS, Azure,
Google Cloud E2E solution
Kaskada N Parquet, S3, plain text Redshift, Snowflake DynamoDB, Redis Kaskada define pipeline
with Fenl language
Molecula N CSV, Delta Lake, JSON,
Kafka, Snowflake N/A N/A ? E2E solution
Rasgo N Azure, BigQuery, Delta
Lake, RasgoQL, S3 Cloud Data Warehouse, S3 N/A Snowflake,
RASGO
support no-code
development
Tecton N Kafka, Amazon Kinesis,
Redshift, S3, Snowflake Snowflake, S3 Redis Databricks, AWS
EMR
E2E solution of
Feast
Vertex AI N integrates with the Google Cloud ecosystem Google Cloud
deployed locally and also in the public cloud.
V. EVALUATION
In this section we first evaluate aspects related to feature
importance and feature selection for developing a machine
learning model, following by benchmarks of three feature
management solutions.
Throughout the section we use HUE (the Hourly Usage of
Energy dataset for buildings in British Columbia) dataset [19].
It contains hourly data from 28 households in Canada, col-
lected in different timespans between 2012 and 2020. The
dataset consists of raw data, household metadata, and weather
data (744 000 samples in total). This dataset is suitable for
analyzing and predicting household energy consumption.
A. Feature importance
To asses the important of the three categories of features
from Figure 2, in addition to the HUE dataset that includes
domain specific, contextual and behavioural features, we also
consider additional contextual features related to solar radia-
tion and altitude produced by a model [80].
From HUE we have the energy consumption measured
by IEDs and categorical variables related to the type of
heating devices such as forced air gas furnace (FAFG), heat
pump (HP), etc. as domain measurement. Additionally, as
contextual features we consider available metadata related to
the building such as the id of the residence, house orientation,
type of house related to the geographical location such as
region and meteorological data such as pressure, temperature,
humidity and weather (e.g. cloudy, windy, snow storm). We
also consider behavioural features related to weekdays and
holidays (is holiday, weekday, is weekend). To understand the
importance these features may have in estimating short time
consumption for 1 hour ahead we train an XGBoost7regressor
and asses the assigned importance.
The results of the feature importance as learnt by XGBoost
are presented in Figure 4. It can be seen that the raw instant
energy consumption is the feature that contributes the most to
the energy estimation. The second and sixth most important
features are the mean and standard deviation of the energy
generated using statistical feature engineering techniques as
discussed in Section III. Contextual features such as solar
azimuth and how much of the 24h in a day have passed
(day percent) are the third and fourth most important features.
It can be noticed that the XGBoost considers the instant
7https://xgboost.readthedocs.io/en/stable/
7
Legend:
raw
energy current hourly energy consumption
statistical
energy mean rolling hourly average; window size 10 hours
energy std standard deviation; window size 10 hours
weather
temperature current outdoor temperature
humidity current outdoor relative air humidity
pressure current outdoor air pressure
weather weather condition (e.g., cloudy, rain, snow)
solar altitude Sun altitude (VSOP 87 model)
solar azimuth Sun azimuth (VSOP 87 model)
solar radiation clear sky radiation (VSOP 87 model)
building properties
residential id residential unique ID
house type house type (e.g., apartment, duplex, bungalow)
facing house orientation (e.g., North, South, Northwest)
RUs number of rental units
SN special operation regimes
FAGF forced air gas furnace
FPG gas fireplace
IFRHG in-floor radiant heating (gas boiler)
NAC no a/c
FAC has fixed a/c unit
PAC has portable a/c unit
BHE baseboard heater (electric)
IFRHE in-floor radiant heating (electric)
WRHIR water radiant heat (cast iron radiators)
GEOTH geothermal
time
day percent percentage of the day elapsed
year percent percentage of the year elapsed
sociological
is holiday is a holiday
weekday integer presentation of day of week day
region geographic region
Fig. 4: Feature importance score for estimating future (1h
ahead) energy consumption.
energy consumption more than twice as important as its rolling
window average with a score of 1035 compared to 488. The
importance of the second and third features are between 400
and 500, the importance of fourth to seventh features is also
comparable, with values between 300 and 400. Starting with
the eight feature, the importance decreases more abruptly
from just below 300 to below 100 while the last 12 features,
mostly related to the type of heating and cooling devices used
as can be seen from the legend of Figure 4, are relatively
less important by an order of magnitude lower than the first.
Other features, which were omitted from bar plot, show no
significant importance.
TABLE III: Impact of additional feature categories on the
XGBoost regression model.
Feature set MSE [kWh] MAE [kWh] medAE [kWh]
raw 0.343 0.317 0.146
++ statistical 0.327 0.308 0.148
++ weather 0.293 0.292 0.148
++ building properties 0.265 0.278 0.141
++ time 0.260 0.275 0.138
++ geolocation 0.259 0.274 0.138
++ sociological 0.258 0.274 0.138
B. Impact of features on the model performance.
This section examines how each category of features can
contribute to the model’s accuracy. The goal was an accurate
prediction of energy consumption 1 hour ahead. The training
data was shuffled, split using 10-fold cross-validation, and
evaluated 10-times using the XGBoost regressor algorithm.
Every step of the ML pipeline was seeded for a fair compar-
ison.
Table III presents the evaluation on the impact of features
on model performance. From top to bottom, each row adds a
set of features. The ”raw” feature set contains only instant en-
ergy consumption collected by IEDs. The ”statistical” feature
set adds rolling average and standard deviation for the last
10 hours. The weather feature set adds attributes regarding
outside temperature, humidity, pressure, weather condition,
theoretical solar altitude, azimuth and radiation. The building
properties adds attributes of each household, house type, house
facing direction, number of EVs, and type of heating system.
The ”time” feature set adds the percentage of day, week,
and year elapsed. The ”geolocation” set adds geographical
longitude and latitude. Finally, the ”sociological” feature set
adds information regarding holidays, weekday, weekend, and
information about region.
Table III shows that adding additional features to the data
significantly improve the estimation performance. Improve-
ment can be observed through a consistent decrease of mean
squared error of prediction (top to bottom).
The first row shows that using (only current) raw values, the
model achieves 0.343 kWh mean squared error. By adding the
”stastistical” feature set, MSE decreases to 0.327 kWh. The
most significant improvement is observed when weather data
is added, where MSE drops from 0.327 to 0.292 kWh. By
adding building properties in addition to raw, statistical, and
weather data, MSE further decrease to 0.265 kWh. By adding
time feature set, MSE decrease to 0.260 kWh. Finally, a minor
improvement comes from geolocation and sociological feature
sets, where MSE decrease from 0.260 to 0.258 kWh.
We observe that new features can significantly improve
energy consumption prediction performance. We found that
new features, which may at first glance be unrelated to the en-
ergy, can significantly contribute to the model’s performance.
By adding new sets of features, we improved MSE from
0.343 kWh to 0.258 kWh, which is 33% improvement.
8
Fig. 5: Benchmarking feature management pipelines.
C. Benchmark in feature processing pipelines
Figure 1To manage the features, we consider three ap-
proaches and evaluate their relative performance with respect
to basic steps in the processing pipeline as illustrated in
Figure 5. As presented on the top of Figure 5, all three
pipelines share a common pre-processing step, where python
scripts perform “basic” data cleaning of raw measurements and
metadata and store data in structured Apache Parquet format.
Parquet is a common data source format for feature stores as
can also be seen from Table II.
In the first benchmarked pipeline in Figure 5, we used
Python with Pandas and Data Version Control (DVC) tools.
Pandas is a specialized tool for manipulating tabular data,
and the DVC tool helps define and execute reproducible
multi-step pipelines. The second benchmarked pipeline uses
Apache Spark controlled by pySpark. Spark runs on a single
node, where Master and Worker nodes reside in their own
Docker containers, which are interconnected with a subnet-
work. Finally, the third benchmarked pipeline is the Feast
implementation of the feature store presented in Figure 1 that
is used to access features. Feast accesses cold storage (i.e.
Parquet files) on each request to deliver requested features.
TABLE IV: The benchmark results of the three pipelines on
the HUE dataset.
Implementation Timings
to process to join & enrich to obtain subset
Python (Pandas) <1 s 1707.308 s 0.617 s
Spark (pySpark) - 1050.897 s 6.941 s
Feature Store (Feast) - 1.235 s 20.581 s
Note: Benchmarked on AMD Ryzen 5 3600 (6c/12t), 32GB DDR4
3200MHz, Samsung NVMe storage, Ubuntu Server 20.04
The results of the benchmark summarized in Table IV.
The common preprocessing step that deals with “basic” data
cleaning of raw measurements and metadata, and stores data
into structured Apache Parquet format takes less than a second.
This relatively short execution time is due to the size of the
data of up to 744 000 rows.
The join&enrich steps with Python took the longest to
complete. It took Python 1707 seconds using Pandas to merge
(i.e., SQL LEFT JOIN operation) three tables together and
generate new features. However, pure Python was the fastest
at retrieving a subset of the merged dataset, which took 0.6
seconds. For cases when an intermediate Parquet file with all
the features exceeds the system memory size, it requires extra
engineering and may not scale well.
The approach with Spark was the fastest at merging datasets
taking 1051 seconds, which is approximately 3 minutes faster
than the pure Python approach. Faster execution is because
most of the tabular data operations on Spark can utilize mul-
tiple threads and multiple workers (distributed). However, the
distributed approach comes with an overhead of synchroniza-
tion between workers and controller nodes, especially when
flushing the output to the storage. Because of this overhead,
Spark took longer, 7 seconds, to retrieve the subset of data.
However, Spark would scale better with a large intermediate
dataset. This is because Spark can scale in the number of
workers and utilize distributed filesystems, such as HDFS and
GlusterFS.
The approach with feature store is a bit different from
the pure Python and Spark approaches. It took Feast to
”merge” the datasets at around one second. However, Feast
does not ”merge” anything at the preparation phase. Instead,
it checks intermediate files (i.e., Parquet files from the first
stage) and constructs data samples only when requested at the
retrieval stage. The burden of joining data is pushed to the
retrieval phase, which is why Feast requires the longest to
retrieve the subset at approximately 21 seconds. While this
is the slowest retrieval time, adding hot storage (e.g. Redis)
can be significantly improved, and it is expected in official
documentation to be used in production deployment.
One significant benefit of feature store (i.e., Feast) is han-
dling new incoming data. Feature store would require only
processed files to be updated before new data can be accessed.
The present Python and Spark approach would have to redo
the merging of the intermediate dataset with all features before
new samples are accessible.
VI. CONCLUSIONS
In this paper we presented a study on data models, en-
ergy feature engineering and feature management systems for
developing ML based energy applications. We first proposed
and presented a taxonomy for data model design of available
features applicable in developing ML applications in energy
domain. The three main categories of features identified in
the taxonomy are: behavioral, contextual and domain specific.
We then discuss techniques for feature generation and show
that they can improve the performance of ML models on
an energy consumption forecasting example. More recently,
features are managed by dedicated systems and we analyze
existing designs. We also prototyped and evaluated three
complementary feature management solutions and showed that
an open-source feature store solution can significantly reduce
9
the time needed to develop new data models. Compared to
currently used solutions, feature store can take by up to 99
percentage points less time to process, enrich and obtain the
features needed for production ready model development.
ACKNOWLEDGMENT
This work was funded by the Slovenian Research Agency
under the Grant P2-0016 and the European Commission under
grant number 872613.
REFERENCES
[1] C. Fortuna, H. Yetgin, and M. Mohorcic, “Ai-enabled life cycle automa-
tion of smart infrastructures,” Industrial Electronics Magazine, 2022.
[2] G. Dileep, “A survey on smart grid technologies and applications,”
Renewable energy, vol. 146, pp. 2589–2625, 2020.
[3] H.-J. Chuang, C.-C. Wang, L.-T. Chao, H.-M. Chou, T.-I. Chien, and
C.-Y. Chuang, “Monitoring the daily life of the elderly using the energy
management system,” in Innovation in Design, Communication and
Engineering. CRC Press, 2020, pp. 101–106.
[4] Y. Wang, Q. Chen, T. Hong, and C. Kang, “Review of smart meter
data analytics: Applications, methodologies, and challenges,” IEEE
Transactions on Smart Grid, vol. 10, no. 3, pp. 3125–3148, 2018.
[5] H. Quan, A. Khosravi, D. Yang, and D. Srinivasan, “A survey of compu-
tational intelligence techniques for wind power uncertainty quantification
in smart grids,” IEEE Transactions on Neural Networks and Learning
Systems, vol. 31, no. 11, pp. 4582–4599, 2020.
[6] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth et al., “Knowledge
discovery and data mining: Towards a unifying framework.” in KDD,
vol. 96, 1996, pp. 82–88.
[7] P. Ruf, M. Madan, C. Reich, and D. Ould-Abdeslam, “Demystifying
mlops and presenting a recipe for the selection of open-source tools,”
Applied Sciences, vol. 11, no. 19, p. 8861, 2021.
[8] B. Allen, S. Agarwal, J. Kalpathy-Cramer, and K. Dreyer, “Democra-
tizing ai,” Journal of the American College of Radiology, vol. 16, no. 7,
pp. 961–963, 2019.
[9] G. Hern ˜
A¡ndez-Moral, S. Mulero-Palencia, V. I. Serna-Gonz ˜
A¡lez,
C. Rodr ˜
Aguez-Alonso, R. Sanz-Jimeno, V. Marinakis, N. Dimitropoulos,
Z. Mylona, D. Antonucci, and H. Doukas, “Big data value chain:
Multiple perspectives for the built environment,” Energies, vol. 14,
no. 15, 2021. [Online]. Available: https://www.mdpi.com/1996-1073/
14/15/4624
[10] J. Patel, “Unification of machine learning features,” in 2020 IEEE 44th
Annual Computers, Software, and Applications Conference (COMP-
SAC), 2020, pp. 1201–1205.
[11] G. Cerar, H. Yetgin, M. Mohorˇ
ciˇ
c, and C. Fortuna, “Learning to fairly
classify the quality of wireless links,” in 2021 16th Annual Conference
on Wireless On-demand Network Systems and Services Conference
(WONS). Ieee, 2021, pp. 1–8.
[12] W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang, “Short-term
residential load forecasting based on lstm recurrent neural network,”
IEEE Transactions on Smart Grid, vol. 10, no. 1, pp. 841–851, 2017.
[13] X. M. Zhang, K. Grolinger, M. A. M. Capretz, and L. Seewald, “Fore-
casting residential energy consumption: Single household perspective,
in 2018 17th IEEE International Conference on Machine Learning and
Applications (ICMLA), 2018, pp. 110–117.
[14] C.-G. Lim and H.-J. Choi, “Deep learning-based analysis on monthly
household consumption for different electricity contracts,” in 2020 IEEE
International Conference on Big Data and Smart Computing (BigComp),
2020, pp. 545–547.
[15] L. Datastore, “Photovoltaic (pv) solar panel energy generation
data,” 2021. [Online]. Available: http://data.europa.eu/88u/dataset/
photovoltaic-pv-solar- panel-energy- generation-data
[16] J. Soares, Z. Vale, B. Canizes, and H. Morais, “Multi-objective parallel
particle swarm optimization for day-ahead vehicle-to-grid scheduling,
in 2013 IEEE Computational Intelligence Applications in Smart Grid
(CIASG), 2013, pp. 138–145.
[17] Sandoval, “Wind power generation data,” 2021. [Online]. Available:
https://www.kaggle.com/datasets/jorgesandoval/wind-power-generation
[18] T. Kriechbaumer and H.-A. Jacobsen, “Blond, a building-level office
environment dataset of typical electrical appliances,Scientific data,
vol. 5, no. 1, pp. 1–14, 2018.
[19] S. Makonin, “Hue: The hourly usage of energy dataset for
buildings in british columbia,” Tech. Rep., 2018. [Online].
Available: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:
10.7910/DVN/N3HGRN
[20] M. Spichakova, J. Belikov, K. Nou, and E. Petlenkov, “Feature engi-
neering for short-term forecast of energy consumption,” 09 2019, pp.
1–5.
[21] Kannal, “Solar power generation data,” 2020. [Online]. Available:
https://www.kaggle.com/datasets/anikannal/solar-power-generation-data
[22] C. Gaete-Morales, “emobpy: application for the german case,” Feb.
2021. [Online]. Available: https://doi.org/10.5281/zenodo.4514928
[23] Lafaz, “Wind energy in germany,” 2019. [Online]. Available:
https://www.kaggle.com/datasets/aymanlafaz/wind-energy-germany
[24] B. V ¨
olker, M. Pfeifer, P. M. Scholl, and B. Becker, “Fired: A
fully-labeled high-frequency electricity disaggregation dataset,” in
Proceedings of the 7th ACM International Conference on Systems for
Energy-Efficient Buildings, Cities, and Transportation, ser. BuildSys ’20.
New York, NY, USA: Association for Computing Machinery, 2020, p.
294–297. [Online]. Available: https://doi.org/10.1145/3408308.3427623
[25] C. O. Adika and L. Wang, “Short term energy consumption prediction
using bio-inspired fuzzy systems,” in 2012 North American Power
Symposium (NAPS), 2012, pp. 1–6.
[26] J. Kelly and W. Knottenbelt, “The uk-dale dataset, domestic appliance-
level electricity demand and whole-house demand from five uk homes,
Scientific data, vol. 2, no. 1, pp. 1–14, 2015.
[27] T. Parkinson, S. Schiavon, R. de Dear, and G. Brager, “Overcooling of
offices reveals gender inequity in thermal comfort,Scientific reports,
vol. 11, no. 1, pp. 1–7, 2021.
[28] D. Murray, L. Stankovic, and V. Stankovic, “An electrical load mea-
surements dataset of united kingdom households from a two-year
longitudinal study,Scientific data, vol. 4, no. 1, pp. 1–12, 2017.
[29] C. Beckel, W. Kleiminger, R. Cicchetti, T. Staake, and S. Santini, “The
eco data set and the performance of non-intrusive load monitoring
algorithms,” in Proceedings of the 1st ACM conference on embedded
systems for energy-efficient buildings, 2014, pp. 80–89.
[30] S. Rennie, C. Andrews, S. Atkinson, D. Beaumont, S. Benham, V. Bow-
maker, J. Dick, B. Dodd, C. McKenna, D. Pallett et al., “The uk en-
vironmental change network datasets–integrated and co-located data for
long-term environmental research (1993–2015),Earth System Science
Data, vol. 12, no. 1, pp. 87–107, 2020.
[31] J. Z. Kolter and M. J. Johnson, “Redd: A public data set for energy
disaggregation research,” in Workshop on data mining applications in
sustainability (SIGKDD), San Diego, CA, vol. 25, no. Citeseer, 2011,
pp. 59–62.
[32] MetOffice, “Midas open: Uk daily temperature data,
v202107,” 2021. [Online]. Available: http://dx.doi.org/10.5285/
92e823b277cc4f439803a87f5246db5f
[33] N. Batra, M. Gulati, A. Singh, and M. B. Srivastava, “It’s different:
Insights into home energy consumption in india,” in Proceedings of
the 5th ACM Workshop on Embedded Systems For Energy-Efficient
Buildings, 2013, pp. 1–8.
[34] W. I. Y. Keller and J. D. Keller, “Daily weather data averages
for Germany aggregated over official weather stations,” Jun. 2021.
[Online]. Available: https://doi.org/10.5281/zenodo.5015006
[35] N. Batra, O. Parson, M. Berges, A. Singh, and A. Rogers, “A comparison
of non-intrusive load monitoring methods for commercial and residential
buildings,” 2014.
[36] Kukreja, “Delhi weather data,” 2017. [Online]. Available: https:
//www.kaggle.com/datasets/mahirkukreja/delhi-weather-data
[37] G. Stamatescu, “Hvac air handling units: One-year data from
medium-to-large size academic building,” 2019. [Online]. Available:
https://dx.doi.org/10.21227/0kbv-zs06
[38] H. Li, Z. Wang, and T. Hong, “A synthetic building operation dataset,”
Scientific data, vol. 8, no. 1, pp. 1–13, 2021.
[39] M. Sinimaa, M. Spichakova, J. Belikov, and E. Petlenkov, “Feature en-
gineering of weather data for short-term energy consumption forecast,”
in 2021 IEEE Madrid PowerTech, 2021, pp. 1–6.
[40] T. Lambrou, P. Kudumakis, R. Speller, M. Sandler, and A. Linney,
“Classification of audio signals using statistical features on time and
wavelet transform domains,” in Proceedings of the 1998 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing, ICASSP
’98 (Cat. No.98CH36181), vol. 6, 1998, pp. 3621–3624 vol.6.
[41] Y. Lei and Z. Wu, “Time series classification based on statistical fea-
tures,” EURASIP Journal on Wireless Communications and Networking,
vol. 2020, no. 1, pp. 1–13, 2020.
10
[42] M. Christ, N. Braun, J. Neuffer, and A. W. Kempa-Liehr, “Time
series feature extraction on basis of scalable hypothesis tests (tsfresh
– a python package),” Neurocomputing, vol. 307, pp. 72–77, 2018.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
S0925231218304843
[43] W. Yu, D. An, D. Griffith, Q. Yang, and G. Xu, “Towards statistical
modeling and machine learning based energy usage forecasting in smart
grid,” ACM SIGAPP Applied Computing Review, vol. 15, no. 1, pp.
6–16, 2015.
[44] R. Fagaras, C. Nichiforov, I. Stamatescu, and G. Stamatescu, “Evaluation
of compressed residential energy forecasting models,” in 2021 IEEE
International Conference on Systems, Man, and Cybernetics (SMC),
2021, pp. 1424–1429.
[45] D. Chowdhury, M. Hasan, and M. Z. Rahman Khan, “Statistical features
extraction from current envelopes for non- intrusive appliance load
monitoring,” in 2020 SoutheastCon, 2020, pp. 1–5.
[46] Z. Ouyang, X. Sun, and D. Yue, “Hierarchical time series feature
extraction for power consumption anomaly detection,” in Advanced
Computational Methods in Energy, Power, Electric Vehicles, and Their
Integration, K. Li, Y. Xue, S. Cui, Q. Niu, Z. Yang, and P. Luk, Eds.
Singapore: Springer Singapore, 2017, pp. 267–275.
[47] R. E. Bellman, Adaptive control processes. Princeton university press,
2015.
[48] R. Krishnan, D. Liang, and M. Hoffman, “On the challenges of learning
with inference networks on sparse, high-dimensional data,” in Interna-
tional conference on artificial intelligence and statistics. PMLR, 2018,
pp. 143–151.
[49] H. Abdi and L. J. Williams, “Principal component analysis,” Wiley
interdisciplinary reviews: computational statistics, vol. 2, no. 4, pp. 433–
459, 2010.
[50] L. Van Der Maaten, “Accelerating t-sne using tree-based algorithms,” J.
Mach. Learn. Res., vol. 15, no. 1, p. 3221–3245, jan 2014.
[51] X. He and P. Niyogi, “Locality preserving projections,Advances in
neural information processing systems, vol. 16, 2003.
[52] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
2016, http://www.deeplearningbook.org.
[53] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
[54] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,
G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,
“Pytorch: An imperative style, high-performance deep learning
library,” in Advances in Neural Information Processing Systems
32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´
e-Buc,
E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019,
pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/
9015-pytorch- an-imperative-style-high- performance-deep- learning-library.
pdf
[55] M. A. et all, “TensorFlow: Large-scale machine learning on
heterogeneous systems,” 2015, software available from tensorflow.org.
[Online]. Available: http://tensorflow.org/
[56] Y. Zhang, B. Chen, G. Pan, and Y. Zhao, “A novel hybrid model
based on vmd-wt and pca-bp-rbf neural network for short-term wind
speed forecasting,” Energy Conversion and Management, vol. 195,
pp. 180–197, 2019. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S0196890419305448
[57] L. Ge, Y. Xian, J. Yan, B. Wang, and Z. Wang, “A hybrid model for
short-term pv output forecasting based on pca-gwo-grnn,” Journal of
Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1268–1275,
2020.
[58] R. Platon, V. R. Dehkordi, and J. Martel, “Hourly prediction
of a building’s electricity consumption using case-based reasoning,
artificial neural networks and principal component analysis,” Energy
and Buildings, vol. 92, pp. 10–18, 2015. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0378778815000651
[59] R. Machlev, D. Tolkachov, Y. Levron, and Y. Beck, “Dimension re-
duction for nilm classification based on principle component analysis,”
Electric Power Systems Research, vol. 187, p. 106459, 2020.
[60] A. Moradzadeh, O. Sadeghian, K. Pourhossein, B. Mohammadi-Ivatloo,
and A. Anvari-Moghaddam, “Improving residential load disaggregation
for sustainable development of energy via principal component analysis,
Sustainability, vol. 12, no. 8, p. 3158, 2020.
[61] S. Ghosh, D. K. Panda, S. Das, and D. Chatterjee, “Cross-correlation
based classification of electrical appliances for non-intrusive load mon-
itoring,” in 2021 International Conference on Sustainable Energy and
Future Electric Transportation (SEFET), 2021, pp. 1–6.
[62] D. Kaur, S. N. Islam, and M. A. Mahmud, “A variational autoencoder-
based dimensionality reduction technique for generation forecasting in
cyber-physical smart grids,” in 2021 IEEE International Conference on
Communications Workshops (ICC Workshops), 2021, pp. 1–6.
[63] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,Advances in neural informa-
tion processing systems, vol. 25, pp. 1097–1105, 2012.
[64] Z. Zhao, P. Zheng, S. Xu, and X. Wu, “Object detection with deep learn-
ing: A review,” IEEE Transactions on Neural Networks and Learning
Systems, vol. 30, no. 11, pp. 3212–3232, 2019.
[65] Z. Wang and T. Oates, “Encoding time series as images for visual in-
spection and classification using tiled convolutional neural networks,” in
Workshops at the twenty-ninth AAAI conference on artificial intelligence,
vol. 1, 2015.
[66] J. Faouzi and H. Janati, “pyts: A python package for time series
classification,” Journal of Machine Learning Research, vol. 21, no. 46,
pp. 1–6, 2020. [Online]. Available: http://jmlr.org/papers/v21/19-763.
html
[67] D. F. Silva, V. M. D. Souza, and G. E. Batista, “Time series classification
using compression distance of recurrence plots,” in 2013 IEEE 13th
International Conference on Data Mining, 2013, pp. 687–696.
[68] J. Eckmann, S. O. Kamphorst, D. Ruelle et al., “Recurrence plots of
dynamical systems,” World Scientific Series on Nonlinear Science Series
A, vol. 16, pp. 441–446, 1995.
[69] B. Bertalani ˇ
c, J. Jenko, and C. Fortuna, “Dimensionality expansion
and transfer learning for next generation energy management systems,
2022. [Online]. Available: https://arxiv.org/abs/2204.02802
[70] M. Mottahedi and S. Asadi, “Non-intrusive load monitoring using imag-
ing time series and convolutional neural networks,” in 16th International
Conference on computing in civil and building engineering, 2016, pp.
705–710.
[71] D. Jia, X. Huang, Z. Du, R. Li, and K. Li, “Identification of electrical
equipment based on two-dimensional time series characteristics of
power,” in IOP Conference Series: Materials Science and Engineering,
vol. 768, no. 6. IOP Publishing, 2020, p. 062019.
[72] L. Kyrkou, C. Nalmpantis, and D. Vrakas, “Imaging time-series for
nilm,” in International Conference on Engineering Applications of
Neural Networks. Springer, 2019, pp. 188–196.
[73] S. R. Tito, A. Ur Rehman, Y. Kim, P. Nieuwoudt, S. Aslam, S. Soltic,
T. T. Lie, N. Pandey, and M. D. Ahmed, “Image segmentation-based
event detection for non-intrusive load monitoring using gramian angular
summation field,” in 2021 IEEE Industrial Electronics and Applications
Conference (IEACon), 2021, pp. 185–190.
[74] A. Faustine and L. Pereira, “Improved appliance classification in
non-intrusive load monitoring using weighted recurrence graph and
convolutional neural networks,Energies, vol. 13, no. 13, 2020.
[Online]. Available: https://www.mdpi.com/1996-1073/13/13/3374
[75] L. De Baets, J. Ruyssinck, C. Develder, T. Dhaene, and D. Deschrijver,
“Appliance classification using vi trajectories and convolutional neural
networks,” Energy and Buildings, vol. 158, pp. 32–36, 2018.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
S0378778817312690
[76] Y.-Y. Hong, J. J. F. Martinez, and A. C. Fajardo, “Day-ahead solar
irradiation forecasting utilizing gramian angular field and convolutional
long short-term memory,IEEE Access, vol. 8, pp. 18 741–18 753, 2020.
[77] H. Chen, J. Wang, and D. Shi, “Spatial-temporal correlation-concerned
measurement manipulation detection based on gramian angular sum-
mation field and convolutional neural networks,” in 2021 IEEE 4th
International Electrical and Energy Conference (CIEEC), 2021, pp. 1–6.
[78] H. Bousbiat, C. Klemenjak, and W. Elmenreich, “Exploring time series
imaging for load disaggregation,” in Proceedings of the 7th ACM
International Conference on Systems for Energy-Efficient Buildings,
Cities, and Transportation, 2020, pp. 254–257.
[79] T. Kakantousis, A. Kouzoupis, F. Buso, G. Berthou, J. Dowling, and
S. Haridi, “Horizontally scalable ML pipelines with a feature store,” in
Proc. 2nd SysML Conf., Palo Alto, USA, 2019.
[80] I. Reda and A. Andreas, “Solar position algorithm for solar radiation
applications,” Solar energy, vol. 76, no. 5, pp. 577–589, 2004.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Deployment and maintenance of large smart infrastructures used for powering data-driven decision making, regardless of retrofitted or newly deployed infrastructures, still lack automation and mostly rely on extensive manual effort. In this paper, we focus on the two main challenges in the life cycle of smart infrastructures: deployment and operation, each of which is rather generic and apply to all infrastructures. We discuss the existing technologies designed to help improve and automate deployment and operation for for smart infrastructures in general and use smart grid as a guiding example to ground some examples across the paper. Next, we identify and discuss opportunities where the broad field of artificial intelligence (AI) can help further improve and automate the life cycle of smart infrastructures to eventually improve their reliability and drive down their deployment and operation costs. Finally, based on the usage of AI for web and social networks as well as our previous experience in AI for networks and cyber-physical systems, we provide decision guidelines for the adoption of AI. Index Terms-smart infrastructure, artificial intelligence, deployment automation, operation automation.
Article
Full-text available
Growth in energy use for indoor cooling tripled between 1990 and 2016 to outpace any other end use in buildings. Part of this energy demand is wasted on excessive cooling of offices, a practice known as overcooling. Overcooling has been attributed to poorly designed or managed air-conditioning systems with thermostats that are often set below recommended comfort temperatures. Prior research has reported lower thermal comfort for women in office buildings, but there is insufficient evidence to explain the reasons for this disparity. We use two large and independent datasets from US buildings to show that office temperatures are less comfortable for women largely due to overcooling. Survey responses show that uncomfortable temperatures are more likely to be cold than hot regardless of season. Crowdsourced data suggests that overcooling is a common problem in warm weather in offices across the US. The associated impacts of this pervasive overcooling on well-being and performance are borne predominantly by women. The problem is likely to increase in the future due to growing demand for cooling in increasingly extreme climates. There is a need to rethink the approach to air-conditioning office buildings in light of this gender inequity caused by overcooling.
Article
Full-text available
Nowadays, machine learning projects have become more and more relevant to various real-world use cases. The success of complex Neural Network models depends upon many factors, as the requirement for structured and machine learning-centric project development management arises. Due to the multitude of tools available for different operational phases, responsibilities and requirements become more and more unclear. In this work, Machine Learning Operations (MLOps) technologies and tools for every part of the overall project pipeline, as well as involved roles, are examined and clearly defined. With the focus on the inter-connectivity of specific tools and comparison by well-selected requirements of MLOps, model performance, input data, and system quality metrics are briefly discussed. By identifying aspects of machine learning, which can be reused from project to project, open-source tools which help in specific parts of the pipeline, and possible combinations, an overview of support in MLOps is given. Deep learning has revolutionized the field of Image processing, and building an automated machine learning workflow for object detection is of great interest for many organizations. For this, a simple MLOps workflow for object detection with images is portrayed.
Conference Paper
Full-text available
Modern energy systems often regarded as smart grid (SG) systems are cyber-physical systems (CPS) equipped with advanced metering and smart sensing devices, leading to a high-dimensional data generation in large volumes. To address this challenge, we propose a new variational autoencoder (VAE)- based dimensionality reduction technique for SG data to enable renewable energy generation forecasting with improved accuracy. The proposed method integrates bidirectional long shortterm memory (BiLSTM) deep neural networks with variational inference, to generate an encoded representation of the highdimensional time-series energy data. The encoded data is further utilized as low- dimensional representation of the original data for the application of energy forecasting, which leads to the reduced computational cost and more accurate forecasting results. The efficacy of the proposed VAE-BiLSTM method is evaluated using python programming and tensorflow library on the data traces taken from the Ausgrid solar generation dataset. Moreover, a comparative analysis of the proposed technique is presented with the benchmark autoencoder (AE) and VAE-based methods. Our result analysis illustrates that the proposed VAE-BiLSTM outperforms VAE-RNN, VAE-LSTM, as well as standard AEbased methods using evaluation metrics such as reconstruction error, pinball score, root-mean square error (RMSE), and mean absolute error (MAE).
Article
Full-text available
This paper presents a synthetic building operation dataset which includes HVAC, lighting, miscellaneous electric loads (MELs) system operating conditions, occupant counts, environmental parameters, end-use and whole-building energy consumptions at 10-minute intervals. The data is created with 1395 annual simulations using the U.S. DOE detailed medium-sized reference office building, and 30 years’ historical weather data in three typical climates including Miami, San Francisco, and Chicago. Three energy efficiency levels of the building and systems are considered. Assumptions regarding occupant movements, occupants’ diverse temperature preferences, lighting, and MELs are adopted to reflect realistic building operations. A semantic building metadata schema - BRICK, is used to store the building metadata. The dataset is saved in a 1.2 TB of compressed HDF5 file. This dataset can be used in various applications, including building energy and load shape benchmarking, energy model calibration, evaluation of occupant and weather variability and their influences on building performance, algorithm development and testing for thermal and energy load prediction, model predictive control, policy development for reinforcement learning based building controls.
Conference Paper
A Non-intrusive Load Monitoring approach extracts the operation time of individual appliances from an aggregated load measured at a single entry-point using their energy consumption characteristics. Event detection represents an important step for load segregation where energy state change on aggregated load and duration are obtained. This paper proposes two event detection algorithms using image segmentation based on two diverse methodologies namely, k-means clustering and thresholding technique. The proposed algorithms are applied to an image generated by encoded Gramian angular summation field of time series data. The method is simple to implement and efficient in computation. The proposed approach is tested and validated using real-world load measurements: Almanac of Minutely Power dataset, and for said purposes, comprehensive simulation studies have been carried out on a low-cost Raspberry Pi 3B+ platform. The corresponding results are promising in terms of event detection and indicate that the proposed approach has a strong potential towards more robust and accurate event-based NILM systems.