Content uploaded by Baptiste Feron

Author content

All content in this area was uploaded by Baptiste Feron on Jul 02, 2018

Content may be subject to copyright.

Content uploaded by Baptiste Feron

Author content

All content in this area was uploaded by Baptiste Feron on Jul 02, 2018

Content may be subject to copyright.

1

Probabilistic Forecasting of Household Electrical

Load Using Artiﬁcial Neural Networks

Julian Vossen, Baptiste Feron, Antonello Monti

Institute for Automation of Complex Power Systems, E.ON Energy Research Center, RWTH University Aachen,

Germany

julian.vossen@rwth-aachen.de; [bferon, amonti]@eonerc.rwth-aachen.de

Abstract—For optimizing the usage of electricity, energy man-

agement systems require forecasts of electrical consumption on

a single-household level. As this consumption is subject to high

uncertainties, state-of-the-art point-forecasting methods fail to

provide accurate predictions. In order to overcome this challenge,

this paper incorporates the uncertainty into a probabilistic fore-

cast using density-estimating Artiﬁcial Neural Networks. As such,

Mixture Density Networks (MDN) and Softmax Regression Net-

works (SRN) are implemented and compared on three different

datasets over a broad range of hyper-parameter conﬁgurations.

The evaluation shows that both neural network models generate

reliable forecasts of the probability density over the future

consumption, which signiﬁcantly outperform an unconditional

benchmarking model. Furthermore, the experiments demonstrate

that a decreased dataset granularity and lagged input improve the

forecasts, while using additional calendar inputs and increasing

the length of lagged inputs had little effect.

Index Terms—STLF, Smart meter, Neural network, Probabilis-

tic forecasting

NOMENCLATURE

FCumulative distribution function

EError function used to ﬁt the ANN model

xANN input vector

yANN output vector with forecasted value

ˆyObserved or real value

p(y|x)Probability density of y given x

SScoring function

XSet of conditioning variables

σStandard deviation

µiCentroids (MDN)

φiKernel functions (MDN)

αiMixing coefﬁcients (MDN)

I. INTRODUCTION

THE increase of distributed energy resources is one of

the biggest challenges faced by the energy sector in the

upcoming years due to their volatile and irregular nature. At

the same time, the spread of domestic smart appliances and

storage systems enable consumers to become an active part of

the grid. In the future, the domestic ﬂexibility sources could

be controlled through a Energy Management System (EMS) in

order to provide grid services, reduce energy costs or reduce

carbon emission [1]. In literature, the EMS approaches are

mainly based on optimization formulation and require reliable

short-term load forecasts (STLF) [2].

Therefore, research [3], [4], [5], [6] has been focused on

generating STLF on a single-household level using methods

such as Auto-Regressive Moving-Average, Artiﬁcial Neural

Networks (ANNs) and Support Vector Regression. Most com-

monly, these methods output a single point forecast per time

step and are therefore referred to as point-forecasting methods.

However, as consumption on a household level is subject

to high volatility and unpredictable human behaviour, these

point-forecasts bear high errors. In fact, depending on the

dataset used, even advanced models fail to outperform naive

benchmarking methods [4].

As a way of dealing with this high uncertainty, probabilistic

forecasting methods provide information on the distribution

of future values. This can be in the form of intervals with

assigned probabilities or probability density functions (PDFs).

When probabilistic forecasts of the future electrical consump-

tion are available, a stochastic optimization of the consumption

can account for the uncertainty. Thereby, it can be ensured

that the decision strategy is not only locally optimal for

the expected value of the future consumption, but a global

optimum over the predicted distribution of future values [7].

However, literature on probabilistic forecasting of electrical

load on a single-household level is still sparse [8]. To the

best of the authors’ knowledge, there are only two studies

yet available. In [9], density forecasts are generated using

conditional kernel density estimation and in [10] using additive

quantile regression. While ANNs have often been applied for

point-forecasting, transferring ANNs to probabilistic electrical

load forecasting is still missing. In addition, the forecasts in

[9] and [10] were evaluated on a dataset of low temporal

resolution. Therefore, the effect of decreasing the dataset

granularity shall be investigated.

Aside from load forecasting, density-estimating ANNs have

been successfully applied to generate probabilistic forecasts in

other domains where forecasts are subject to a high uncertainty

[11], [12]. The objective of this study is to enrich the STLF

literature by

•presenting the implementation of Mixture Density Net-

works (MDN) and Softmax Regression Networks (SRN)

for modelling probability density;

•evaluating and comparing these approaches to an uncon-

ditional benchmarking forecasting method;

2

…

…

… … …

…

x

𝛂i(x)

𝝁i(x)

𝞼i(x)

Neural,network Mixture,model

p(y|x)

Fig. 1: Mixture Density Network: The output of a neural

network parametrizes a Gaussian mixture model.

•studying the inﬂuence of a wide range of model

hyper-parameters (dataset granularity, inputs conﬁgura-

tion, ANN architecture).

This study is structured into three parts. First, we provide a

brief overview over the methods used in this study. Secondly,

we describe the setup and implementation of the experiments

conducted in this study and ﬁnally, we provide and discuss

the results of these experiments.

II. ME TH OD FUNDAMENTALS

In this section we provide a brief overview over the forecast-

ing methods used in this study. Before covering the density-

estimating neural networks, the following paragraph provides

a short introduction for those readers unfamiliar with neural

networks. Throughout the rest of this paper, yrefers to the

forecasted value and xto the lagged inputs used to forecast

this value.

A. Artiﬁcial Neural Networks

Neural networks are computing structures, which consist

of interconnected artiﬁcial neurons. An artiﬁcial neuron is

a function that computes a single output by calculating the

weighted sum of its inputs and applying a non-linear activation

function, e.g. exponential, softmax. Many of such neurons are

connected in layers to form a network, whereby the output of

one layer is fed as input to the following layer. By adjusting

the input weights of each neuron, the resulting network can

be ﬁt to map an input vector to an output vector. With mild

assumptions on the activation function, neural networks can

be thought of as universal function approximators. Fitting

the network weights to represent a function given observed

input and output examples can be done by backpropagation.

Thereby, a so-called error function quantiﬁes how effective the

network captures the relation between input and outputs of the

training examples. Then, the network weights are iteratively

updated towards the direction of a reducing error function.

B. Mixture Density Networks

Conventional least-square regression neural networks can

be derived from maximum likelihood by assuming the target

data to be Gaussian distributed [13]. This motivates the idea

of replacing the Gaussian distribution with a mixture model,

which can model generic distribution functions [13]. Hence,

…

…

…

…

x

p(y|x)

y

…

Fig. 2: Softmax Regression Network: The output of a neural

network represents the probability of class membership. In

this case class membership means the prediction variable yis

falling into the respective interval.

the probability density of the target data is represented as a

linear combination of kernel functions (Eq. 1)

p(y|x) =

m

X

i=1

αi(x)φi(y|x)(1)

where αi(x)are mixing coefﬁcients conditioned on the input

vector xand φi(y|x)represents a kernel function. Gaussian

kernels (Eq. 2) are used in this study as in [13].

φi(y|x) = 1

p2σi(x)2πexp −(y−µi(x))2

2σi(x)2(2)

With σi(x)as standard deviations conditioned on x. Therefore,

the output layer of the neural network resembles a parameter

vector [αi(x), µi(x), σi(x)]. The architecture of the mixture

density model is shown in Figure 1.

Using respective activation functions in the output layer

ensures that the network outputs valid parameter vectors.

In this paper, a softmax activation is used for the mixing

coefﬁcients α, and a simple exponential function for the

standard deviations σ, while the means are unrestricted. In

a post-processing step, the non-negative load values of the

model are assigned to a positive probability density. Thereby,

the cumulative distribution function is set to zero for negative

electrical loads.

The network can be ﬁt to observations using backpropaga-

tion. Therefore, an error function is deﬁned to quantify the

quality of the PDF forecasted, given observations as a single

scalar. The error function E(y, ˆy)is constructed using the

maximum likelihood criterion by taking the negative logarithm

of the likelihood, also called negative log-likelihood (Eq. 3).

E(y, ˆy) = −ln L(ˆy|x) = −ln p(ˆy|x)(3)

MDNs have been successfully applied to a wide range of

problems, such as ﬁnancial forecasting [12], weather forecast-

ing [11] or speech synthesis [14], as they can approximate

arbitrary probability distributions. However, there is an alter-

native to MDNs, which approximates the probability density

function at discrete sample points, by binning the output

range of the target variable and applying a softmax activation

function to the network output. This technique is referred as

Softmax Regression Networks (SRNs) and is introduced in the

following section.

3

C. Softmax Regression Networks

Like the above described MDNs, SRNs can be used to ap-

proximate arbitrary probability distributions. Instead of assum-

ing a kernel mixture model parametrized by the neural network

output, each output neuron represents the mean probability

density for a fraction of the output space. These fractions are

referred to as bins (Fig. 2). Normalizing the network output

by applying a softmax function to the output layer, the sum

of all bins is ensured to be one. Hence, the outputs of each

neuron can be also interpreted as the probability that the target

variable ylies in the respective bin.

Analog to the MDN, the negative log-likelihood can be used

as error function, which for the softmax output layer with i

discrete bins yibecomes:

E(y, ˆy) = −ln L(ˆy|x) = −ln p(yarg min

i|yi−ˆy||x)(4)

D. Benchmarking methods

Two benchmarking models are implemented to compare and

evaluate the proposed ANN models.

First, an unconditional model obtains the overall distribution

of the target variable as a histogram from the training data

and returns this histogram as a forecast for future values.

The model is so-called unconditional as the forecasted

distribution is constant for all forecast steps, independent of

any conditioning variables.

A second benchmark is used to isolate the effect of mod-

elling a time-dependent density compared to a point-forecast.

Instead of training a separate model with a single output

neuron and mean squared error, the forecast is derived from

the predictive conditional mean of a single-component MDN.

In this way, the information on the distribution from the

probabilistic forecasts are discarded to result in a comparable

point-forecast. To compare the point-forecast with a proba-

bilistic one, a Gaussian distribution with a standard derivation

obtained from the residuals in the training data is added

around the point-forecast. This density forecast summarizes all

information on the uncertainty that a point-forecasts provides

and results in a homoscedastic benchmarking model, as the

variance of the forecasted density is constant over time.

E. Evaluation metrics

A scoring function S(p(y),ˆy)is used to evaluate the quality

of probabilistic forecasts, by comparing a predicted probability

density function p(y)to a scalar observation ˆy. To ensure that

the score motivates forecasting the true distribution over any

other, the score needs to be proper. In this paper, forecast

performance is evaluated using the Continuous Ranked Proba-

bility Score (CRPS, Eq. 5) as it is proper and has two favorable

properties [15]:

•its unit is identical to the forecast variable, which makes

it more descriptive than e.g. the logarithmic score;

•for point-forecasts it becomes the absolute error, which

provides a way to compare probabilistic forecasts and

point-forecasts.

TABLE I: Hyper-parameters evaluated during the grid search

Hyper parameter Values

Number of hidden layers 1,3,9

Number of hidden neurons 1,10,40,120,360

Dataset granularity [min] 1,5,30

Length of lagged input [min] 1,30,60,360,1440

Use calendar inputs T rue, F alse

Forecast horizon [min] 60

The CRPS is deﬁned as:

CR P S(F(y),ˆy) = Z∞

−∞ F(z)−1(z−ˆy)2

dz (5)

where F(y)is the cumulative distribution function of the

forecast and 1() is the Heaviside step function.

III. MOD EL IN G AN D DATA

The models output the forecasted distribution over the total

consumption within the forecast interval conditioned on a

ﬁxed number of most recent load observations and calendar

variables. As calendar variables we use the time of the day,

the day of the week and the month of the year encoded as

numbers in the interval [0,1]. The load recordings are scaled

into the same order of magnitude. Training examples are then

constructed based on input sequences, calendar variables and

respective consumptions during the forecast horizon. If there

are missing or invalid recordings during the forecast horizon,

the respective example is excluded from training data as it

would cause the model to learn on invalid values. Both models

are implemented using Keras and Tensorﬂow. Then, separate

models for each household are trained on the ﬁrst 80% of the

training examples, cross-validated using the following 10%

and tested on the most recent 10% of recordings. To provide

insights on how the choice of hyper-parameters effects the

forecasting performance, both models are evaluated on each

dataset for all combinations of the hyper-parameters provided

in Table I.

Fig. 3: Typical domestic electrical load consumption (Smart*

dataset [16]) with a high volatility and changing patterns.

While this kind of grid search is not the most efﬁcient way

to ﬁnd a single good-performing hyper-parameter combina-

tion, it allows us to gain insights into how the hyper-parameters

affect the forecasting performance.

A. Data

Numerous datasets have been recorded to perpetuate re-

search in load forecasting and disaggregation. In this study,

4

Fig. 4: Mean CRPS value of the unconditional benchmark and

the presented approaches: Softmax Regression Network (SRN)

and Mixture Density Network (MDN).

Fig. 5: Normalized CRPS for different input granularities and

lengths of lagged input; Decreasing the granularity improves

performance.

the Smart* [16], the UK-DALE [17] and a UCI [18] dataset

(Fig. 3) are used to evaluate the proposed forecasting methods,

as these are publicly available, exhibit a ﬁne granularity (<1

minute) and include relatively long recording periods.

IV. HYP ER PA RA ME TERS EFFECTS

This section highlights the impact of the hyper-parameters

on the forecasting performance. For this purpose, a single

hyper-parameter is varied, while the others are unrestricted.

The mean CRPS is normalized by dividing the CRPS of the

model with a speciﬁc hyper-parameter by the overall best

CRPS when all hyper-parameters are unrestricted. That some

values fall below a normalized CRPS of one is a consequence

of the cross-validation. The best performing model on the

cross-validation data is not necessarily the best performing

model on the test data.

The following part further elaborates on A) the impact

of the considered model inputs B) the optimum network

conﬁguration and C) the impact of the number of mixture

components for the MDN model.

A. Model inputs

Figure 5 shows that feeding lagged input data in a ﬁner

granularity improves the forecasting performance. However,

increasing the length of lagged input, does not signiﬁcantly

improve the performance. That indicates that the forecast

mostly depends on the most recent observation before the

forecast horizon. The models do not seem to exploit higher-

order patterns in the data. Hence, the increasing performance

for lower granularities is likely an effect of the most recent

Fig. 6: Comparison of different ANN input conﬁgurations

with the benchmark model for different combination of lagged

inputs (previous power consumption) and calendar variables.

Fig. 7: Comparison of different model architectures; The

overall best conﬁguration is three hidden layers with 100

neurons each. However, performance gains compared to very

small networks are relatively small and depend on the dataset.

recording more accurately approximating the load during the

forecast horizon rather than the exhibition of ﬁne-granular

patterns in the data. Figure 6 shows the effect of conditioning

the models on different inputs. The best performing model

conditioned on both lagged input and calendar variables

is compared to the best performing models conditioned

on either only lagged input or only calendar variables

and the unconditional benchmark. Models based only on

calendar inputs perform slightly better than the unconditional

benchmark. Conditioning models only on lagged input

signiﬁcantly improves the forecasting performance. However,

including calendar variables in addition to lagged inputs, has

a little effect.

B. Network conﬁguration

Different network conﬁgurations (Fig. 7) were investigated.

Results highlight that a conﬁguration of three hidden layers

with 100 neurons each, achieve the best overall performance.

Still, the gains compared to conﬁgurations with only few

hidden neurons are relatively small. This small performance

difference indicates that the feed forward ANN structure barely

learns complex features in the input data.

C. MDN: Number of mixture components

The performance of the best MDN with only a single Gaus-

sian component is compared to a MDN with ﬁve components

5

Fig. 8: Comparison of different number of mixture compo-

nents; restricting the predictive density to be Gaussian leads

to a worse performance compared to a density mixed from

ﬁve Gaussian components.

Fig. 9: Forecasts comparison between a time-dependent and a

constant variance models with identical conditional means.

(Fig. 8). Increasing the number of considered components

allows the forecasted density function to take more generic

shapes instead of restricting it to be one single Gaussian. This

leads to a consistently better forecasting performance.

V. MET HO D EVAL UATIO N

This section aims at evaluating the presented models with an

unconditional benchmarking model and a second benchmark-

ing model with a time-constant variance (see section II-D).

A. Benchmarking model versus MDN and SRN models

The MDN and SRN forecasting models are compared to

the unconditional benchmark (Fig. 4 and 6). The presented

results are obtained comparing the models with the overall

best-performing combination of hyper-parameters from Table

I.

The results highlight that both ANN models achieve a

similar performance and clearly outperform the unconditional

benchmark on the different datasets considered.

B. Time-dependent versus constant variance model

Both presented ANN models output a time-varying variance

of the predictive densities over the forecast steps (Fig. 9),

which means that these models can capture heteroscedacity.

Fig. 10: Performance comparison of MDN model with time-

dependent and constant variance.

Fig. 11: Reliability plots for one and ﬁve-component MDN

model. Ideal reliability is indicated by the angle-bisector

(dashed line). Single Gaussian forecasts are less reliable than

the more generic ﬁve-component ones.

Therefore, this section aims at evaluating the added value of

considering a time-dependent variance in terms of the CRPS.

The results demonstrate that the time-dependent variance

model, always scores better than the constant variance model,

even though both models have an identical predictive condi-

tional mean (Fig. 10). The better performance of the time-

dependent model indicates that the uncertainty can be well

captured and highlight the added value of using forecasting

methods capturing time-dependent variance, unlike the state-

of-the-art point forecasting methods.

C. Reliability

The forecast reliability or calibration describes the statistical

compatibility between the forecasted PDF and the realizations.

This means that if the model assigns a probability pto an

outcome, the proportion of realizations matching this outcome

should converge towards pfor a large number of experiments.

This behaviour can be evaluated using reliability plots, where

the frequency of observations falling into the predicted quan-

tile is plotted over the predictive quantile itself (see Fig. 11).

The ﬁgure shows that, especially, for the longer datasets

UK-DALE and UCI, the forecasts show little deviation for the

ideal reliability line (the angle bisector). This shows that even

though the mean absolute errors can be high, the uncertainty

in electrical load forecasts are well captured. Furthermore, the

6

plots indicate that the ﬁve-component MDN is more reliable

than the single-Gaussian model. Overall, the single-Gaussian

model seems to assign too much probability to small values

of the consumption, which is indicated by the deviation from

the angle bisector for low predicted probabilities.

VI. CONCLUSION

Because of the high errors in state-of-the-art single-point

forecasts, the primary objective of this study was to present

a probabilistic forecasting model based on artiﬁcial neural

networks (ANNs) for quantifying the forecast uncertainty. The

second objective was to determine the inﬂuence of different

model input conﬁgurations on the forecasting performance.

Towards these objectives, two different density estimating

ANNs have been implemented: First, a Mixture Density Net-

work (MDN), which approximates the predictive probabil-

ity density function as a mixture of Gaussian kernels and,

second, a Softmax Regression Network (SRN) model, which

approximates the predictive probability density as a discrete

distribution over output bins. Both models were evaluated

over a variety of different conﬁgurations on the Smart*, the

UK-DALE and a UCI dataset, which consist of individual

household electrical load recordings. This evaluation led to

the following conclusions:

First and most important, it has been shown that MDNs

and SRNs can generate reliable probabilistic forecasts that sig-

niﬁcantly outperform an unconditional benchmarking model.

Conditioning the models on lagged inputs signiﬁcantly im-

proves the forecasting performance. However, models con-

ditioned on the lagged electrical load of the past 30 or 60

minutes only moderately outperform models conditioned on

solely the most recent lagged electrical load. Lagged input

of more than 60 minutes did not result in better forecasts.

The forecasting performance improves when increasing the

temporal resolution (granularity) of the training data. This is

likely due to the availability of lagged inputs closer to the

forecast horizon rather than the exploitation of higher-order

patterns exhibited by a ﬁner granularity. Conditioning models

on calendar variables (time of the day, day of the week, month

of the year) had no effect on the forecasting performance,

when using lagged inputs. Assuming the predictive distribu-

tions to be Gaussian is restrictive, as it reduces the overall

performance and the reliability of the forecasts.

VII. FUTURE WOR K

The feedforward ANNs used in this study were not able

to beneﬁt much from more lagged input, but mostly depend

on the most recent electrical load observation. Hence, further

research can focus on trying to increase the gains from more

lagged input, by using more advanced model architectures. In

particular, recurrent neural networks or convolutional neural

networks could be combined with the output layers used in

this thesis to result in e.g. recurrent MDNs. These can be

evaluated against the feedforward ones used in this study.

REFERENCES

[1] M. Beaudin and H. Zareipour, “Home energy management systems:

A review of modelling and complexity,” Renewable and Sustainable

Energy Reviews, vol. 45, pp. 318–335, 2015.

[2] B. Feron and A. Monti, “An agent based approach for virtual power

plant valuing thermal ﬂexibility in energy markets,” IEEE Powertech

Manchester, 2017.

[3] A. K. Singh, Ibraheem, S. Khatoon, and M. Muazzam, “An overview

of electricity demand forecasting techniques,” in National Conference

on Emerging Trends in Electrical, Instrumentation & Communication

Engineering, vol. 3, no. 3, 2013.

[4] A. Veit, C. Goebel, R. Tidke, C. Doblander, and H.-A. Jacobsen,

“Household electricity demand forecasting - benchmarking state-of-the-

art methods,” in Proceedings of the 5th international conference on

Future energy systems, 2014, pp. 233–234.

[5] H.-T. Yang, J.-T. Liao, and C.-I. Lin, “A load forecasting method for

hems applications,” in IEEE PowerTech Grenoble, 2013.

[6] R. E. Edwards, J. New, and L. E. Parker, “Predicting future hourly

residential electrical consumption: A machine learning case study,”

Energy and Buildings, vol. 49, pp. 591–603, 2012.

[7] T. Gneiting and M. Katzfuss, “Probabilistic forecasting,” Annual Review

of Statistics and Its Application, no. 1, pp. 125–151, 2014.

[8] T. Hong and S. Fan, “Probabilistic electric load forecasting: A tutorial

review,” International Journal of Forecasting, no. 32, pp. 914–938, 2015.

[9] S. Arora and J. W. Taylor, “Forecasting electricity smart meter data using

conditional kernel density estimation,” OMEGA - The International

Journal of Management Science, 2016.

[10] S. B. Taieb, R. Huser, R. J. Hyndman, and M. G. Genton, “Forecasting

uncertainty in electricity smart meter data by boosting additive quantile

regression,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2448–

2455, 2016.

[11] M. Felder, A. Kaifel, and A. Graves, “Wind power prediction using

mixture density recurrent neural networks.”

[12] D. Ormoneit and R. Neuneier, “Experiments in predicting the german

stock index dax with density estimating neural networks,” in IEEE/IAFE

1996 Conference on Computational Intelligence for Financial Engineer-

ing (CIFEr), 1996.

[13] C. M. Bishop, “Mixture density networks,” Neural Computing Research

Group, Tech. Rep., 1994.

[14] H. Zen and A. Senior, “Deep mixture density networks for acoustic mod-

eling in statistical parametric speech synthesis,” in IEEE International

Conference on Acoustic, Speech and Signal Processing, 2014.

[15] T. Gneiting and A. E. Raftery, “Strictly proper scoring rules, prediction

and estimation,” Journal of the American Statistical Association, 2007.

[16] S. Barker, A. M. D. Irwin, E. Cecchet, P. Shenoy, and J. Albrecht,

“Smart*: An open data set and tools for enabling research in sustain-

able homes,” in Proceedings of the 2012 Workshop on Data Mining

Applications in Sustainability, 2012.

[17] J. Kelly and W. Knottenbelt, “The UK-DALE dataset, domestic

appliance-level electricity demand and whole-house demand from ﬁve

UK homes,” Scientiﬁc Data, vol. 2, no. 150007, 2015.

[18] M. Lichman, “UCI machine learning repository,” 2013. [Online].

Available: https://archive.ics.uci.edu/ml/datasets/