Probabilistic Forecasting of Household Electrical
Load Using Artiﬁcial Neural Networks
Julian Vossen, Baptiste Feron, Antonello Monti
Institute for Automation of Complex Power Systems, E.ON Energy Research Center, RWTH University Aachen,
email@example.com; [bferon, amonti]@eonerc.rwth-aachen.de
Abstract—For optimizing the usage of electricity, energy man-
agement systems require forecasts of electrical consumption on
a single-household level. As this consumption is subject to high
uncertainties, state-of-the-art point-forecasting methods fail to
provide accurate predictions. In order to overcome this challenge,
this paper incorporates the uncertainty into a probabilistic fore-
cast using density-estimating Artiﬁcial Neural Networks. As such,
Mixture Density Networks (MDN) and Softmax Regression Net-
works (SRN) are implemented and compared on three different
datasets over a broad range of hyper-parameter conﬁgurations.
The evaluation shows that both neural network models generate
reliable forecasts of the probability density over the future
consumption, which signiﬁcantly outperform an unconditional
benchmarking model. Furthermore, the experiments demonstrate
that a decreased dataset granularity and lagged input improve the
forecasts, while using additional calendar inputs and increasing
the length of lagged inputs had little effect.
Index Terms—STLF, Smart meter, Neural network, Probabilis-
FCumulative distribution function
EError function used to ﬁt the ANN model
xANN input vector
yANN output vector with forecasted value
ˆyObserved or real value
p(y|x)Probability density of y given x
XSet of conditioning variables
φiKernel functions (MDN)
αiMixing coefﬁcients (MDN)
THE increase of distributed energy resources is one of
the biggest challenges faced by the energy sector in the
upcoming years due to their volatile and irregular nature. At
the same time, the spread of domestic smart appliances and
storage systems enable consumers to become an active part of
the grid. In the future, the domestic ﬂexibility sources could
be controlled through a Energy Management System (EMS) in
order to provide grid services, reduce energy costs or reduce
carbon emission . In literature, the EMS approaches are
mainly based on optimization formulation and require reliable
short-term load forecasts (STLF) .
Therefore, research , , ,  has been focused on
generating STLF on a single-household level using methods
such as Auto-Regressive Moving-Average, Artiﬁcial Neural
Networks (ANNs) and Support Vector Regression. Most com-
monly, these methods output a single point forecast per time
step and are therefore referred to as point-forecasting methods.
However, as consumption on a household level is subject
to high volatility and unpredictable human behaviour, these
point-forecasts bear high errors. In fact, depending on the
dataset used, even advanced models fail to outperform naive
benchmarking methods .
As a way of dealing with this high uncertainty, probabilistic
forecasting methods provide information on the distribution
of future values. This can be in the form of intervals with
assigned probabilities or probability density functions (PDFs).
When probabilistic forecasts of the future electrical consump-
tion are available, a stochastic optimization of the consumption
can account for the uncertainty. Thereby, it can be ensured
that the decision strategy is not only locally optimal for
the expected value of the future consumption, but a global
optimum over the predicted distribution of future values .
However, literature on probabilistic forecasting of electrical
load on a single-household level is still sparse . To the
best of the authors’ knowledge, there are only two studies
yet available. In , density forecasts are generated using
conditional kernel density estimation and in  using additive
quantile regression. While ANNs have often been applied for
point-forecasting, transferring ANNs to probabilistic electrical
load forecasting is still missing. In addition, the forecasts in
 and  were evaluated on a dataset of low temporal
resolution. Therefore, the effect of decreasing the dataset
granularity shall be investigated.
Aside from load forecasting, density-estimating ANNs have
been successfully applied to generate probabilistic forecasts in
other domains where forecasts are subject to a high uncertainty
, . The objective of this study is to enrich the STLF
•presenting the implementation of Mixture Density Net-
works (MDN) and Softmax Regression Networks (SRN)
for modelling probability density;
•evaluating and comparing these approaches to an uncon-
ditional benchmarking forecasting method;
… … …
Fig. 1: Mixture Density Network: The output of a neural
network parametrizes a Gaussian mixture model.
•studying the inﬂuence of a wide range of model
hyper-parameters (dataset granularity, inputs conﬁgura-
tion, ANN architecture).
This study is structured into three parts. First, we provide a
brief overview over the methods used in this study. Secondly,
we describe the setup and implementation of the experiments
conducted in this study and ﬁnally, we provide and discuss
the results of these experiments.
II. ME TH OD FUNDAMENTALS
In this section we provide a brief overview over the forecast-
ing methods used in this study. Before covering the density-
estimating neural networks, the following paragraph provides
a short introduction for those readers unfamiliar with neural
networks. Throughout the rest of this paper, yrefers to the
forecasted value and xto the lagged inputs used to forecast
A. Artiﬁcial Neural Networks
Neural networks are computing structures, which consist
of interconnected artiﬁcial neurons. An artiﬁcial neuron is
a function that computes a single output by calculating the
weighted sum of its inputs and applying a non-linear activation
function, e.g. exponential, softmax. Many of such neurons are
connected in layers to form a network, whereby the output of
one layer is fed as input to the following layer. By adjusting
the input weights of each neuron, the resulting network can
be ﬁt to map an input vector to an output vector. With mild
assumptions on the activation function, neural networks can
be thought of as universal function approximators. Fitting
the network weights to represent a function given observed
input and output examples can be done by backpropagation.
Thereby, a so-called error function quantiﬁes how effective the
network captures the relation between input and outputs of the
training examples. Then, the network weights are iteratively
updated towards the direction of a reducing error function.
B. Mixture Density Networks
Conventional least-square regression neural networks can
be derived from maximum likelihood by assuming the target
data to be Gaussian distributed . This motivates the idea
of replacing the Gaussian distribution with a mixture model,
which can model generic distribution functions . Hence,
Fig. 2: Softmax Regression Network: The output of a neural
network represents the probability of class membership. In
this case class membership means the prediction variable yis
falling into the respective interval.
the probability density of the target data is represented as a
linear combination of kernel functions (Eq. 1)
where αi(x)are mixing coefﬁcients conditioned on the input
vector xand φi(y|x)represents a kernel function. Gaussian
kernels (Eq. 2) are used in this study as in .
φi(y|x) = 1
With σi(x)as standard deviations conditioned on x. Therefore,
the output layer of the neural network resembles a parameter
vector [αi(x), µi(x), σi(x)]. The architecture of the mixture
density model is shown in Figure 1.
Using respective activation functions in the output layer
ensures that the network outputs valid parameter vectors.
In this paper, a softmax activation is used for the mixing
coefﬁcients α, and a simple exponential function for the
standard deviations σ, while the means are unrestricted. In
a post-processing step, the non-negative load values of the
model are assigned to a positive probability density. Thereby,
the cumulative distribution function is set to zero for negative
The network can be ﬁt to observations using backpropaga-
tion. Therefore, an error function is deﬁned to quantify the
quality of the PDF forecasted, given observations as a single
scalar. The error function E(y, ˆy)is constructed using the
maximum likelihood criterion by taking the negative logarithm
of the likelihood, also called negative log-likelihood (Eq. 3).
E(y, ˆy) = −ln L(ˆy|x) = −ln p(ˆy|x)(3)
MDNs have been successfully applied to a wide range of
problems, such as ﬁnancial forecasting , weather forecast-
ing  or speech synthesis , as they can approximate
arbitrary probability distributions. However, there is an alter-
native to MDNs, which approximates the probability density
function at discrete sample points, by binning the output
range of the target variable and applying a softmax activation
function to the network output. This technique is referred as
Softmax Regression Networks (SRNs) and is introduced in the
C. Softmax Regression Networks
Like the above described MDNs, SRNs can be used to ap-
proximate arbitrary probability distributions. Instead of assum-
ing a kernel mixture model parametrized by the neural network
output, each output neuron represents the mean probability
density for a fraction of the output space. These fractions are
referred to as bins (Fig. 2). Normalizing the network output
by applying a softmax function to the output layer, the sum
of all bins is ensured to be one. Hence, the outputs of each
neuron can be also interpreted as the probability that the target
variable ylies in the respective bin.
Analog to the MDN, the negative log-likelihood can be used
as error function, which for the softmax output layer with i
discrete bins yibecomes:
E(y, ˆy) = −ln L(ˆy|x) = −ln p(yarg min
D. Benchmarking methods
Two benchmarking models are implemented to compare and
evaluate the proposed ANN models.
First, an unconditional model obtains the overall distribution
of the target variable as a histogram from the training data
and returns this histogram as a forecast for future values.
The model is so-called unconditional as the forecasted
distribution is constant for all forecast steps, independent of
any conditioning variables.
A second benchmark is used to isolate the effect of mod-
elling a time-dependent density compared to a point-forecast.
Instead of training a separate model with a single output
neuron and mean squared error, the forecast is derived from
the predictive conditional mean of a single-component MDN.
In this way, the information on the distribution from the
probabilistic forecasts are discarded to result in a comparable
point-forecast. To compare the point-forecast with a proba-
bilistic one, a Gaussian distribution with a standard derivation
obtained from the residuals in the training data is added
around the point-forecast. This density forecast summarizes all
information on the uncertainty that a point-forecasts provides
and results in a homoscedastic benchmarking model, as the
variance of the forecasted density is constant over time.
E. Evaluation metrics
A scoring function S(p(y),ˆy)is used to evaluate the quality
of probabilistic forecasts, by comparing a predicted probability
density function p(y)to a scalar observation ˆy. To ensure that
the score motivates forecasting the true distribution over any
other, the score needs to be proper. In this paper, forecast
performance is evaluated using the Continuous Ranked Proba-
bility Score (CRPS, Eq. 5) as it is proper and has two favorable
•its unit is identical to the forecast variable, which makes
it more descriptive than e.g. the logarithmic score;
•for point-forecasts it becomes the absolute error, which
provides a way to compare probabilistic forecasts and
TABLE I: Hyper-parameters evaluated during the grid search
Hyper parameter Values
Number of hidden layers 1,3,9
Number of hidden neurons 1,10,40,120,360
Dataset granularity [min] 1,5,30
Length of lagged input [min] 1,30,60,360,1440
Use calendar inputs T rue, F alse
Forecast horizon [min] 60
The CRPS is deﬁned as:
CR P S(F(y),ˆy) = Z∞
where F(y)is the cumulative distribution function of the
forecast and 1() is the Heaviside step function.
III. MOD EL IN G AN D DATA
The models output the forecasted distribution over the total
consumption within the forecast interval conditioned on a
ﬁxed number of most recent load observations and calendar
variables. As calendar variables we use the time of the day,
the day of the week and the month of the year encoded as
numbers in the interval [0,1]. The load recordings are scaled
into the same order of magnitude. Training examples are then
constructed based on input sequences, calendar variables and
respective consumptions during the forecast horizon. If there
are missing or invalid recordings during the forecast horizon,
the respective example is excluded from training data as it
would cause the model to learn on invalid values. Both models
are implemented using Keras and Tensorﬂow. Then, separate
models for each household are trained on the ﬁrst 80% of the
training examples, cross-validated using the following 10%
and tested on the most recent 10% of recordings. To provide
insights on how the choice of hyper-parameters effects the
forecasting performance, both models are evaluated on each
dataset for all combinations of the hyper-parameters provided
in Table I.
Fig. 3: Typical domestic electrical load consumption (Smart*
dataset ) with a high volatility and changing patterns.
While this kind of grid search is not the most efﬁcient way
to ﬁnd a single good-performing hyper-parameter combina-
tion, it allows us to gain insights into how the hyper-parameters
affect the forecasting performance.
Numerous datasets have been recorded to perpetuate re-
search in load forecasting and disaggregation. In this study,
Fig. 4: Mean CRPS value of the unconditional benchmark and
the presented approaches: Softmax Regression Network (SRN)
and Mixture Density Network (MDN).
Fig. 5: Normalized CRPS for different input granularities and
lengths of lagged input; Decreasing the granularity improves
the Smart* , the UK-DALE  and a UCI  dataset
(Fig. 3) are used to evaluate the proposed forecasting methods,
as these are publicly available, exhibit a ﬁne granularity (<1
minute) and include relatively long recording periods.
IV. HYP ER PA RA ME TERS EFFECTS
This section highlights the impact of the hyper-parameters
on the forecasting performance. For this purpose, a single
hyper-parameter is varied, while the others are unrestricted.
The mean CRPS is normalized by dividing the CRPS of the
model with a speciﬁc hyper-parameter by the overall best
CRPS when all hyper-parameters are unrestricted. That some
values fall below a normalized CRPS of one is a consequence
of the cross-validation. The best performing model on the
cross-validation data is not necessarily the best performing
model on the test data.
The following part further elaborates on A) the impact
of the considered model inputs B) the optimum network
conﬁguration and C) the impact of the number of mixture
components for the MDN model.
A. Model inputs
Figure 5 shows that feeding lagged input data in a ﬁner
granularity improves the forecasting performance. However,
increasing the length of lagged input, does not signiﬁcantly
improve the performance. That indicates that the forecast
mostly depends on the most recent observation before the
forecast horizon. The models do not seem to exploit higher-
order patterns in the data. Hence, the increasing performance
for lower granularities is likely an effect of the most recent
Fig. 6: Comparison of different ANN input conﬁgurations
with the benchmark model for different combination of lagged
inputs (previous power consumption) and calendar variables.
Fig. 7: Comparison of different model architectures; The
overall best conﬁguration is three hidden layers with 100
neurons each. However, performance gains compared to very
small networks are relatively small and depend on the dataset.
recording more accurately approximating the load during the
forecast horizon rather than the exhibition of ﬁne-granular
patterns in the data. Figure 6 shows the effect of conditioning
the models on different inputs. The best performing model
conditioned on both lagged input and calendar variables
is compared to the best performing models conditioned
on either only lagged input or only calendar variables
and the unconditional benchmark. Models based only on
calendar inputs perform slightly better than the unconditional
benchmark. Conditioning models only on lagged input
signiﬁcantly improves the forecasting performance. However,
including calendar variables in addition to lagged inputs, has
a little effect.
B. Network conﬁguration
Different network conﬁgurations (Fig. 7) were investigated.
Results highlight that a conﬁguration of three hidden layers
with 100 neurons each, achieve the best overall performance.
Still, the gains compared to conﬁgurations with only few
hidden neurons are relatively small. This small performance
difference indicates that the feed forward ANN structure barely
learns complex features in the input data.
C. MDN: Number of mixture components
The performance of the best MDN with only a single Gaus-
sian component is compared to a MDN with ﬁve components
Fig. 8: Comparison of different number of mixture compo-
nents; restricting the predictive density to be Gaussian leads
to a worse performance compared to a density mixed from
ﬁve Gaussian components.
Fig. 9: Forecasts comparison between a time-dependent and a
constant variance models with identical conditional means.
(Fig. 8). Increasing the number of considered components
allows the forecasted density function to take more generic
shapes instead of restricting it to be one single Gaussian. This
leads to a consistently better forecasting performance.
V. MET HO D EVAL UATIO N
This section aims at evaluating the presented models with an
unconditional benchmarking model and a second benchmark-
ing model with a time-constant variance (see section II-D).
A. Benchmarking model versus MDN and SRN models
The MDN and SRN forecasting models are compared to
the unconditional benchmark (Fig. 4 and 6). The presented
results are obtained comparing the models with the overall
best-performing combination of hyper-parameters from Table
The results highlight that both ANN models achieve a
similar performance and clearly outperform the unconditional
benchmark on the different datasets considered.
B. Time-dependent versus constant variance model
Both presented ANN models output a time-varying variance
of the predictive densities over the forecast steps (Fig. 9),
which means that these models can capture heteroscedacity.
Fig. 10: Performance comparison of MDN model with time-
dependent and constant variance.
Fig. 11: Reliability plots for one and ﬁve-component MDN
model. Ideal reliability is indicated by the angle-bisector
(dashed line). Single Gaussian forecasts are less reliable than
the more generic ﬁve-component ones.
Therefore, this section aims at evaluating the added value of
considering a time-dependent variance in terms of the CRPS.
The results demonstrate that the time-dependent variance
model, always scores better than the constant variance model,
even though both models have an identical predictive condi-
tional mean (Fig. 10). The better performance of the time-
dependent model indicates that the uncertainty can be well
captured and highlight the added value of using forecasting
methods capturing time-dependent variance, unlike the state-
of-the-art point forecasting methods.
The forecast reliability or calibration describes the statistical
compatibility between the forecasted PDF and the realizations.
This means that if the model assigns a probability pto an
outcome, the proportion of realizations matching this outcome
should converge towards pfor a large number of experiments.
This behaviour can be evaluated using reliability plots, where
the frequency of observations falling into the predicted quan-
tile is plotted over the predictive quantile itself (see Fig. 11).
The ﬁgure shows that, especially, for the longer datasets
UK-DALE and UCI, the forecasts show little deviation for the
ideal reliability line (the angle bisector). This shows that even
though the mean absolute errors can be high, the uncertainty
in electrical load forecasts are well captured. Furthermore, the
plots indicate that the ﬁve-component MDN is more reliable
than the single-Gaussian model. Overall, the single-Gaussian
model seems to assign too much probability to small values
of the consumption, which is indicated by the deviation from
the angle bisector for low predicted probabilities.
Because of the high errors in state-of-the-art single-point
forecasts, the primary objective of this study was to present
a probabilistic forecasting model based on artiﬁcial neural
networks (ANNs) for quantifying the forecast uncertainty. The
second objective was to determine the inﬂuence of different
model input conﬁgurations on the forecasting performance.
Towards these objectives, two different density estimating
ANNs have been implemented: First, a Mixture Density Net-
work (MDN), which approximates the predictive probabil-
ity density function as a mixture of Gaussian kernels and,
second, a Softmax Regression Network (SRN) model, which
approximates the predictive probability density as a discrete
distribution over output bins. Both models were evaluated
over a variety of different conﬁgurations on the Smart*, the
UK-DALE and a UCI dataset, which consist of individual
household electrical load recordings. This evaluation led to
the following conclusions:
First and most important, it has been shown that MDNs
and SRNs can generate reliable probabilistic forecasts that sig-
niﬁcantly outperform an unconditional benchmarking model.
Conditioning the models on lagged inputs signiﬁcantly im-
proves the forecasting performance. However, models con-
ditioned on the lagged electrical load of the past 30 or 60
minutes only moderately outperform models conditioned on
solely the most recent lagged electrical load. Lagged input
of more than 60 minutes did not result in better forecasts.
The forecasting performance improves when increasing the
temporal resolution (granularity) of the training data. This is
likely due to the availability of lagged inputs closer to the
forecast horizon rather than the exploitation of higher-order
patterns exhibited by a ﬁner granularity. Conditioning models
on calendar variables (time of the day, day of the week, month
of the year) had no effect on the forecasting performance,
when using lagged inputs. Assuming the predictive distribu-
tions to be Gaussian is restrictive, as it reduces the overall
performance and the reliability of the forecasts.
VII. FUTURE WOR K
The feedforward ANNs used in this study were not able
to beneﬁt much from more lagged input, but mostly depend
on the most recent electrical load observation. Hence, further
research can focus on trying to increase the gains from more
lagged input, by using more advanced model architectures. In
particular, recurrent neural networks or convolutional neural
networks could be combined with the output layers used in
this thesis to result in e.g. recurrent MDNs. These can be
evaluated against the feedforward ones used in this study.
 M. Beaudin and H. Zareipour, “Home energy management systems:
A review of modelling and complexity,” Renewable and Sustainable
Energy Reviews, vol. 45, pp. 318–335, 2015.
 B. Feron and A. Monti, “An agent based approach for virtual power
plant valuing thermal ﬂexibility in energy markets,” IEEE Powertech
 A. K. Singh, Ibraheem, S. Khatoon, and M. Muazzam, “An overview
of electricity demand forecasting techniques,” in National Conference
on Emerging Trends in Electrical, Instrumentation & Communication
Engineering, vol. 3, no. 3, 2013.
 A. Veit, C. Goebel, R. Tidke, C. Doblander, and H.-A. Jacobsen,
“Household electricity demand forecasting - benchmarking state-of-the-
art methods,” in Proceedings of the 5th international conference on
Future energy systems, 2014, pp. 233–234.
 H.-T. Yang, J.-T. Liao, and C.-I. Lin, “A load forecasting method for
hems applications,” in IEEE PowerTech Grenoble, 2013.
 R. E. Edwards, J. New, and L. E. Parker, “Predicting future hourly
residential electrical consumption: A machine learning case study,”
Energy and Buildings, vol. 49, pp. 591–603, 2012.
 T. Gneiting and M. Katzfuss, “Probabilistic forecasting,” Annual Review
of Statistics and Its Application, no. 1, pp. 125–151, 2014.
 T. Hong and S. Fan, “Probabilistic electric load forecasting: A tutorial
review,” International Journal of Forecasting, no. 32, pp. 914–938, 2015.
 S. Arora and J. W. Taylor, “Forecasting electricity smart meter data using
conditional kernel density estimation,” OMEGA - The International
Journal of Management Science, 2016.
 S. B. Taieb, R. Huser, R. J. Hyndman, and M. G. Genton, “Forecasting
uncertainty in electricity smart meter data by boosting additive quantile
regression,” IEEE Transactions on Smart Grid, vol. 7, no. 5, pp. 2448–
 M. Felder, A. Kaifel, and A. Graves, “Wind power prediction using
mixture density recurrent neural networks.”
 D. Ormoneit and R. Neuneier, “Experiments in predicting the german
stock index dax with density estimating neural networks,” in IEEE/IAFE
1996 Conference on Computational Intelligence for Financial Engineer-
ing (CIFEr), 1996.
 C. M. Bishop, “Mixture density networks,” Neural Computing Research
Group, Tech. Rep., 1994.
 H. Zen and A. Senior, “Deep mixture density networks for acoustic mod-
eling in statistical parametric speech synthesis,” in IEEE International
Conference on Acoustic, Speech and Signal Processing, 2014.
 T. Gneiting and A. E. Raftery, “Strictly proper scoring rules, prediction
and estimation,” Journal of the American Statistical Association, 2007.
 S. Barker, A. M. D. Irwin, E. Cecchet, P. Shenoy, and J. Albrecht,
“Smart*: An open data set and tools for enabling research in sustain-
able homes,” in Proceedings of the 2012 Workshop on Data Mining
Applications in Sustainability, 2012.
 J. Kelly and W. Knottenbelt, “The UK-DALE dataset, domestic
appliance-level electricity demand and whole-house demand from ﬁve
UK homes,” Scientiﬁc Data, vol. 2, no. 150007, 2015.
 M. Lichman, “UCI machine learning repository,” 2013. [Online].