Conference PaperPDF Available

Gaussian Mixture Models for Time Series Modelling, Forecasting, and Interpolation

Authors:

Abstract

Gaussian mixture models provide an appealing tool for time series modelling. By embedding the time series to a higher-dimensional space, the density of the points can be estimated by a mixture model. The model can directly be used for short-to-medium term forecasting and missing value imputation. The modelling setup introduces some restrictions on the mixture model, which when appropriately taken into account result in a more accurate model. Experiments on time series forecasting show that including the constraints in the training phase particularly reduces the risk of overfitting in challenging situations with missing values or a large number of Gaussian components.
Gaussian Mixture Models for Time Series
Modelling, Forecasting, and Interpolation
Emil Eirola1and Amaury Lendasse12 3
1Department of Information and Computer Science, Aalto University,
FI–00076 Aalto, Finland
emil.eirola@aalto.fi
2IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain
3Computational Intelligence Group, Computer Science Faculty, University of the
Basque Country, Paseo Manuel Lardizabal 1, Donostia/San Sebasti´an, Spain
Abstract. Gaussian mixture models provide an appealing tool for time
series modelling. By embedding the time series to a higher-dimensional
space, the density of the points can be estimated by a mixture model.
The model can directly be used for short-to-medium term forecasting and
missing value imputation. The modelling setup introduces some restric-
tions on the mixture model, which when appropriately taken into account
result in a more accurate model. Experiments on time series forecasting
show that including the constraints in the training phase particularly re-
duces the risk of overfitting in challenging situations with missing values
or a large number of Gaussian components.
Keywords: time series, missing data, Gaussian mixture model
1 Introduction
A time series is one of the most common forms of data, and has been studied
extensively from weather patterns spanning centuries to sensors and microcon-
trollers operating on nanosecond scales. The features and irregularities of time
series can be modelled through various means, such as autocovariance analysis,
trend fitting, or frequency-domain methods. From a machine learning perspec-
tive, the most relevant tasks tend to be prediction of one or several future data
points, or interpolation for filling in gaps in the data. In this paper, we study a
model for analysing time series, which is applicable to both tasks.
For uniformly sampled stationary processes, we propose a versatile method-
ology to model the features of the time series by embedding the data to a high-
dimensional regressor space. The density of the points in this space can then be
modelled with Gaussian mixture models [1]. Such an estimate of the probability
density enables a direct way to interpolate missing values in the time series and
conduct short-to-medium term prediction by finding the conditional expecta-
tion of the unknown values. Embedding the time series in a higher-dimensional
space imposes some restrictions on the possible distribution of points, but these
constraints can be accounted for when fitting the Gaussian mixture models.
The suggested framework can readily be extended to situations with several
related time series, using exogenous time series to improve the predictions of a
target series. Furthermore, any missing values can be handled by the Gaussian
mixture model in a natural manner.
This paper is structured as follows. Section 2 presents the procedure for mod-
elling time series by Gaussian mixture models, the constraints on the Gaussian
mixture model due to time series data are discussed in Section 3, and some exper-
iments showing the effect of selecting the number of components and introducing
missing values are studied in Section 4.
2 Mixture Models for Time Series
Given a time series zof length n, corresponding to a stationary process:
z0, z1, z2, . . . , zn2, zn1,
by choosing a regressor length dwe can conduct a delay embedding [2] and form
the design matrix X,
X=
z0z1. . . zd1
z1z2. . . zd
.
.
..
.
..
.
.
zndznd+1 . . . zn1
=
x0
x1
.
.
.
xnd
.(1)
The rows of Xcan be interpreted as vectors in Rd. We can model the density of
these points by a Gaussian mixture model, with the probability density function
p(x) =
K
X
k=1
πkN(x|µk,Σk) (2)
where N(x|µk,Σk) is the probability density function of the multivariate
normal distribution, µkrepresents the means, Σkthe covariance matrices, and
πkthe mixing coefficients for each component k(0 < πk<1, PK
k=1 πk= 1).
Given a set of data, the standard approach to training a Gaussian mixture
model is the EM algorithm [3,4] for finding a maximum-likelihood fit. The log-
likelihood of the Ndata points is given by
log L(θ) = log p(X|θ) =
N
X
i=1
log K
X
k=1
πkN(xi|µk,Σk)!,(3)
where θ={πk,µk,Σk}K
k=1 is the set of parameters defining the model. The
log-likelihood can be maximised by applying the EM algorithm. After some ini-
tialisation of parameters, the E-step is to find the expected value of the log
likelihood function, with respect to the conditional distribution of latent vari-
ables Zgiven the data Xunder the current estimate of the parameters θ(t):
Q(θ|θ(t)) = EZ|X(t)log L(θ;X,Z)(4)
This requires evaluating the probabilities tik that xiis generated by the kth
Gaussian using the current parameter values:
t(t)
ik =π(t)
kN(xi|µ(t)
k,Σ(t)
k)
PK
j=1 π(t)
jN(xi|µ(t)
j,Σ(t)
j).(5)
In the M-step, the expected log-likelihood is maximised:
θ(t+1) = arg max
θQ(θ|θ(t)),(6)
which corresponds to re-estimating the parameters with the new probabilities:
µ(t+1)
k=1
Nk
N
X
i=1
t(t)
ik xi,(7)
Σ(t+1)
k=1
Nk
N
X
i=1
t(t)
ik (xiµ(t+1)
k)(xiµ(t+1)
k)T,(8)
π(t+1)
k=1
N
N
X
i=1
t(t)
ik .(9)
Here Nk=PN
i=1 t(t)
ik is the effective number of samples covered by the kth
component. The E and M-steps are alternated repeatedly until convergence.
As the algorithm tends to occasionally converge to sub-optimal solutions, the
procedure can be repeated to find the best fit.
2.1 Model Structure Selection
The selection of the number of components Kis crucial, and has a significant
effect on the resulting accuracy. Too few components are not able to model the
distribution appropriately, while having too many components causes issues of
overfitting.
The number of components can be selected according to the Akaike informa-
tion criterion (AIC) [5] or the Bayesian information criterion (BIC) [6]. Both are
expressed as a function of the log-likelihood of the converged mixture model:
AIC = 2 log L(θ)+2P , (10)
BIC = 2 log L(θ) + log(N)P , (11)
where P=Kd +1
2Kd(d+ 1) + K1 is the number of free parameters. The
EM algorithm is run for several different values of K, and the model which
minimises the chosen criterion is selected. As log(N)>2 in most cases, BIC
more aggresively penalises an increase in P, generally resulting in a smaller
choice for Kthan by AIC.
2.2 Forecasting
The model readily lends itself to being used for short-to-medium term time series
prediction. For example, if a time series is measured monthly and displays some
seasonal behaviour, a Gaussian model could be trained with a regressor size of
24 (two years). This allows us to take the last year’s measurements as the 12 first
months, and determine the conditional expectation of the following 12 months.
The mixture model provides a direct way to calculate the conditional expec-
tation. Let the input dimensions be partitioned into past values P(known) and
future values F(unknown). Then, given a sample xP
ifor which only the past
values are known and a prediction is to be made, calculate the probabilities of
it belonging to each component
tik =πkN(xP
i|µk,Σk)
PK
j=1 πjN(xP
i|µj,Σj),(12)
where N(xP
i|µk,Σk) is the marginal multivariate normal distribution proba-
bility density of the observed (i.e., past) values of xi.
Let the means and covariances of each component also be partitioned accord-
ing to past and future variables:
µk=µP
k
µF
k,Σk=ΣP P
kΣP F
k
ΣF P
kΣF F
k.(13)
Then the conditional expectation of the future values with respect to the com-
ponent kis given by
˜
yik =µF
k+ΣF P
k(ΣP P
k)1(xP
iµP
k) (14)
in accordance with [7, Thm. 2.5.1]. The total conditional expectation can now
be found as a weighted average of these predictions by the probabilities tik:
ˆ
yi=
K
X
k=1
tik ˜
yik .(15)
It should be noted that the method directly estimates the full vector of future
values at once, in contrast with most other methods which would separately
predict each required data point.
2.3 Missing Values and Imputation
The proposed method is directly applicable to time series with missing values.
Missing data in the time series become diagonals of missing values in the design
matrix. The EM-algorithm can in a natural way account for missing values in
the samples [8,9].
An assumption here is that data are Missing-at-Random (MAR) [10]:
P(M|xobs, xmis ) = P(M|xobs),
i.e., the event Mof a measurement being missing is independent from the value
it would take (xmis), conditional on the observed data (xobs). The stronger as-
sumption of Missing-Completely-at-Random (MCAR) is not necessary, as MAR
is an ignorable missing-data mechanism in the sense that maximum likelihood
estimation still provides a consistent estimator [10].
To conduct missing value imputation, the procedure is the same as for pre-
diction in Section 2.2. The only difference is that in this case the index set P
contains all known values for a sample (both before and after the target to be
predicted), while Fcontains the missing values that will be imputed.
2.4 Missing-data Padding
When using an implementation of the EM algorithm that is able to handle
missing values, it is reasonable to consider that every value before and after the
recorded time series consists is missing. This can be seen as “padding” the design
matrix Xwith missing values (marked as ‘?’), effectively increasing the number
of samples available for training from nd+ 1 to n+d1 (cf. Eq. (1)):
X=
? ? . . . ?z0
? ? . . . z0z1
.
.
..
.
..
.
..
.
.
?z0. . . zd3zd2
z0z1. . . zd2zd1
.
.
..
.
..
.
..
.
.
zndznd+1 . . . zn2zn1
znd+1 znd+2 . . . zn1?
.
.
..
.
..
.
..
.
.
zn1?. . . ? ?
=
x0
x1
.
.
.
xd2
xd1
.
.
.
xn1
xn
.
.
.
xn+d2
(16)
Fitting the mixture model using this padded design matrix has the added advan-
tage that the sample mean and variance of (the observed values in) each column
is guaranteed to be equal. The missing-data padding can thus be a useful trick
even if the time series itself features no missing values, particularly if only a
limited amount of data is available.
3 Constraining the Global Covariance
The Gaussian mixture model is ideal for modelling arbitrary continuous distri-
butions. However, embedding a time series to a higher-dimensional space cannot
lead to an arbitrary distribution. For instance, the mean and variance for each
dimension should equal the mean and variance of the time series. In addition, all
second-order statistics, such as covariances, should equal the respective autoco-
variances of the time series. These restrictions impose constraints on the mixture
model, and accounting for them appropriately should lead to a more accurate
model when fitting to data.
In the EM algorithm, we estimate means µk, covariances Σk, and mixing
coefficients πkfor each component k, and then the global mean and covariance
of the distribution defined by the model is
µ=
K
X
k=1
πkµk,Σ=
K
X
k=1
πkΣk+µkµT
kµµT.(17)
However, the global mean and covariance correspond to the mean and autoco-
variance matrix of the time series. This implies that the global mean for each
dimension should be equal. Furthermore, the global covariance matrix should be
symmetric and Toeplitz (“diagonal-constant”):
ΣRz=
rz(0) rz(1) rz(2) . . . rz(d1)
rz(1) rz(0) rz(1) . . . rz(d2)
.
.
..
.
..
.
..
.
.
rz(d1) rz(d2) rz(d3) . . . rz(0)
where rz(l) is the autocovariance of the time series zat lag l.
In practice, these statistics usually do not exactly correspond to each other,
even when training the model on the missing-data padded design matrix dis-
cussed in Section 2.4. Unfortunately, the question of how to enforce this con-
straint in each M-step has no trivial solution. Forcing every component to have
an equal mean and Toeplitz covariance structure by its own is one possibility,
but this is far too restrictive.
Our suggestion is to calculate the M-step by Eqs. (7–9), and then modify
the parameters as little as possible in order to achieve the appropriate structure.
As θ={µk,Σk, πk}K
k=1 contains the parameters for the mixture model, let
be the space of all possible values for θ, and Tbe the subset such that all
parameter values θTcorrespond to a global mean with equal elements, and
Toeplitz covariance matrix by Eq. (17).
When maximising the expected log-likelihood with the constraints, the M-
step should be
θ(t+1) = arg max
θTQ(θ|θ(t)),(18)
but this is not feasible to solve exactly. Instead, we solve the conventional M-step
θ0= arg max
θQ(θ|θ(t)),(19)
and then project this θ0onto Tto find the closest solution
θ(t+1) = arg min
θTd(θ, θ0) (20)
for some interpretation of the distance d(θ, θ0). If the difference is small, the
expected log-likelihood Q(θ(t+1) |θ(t)) should not be too far from the optimal
maxθTQ(θ|θ(t)). As the quantity is not maximised, though it can be observed
to increase, this becomes a Generalised EM (GEM) algorithm. As long as an
increase is ensured in every iteration, the GEM algorithm is known to have
similar convergence properties as the EM algorithm [3,4].
Define the distance function between sets of parameters as follows:
d(θ, θ0) = X
k
kµkµ0
kk2+X
k
kSkS0
kk2
F+X
k
(πkπ0
k)2,(21)
where Sk=Σk+µkµT
kare the second moments of the distributions of each
component and k·kFis the Frobenius norm. Using Lagrange multipliers, it can
be shown that this distance function is minimised by the results presented below
in Eqs. (22) and (23).
3.1 The Mean
After an iteration of the normal EM-algorithm by Eqs. (7–9), find the vector
with equal components which is nearest to the global mean µas calculated
by Eq. (17). This is done by finding the mean mof the components of µ, and
calculating the discrepancy δof how much the current mean is off from the equal
mean:
m=1
d
d
X
j=1
µj,δ=µm1,
where 1is a vector of ones. Shift the means of each component to compensate,
as follows:
µ0
k=µkπk
PK
j=1 π2
j
δk . (22)
As can be seen, components with larger πktake on more of the “responsibility”
of the discrepancy, as they contribute more to the global statistics. Any weights
which sum to unity would fulfil the constraints, but choosing the weights to be
directly proportional to πkminimises the distance in Eq. (21).
3.2 The Covariance
After updating the means µk, recalculate the covariances around the updated
values as
ΣkΣk+µkµT
kµ0
kµ0T
kk .
Then, find the nearest (in Frobenius norm) Toeplitz matrix Rby calculating the
mean of each diagonal of the global covariance matrix Σ(from Eq. (17)):
r(0) = 1
d
d
X
j=1
Σj,j , r(1) = 1
d1
d1
X
j=1
Σj,j+1 , r(2) = 1
d2
d2
X
j=1
Σj,j+2 ,etc.
The discrepancy from this Toeplitz matrix is
=ΣR,where R=
r(0) r(1) r(2) . . . r(d1)
r(1) r(0) r(1) . . . r(d2)
.
.
..
.
..
.
..
.
.
r(d1) r(d2) r(d3) . . . r(0)
.
In order to satisfy the constraint of a Toeplitz matrix for the global covariance,
the component covariances are updated as
Σ0
k=Σkπk
PK
j=1 π2
j
k , (23)
the weights being the same as in Eq. (22). Eqs. (22) and (23), together with
π0
k=πk, minimise the distance in Eq. (21) subject to the constraints.
3.3 Heuristic Correction
Unfortunately the procedure described above seems to occasionally lead to ma-
trices Σ0
kwhich are not positive definite. Hence an additional heuristic correction
ckis applied in such cases to force the matrix to remain positive definite:
Σ00
k=Σkπk
PK
k=1 π2
k
+ckIk. (24)
In the experiments, the value ck= 1.1|λk0|is used, where λk0is the most negative
eigenvalue of Σ0
k. The multiplier needs to be larger than unity to avoid making
the matrix singular.
A more appealing correction would be to only increase the negative (or zero)
eigenvalues to some acceptable, positive, value. However, this would break the
constraint of a Toeplitz global covariance matrix, and hence the correction must
be applied to all eigenvalues, as is done in Eq. (24) by adding to the diagonal.
3.4 Free Parameters
The constraints reduce the number of free parameters relevant to calculating the
AIC and BIC. Without constraints, the number of free parameters is
P=Kd
|{z}
means
+1
2Kd(d+ 1)
| {z }
covariances
+K1
| {z }
mixing coeffs
,
where Kis the number of Gaussian components, and dis the regressor length.
There are d1 equality constraints for the mean, and 1
2d(d1) constraints
for the covariance, each reducing the number of free parameters by 1. With the
constraints, the number of free parameters is then
P0= (K1)d+ 1
| {z }
means
+1
2(K1)d(d+ 1) + d
| {z }
covariances
+K1
| {z }
mixing coeffs
.
The leading term is reduced from 1
2Kd2to 1
2(K1)d2, in effect allowing one
additional component for approximately the same number of free parameters.
3.5 Exogenous Time Series or Non-contiguous Lag
If the design matrix is formed in a different way than by taking consecutive
values, the restrictions for the covariance matrix will change. Such cases are
handled by forcing any affected elements in the matrix to equal the mean of the
elements it should equal. This will also affect the number of free parameters.
As this sort of delay embedding may inherently have a low intrinsic dimen-
sion, optimising the selection of variables could considerably improve accuracy.
4 Experiments: Time Series Forecasting
To show the effects of the constraints and the number of components on the
prediction accuracy, some experimental results are shown here. The studied time
series is the Santa Fe time series competition data set A: Laser generated data
[11]. The task is set at predicting the next 12 values, given the previous 12.
This makes the regressor size d= 24, and the mixture model fitting is in a
24-dimensional space. The original 1000 points of the time series are used for
training the model, and the continuation (9093 points) as a test set for estimating
the accuracy of the prediction. Accuracy is determined by the mean squared
error (MSE), averaging over the 12 future values for all samples. No variable
selection is conducted, and all 12 variables in the input are used for the model.
The missing-data padded design matrix of Section 2.4 is used for the training,
even when the time series otherwise has no missing values.
4.1 The Number of Components
Gaussian mixture models were trained separately for 1 through 30 components,
each time choosing out of 10 runs the best result in terms of log-likelihood. In
order to provide a perspective on average behaviour, this procedure was repeated
20 times both with and without the constraints detailed in Section 3.
The first two plots in Fig. 1 show the MSE of the prediction on the training
and test sets, as an average of the 20 repetitions. It is important to note that the
model fitting and selection was conducted by maximising the log-likelihood, and
not by attempting to minimise this prediction error. Nevertheless, it can be seen
that the training error decreases when adding components, and is consistently
lower than the test error, as expected. Notably, the difference between train-
ing and test errors is much smaller for the constrained mixture model than the
unconstrained one. Also, the training error is consistently decreasing for both
models when increasing the number of components, but for the test error this is
true only for constrained model. It appears that the unconstrained model results
in overfitting when used with more than 10 components. For 1 to 10 components,
0 10 20 30
0
200
400
600
800
1000
Components
MSE (training)
Constrained
Unconstrained
0 10 20 30
0
200
400
600
800
1000
Components
MSE (test)
Constrained
Unconstrained
0 10 20 30
0.5
1
1.5
2
2.5 x 105
Components
AIC
Constrained
Unconstrained
0 10 20 30
0.5
1
1.5
2
2.5 x 105
Components
BIC
Constrained
Unconstrained
Fig. 1. Results on the Santa Fe A Laser time series data, including the average MSE
of the 12-step prediction on the training and test sets, AIC and BIC values for both
the constrained and unconstrained mixture models for 1 through 30 components.
there is no notable difference in the test error between the two models, presum-
ably because around 10 components are required for a decent approximation
of the density. However, for 10 or more components, the constraints provide a
consistent improvement in the forecasting accuracy.
The third and fourth plots in Fig. 1 shows the evolution of the average AIC
and BIC of the converged model. The line plots show the average value of the
criterion, and the asterisks depict the minimum AIC (or BIC) value (i.e., the
selected model) for each of the 20 runs. As results on the test set are not available
in the model selection phase, the number of components should be chosen based
on these criteria. As the log-likelihood grows much faster for the unconstrained
model, this results in a consistently larger number of components as selected by
both criteria. Comparing the AIC and BIC, it is clear that BIC tends to choose
fewer components, as expected. However, the test MSE for the constrained model
keeps decreasing even until 30 components, suggesting that both criteria may be
exaggerating the penalisation in this case when increasing the model size.
4.2 Missing Data
To study the effect of missing values, the modelling of the Santa Fe Laser time
series is repeated with various degrees of missing data (1% through 50%). In the
training phase, missing data is removed at random from the time series before
forming the padded design matrix. To calculate the testing MSE, missing values
0 5 10 15 20
0
200
400
600
800
1000
Components
MSE (training)
Constrained
Unconstrained
0 5 10 15 20
0
200
400
600
800
1000
Components
MSE (test)
Constrained
Unconstrained
Fig. 2. Results on the Santa Fe A Laser time series data with 10% missing values,
including the average MSE of the 12-step prediction on the training and test sets for
both the constrained and unconstrained mixture models for 1 through 20 components.
are also removed from the inputs (i.e., the past values from which predictions
are to be made) at the same probability. The MSE is then calculated as the error
between the forecast and the actual time series (with no values removed).
The training and test MSE for 10% missing values are shown in Fig. 2. The
behaviour is similar to the corresponding plots in Fig. 1, although the difference
in the testing MSE appears more pronounced, and for a lower number of com-
ponents. This supports the notion that the constraints help against overfitting.
Fig. 3 shows the number of components selected by AIC and BIC, and the
corresponding test MSEs, for various degrees of missing values. As expected, the
forecasting accuracy deteriorates with an increasing ratio of missing data. The
number of components selected by the AIC remains largely constant, and the
constrained model consistently performs better. The BIC, on the other hand,
seems to selected far too few components for the constrained model (the MSE
plots in Figs. 1 and 2 suggest five components are far from sufficient), resulting
in a reduced forecasting accuracy.
Figs. 1 and 3 reveal largely similar results between using AIC and BIC for
the unconstrained case. However, for the constrained model, BIC is clearly too
restrictive, and using AIC leads to more accurate results.
5 Conclusions
Time series modelling through Gaussian mixture models is an appealing method,
capable of accurate short-to-medium term prediction and missing value interpo-
lation. Certain restrictions on the structure of the model arise naturally through
the modelling setup, and appropriately including these constraints in the mod-
elling procedure further increases its accuracy. The constraints are theoretically
justified, and experiments support their utility. The effect is negligible when
there are enough samples or few components such that fitting a mixture model
is easy, but in more challenging situations with a large number of components
or missing values they considerably reduce the risk of overfitting.
0 10 20 30 40 50
0
5
10
15
20
25
% Missing
Components by AIC
Constrained
Unconstrained
0 10 20 30 40 50
0
5
10
15
20
25
% Missing
Components by BIC
Constrained
Unconstrained
0 10 20 30 40 50
0
500
1000
% Missing
MSE (test)
Constrained
Unconstrained
0 10 20 30 40 50
0
500
1000
% Missing
MSE (test)
Constrained
Unconstrained
Fig. 3. Results on the Santa Fe A Laser time series data for various degrees of missing
values, including the number of components selected by AIC (left) and BIC (right) and
the resulting MSEs of the corresponding test set predictions.
References
1. McLachlan, G., Peel, D.: Finite Mixture Models. Wiley Series in Probability and
Statistics. John Wiley & Sons, New York (2000)
2. Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis. Cambridge nonlinear
science series. Cambridge University Press (2004)
3. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society. Series B
(Methodological) 39(1) (1977) pp. 1–38
4. McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley Series
in Probability and Statistics. John Wiley & Sons, New York (1997)
5. Akaike, H.: A new look at the statistical model identification. Automatic Control,
IEEE Transactions on 19(6) (December 1974) 716–723
6. Schwarz, G.: Estimating the dimension of a model. The annals of statistics 6(2)
(1978) 461–464
7. Anderson, T.W.: An Introduction to Multivariate Statistical Analysis. Third edn.
Wiley-Interscience, New York (2003)
8. Ghahramani, Z., Jordan, M.: Learning from incomplete data. Technical report,
Lab Memo No. 1509, CBCL Paper No. 108, MIT AI Lab (1995)
9. Hunt, L., Jorgensen, M.: Mixture model clustering for mixed data with missing
information. Computational Statistics & Data Analysis 41(3–4) (2003) 429–440
10. Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Second edn.
Wiley-Interscience (2002)
11. Gershenfeld, N., Weigend, A.: The Santa Fe time series competition data (1991)
http://www-psych.stanford.edu/~andreas/Time-Series/SantaFe.html.
... If this is the case, the test-statistic is small, and the null hypothesis of having the same level cannot be rejected. Figure 3.2, the distribution of load process points can be modeled as Gaussian Mixture Model (GMM) (Mclachlan andPeel, 2004, Eirola andLendasse, 2013). Here, we model the GMM using the first 10,000 points, i.e D = 10, 000. ...
Thesis
Modelle und Daten sind die beiden grundlegenden Elemente in den meisten Finanzmarktstudien. Viele Arbeiten konzentrieren sich auf die Verbesserung von Modellen zur besseren Annäherung an wahre Marktmechanismen, dabei konzentriert sich ein wichtiger Teil der Literatur auf die Nutzung von Informationen aus verschiedenen Quellen. In letzter Zeit haben immer mehr Forscher die Bedeutung der Modellierung aus realen Daten erkannt, dies geht einhermit der Weiterentwicklung moderner statistischer Modelle, insbesondere dem maschinellen (statistischen) Lernen, wie z. B. rekurrente neuronale Netze, die sich in den letzten Jahren bei vielen Problemen als wirksam erwiesen haben. Es hat sich gezeigt, dass der zunehmende Trend auf innovative Datenquellen wie Textnachrichten und Satellitenbilder zuzugreifen und diese zu analysieren, sich schnell zu einer wichtigen Säule der Finanzwissenschaft entwickelt hat. Auf der anderen Seite bietet die klassische Finanzliteratur eine fundierte Basis, um die aus diesen hochentwickelten Modellen und Daten gewonnenen Ergebnisse zu hinterfragen. Basierend auf der Finanzmarktanalyse mit modernen statistischen Modellen werden in dieser Dissertation in den ersten drei Kapiteln verschiedene Themen behandelt, darunter das Portfoliomanagement in Verbindung mit Informationen aus Nachrichtennetzwerken, das Risikomanagement des aufstrebenden Bitcoin-Marktes und die Vorhersage von Zeitreihen von Stromlasten mit fortgeschrittenen statistischen Modellen.
... Motivated by Figure 2, the distribution of load process points can be modeled as Gaussian Mixture Model (GMM) (Mclachlan and Peel (2004), Eirola and Lendasse (2013)). Here, we model the GMM using the first 10,000 points, i.e D = 10, 000. ...
Preprint
Full-text available
Short Term Load Forecast (STLF) is necessary for effective scheduling, operation optimization trading, and decision-making for electricity consumers. Modern and efficient machine learning methods are recalled nowadays to manage complicated structural big datasets, which are characterized by having a nonlinear temporal dependence structure. We propose different statistical nonlinear models to manage these challenges of hard type datasets and forecast 15-min frequency electricity load up to 2-days ahead. We show that the Long-short Term Memory (LSTM) and the Gated Recurrent Unit (GRU) models applied to the production line of a chemical production facility outperform several other predictive models in terms of out-of-sample forecasting accuracy by the Diebold-Mariano (DM) test with several metrics. The predictive information is fundamental for the risk and production management of electricity consumers.
... There exist two popular paradigms for learning flexible probability distributions using neural networks: In mixture density networks (Bishop, 1994), a neural net directly produces the distribution parameters; in normalizing flows (Tabak & Turner, 2013;Rezende & Mohamed, 2015), we obtain a complex distribution by transforming a simple one. Both mixture models (Schuster, 2000;Eirola & Lendasse, 2013;Graves, 2013) and normalizing flows (Oord et al., 2016; Ziegler & Rush, 2019) have been applied for modeling sequential data. However, surprisingly, none of the existing works make the connection and consider these approaches in the context of TPPs. ...
Preprint
Temporal point processes are the dominant paradigm for modeling sequences of events happening at irregular intervals. The standard way of learning in such models is by estimating the conditional intensity function. However, parameterizing the intensity function usually incurs several trade-offs. We show how to overcome the limitations of intensity-based approaches by directly modeling the conditional distribution of inter-event times. We draw on the literature on normalizing flows to design models that are flexible and efficient. We additionally propose a simple mixture model that matches the flexibility of flow-based models, but also permits sampling and computing moments in closed form. The proposed models achieve state-of-the-art performance in standard prediction tasks and are suitable for novel applications, such as learning sequence embeddings and imputing missing data.
... In this section, we illustrate how to learn the underlying joint distribution of arrival time, session duration, and energy delivered using Gaussian mixture models (GMMs) (e.g., [14,24]). We then use these GMMs to predict user behavior (Section 5), optimally size onsite solar for adaptive EV charging (Section 6), and control EV charging to smooth the duck curve (Section 7). ...
Conference Paper
We are releasing ACN-Data, a dynamic dataset of workplace EV charging which currently includes over 30,000 sessions with more added daily. In this paper we describe the dataset, as well as some interesting user behavior it exhibits. To demonstrate the usefulness of the dataset, we present three examples, learning and predicting user behavior using Gaussian mixture models, optimally sizing on-site solar generation for adaptive electric vehicle charging, and using workplace charging to smooth the net demand Duck Curve.
... In a wider context, some developments on the use of GMMs for time series forecasting exist. Reference [12] reports on the initial results of the use of GMMs for time series but focuses exclusively on the forecasting abilities of the approach though the computation of conditional expectations. ...
... Moreover, spectral signatures of some metabolites can exhibit complex shapes, which may require including even more components for their modeling (Chyla, 2012). In some applications concerning modeling, interpolation and forecasting of time signals (time series) mixtures of (Gaussian) functions are applied, which include tens of components (Eirola, Lendasse, 2013). Time of flight mass spectrometry, a high-throughput, experimental technique in molecular biology provides measurements of peptide, protein or lipid compositions of biological samples. ...
Article
Full-text available
Setting initial values of parameters of mixture distributions estimated by using the EM recursive algorithm is very important to the overall quality of estimation. None of the existing methods is suitable for mixtures with large number of components. We present a relevant methodology of estimating initial values of parameters of univariate, heteroscedastic Gaussian mixtures, on the basis of the dynamic programming algorithm for partitioning the range of observations into bins. We evaluate variants of dynamic programming method corresponding to different scoring functions for partitioning. For simulated and real datasets we demonstrate superior efficiency of the proposed method compared to existing techniques.
... After this, the gaps in the time series are filled using a Gaussian mixture model, according to the procedure described in detail in [4]. Gaussian mixture is a very flexible distribution which can fit any other continuous distribution provided enough number of Gaussian components is taken [5]. ...
Article
Aging sensitive parameters like dissipation factor (tanδ), oil and paper conductivity and paper moisture can be estimated from insulation model parameterized using polarization current to assess the condition of transformer insulation. However, polarization current measurement is a time-consuming offline technique. During measurement, variation of environmental conditions (especially temperature) affects the monotonically decreasing nature of recorded data. Analysis of such affected data lead to incorrect conclusion regarding insulation condition. Power transformer being a crucial equipment, utility prefer to reduce its shut down time to minimum amount. In this paper a technique is discussed through which the polarization current measurement time can be reduced significantly. Several transformer data are used for verification of developed method. Presented results show that measurement data corresponding to only 10 minutes is sufficient to estimate the remaining data through the application of discussed method.