Content uploaded by Emil Eirola
Author content
All content in this area was uploaded by Emil Eirola on Jul 30, 2014
Content may be subject to copyright.
Gaussian Mixture Models for Time Series
Modelling, Forecasting, and Interpolation
Emil Eirola1and Amaury Lendasse12 3
1Department of Information and Computer Science, Aalto University,
FI–00076 Aalto, Finland
emil.eirola@aalto.fi
2IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain
3Computational Intelligence Group, Computer Science Faculty, University of the
Basque Country, Paseo Manuel Lardizabal 1, Donostia/San Sebasti´an, Spain
Abstract. Gaussian mixture models provide an appealing tool for time
series modelling. By embedding the time series to a higher-dimensional
space, the density of the points can be estimated by a mixture model.
The model can directly be used for short-to-medium term forecasting and
missing value imputation. The modelling setup introduces some restric-
tions on the mixture model, which when appropriately taken into account
result in a more accurate model. Experiments on time series forecasting
show that including the constraints in the training phase particularly re-
duces the risk of overfitting in challenging situations with missing values
or a large number of Gaussian components.
Keywords: time series, missing data, Gaussian mixture model
1 Introduction
A time series is one of the most common forms of data, and has been studied
extensively from weather patterns spanning centuries to sensors and microcon-
trollers operating on nanosecond scales. The features and irregularities of time
series can be modelled through various means, such as autocovariance analysis,
trend fitting, or frequency-domain methods. From a machine learning perspec-
tive, the most relevant tasks tend to be prediction of one or several future data
points, or interpolation for filling in gaps in the data. In this paper, we study a
model for analysing time series, which is applicable to both tasks.
For uniformly sampled stationary processes, we propose a versatile method-
ology to model the features of the time series by embedding the data to a high-
dimensional regressor space. The density of the points in this space can then be
modelled with Gaussian mixture models [1]. Such an estimate of the probability
density enables a direct way to interpolate missing values in the time series and
conduct short-to-medium term prediction by finding the conditional expecta-
tion of the unknown values. Embedding the time series in a higher-dimensional
space imposes some restrictions on the possible distribution of points, but these
constraints can be accounted for when fitting the Gaussian mixture models.
The suggested framework can readily be extended to situations with several
related time series, using exogenous time series to improve the predictions of a
target series. Furthermore, any missing values can be handled by the Gaussian
mixture model in a natural manner.
This paper is structured as follows. Section 2 presents the procedure for mod-
elling time series by Gaussian mixture models, the constraints on the Gaussian
mixture model due to time series data are discussed in Section 3, and some exper-
iments showing the effect of selecting the number of components and introducing
missing values are studied in Section 4.
2 Mixture Models for Time Series
Given a time series zof length n, corresponding to a stationary process:
z0, z1, z2, . . . , zn−2, zn−1,
by choosing a regressor length dwe can conduct a delay embedding [2] and form
the design matrix X,
X=
z0z1. . . zd−1
z1z2. . . zd
.
.
..
.
..
.
.
zn−dzn−d+1 . . . zn−1
=
x0
x1
.
.
.
xn−d
.(1)
The rows of Xcan be interpreted as vectors in Rd. We can model the density of
these points by a Gaussian mixture model, with the probability density function
p(x) =
K
X
k=1
πkN(x|µk,Σk) (2)
where N(x|µk,Σk) is the probability density function of the multivariate
normal distribution, µkrepresents the means, Σkthe covariance matrices, and
πkthe mixing coefficients for each component k(0 < πk<1, PK
k=1 πk= 1).
Given a set of data, the standard approach to training a Gaussian mixture
model is the EM algorithm [3,4] for finding a maximum-likelihood fit. The log-
likelihood of the Ndata points is given by
log L(θ) = log p(X|θ) =
N
X
i=1
log K
X
k=1
πkN(xi|µk,Σk)!,(3)
where θ={πk,µk,Σk}K
k=1 is the set of parameters defining the model. The
log-likelihood can be maximised by applying the EM algorithm. After some ini-
tialisation of parameters, the E-step is to find the expected value of the log
likelihood function, with respect to the conditional distribution of latent vari-
ables Zgiven the data Xunder the current estimate of the parameters θ(t):
Q(θ|θ(t)) = EZ|X,θ(t)log L(θ;X,Z)(4)
This requires evaluating the probabilities tik that xiis generated by the kth
Gaussian using the current parameter values:
t(t)
ik =π(t)
kN(xi|µ(t)
k,Σ(t)
k)
PK
j=1 π(t)
jN(xi|µ(t)
j,Σ(t)
j).(5)
In the M-step, the expected log-likelihood is maximised:
θ(t+1) = arg max
θQ(θ|θ(t)),(6)
which corresponds to re-estimating the parameters with the new probabilities:
µ(t+1)
k=1
Nk
N
X
i=1
t(t)
ik xi,(7)
Σ(t+1)
k=1
Nk
N
X
i=1
t(t)
ik (xi−µ(t+1)
k)(xi−µ(t+1)
k)T,(8)
π(t+1)
k=1
N
N
X
i=1
t(t)
ik .(9)
Here Nk=PN
i=1 t(t)
ik is the effective number of samples covered by the kth
component. The E and M-steps are alternated repeatedly until convergence.
As the algorithm tends to occasionally converge to sub-optimal solutions, the
procedure can be repeated to find the best fit.
2.1 Model Structure Selection
The selection of the number of components Kis crucial, and has a significant
effect on the resulting accuracy. Too few components are not able to model the
distribution appropriately, while having too many components causes issues of
overfitting.
The number of components can be selected according to the Akaike informa-
tion criterion (AIC) [5] or the Bayesian information criterion (BIC) [6]. Both are
expressed as a function of the log-likelihood of the converged mixture model:
AIC = −2 log L(θ)+2P , (10)
BIC = −2 log L(θ) + log(N)P , (11)
where P=Kd +1
2Kd(d+ 1) + K−1 is the number of free parameters. The
EM algorithm is run for several different values of K, and the model which
minimises the chosen criterion is selected. As log(N)>2 in most cases, BIC
more aggresively penalises an increase in P, generally resulting in a smaller
choice for Kthan by AIC.
2.2 Forecasting
The model readily lends itself to being used for short-to-medium term time series
prediction. For example, if a time series is measured monthly and displays some
seasonal behaviour, a Gaussian model could be trained with a regressor size of
24 (two years). This allows us to take the last year’s measurements as the 12 first
months, and determine the conditional expectation of the following 12 months.
The mixture model provides a direct way to calculate the conditional expec-
tation. Let the input dimensions be partitioned into past values P(known) and
future values F(unknown). Then, given a sample xP
ifor which only the past
values are known and a prediction is to be made, calculate the probabilities of
it belonging to each component
tik =πkN(xP
i|µk,Σk)
PK
j=1 πjN(xP
i|µj,Σj),(12)
where N(xP
i|µk,Σk) is the marginal multivariate normal distribution proba-
bility density of the observed (i.e., past) values of xi.
Let the means and covariances of each component also be partitioned accord-
ing to past and future variables:
µk=µP
k
µF
k,Σk=ΣP P
kΣP F
k
ΣF P
kΣF F
k.(13)
Then the conditional expectation of the future values with respect to the com-
ponent kis given by
˜
yik =µF
k+ΣF P
k(ΣP P
k)−1(xP
i−µP
k) (14)
in accordance with [7, Thm. 2.5.1]. The total conditional expectation can now
be found as a weighted average of these predictions by the probabilities tik:
ˆ
yi=
K
X
k=1
tik ˜
yik .(15)
It should be noted that the method directly estimates the full vector of future
values at once, in contrast with most other methods which would separately
predict each required data point.
2.3 Missing Values and Imputation
The proposed method is directly applicable to time series with missing values.
Missing data in the time series become diagonals of missing values in the design
matrix. The EM-algorithm can in a natural way account for missing values in
the samples [8,9].
An assumption here is that data are Missing-at-Random (MAR) [10]:
P(M|xobs, xmis ) = P(M|xobs),
i.e., the event Mof a measurement being missing is independent from the value
it would take (xmis), conditional on the observed data (xobs). The stronger as-
sumption of Missing-Completely-at-Random (MCAR) is not necessary, as MAR
is an ignorable missing-data mechanism in the sense that maximum likelihood
estimation still provides a consistent estimator [10].
To conduct missing value imputation, the procedure is the same as for pre-
diction in Section 2.2. The only difference is that in this case the index set P
contains all known values for a sample (both before and after the target to be
predicted), while Fcontains the missing values that will be imputed.
2.4 Missing-data Padding
When using an implementation of the EM algorithm that is able to handle
missing values, it is reasonable to consider that every value before and after the
recorded time series consists is missing. This can be seen as “padding” the design
matrix Xwith missing values (marked as ‘?’), effectively increasing the number
of samples available for training from n−d+ 1 to n+d−1 (cf. Eq. (1)):
X=
? ? . . . ?z0
? ? . . . z0z1
.
.
..
.
..
.
..
.
.
?z0. . . zd−3zd−2
z0z1. . . zd−2zd−1
.
.
..
.
..
.
..
.
.
zn−dzn−d+1 . . . zn−2zn−1
zn−d+1 zn−d+2 . . . zn−1?
.
.
..
.
..
.
..
.
.
zn−1?. . . ? ?
=
x0
x1
.
.
.
xd−2
xd−1
.
.
.
xn−1
xn
.
.
.
xn+d−2
(16)
Fitting the mixture model using this padded design matrix has the added advan-
tage that the sample mean and variance of (the observed values in) each column
is guaranteed to be equal. The missing-data padding can thus be a useful trick
even if the time series itself features no missing values, particularly if only a
limited amount of data is available.
3 Constraining the Global Covariance
The Gaussian mixture model is ideal for modelling arbitrary continuous distri-
butions. However, embedding a time series to a higher-dimensional space cannot
lead to an arbitrary distribution. For instance, the mean and variance for each
dimension should equal the mean and variance of the time series. In addition, all
second-order statistics, such as covariances, should equal the respective autoco-
variances of the time series. These restrictions impose constraints on the mixture
model, and accounting for them appropriately should lead to a more accurate
model when fitting to data.
In the EM algorithm, we estimate means µk, covariances Σk, and mixing
coefficients πkfor each component k, and then the global mean and covariance
of the distribution defined by the model is
µ=
K
X
k=1
πkµk,Σ=
K
X
k=1
πkΣk+µkµT
k−µµT.(17)
However, the global mean and covariance correspond to the mean and autoco-
variance matrix of the time series. This implies that the global mean for each
dimension should be equal. Furthermore, the global covariance matrix should be
symmetric and Toeplitz (“diagonal-constant”):
Σ≈Rz=
rz(0) rz(1) rz(2) . . . rz(d−1)
rz(1) rz(0) rz(1) . . . rz(d−2)
.
.
..
.
..
.
..
.
.
rz(d−1) rz(d−2) rz(d−3) . . . rz(0)
where rz(l) is the autocovariance of the time series zat lag l.
In practice, these statistics usually do not exactly correspond to each other,
even when training the model on the missing-data padded design matrix dis-
cussed in Section 2.4. Unfortunately, the question of how to enforce this con-
straint in each M-step has no trivial solution. Forcing every component to have
an equal mean and Toeplitz covariance structure by its own is one possibility,
but this is far too restrictive.
Our suggestion is to calculate the M-step by Eqs. (7–9), and then modify
the parameters as little as possible in order to achieve the appropriate structure.
As θ={µk,Σk, πk}K
k=1 contains the parameters for the mixture model, let Ω
be the space of all possible values for θ, and T⊂Ωbe the subset such that all
parameter values θ∈Tcorrespond to a global mean with equal elements, and
Toeplitz covariance matrix by Eq. (17).
When maximising the expected log-likelihood with the constraints, the M-
step should be
θ(t+1) = arg max
θ∈TQ(θ|θ(t)),(18)
but this is not feasible to solve exactly. Instead, we solve the conventional M-step
θ0= arg max
θ∈ΩQ(θ|θ(t)),(19)
and then project this θ0onto Tto find the closest solution
θ(t+1) = arg min
θ∈Td(θ, θ0) (20)
for some interpretation of the distance d(θ, θ0). If the difference is small, the
expected log-likelihood Q(θ(t+1) |θ(t)) should not be too far from the optimal
maxθ∈TQ(θ|θ(t)). As the quantity is not maximised, though it can be observed
to increase, this becomes a Generalised EM (GEM) algorithm. As long as an
increase is ensured in every iteration, the GEM algorithm is known to have
similar convergence properties as the EM algorithm [3,4].
Define the distance function between sets of parameters as follows:
d(θ, θ0) = X
k
kµk−µ0
kk2+X
k
kSk−S0
kk2
F+X
k
(πk−π0
k)2,(21)
where Sk=Σk+µkµT
kare the second moments of the distributions of each
component and k·kFis the Frobenius norm. Using Lagrange multipliers, it can
be shown that this distance function is minimised by the results presented below
in Eqs. (22) and (23).
3.1 The Mean
After an iteration of the normal EM-algorithm by Eqs. (7–9), find the vector
with equal components which is nearest to the global mean µas calculated
by Eq. (17). This is done by finding the mean mof the components of µ, and
calculating the discrepancy δof how much the current mean is off from the equal
mean:
m=1
d
d
X
j=1
µj,δ=µ−m1,
where 1is a vector of ones. Shift the means of each component to compensate,
as follows:
µ0
k=µk−πk
PK
j=1 π2
j
δ∀k . (22)
As can be seen, components with larger πktake on more of the “responsibility”
of the discrepancy, as they contribute more to the global statistics. Any weights
which sum to unity would fulfil the constraints, but choosing the weights to be
directly proportional to πkminimises the distance in Eq. (21).
3.2 The Covariance
After updating the means µk, recalculate the covariances around the updated
values as
Σk←Σk+µkµT
k−µ0
kµ0T
k∀k .
Then, find the nearest (in Frobenius norm) Toeplitz matrix Rby calculating the
mean of each diagonal of the global covariance matrix Σ(from Eq. (17)):
r(0) = 1
d
d
X
j=1
Σj,j , r(1) = 1
d−1
d−1
X
j=1
Σj,j+1 , r(2) = 1
d−2
d−2
X
j=1
Σj,j+2 ,etc.
The discrepancy ∆from this Toeplitz matrix is
∆=Σ−R,where R=
r(0) r(1) r(2) . . . r(d−1)
r(1) r(0) r(1) . . . r(d−2)
.
.
..
.
..
.
..
.
.
r(d−1) r(d−2) r(d−3) . . . r(0)
.
In order to satisfy the constraint of a Toeplitz matrix for the global covariance,
the component covariances are updated as
Σ0
k=Σk−πk
PK
j=1 π2
j
∆∀k , (23)
the weights being the same as in Eq. (22). Eqs. (22) and (23), together with
π0
k=πk, minimise the distance in Eq. (21) subject to the constraints.
3.3 Heuristic Correction
Unfortunately the procedure described above seems to occasionally lead to ma-
trices Σ0
kwhich are not positive definite. Hence an additional heuristic correction
ckis applied in such cases to force the matrix to remain positive definite:
Σ00
k=Σk−πk
PK
k=1 π2
k
∆+ckI∀k. (24)
In the experiments, the value ck= 1.1|λk0|is used, where λk0is the most negative
eigenvalue of Σ0
k. The multiplier needs to be larger than unity to avoid making
the matrix singular.
A more appealing correction would be to only increase the negative (or zero)
eigenvalues to some acceptable, positive, value. However, this would break the
constraint of a Toeplitz global covariance matrix, and hence the correction must
be applied to all eigenvalues, as is done in Eq. (24) by adding to the diagonal.
3.4 Free Parameters
The constraints reduce the number of free parameters relevant to calculating the
AIC and BIC. Without constraints, the number of free parameters is
P=Kd
|{z}
means
+1
2Kd(d+ 1)
| {z }
covariances
+K−1
| {z }
mixing coeffs
,
where Kis the number of Gaussian components, and dis the regressor length.
There are d−1 equality constraints for the mean, and 1
2d(d−1) constraints
for the covariance, each reducing the number of free parameters by 1. With the
constraints, the number of free parameters is then
P0= (K−1)d+ 1
| {z }
means
+1
2(K−1)d(d+ 1) + d
| {z }
covariances
+K−1
| {z }
mixing coeffs
.
The leading term is reduced from 1
2Kd2to 1
2(K−1)d2, in effect allowing one
additional component for approximately the same number of free parameters.
3.5 Exogenous Time Series or Non-contiguous Lag
If the design matrix is formed in a different way than by taking consecutive
values, the restrictions for the covariance matrix will change. Such cases are
handled by forcing any affected elements in the matrix to equal the mean of the
elements it should equal. This will also affect the number of free parameters.
As this sort of delay embedding may inherently have a low intrinsic dimen-
sion, optimising the selection of variables could considerably improve accuracy.
4 Experiments: Time Series Forecasting
To show the effects of the constraints and the number of components on the
prediction accuracy, some experimental results are shown here. The studied time
series is the Santa Fe time series competition data set A: Laser generated data
[11]. The task is set at predicting the next 12 values, given the previous 12.
This makes the regressor size d= 24, and the mixture model fitting is in a
24-dimensional space. The original 1000 points of the time series are used for
training the model, and the continuation (9093 points) as a test set for estimating
the accuracy of the prediction. Accuracy is determined by the mean squared
error (MSE), averaging over the 12 future values for all samples. No variable
selection is conducted, and all 12 variables in the input are used for the model.
The missing-data padded design matrix of Section 2.4 is used for the training,
even when the time series otherwise has no missing values.
4.1 The Number of Components
Gaussian mixture models were trained separately for 1 through 30 components,
each time choosing out of 10 runs the best result in terms of log-likelihood. In
order to provide a perspective on average behaviour, this procedure was repeated
20 times both with and without the constraints detailed in Section 3.
The first two plots in Fig. 1 show the MSE of the prediction on the training
and test sets, as an average of the 20 repetitions. It is important to note that the
model fitting and selection was conducted by maximising the log-likelihood, and
not by attempting to minimise this prediction error. Nevertheless, it can be seen
that the training error decreases when adding components, and is consistently
lower than the test error, as expected. Notably, the difference between train-
ing and test errors is much smaller for the constrained mixture model than the
unconstrained one. Also, the training error is consistently decreasing for both
models when increasing the number of components, but for the test error this is
true only for constrained model. It appears that the unconstrained model results
in overfitting when used with more than 10 components. For 1 to 10 components,
0 10 20 30
0
200
400
600
800
1000
Components
MSE (training)
Constrained
Unconstrained
0 10 20 30
0
200
400
600
800
1000
Components
MSE (test)
Constrained
Unconstrained
0 10 20 30
0.5
1
1.5
2
2.5 x 105
Components
AIC
Constrained
Unconstrained
0 10 20 30
0.5
1
1.5
2
2.5 x 105
Components
BIC
Constrained
Unconstrained
Fig. 1. Results on the Santa Fe A Laser time series data, including the average MSE
of the 12-step prediction on the training and test sets, AIC and BIC values for both
the constrained and unconstrained mixture models for 1 through 30 components.
there is no notable difference in the test error between the two models, presum-
ably because around 10 components are required for a decent approximation
of the density. However, for 10 or more components, the constraints provide a
consistent improvement in the forecasting accuracy.
The third and fourth plots in Fig. 1 shows the evolution of the average AIC
and BIC of the converged model. The line plots show the average value of the
criterion, and the asterisks depict the minimum AIC (or BIC) value (i.e., the
selected model) for each of the 20 runs. As results on the test set are not available
in the model selection phase, the number of components should be chosen based
on these criteria. As the log-likelihood grows much faster for the unconstrained
model, this results in a consistently larger number of components as selected by
both criteria. Comparing the AIC and BIC, it is clear that BIC tends to choose
fewer components, as expected. However, the test MSE for the constrained model
keeps decreasing even until 30 components, suggesting that both criteria may be
exaggerating the penalisation in this case when increasing the model size.
4.2 Missing Data
To study the effect of missing values, the modelling of the Santa Fe Laser time
series is repeated with various degrees of missing data (1% through 50%). In the
training phase, missing data is removed at random from the time series before
forming the padded design matrix. To calculate the testing MSE, missing values
0 5 10 15 20
0
200
400
600
800
1000
Components
MSE (training)
Constrained
Unconstrained
0 5 10 15 20
0
200
400
600
800
1000
Components
MSE (test)
Constrained
Unconstrained
Fig. 2. Results on the Santa Fe A Laser time series data with 10% missing values,
including the average MSE of the 12-step prediction on the training and test sets for
both the constrained and unconstrained mixture models for 1 through 20 components.
are also removed from the inputs (i.e., the past values from which predictions
are to be made) at the same probability. The MSE is then calculated as the error
between the forecast and the actual time series (with no values removed).
The training and test MSE for 10% missing values are shown in Fig. 2. The
behaviour is similar to the corresponding plots in Fig. 1, although the difference
in the testing MSE appears more pronounced, and for a lower number of com-
ponents. This supports the notion that the constraints help against overfitting.
Fig. 3 shows the number of components selected by AIC and BIC, and the
corresponding test MSEs, for various degrees of missing values. As expected, the
forecasting accuracy deteriorates with an increasing ratio of missing data. The
number of components selected by the AIC remains largely constant, and the
constrained model consistently performs better. The BIC, on the other hand,
seems to selected far too few components for the constrained model (the MSE
plots in Figs. 1 and 2 suggest five components are far from sufficient), resulting
in a reduced forecasting accuracy.
Figs. 1 and 3 reveal largely similar results between using AIC and BIC for
the unconstrained case. However, for the constrained model, BIC is clearly too
restrictive, and using AIC leads to more accurate results.
5 Conclusions
Time series modelling through Gaussian mixture models is an appealing method,
capable of accurate short-to-medium term prediction and missing value interpo-
lation. Certain restrictions on the structure of the model arise naturally through
the modelling setup, and appropriately including these constraints in the mod-
elling procedure further increases its accuracy. The constraints are theoretically
justified, and experiments support their utility. The effect is negligible when
there are enough samples or few components such that fitting a mixture model
is easy, but in more challenging situations with a large number of components
or missing values they considerably reduce the risk of overfitting.
0 10 20 30 40 50
0
5
10
15
20
25
% Missing
Components by AIC
Constrained
Unconstrained
0 10 20 30 40 50
0
5
10
15
20
25
% Missing
Components by BIC
Constrained
Unconstrained
0 10 20 30 40 50
0
500
1000
% Missing
MSE (test)
Constrained
Unconstrained
0 10 20 30 40 50
0
500
1000
% Missing
MSE (test)
Constrained
Unconstrained
Fig. 3. Results on the Santa Fe A Laser time series data for various degrees of missing
values, including the number of components selected by AIC (left) and BIC (right) and
the resulting MSEs of the corresponding test set predictions.
References
1. McLachlan, G., Peel, D.: Finite Mixture Models. Wiley Series in Probability and
Statistics. John Wiley & Sons, New York (2000)
2. Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis. Cambridge nonlinear
science series. Cambridge University Press (2004)
3. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society. Series B
(Methodological) 39(1) (1977) pp. 1–38
4. McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley Series
in Probability and Statistics. John Wiley & Sons, New York (1997)
5. Akaike, H.: A new look at the statistical model identification. Automatic Control,
IEEE Transactions on 19(6) (December 1974) 716–723
6. Schwarz, G.: Estimating the dimension of a model. The annals of statistics 6(2)
(1978) 461–464
7. Anderson, T.W.: An Introduction to Multivariate Statistical Analysis. Third edn.
Wiley-Interscience, New York (2003)
8. Ghahramani, Z., Jordan, M.: Learning from incomplete data. Technical report,
Lab Memo No. 1509, CBCL Paper No. 108, MIT AI Lab (1995)
9. Hunt, L., Jorgensen, M.: Mixture model clustering for mixed data with missing
information. Computational Statistics & Data Analysis 41(3–4) (2003) 429–440
10. Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Second edn.
Wiley-Interscience (2002)
11. Gershenfeld, N., Weigend, A.: The Santa Fe time series competition data (1991)
http://www-psych.stanford.edu/~andreas/Time-Series/SantaFe.html.