Content uploaded by Danilo Bolano
Author content
All content in this area was uploaded by Danilo Bolano on Feb 21, 2018
Content may be subject to copyright.
Computational Statistics and Data Analysis 93 (2016) 131–145
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis
journal homepage: www.elsevier.com/locate/csda
General framework and model building in the class of Hidden
Mixture Transition Distribution models
Danilo Bolano a,b,∗, André Berchtold a,c
aNational Center of Competence in Research LIVES, Switzerland
bInstitute of Demographic and Life Course Studies, University of Geneva, Switzerland
cInstitute of Social Sciences, University of Lausanne, Switzerland
article info
Article history:
Received 29 April 2014
Received in revised form 29 August 2014
Accepted 12 September 2014
Available online 22 September 2014
Keywords:
Mixture model
Model selection
Hidden Markov model
Mixture Transition Distribution model
BIC
Panel data
abstract
Modeling time series that present non-Gaussian features plays as central role in many
fields, including finance, seismology, psychological, and life course studies. The Hidden
Mixture Transition Distribution model is an answer to the complexity of such series. The
observed heterogeneity can be induced by one or several latent factors, and each level of
these factors is related to a different component of the observed process. The time series
is then treated as a mixture and the relation between the components is governed by a
Markovian latent transition process. This framework generalizes several specifications that
appear separately in related literature. Both the expectation and the standard deviation of
each component are allowed to be functions of the past of the process. The latent process
can be of any order, and can be modeled using a discrete Mixture Transition Distribution.
The effects of covariates at the visible and hidden levels are also investigated. One of the
main difficulties lies in correctly specifying the structure of the model. Therefore, we pro-
pose a hierarchical model selection procedure that exploits the multilevel structure of our
approach. Finally, we illustrate the model and the model selection procedure through a real
application in social science.
©2014 Elsevier B.V. All rights reserved.
1. Introduction
Real data are often a combination of many different, possibly non-observed causes that lead to apparently unpredictable
behaviors. For example, in the context of longitudinal data, time series may show non-homogeneous behaviors, can switch
between alternative regimes characterized by a low or high variance, can contain extreme values, and the distribution of
future values can take complex multimodal shapes.
The Hidden Mixture Transition Distribution (HMTD) model considered in this study is a general framework to study
time series. The model can be used to describe and analyze the evolution of any continuous variable observed on a set of M
independent sequences that can also vary in length. The model integrates several refinements of the mixture model and the
hidden Markov model. More specifically, it can be used for different purposes, including describing observed data, searching
for a generalizable model, testing hypotheses, prediction, and classifying time series.
Mixture models are a popular and efficient approach, both for cross-sectional data and time series, to describe multimodal
distributions that do not correspond to any specific statistical family. Related literature provides many examples of the
usefulness of such models since the work of Weldon and Pearson at the end of the 19th century. Historically, mixture models
∗Correspondence to: University of Geneva (CH), bd. du Pont d’Arve 40, Switzerland. Tel.: +41 22 379 98 74.
E-mail addresses: Danilo.Bolano@unige.ch,d_bolano@yahoo.it (D. Bolano), Andre.Berchtold@unil.ch (A. Berchtold).
http://dx.doi.org/10.1016/j.csda.2014.09.011
0167-9473/©2014 Elsevier B.V. All rights reserved.
132 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145
for continuous-valued time series were introduced in Le et al. (1996) as the Gaussian Mixture Transition Distribution (GMTD)
model, building upon the earlier work of Raftery (1985) in the discrete case. See Berchtold and Raftery (2002) for a complete
review of the basic principles of Mixture Transition Distribution (MTD) models. The general principle of all MTD-like models
for count data is to combine different Gaussian distributions (called components) using a mixture model in which the mean
of each distribution is a function of the past observed process. The weights associated to each component are interpreted as
the probability of that specific component generating the next value of the process. This class of models has been expanded
on in several ways. For example, allowing detailed specifications for the mean of each component (e.g., Wong and Li, 2001),
allowing the variance of each component to depend on its past (Wong and Li, 2001;Berchtold,2003), and replacing the fixed
probabilities associated with each component by a Markov chain (Bartolucci and Farcomeni, 2010). Then, covariates have
been included (Chariatte et al.,2008;Luo and Qiu, 2009) used non-Gaussian distributions, and extensions to bivariate data
have been suggested (Hassan and Lii, 2006;Hassan and El-Bassiouni, 2013).
Hidden Markov Models (HMMs) form another class of stochastic processes often used to represent and analyze complex
time series in the presence of over-dispersion. They are particularly well suited to analyzing data switching between several
regimes. In its traditional formulation, a HMM combines a hidden first-order Markov chain with different conditionally
independent distributions for the observed process. Here each distribution is related to one of the states of the Hidden
Markov chain. Many developments to the basic HMM have been proposed, including using high-order Markov chains and
removing the conditional independence hypothesis. The latter change allows the observed process to depend on both its
past and the hidden process (Wellekens,1987;Berchtold,1999,2002). Under the hypothesis of stationarity, the marginal
distribution of the observations in a HMM is a finite mixture, with the number of components being equal to the number of
states in the unobserved Markov process. Therefore, we may consider the two models as one unique approach. When the
transitions between components are driven by a Markov chain, the mixture transition model becomes a HMM. In addition,
when the observations are allowed to follow an autoregressive process, the HMM becomes a mixture transition model. Some
authors even refer to HMMs as Markov-dependent mixture models (as coined by Leroux, 1992).
It is worth noting that mixture transition and Markovian models are distributional stochastic models, in contrast to
other well-known and widespread approaches that are essentially point processes. The latter group includes the ARMA and
ARIMA models, which are based on autoregressive equations, and the ARCH and GARCH models, which explicitly consider
the variance of the process. See, for example, Box et al. (1994), Hamilton (1994), and Bollerslev et al. (1992) for a complete
review of these models, and Kon (1984) and Kim and Kon (1994) for a comparison of the different models. When trying
to predict the next value of a series, the advantage of the point approach is that the model will provide a clear answer as
one numerical value (associated with a confidence interval). However, the drawback is that, in most cases, the answer is
either inaccurate or completely wrong. Given the high variability of many time series, the expectation of the model is not a
good estimator of the value of the next observation. Even when modeling the variance of the process, there is only a small
probability of accurately predicting the next value. However, in the probabilistic approach, instead of trying to determine
the next value of the series, a non-null probability is associated with values (discrete case) or intervals (continuous case),
which are then possible candidates to be the next data. Then, an adequate probabilistic model does not provide a single
value, but rather leads to a complete representation of possible futures through a (possibly multimodal) distribution. In that
sense, the answer given by this approach generally has a higher probability of helping to make the right decisions, because
it shows all possibilities rather than one (probably) wrong value.
Hidden Markov models were historically used for speech recognition (Rabiner,1989;Baum and Petrie, 1966), but many
applications have since been found in other fields, including econometrics (e.g., Elliott et al.,1998;Hayashi,2004;Netzer
et al.,2008) and the biosciences (e.g., Le Strat and Carrat, 1999;Shirley et al.,2010). Mixture models for count data are
also quite common in finance and biomedical studies (Sclattmann, 2009) and behavioral studies (under the name of growth
mixture modeling, e.g., Muthen, 2001). However, Mixture Transition Models seem to be used quite exclusively in economics
and finance (e.g., Wong and Chan, 2005;Frydman and Schuermann, 2008). In fact, despite their unanimously recognized
advantages, mixture transition and hidden models are still only sparsely used in social sciences. This is unfortunate, since
the current trend in this field is clearly to switch from cross-sectional to longitudinal surveys, hence the need for advanced
methods for modeling longitudinal data showing non-Gaussian distributions.
In addition, even though many developments have been proposed during the past few decades on the basic MTD and
HMM models, there is still a need for a more general framework that integrates these refinements. As a result, this study has
three objectives. First, we define a general framework that integrates the many extensions presented previously separately
in the literature (see Section 2), and then we discuss the estimation procedure (see Section 3). Second, as noted by Rabiner
(1989), the number of possibilities offered by hidden models is so large that it becomes difficult to identify an adequate
model structure for a particular research question. Therefore, in Section 4, we outline a search strategy similar to that used
with other multilevel models. Finally, we illustrate the model and the proposed procedure using a real dataset, the US Panel
Study of Income Dynamics (see Section 5).
2. The HMTD model
The Hidden Mixture Transition Distribution (HMTD) model developed in this paper combines a hidden and an observed
level. It can be used to describe and analyze the evolution of any continuous variable observed on a set of Mindependent
sequences, which may vary in length.
D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 133
At the hidden level, a latent discrete variable, S, taking values in the finite set {1,...,k}follows a Markov chain of order
ℓ≥0 (ℓ=0 is a special case reducing the model to a simple mixture). Each of these values represents a different hidden
state of the process. In addition to its lags, the value taken by Sat time tcan also be influenced by one or several categorical
covariates. Since Sis unobserved, we never know its exact value at time t, but we can estimate its distribution and the most
likely sequence of states corresponding to each sequence of observations.
At the observed level, we consider a random variable, X, taking values in R. A Gaussian component is associated with
each hidden state of the latent process, with an expectation and variance that may depend on both the lags of Xand on a
set of categorical and/or continuous covariates. Since we only know the distribution of the possible hidden states at time t,
the visible model is a mixture of the kGaussian components.
The model can be estimated by simultaneously using as many independent sequences of observations as required.
Each sequence typically corresponds to the observation of a separate subject. Each hidden state and its associated visible
component can instead be interpreted as one possible behavior of the subjects under investigation and multiplying the
number of components, we allow the subjects to follow different behaviors. This enables us to capture both the complexity
of the overall population and the evolution of each individual over time.
The following subsections describe the two levels of the model in more detail.
2.1. The visible level
Let {Xt,t∈N∗}be a sequence of random variables taking values in R. Let Xt−1
t−adenote the past observations between
time t−aand t−1. A frequently used and convenient hypothesis assumes that the probability of Xt, given its past, follows
a Gaussian distribution, but this hypothesis is generally too simplistic to account for the complexity of real data. A better
solution is to assume that the observed time series was generated by kdifferent sub-models, each model being used for one
or several parts of the overall time series, and to write the resulting model as a mixture (McLachlan and Peel, 2000). We
then have
F(xt|xt−1
1)=
k
g=1
λg(t)Gg(xt|xt−1
t−rg,C(t))
where F(xt|xt−1
1)is the cumulative distribution function of xt, given its past, Gg(xt|xt−1
t−rg)is a cumulative distribution function
of xt, given a part of its past (from xt−rgto xt−1,rg≥1), C(t)represents a set of covariates available at time t, and λg(t)is
the weight of the gth component at time t, with
k
g=1
λg(t)=1, λg(t) > 0,∀g,t.
Different specifications can be used for the Gs. However, we only consider the Gaussian case here because, as pointed
out by Rabiner (1989), Gaussian mixtures can approximate almost all continuous density functions as closely as necessary.
The gth component is then written as
Gg(xt|xt−1
t−rg)=Φxt−µg,t
σg,t.
In order to explicitly incorporate the dependence between successive observations into the model, the expectation and
the standard deviation of each Gaussian component are written as functions of the past. The expectation of the gth compo-
nent is specified by
µg,t=ϕg,0+
pg
i=1
ϕg,ixt−i+
cg
j=1
δg,jcj(t), pg≥0,cg≥0.(2.1)
The first part of the equation is an autoregressive model, with the ϕg,icoefficient associated with the ith lag of Xt. The
second part represents the influence of the covariates: cj(t)is the jth covariate observed at time t, and δg,jis its associated
coefficient. Covariates can be continuous or categorical. As usual, in order to facilitate the interpretation of the results, we
suggest recoding the categorical covariates as 0–1 dummy variables. When the number of lags is fixed as zero (pg=0) and
there are no covariates (cg=0), the expectation of the component becomes a constant (Eq. (2.1)).
Different specifications can be chosen to model the standard deviation of the gth component:
σg,t=
θg,0+
qg
j=1
θg,jx2
t−j,qg≥0,(2.2)
σg,t=
θg,0+
qg
j=1
θg,j(xt−j−xt−1
t−qg)2,qg≥2,(2.3)
134 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145
σg,t=
θg,0+
qg
j=1
θg,j(xt−j−µg,t)2,qg≥1,(2.4)
σg,t=
θg,0+
qg
j=1
θg,je2
g,t−j,qg≥1,(2.5)
where
xt−1
t−qg=1
qg
qg
j=1
xt−j,
eg,t−j=xt−j−µg,t−j
and
θg,0>0,∀g,
θg,j≥0,∀g,j=1,...,qg.
The first three specifications were introduced in Berchtold (2003). The case of a constant standard deviation is included in
the first specification by fixing qg=0. Since this first specification uses only the past squared observations, it should be used
only on datasets in which a substantial part of the data is homoscedastic. Indeed, if we compare two time series, Sa
tand Sb
t,
such that Sb
t=Sa
t+c, the two series have the same standard deviation, but the parameters of Eq. (2.2) would take different
values. The next two specifications of the standard deviation do not suffer from this problem because they are both derived
from the usual standard deviation formula. The difference lies in the reference point used. Eq. (2.3) directly compares each
lag to the empirical mean. The only restriction here is that the number of lags has to be greater than or equal to 2. Eq. (2.4)
compares each lag to the expectation of the component. As noted by Wong and Li (2000), the latter approach is somewhat
curious, since µg,tis not always a good predictor of the next observation when the series is highly variable. However, in
practice, this specification has the advantage of ensuring consistency between the two time-dependent elements (µg,tand
σg,t) of each Gaussian component. Finally, Eq. (2.5) is the ARCH specification proposed by Wong and Li (2001).
With regard to modeling the expectation, it is also possible to let the standard deviation of a component depend on ex-
ternal factors, and to introduce covariates into Eqs. (2.2)–(2.5). Furthermore, when the variability of the studied process is
related to its magnitude, using specification (2.4), with the covariates included in the modeling of µg,t, could prove partic-
ularly useful.
In the above equations, pgdenotes the number of lags used when modeling the expectation of the gth component, and qg
denotes the corresponding number of lags used in the modeling of the variance. The larger time dependence for each com-
ponent is then rg=max(pg,qg). Note that, even if pg>1 or qg>1, it is not mandatory to use all lags between 1 and pgor
qg. In practice, it can be interesting to associate some lags with only some components. For instance, in the original GMTD
model (Le et al., 1996), the expectation of the gth component used only the gth lag of the observed variable, X. Another
possibility is to constrain the sum of several parameters to take a particular value. For instance, Le et al. (1996) proposed
imposing the following constraint on a component g:
pg
i=1
ϕg,i=1.
2.2. The hidden level
In the previous subsection, we wrote the weight of the gth component as a function of time: λg(t). Of course, it is possible
to remove this dependence and consider
λg(t)=λg,∀t.
In this case, the HMTD model reduces to a pure mixture model. On the other hand, following Weigend and Shi (2000), there
are at least two different possibilities for specifying varying weights. The first one, called ‘‘gated experts’’, allows each weight
to depend on a set of covariates (Weigend et al., 1995). We then have
λt=(λ1(c1(t)), λ2(c2(t)), . . . , λk(ck(t))).
Since covariates are supposed to evolve over time, the component weights do too.
The second possibility is to let the weights at time tdepend on the past through a Markov chain and, in this way, to
implement a HMM at the hidden level. We then write
λg(t)=P(St=g|St−1,...,St−ℓ)
where Strepresents the component chosen at time t, and ℓis the order of the Markov chain. When the order of the hidden
Markov chain is set to zero, the model once again reduces to a pure mixture.
D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 135
The last two approaches can be combined to let the weights of the components depend on both the past of the latent
process and on the covariates. This is accomplished by adding additional terms to the previous equation:
λg(t)=P(St=g|St−1,...,St−ℓ,C1(t), . . . , Ccg(t)). (2.6)
Eq. (2.6), even if still describing transition probabilities, no longer corresponds to a Markov chain. Moreover, the number of
parameters to be estimated can become prohibitive, especially when there is more than one lag. A solution is to model the
hidden process through a discrete MTD model (Berchtold and Raftery, 2002), as follows:
λg(t)=
ℓ
j=1
ψjqsj,g+
cg
h=1
fh(ch),
where qsj,gdenotes the transition probability from state value sjat time t−jto state value gat time t, and ψjis the weight
of lag j. Then, the set of all transition probabilities can be written as a transition matrix, Q:
Q=
q1,1· · · q1,k
.
.
..
.
.
qk,1· · · qk,k
.
Then, C= {C1,...,Ck}is a set of covariates and f1,...,fkare transformation functions. When the covariates are categorical,
we can write
cg
h=1
fh(ch)=
cg
h=1
γhqch,g
where qch,gdenotes the probability of the transition from the value of the hth covariate to the state value gat time t, and
γhis the corresponding coefficient. Of course, the preceding equations have to be constrained in order for the results to be
probabilities.
By allowing all possible transitions between hidden states to occur, we can represent the behavior of the observed time-
series as precisely as needed. However, it is also sometimes useful to constrain the transition matrix of the hidden level. For
instance, consider representing the transitions between successive hidden states using a first-order Markov chain, with no
covariates. With four components, the matrix is written as
A=
q1,1q1,2q1,3q1,4
q2,1q2,2q2,3q2,4
q3,1q3,2q3,3q3,4
q4,1q4,2q4,3q4,4
.
An interesting case occurs when the different states are hierarchical. In other words, when a subject enters in state si, he/she
can only stay in this state or move to state sj, such that sj>si(see the left-hand side of Table 1). For example, in behavioral
studies, we observe this kind of situation when the phenomenon evolves with the age of the subjects. Young subjects start
in the first hidden state and then switch to higher states as they grow older. Hearing capabilities are a good example. Young
people are supposed to have maximal capabilities, which then decline with age, passing through different stages.
Here, the first state would represent people with full hearing capabilities, and the last state could represent those who are
deaf. Of course, the probability of remaining in the last state is one. If we want to impose a strict order between the states (one
can switch only to the next state), then the transition matrix can be represented as is shown on the right-hand side of Table 1.
Another useful specification is to set Ato the identity matrix. Each state is then absorbing, meaning that it is impossible to
leave a state. This implies that each independently observed subject is associated with one, and only one state during its
observation period. The model then classifies the subjects into kmutually exclusive groups based on the sequence of data
observed for each subject. Other approaches that use HMMs to classify sequence data are presented in Helske et al. (2010).
From a practical point of view, the constraints discussed above are very easy to set, because the Expectation–
Maximization (EM) algorithm used to estimate the parameters of the HMTD model has the property that parameters initial-
ized to zero will not be updated during the optimization process. Hence, it is sufficient to initialize the required parameters
to zero.
3. Estimation
Belonging to the class of hidden Markov models, the HMTD model can be estimated using the general framework
proposed by Rabiner (1989), in which three different problems have to be considered:
1. The computation of the (log-)likelihood of the sequence of observed data, given the current model.
2. The identification of the optimal sequence of hidden states, given the current model and the sequence of observed data.
3. The estimation of the optimal model parameters, given the sequence of observed data—that is, the parameters that
maximize the log-likelihood of the model.
136 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145
Table 1
Transition matrices for hierarchical states.
A=
q1,1q1,2q1,3q1,4
0q2,2q2,3q2,4
0 0 q3,3q3,4
0 0 0 1
,A=
q1,1q1,20 0
0q2,2q2,30
0 0 q3,3q3,4
0 0 0 1
The first problem is solved through an iterative computation known as the Forward–Backward procedure (Rabiner, 1989).
The second problem, sometimes called the decoding problem, is solved using the Viterbi algorithm (Viterbi, 1967). Finally, the
third problem is solved using a version of the Expectation–Maximization (EM) algorithm especially designed for the Hidden
Markov Model, known as the Baum–Welch algorithm (Baum et al., 1970). Each of these three problems has been studied
extensively and the interested reader can find detailed discussions in Baum et al. (1970), Dempster et al. (1977), Rabiner
(1989), McLachlan and Krishnan (1996), Berchtold (2002) and Berchtold (2004). Even though the framework presented in
this study incorporates many extensions to the basic HMM and MTD models, the estimation procedure remains essentially
the same. Only minor adjustments are necessary in order to explicitly incorporate the relationships between successive
observations and the effect of the covariates.
Although the maximization of the log-likelihood is not difficult from a purely computational point of view, it is much
more difficult to assess the quality of the solution. Different problems can occur during the estimation phase. For instance,
even though it has been demonstrated that EM algorithms converge to a maximum of the log-likelihood, there is no way
of ensuring that this is the global maximum, rather than a local optimum. Even for quite simple HMTD models with few
components and lags, the solution space can be very complex with many local optima. Different solutions have been pro-
posed to overcome this issue, ranging from several independent runs of the EM algorithm and variants of the basic algorithm
(e.g., the Classification EM and the Stochastic EM algorithm) to a combination of the EM algorithm and a gradient-type or
genetic algorithm. More details can be found in Biernacki et al. (2000), Böhning (2001) and Berchtold (2004).
Another well-known situation is the possible degeneracy of the log-likelihood, with the log-likelihood taking values
larger than one. This issue is easy to detect and then to avoid by imposing constraints on the model parameters, especially
on the coefficients of the standard deviation. Finally, when a component is fitted to very small subsets of the data, the degen-
eracy in the maximum likelihood estimation might be due to particular features of those subsets, preventing a good general-
ization of the results. To avoid these fallacious situations, we again constrain some of the parameters, for instance, by limiting
the ratio between the largest and smallest weights of the different components. For further details, refer to Berchtold (2003).
Other difficulties related to mixture models are formulating standard errors and t-statistics, in order to perform a di-
agnostic of the parameters estimated from the EM algorithm, and computing confidence intervals. The simple use of the
expectation is typically not possible, and obtaining the Hessian matrix is not straightforward. In the literature, two classes
of methods have been proposed to estimate the variance matrix of maximum likelihood estimators: information-based
methods and resampling methods. Information-based methods propose estimating the variance matrix as the inverse of
the information matrix. One way is to approximate the information matrix from the complete(-data) likelihood (e.g., Louis,
1982) so the mixture membership is treated as known. On the other hand, Dietz and Bohning (1996) proposed approximating
the Fisher information using the original data (i.e., using the incomplete-data likelihood). These methods are computation-
ally efficient. However, in both cases, we only have an asymptotic approximation of the covariance matrix, so the sample
size must be large enough to guarantee that the asymptotic theory of maximum likelihood holds. More recently, Boldea
and Magnus (2009) derived the analytical forms of the score vector and Hessian matrix for multivariate Gaussian mixture
models, which are used to estimate the observed information matrix directly.
The bootstrap method is a more popular technique for performing model diagnostics in mixture modeling. In this class
of methods, we can distinguish three main approaches: the parametric bootstrap (Basford et al.,1997;McLachlan and Peel,
2000), the non-parametric bootstrap (Efron, 1979), and a modified version of the non-parametric bootstrap, proposed by
Newton and Raftery (1994), known as the weighted bootstrap. In the latter approach, the data are weighted proportionally to
the number of times that a sample value occurs in the bootstrap re-sample. The parametric and non-parametric bootstrap
methods differ in the way the re-sampling is performed. In the first case, the repeated draws are made from the fitted
mixture (i.e., the mixture with parameters fixed at their estimated values). In the non-parametric approach, the bootstrap
samples are drawn directly from the sampling distribution of the original data. Whatever resampling technique is chosen,
the bootstrap approach estimates the model in each bootstrap sample, then computes the in-sample bootstrap standard
errors of the corresponding parameter estimates using a Monte Carlo approximation (McLachlan and Peel, 2000). Efron and
Tibshirani (1994) showed that 50–100 bootstrap replications are sufficient for reliable standard error estimation.
In the case of resampling-based methods, because of the invariance of the likelihood under permutations of the mixture
components, a label-switching problem may arise (Redner and Walker, 1984). However, according to McLachlan and Peel
(2000), if we use the parameters estimated from the original data as initial values in the EM algorithm in each bootstrap
sample, this problem should remain rare. Nevertheless, the label-switching problem has been extensively studied in the
literature, and different solutions have been proposed (Stephens, 2000).
Longitudinal analyses in the social sciences generally focus more on panel data than on univariate time series. Therefore,
when employing a bootstrap method, two further issues have to be addressed. First, we have to choose how to perform
the sampling. Several methods of resampling panel data have been proposed in the literature: temporal sampling, where
D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 137
the resampling is performed at each time point; the block bootstrap, which re-samples blocks of consecutive observations;
and individual or cross-sectional sampling, which analyzes individual sequences as unique whole unit in order to preserve
intra-individual dependence (Kapetanios, 2008).
The second issue is related to longitudinal panel weights. The resampling must account for the sample weights to avoid
a biased estimation, in a way similar to the weighted bootstrap proposed by Newton and Raftery (1994). In survey method-
ologies, resampling methods from stratified populations have been introduced (e.g., Kovar et al., 1988), but these still need
to account for complex sampling designs, as well as deal with non-response and longitudinal panel weights.
4. Model selection
The great flexibility of the HMTD model can sometimes also be a drawback. For example, it is difficult to decide on the
structure for the ‘‘best’’ model to fit the data. Should we increase the number or lags, or the number of components? Should
we use more covariates? Should we use a non-constant variance? These are just some of the options we need to consider.
However, before addressing these questions, it is necessary to consider the concept of a statistical model and the notion of
a ‘‘best’’ model.
Modeling a time series can be motivated by different objectives, for instance:
1. Obtaining the most precise representation of the data under investigation.
2. Obtaining a model able to represent times series similar to the observed data.
3. Building a model that provides the best possible predictions.
4. Classifying the data into a finite set of mutually exclusive categories.
These different objectives are, of course, related. However, even the notion of a ‘‘best’’ model can differ. In general, we con-
sider that model Adescribes a given time series better than model Bdoes when model Ahas the greater probability of
generating the data. This is equivalent to saying that the (log-)likelihood of model A, given the data, is larger than the (log-)
likelihood of model B. On the other hand, since it is almost always possible to increase the (log-)likelihood of a model by sim-
ply adding more parameters, the model representing the data in the most accurate way can be very complex in terms of free
parameters that need to be estimated. In such a case, this model is perhaps not the most desirable for the purpose of analysis
and decision-making. This is why we generally prefer to select a model based on a penalized criterion, considering both the
precision of the modeling and the complexity. The Bayesian Information Criterion (BIC, see Schwarz,1978;Kass and Raftery,
1995;Raftery,1995) is an example of such a criterion. Moreover, Leroux (1992) proves the consistency of maximum (penal-
ized) likelihood estimators for a mixing distribution obtained from the number of components selected using the AIC or BIC.
For a given model, the BIC is defined as
BIC = −2LL +Plog(N)
where LL is the log-likelihood of the model, Pis the number of independent parameters, and Nis the number of data used
to compute the log-likelihood. The model with the smallest BIC is chosen. For the log-likelihood, we can choose between
the observed and the complete log-likelihood. In the first case, we consider the contribution of the visible part of the model,
given the hidden part, and the BIC explains how well the model fits the data. However, by using the complete log-likelihood,
we have the information from both the visible and the hidden part of the model. Since we are interested in assessing the
quality of both levels of the HMTD, we use the latter approach.
When we are mainly interested in using the model to predict the probability of future events, we have to consider that the
future behavior of the series will, generally, not be the same as its past behavior. Consequently, a model that too precisely de-
scribes a particular, but rarely observed feature of the data can be disadvantageous. In that case, using the BIC can also be in-
teresting, since the precise description of a particular aspect of a time series generally involves many additional parameters.
Another use of mixture models is in classifying data. Here, we assume that each component of the model generates
a different kind of data, and we try to allocate each observed data element to one of the components in order to obtain
homogeneous clusters. In the context of time series, we usually try to obtain a representation in which groups of several
successive data are identified. We then need a model that is sufficiently rough, which will not be too influenced by rela-
tively small variations in the data. In practice, the mixture model is used to associate, a posteriori, each observation to one
component (e.g., McLachlan and Basford, 1988), and in particular, to the component with the largest expectation.
Whatever the final goal of the modeling, we can adopt either a deductive or an inductive approach. In the deductive case,
we start from a clear theory, which the statistical model should reflect. For instance, if the theory tells us that three different
behaviors can be adopted by the subjects under study, so we should choose a three component model, one to represent each
separate behavior. Next, the theory tells us that the first behavior is memoryless, while the other two can be explained by
their past behavior. The first component should then have a constant mean, while the other two should use one or several
lags, and so on. If the theory is sufficiently detailed, then we can use it to completely define the structure of the model.
Once computed, a theory-driven model can be analyzed and its parameters interpreted. Hypotheses can be tested by
computing alternative models that modify aspects of the initial model. These alternative models are then compared to
the initial model based on either likelihood ratio tests (Giudici et al.,2000;Dannemann and Holzmann, 2008) or the BIC. If
alternative models prove to be superior to the first theory-driven model, then the theory may be questioned. The alternative
is to perform an empirical search adopting an inductive approach. In this case, we are not driven by a theory, but instead
138 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145
search for the model that best explains the data under investigation. Given the large number of different models that could
possibly be specified from the same dataset, the question then arises of how to define an optimal search strategy. A possibility
is to rely on the fact that the HMTD belongs to the family of multilevel models and to adopt an ad hoc strategy, such as the
one proposed by Hox (1995). The idea is to work hierarchically, starting from a very basic model, then adding more advanced
elements one at a time. After each change, the new model is compared to the previous one using a measure such as the BIC.
In the context of a HMTD, this can be summarized as follows:
1. Begin with a two-component model, a first-order hidden transition matrix, constant expectations and variance for both
components, and no covariates.
2. Try to find the optimal number of components by computing additional models with 3, 4, 5, ... components. Note that
since no other explanatory elements are used in the model, the number of components could become very large. In this
situation, a limit (for instance 5 or 10) should be imposed.
3. Add lags to the modeling of the expectation of the visible components.
4. Allow the variance of each component to be non-constant using one of the specifications proposed in Section 2.1.
5. Replace the first-order transition matrix with a high-order transition matrix.
6. Introduce covariates at the visible level.
7. Introduce covariates at the hidden level.
This strategy calls for several remarks:
•Since we start with a very simple model in which the only explanatory element lies in the number of visible components,
this number could be very large at first. Even if it is restricted to a maximal value in step 2, it is likely that introducing
additional explanatory elements will later render some of the components useless. A good idea is then to try, at least at
the end of each of the aforementioned steps, to reduce the number of components.
•Beginning with a refinement of the visible model before trying the hidden model is based on the fact that we know the
structure of the data at the visible level. However, the structure of the hidden level is known only in distribution. There-
fore, we have better control of the visible level, making it more useful to begin with a precise model of the observed level.
•Covariates are introduced in the model in the last steps of this strategy. In general, longitudinal data are, at least partially,
self-explaining. Therefore, we should begin by using the information contained in the data, as much as possible, before
using external information. Of course, if the phenomenon under study is known to be strongly related to a particular
external factor, for example gender or age, then this factor can be introduced at an earlier stage.
Finally, always keep in mind that a theory-driven model allows more generalization than does an empirical model. In that
sense, and whatever the approach used, one should avoid trying models without a real practical interpretation. The overall
‘‘best’’ model might not be the model with the optimal BIC value or the one that optimally fits a sample of data. Instead, it
is the model that allows to best understand the phenomenon under study.
5. An empirical application
In this section, we illustrate an application of the HMTD and the selection procedure described in Section 4on unbalanced
panel data. The data are from the core representative sample of the Panel Study of Income Dynamics (PSID). The PSID is the
longest running nationally representative household panel study and allows unique opportunities to conduct life course
research and generational studies over four decades. The PSID began collecting socio-economic and demographic data in
1968 on a national sample of US households (about 4800 families) with the objective of studying the dynamics of income
and poverty in the United States (Hill, 1991). In order to allow the sample to remain representative of the US population, the
PSID follows each family member of the original 1968 sample, even if they move out of the original household and create
their own new families. This occurred yearly until 1997, then changed to every two years. As of 2009, 8690 households had
been interviewed, incorporating 24,385 individuals.
Our application concerns household income from 1968 to 2009 in the United States. The family income used in this
example reflects income from any source (labor, assets, transfers, and so on) and from all persons living in the family unit.
Since 1979, the value of family income was topcoded at $99,999. In 1980, this increased to $999,999, and since 1981, has
been $9,999,999. To correctly compare the evolution over four decades, the data used in this study are adjusted for inflation
in 2008 dollars (Consumer Price Index-U series).
During a life course, it is highly likely that a household will suffer a series of independent economic shocks due to the loss
of a job, national economic downturns, changes in the family structure, and so on. Analyzing the household income dynamics
over almost four decades, the impact of these shocks should be relevant. Thus, using an appropriate model becomes crucial.
For instance, if we represent the income dynamics with a linear stationary (autoregressive) process, the time taken to
recover from an economic shock is assumed to be the same for poor and wealthier households. In the literature, one of
the alternatives proposed is to use a nonlinear model in the classic way of quadratic or cubic functions of lagged income
(Lokshin and Ravallion, 2001).
The HMTD model, with different possible specifications for the expectation and the standard deviation of each Gaussian
component (see Section 2.1), constitutes a valid alternative to this classic approach. In addition, using the HMTD, we can
also simultaneously consider the income stratification over time in terms of the serially inter-temporal correlation between
D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 139
Table 2
Step 1. Finding the optimal number of components.
No. components p q Order HC No. of parameters Complete log-likelihood BIC
2 2 0 1 10 −62,532.2 125,165
3 2 0 1 18 −48,134.8 96,449.1
4 2 0 1 28 −42,105.6 84,491.7
Note:prepresents the dependence order of each expectation, and qdenotes the number of lags for each standard deviation.
successive income levels (the level of income today is naturally a good predictor of its value in subsequent years), as well
as the problem of population heterogeneity. Interpreting the hidden variable, S, as different discrete income regimes, issues
related to the heterogeneity in income dynamics can be solved in two ways. The first is to compute the most likely sequence
of hidden wealth states directly, and the transition probabilities between them. The second is to constrain the hidden
transition matrix to the identity matrix, and then to use the HMTD to classify the households into mutually exclusive
homogeneous groups according to their income trajectories.
In order to capture the effects of demographic events and relevant changes in household composition on the earning
dynamics process, we include four socio-demographic factors in the model: the size of the household, the age and gender of
the head of the household, and his/her educational level. In this illustrative example, we focus only on the effect of certain
socio-demographic characteristics on the evolution of household income over time, even though other aspects, such as job
history and job sector, may be relevant. The variables are all considered to be time varying to capture the effect of changes
in the household composition on the earning dynamic process. We assume that a household may suffer a negative shock
when a change occurs involving the head of the household. For instance, owing to the gender gap in earnings, having a man
or a woman as the head of the household probably affects the overall income of the family.
In order to show the relevance of the model using medium- and long-term panel data, we only include households that
have been interviewed at least 10 times over the four decades covered by the PSID.
5.1. Model selection procedure
As underlined in the previous section, a crucial point is the model selection procedure we use to find the ‘‘best’’ possible
model. For illustration purposes, we use an inductive data-driven approach here. Considering the large amount of data
available in the PSID database and the large number of models that could be estimated, and to avoid an unnecessary large
computation time, we consider a random subsample of 1000 US households from the core representative sample. The dataset
used is made up of 1000 multivariate series of different lengths (from 10 to 36 observations for each household) representing
the (log transformation of the) household income. To compare the specifications, all models are computed using the same
number of observations with a total of 22,496 data points.
In the following tables, we summarize the model specifications and in order to assess the quality of the model, we report
the complete log-likelihood and the BIC. Following Raftery (1995), a BIC difference of more than five points suggests ‘‘strong’’
evidence in favor of the model with the lower BIC.
As shown in Eq. (2.1), it is possible to assign a different number of lags when modeling the expectation of each component.
This allows components to have a shorter or longer memory, according to a specific research question, but for convenience,
we set the same number of lags for all components in this application.
In applied research, in particular in social sciences, researchers might be more interested in an interpretable and gener-
alizable model than in the numerical optimal one. A possibility is to set some constraints on the number of components and
lags in order to have an interpretable model and also to avoid to compute too many different models with different specifi-
cations saving computational time. These constraints can be based for instance on previous investigations or on theoretical
assumptions.
In this illustration, we assume that there are no more than four types of income regimes (i.e. the maximum number of
components is four) representing for instance four types of income shocks: idiosyncratic or aggregate, transitory or perma-
nent.
We also make the assumption that, from an economic point of view, the number of lags when modeling the expectation
for each component should be between two and four. Considering only a time dependence of order one, the current income
may be over-influenced by negative short-term shocks. On the other hand, due to business cycle, we assume that income
shocks are not likely to persist after four years (e.g. Hamilton, 1989).
The idea of the proposed model selection procedure is to work hierarchically, starting from the simplest possible model.
So, we try first to find the optimal number of components (i.e. income regimes) without explanatory variables using a first-
order hidden transition matrix and a constant variance.
Looking at the quality of the models shown in Table 2, and using a hidden chain of order one and two lags for the ex-
pectation, the BIC always improves when increasing the number of components. Then, considering the constraint discussed
above, the HMTD with the lowest BIC is the one with four components (i.e., a BIC of 84,491.7).
By increasing the number of independent parameters through additional components, the log-likelihood always im-
proves, but these additional components may not bring useful information to the model, resulting in a very marginal change
140 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145
Table 3
Step 2. Add lags for the expectation.
No. components p q Order HC No. of parameters Complete log-likelihood BIC
4 2 0 1 28 −42,105.6 84,491.7
4 3 0 1 32 −35,065.2 70,451.1
4 4 0 1 36 −27,871.1 56,103
Table 4
Step 3. Add lags to the modeling of the standard deviation.
No. components p q Order HC No. of parameters Complete log-likelihood BIC
4 4 0 1 36 −27,871 56,103
4 4 1 1 40 −26,642.2 53,685.2
4 4 2 1 44 −26,635.1 53,711.1
Table 5
Recursive step: Complexity reduction.
No. components p q Order HC No. of parameters Complete log-likelihood BIC
4 4 1 1 40 −26,642.2 53,685.2
4 3 1 1 36 −27,570 55,501
4 3 0 1 32 −35,065.2 70,451.1
3 4 1 1 27 −42,348.1 84,966.8
3 4 0 1 24 −42,426.9 85,094.4
Table 6
Step 4. Order of the hidden chain.
No. components p q Order HC No. of parameters Complete log-likelihood BIC
4 4 1 0 27 −47,669.8 95,610.3
4 4 1 1 40 −26,642.2 53,685.2
4 4 1 2 76 −24,320.9 49,403.4
4 4 1 3 220 −24,258.1 50,720.8
in the log-likelihood. Since at the beginning of the selection procedure the only explanatory element is the number of com-
ponents, the number of free parameters and then the subsequent penalization term in the BIC are relatively low, making
it easier to improve the BIC. For this reason and to avoid unnecessary complicated models, it is a good idea to set some
constraints on the components. It might be a limit on the maximum number of components, as in the example, or fixing a
threshold in the membership probabilities. In such a way, an additional component that represents only few cases will not
be included in the model selection procedure.
In the second step, after having fixed the number of components at four, we then increase the time dependence for the
expectation of each component. As mentioned before, we also make an assumption on the maximum number of lags for the
expectation. From an economic perspective, we assume the existence of a business cycle such that economic shocks should
not persist after a cycle of four years. So the maximum number of lags we consider for the expectation is four. Moreover,
from the 1997, the PSID data are collected every two years so considering four time points for the expectation we actually
cover a period of four to eight years.
According to the BIC reported in Table 3, the most adequate model uses a time dependence of order four for the expec-
tation (i.e., a BIC =56,103).
Once we have found the best combination of the number of components and lags for the expectation, we introduce the
dependence between successive observations in the specification of the standard deviation. Then, we increase the order of
the hidden transition matrix. For the specification of the standard deviation, we use Eq. (2.3), comparing each lag to the
empirical mean of the component. As a result of the nonlinearity, during the estimation of the standard deviations, the log-
likelihood sometimes diverges owing to the difficulty of identifying good initial conditions for the optimization procedure.
Therefore, we include at most two lags. Table 4 shows that the best solution is to include only one lag (i.e., a BIC =53,685.2).
Since this step-by-step procedure does not compare all possible models, it might not lead directly to the best possible
solution. Before including a higher dependence at the hidden level, it can be interesting to try to reduce the complexity of
the model by reducing the number of components and/or the number of lags. In this particular case, any simplification of
the model leads to a worse solution, as shown in Table 5, which summarizes different models in reverse order of complexity.
Turning now to the hidden level (Table 6), we notice first that the HMTD performs better than a pure mixture model. The
model with a hidden chain of order zero has a BIC almost double that of the observed level for a HMTD of order one. On the
other hand, by increasing the hidden order from one to two, we obtain a significantly better model (i.e., a BIC =49,403.4),
however using a third-order model is useless.
D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 141
Table 7
Recursive step: Complexity reduction.
No. components p q Order HC No. of parameters Complete log-likelihood BIC
4 4 1 2 76 −24,320.9 49,403.4
4 4 0 2 72 −24,470.8 49,663.1
4 3 1 2 72 −24,619.2 49,959.8
4 3 0 2 68 −25,388.1 51,457.6
3 4 1 2 39 −29,075 58,540.8
3 4 0 2 36 −30,488.1 61,337
Once we have fixed at two the order of the hidden process and before including the effects of external factors, we can
again try to reduce the complexity of the model (see Table 7). However, as before, this step did not lead to a better BIC, so
we stay with the previous model.
The last step of the suggested procedure is to introduce the effect of external covariates. The covariates may be included
at the visible level, in the specification of the mean and standard deviation, and/or at the hidden level by influencing the
transition probabilities between hidden states. As evident from Eq. (2.1), it is possible to analyze the effect of a covariate on
one single component instead of on the overall mixture. This is useful to overcome problems of multi-collinearity, because
using the same covariate on multiple components might give redundant information to the model. Observing the effect of
the covariate on only one component might also be useful when we use the mixture models to classify data or when the
analysis is theory-driven, for instance, when we search for the effect of external factors on a specific group. In our example,
we include only one covariate at a time when modeling the expectation of each component. We consider four covariates.
Two are categorical: having a higher level of education (H. Edu) and the gender of the head of the household being male
(Male). The other two are continuous: the age of the respondent (Age) and the size of the family (H. Size). Table 8 reports the
quality of the more relevant models. The best model is that using all four covariates, each covariate being associated with
one of the four components (i.e., a BIC =44,946.8).
5.2. Model interpretation
In this section, we present an approach to interpret the results of a HMTD. To sum up, for the visible level, we have a
mixture of four components, with four lags for the expectation of each component, four covariates, and a time dependence of
order one for the standard deviation. The hidden level is driven by a second-order transition matrix. The dependent variable
is the (log transformation of the) income levels of a random sub-sample of 1000 US households from the core representative
sample of the PSID database. Four covariates are considered: the size of the household, and the age, gender, and educational
level of the head of the household.
The transition matrix between hidden states is as follows (in reduced form):
Q=
St
St−2St−11 2 3 4
1 1 0.8200 0.0246 0.1403 0.0152
2 1 0.7594 0.0010 0.2198 0.0198
3 1 0.8661 0.0190 0.0917 0.0233
4 1 0.9195 0.0440 0.0149 0.0216
1 2 0.6566 0.0072 0.3148 0.0214
2 2 0.3368 0.0000 0.6437 0.0195
3 2 0.3987 0.0069 0.5685 0.0259
4 2 0.7504 0.0016 0.2261 0.0219
1 3 0.8629 0.0067 0.1029 0.0275
2 3 0.8542 0.0007 0.1029 0.0422
3 3 0.8240 0.0173 0.1318 0.0269
4 3 0.9031 0.0053 0.0726 0.0190
1 4 0.8563 0.0037 0.0271 0.1128
2 4 0.8692 0.0000 0.0462 0.0846
3 4 0.8620 0.0026 0.0239 0.1115
4 4 0.9482 0.0049 0.0191 0.0278
.
Here, Qrepresents the general second-order transition process between hidden states, given the value assumed by the
hidden variable in the two previous periods (St−2,St−1). The distribution of the first hidden state (π1) and the second hidden
state, conditional on the first (π2|1), are
π1=(0.7805;0.0091;0.1716;0.0388) π2|1=
0.8412 0.0221 0.1167 0.0200
0.5356 0.0039 0.4383 0.0222
0.8611 0.0075 0.1025 0.0289
0.8839 0.0028 0.0291 0.0842
.
142 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145
Table 8
Step 5. Introducing external covariates.
k p q Order HC No. of parameters Covariate Complete log-likelihood BIC
4 4 1 2 76 −24,320.9 49,403.4
4 4 1 2 77 Male −24,126.8 49,025.2
4 4 1 2 77 H. Edu −24,167.5 49,106.7
4 4 1 2 77 H. Size −24,310.8 49,393.1
4 4 1 2 78 Male, Age −22,285.1 45,351.6
4 4 1 2 78 Male, H. Size −22,320.8 45,423.3
4 4 1 2 79 Male, Age, H. Size −22,161.1 45,113.8
4 4 1 2 80 Male, H. Edu, Age, H. Size −22,072.5 44,946.8
Note:kis the number of components; pthe dependence order of each expectation; qthe number of lags for each standard deviation.
Covariate: Male: Man as the head of the household; H. Edu: Higher level of education; Age: Age of the head of the household; H. Size: Size of household.
Fig. 1. Graphical representation of hidden states.
Both the transition matrix and the distributions of the hidden states indicate the predominance of the first state (and,
thus, of the first component at the visible level of the model). The process starts in the first hidden state in 78%. Considering
the overall series, the first hidden state is used 93.44% of the time, the second hidden state 0.67%, the third 4.27%, and the
fourth 1.61%. For a graphical representation of the most common transitions, see Fig. 1 (lower, right corner). With the support
of the R package TraMineR (Gabadinho et al., 2011), we can graphically represent the most likely sequence of hidden states
and their distributions.
The cross-sectional entropy plotted in Fig. 1 (lower, left corner) shows an increasing variability in the distribution of the
hidden states among the households over time, and two periods of significant change. The first occurs at the beginning of
the series, with the 1970s oil crisis, and the second in the early 1990s, with the economic downturns (1989–1992). Another
change in the entropy is observed with the 2008 financial crisis. For the latter two periods, the observed increase in variability
may also be due to a high level of attrition. For the last time point, we have only 383 sequences left of the original 1000.
The estimated parameters of Eqs. (2.1) and (2.3) are shown in Table 9 along with the bootstrapped standard errors and
the p-values for statistical significance using 1000 bootstrap replications (Efron and Tibshirani, 1994).
The modeling of the expectation of the first two components is dominated by the first lag, while the second lag dominates
the third component, and the third and fourth lags are the most important in the last component, but with opposite effects
(positive for the third lag and negative for the fourth lag).
The most common income regime, the first component that represents more than 90% of cases, shows a general slightly
positive trend in income dynamics given the positive and relatively high coefficients for the intercept ( ˆϕ1,0=3.0965) and
first and last lags ( ˆϕ1,1=0.6789 and ˆϕ1,4=0.2298). As mentioned before, the second state is however quite rare (it is used
only 0.67% of the time) and it represents a fluctuated regime. This component is associated both with an increasing trend in
the short run (e.g. ˆϕ2,1=3.3473) and with negative shocks. The third and fourth components finally imply lower income
levels and they are generally associated with short run negative shocks.
Among the covariates (ˆ
δ), gender and educational level seem to have the largest impact on earnings dynamics (see Table 9,
components one and four, respectively). In particular, the results show the presence of gender bias. Having a man as the
head of the household has a positive impact on the income of the family. Not surprisingly, having a high level of education
D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 143
Table 9
Estimated parameters for each component.
g1 2 3 4
ˆϕg,03.0965 (0.0819) −3.2835 (0.3610) −0.7567 (0.0815) 0.7426 (1.2834)
[<0.0001] [0.0040] [<0.0001] [0.9269]
ˆϕg,10.6789 (0.0148) 3.3473 (0.0997) −0.4644 (0.0177) −0.1578 (0.1034)
[<0.0001] [<0.0001] [0.0020] [0.2162]
ˆϕg,2−0.0493 (0.0099) −0.6291 (0.0562) 2.3375 (0.0367) 1.2624 (0.379)
[0.0020] [<0.0001] [<0.0001] [0.0060]
ˆϕg,3−0.2085 (0.0166) 0.0160 (0.0727) −1.5414 (0.1703) 3.3831 (0.4038)
[<0.0001] [0.5986] [0.0020] [0.0020]
ˆϕg,40.2298 (0.099) −1.5077 (0.0884) 0.7424 (0.1776) −3.9031 (0.4863)
[<0.0001] [<0.0001] [0.0040] [0.0020]
ˆ
δ1,Male 0.3681 (0.0547)
[<0.0001]
ˆ
δ2,H.Size 0.0713 (0.0071)
[0.2022]
ˆ
δ3,Age 0.0085 (0.0115)
[<0.0001]
ˆ
δ4,H.Edu 1.6369(0.7056)
[0.0400]
ˆ
θg,00.2364 (0.1850) 0.4523 (0.6901) 0.2498 (0.0474) 8.9801 (2.4186)
[<0.0001] [<0.0001] [<0.0001] [<0.0001]
ˆ
θg,17.8083 (0.3035) 9.5959 (0.2929) 9.2314 (0.3728) 7.8483 (1.0098)
[<0.0001] [<0.0001] [<0.0001] [<0.0001]
Note: Bootstrapped standard errors are reported in parentheses. P-values for statistical significance in brackets based on 1000 bootstrap replications.
Covariate: Male: Man as the head of the household; Age: Age of the head of the household; H. Size: Size of household; H. Edu: Higher level of education.
is positively associated with income growth. As expected, the magnitude of the coefficients of the two continuous variables
is lower, but also positive. Increasing the age of the head of the household increases the overall household income (second
component). Note that we included the covariates through a linear additive term. Thus, we neglected the possibility that
the relationship between age and earnings follows an inverted U-shape pattern, as is well known in the literature. In other
words, earnings could increase in the early years, reach a peak around middle age, and then decline thereafter. These kinds
of effects could be captured by using polynomial relationships to model the link between the continuous variables and the
variable of interest. Finally, the size of the family seems not to have a statistical significant impact on the income dynamics.
6. Discussion
In this paper we described a general probabilistic framework for modeling continuous time-series, and we integrated
simultaneously many extensions previously presented separately in related literature. We also discussed searching for a
useful model from among the possible solutions offered by the combination of lags and covariates.
Due to its flexibility, the HMTD model seems to be particularly suitable for longitudinal analyses in the social sciences
and related fields. The model is able to consider the observed heterogeneity in the population and can explain the observed
trajectories, making it useful when predicting the next observation in a series or for probabilistic clustering. As in the Latent
Class Growth model (LCGM), each level of the discrete latent variable may represent a group or a subtype of cases. However,
using the HMTD, we have two alternatives. First, we can set the transition matrix, Q, as a diagonal matrix to identify distinct
subgroups following a similar pattern in the whole series, as in the LCGM. Second, we can consider the latent states to be
different subpopulations, without including any constraints on the transition matrix, and allow individuals to move between
latent classes at each time point.
The drawback of the flexibility of the HMTD is the difficulty in finding the correct specification of the model structure: the
number of components and lags, the use of covariates at the hidden and/or observed levels, modeling the standard deviation
of each component, and so on. Given the large number of different models that could possibly be estimated, we proposed
an ad hoc hierarchical strategy. Starting from the simplest model possible, advanced elements are added, one at a time and
in each step the models are compared using information criteria such as the BIC. Finally, we illustrated the model and the
suggested model selection procedure using a real dataset. Using the US Panel Study of Income Dynamics, we analyzed the
trajectories of household income in the United States over four decades.
Another issue related to the complexity of the model is the estimation procedure. The EM algorithm allows an easy es-
timation of the parameters, but it might be quite unstable in high-dimensional settings with many local optima. Therefore,
144 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145
we require further investigation to test variants of EM algorithms, such as the Classification EM, Stochastic EM, or genetic
algorithms.
Future work should empirically investigate the criteria used to identify the number of components and the model spec-
ification in more detail. Numerical criteria such as the BIC can identify an optimal model from sample data during the esti-
mation process. However, in empirical applications, we are more interested in interpretable and generalizable results than
in strictly optimal ones. For instance, adding an additional component that represents only a few outliers or extreme cases
is interesting from a theoretical point of view, but not necessarily from a practical perspective. These extreme cases might
be due to errors in data entry, drop-out cases, or might represent small, negligible sub-populations. Further analysis can ad-
dress this issue by introducing measures other than the BIC to combine the adequacy of the model with the interpretability
of the results. Finally, further research should also consider the case of continuous time data rather than discrete panel data.
Acknowledgments
This publication benefited from the support of the Swiss National Centre of Competence in Research LIVES Overcoming
vulnerability: life course perspectives, which is financed by the Swiss National Science Foundation. The authors are grateful
to the Swiss National Science Foundation for its financial assistance. We also thank the AE and the two referees for their
helpful comments.
References
Bartolucci, F., Farcomeni, A., 2010. A note on the mixture transition distribution and hidden Markov models. J. Time Ser. Anal. 31 (2), 132–138.
Basford, K.E., Greenway, D., McLachlan, G.J., Peel, D., 1997. Standard errors of fitted means under normal mixture models. Comput. Statist. 12, 1–17.
Baum, L.E., Petrie,, 1966. Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37 (6), 1554–1563.
Baum, L.E., Petrie, T., Soules, G., Weiss, N., 1970. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains.
Ann. Math. Stat. 41 (1), 164–171.
Berchtold, A., 1999. The double chain Markov model. Comm. Statist. Theory Methods 28 (11), 2569–2589.
Berchtold, A., 2002. High-order extensions of the double chain Markov model. Stoch. Models 18 (2), 193–227.
Berchtold, A., 2003. Mixture transition distribution (MTD) modeling of heteroscedastic time series. Comput. Statist. Data Anal. 41, 399–411.
Berchtold, A., 2004. Optimisation of mixture models: comparison of different strategies. Comput. Statist. 19, 385–406.
Berchtold, A., Raftery, A., 2002. The mixture transition distribution model for high-order Markov chains and non-Gaussian time series. Statist. Sci. 17 (3),
328–359.
Biernacki, C., Celeux, G., Govaert, G., 2000. Stratégies algorithmiques pour maximiser la vraisemblance dans les modèles de mélange. In: Actes des XXXII
Journées de Statistique.
Böhning, D., 2001. The potential of recent developments in nonparametric mixture distributions. In: Proceedings of the 10th International Symposium on
Applied Stochastic Models and Data Analysis.
Boldea, O., Magnus, J.R., 2009. Maximum likelihood estimation of the multivariate normal mixture model. J. Amer. Statist. Assoc. 104 (488), 1539–1549.
Bollerslev, T., Chou, R.Y., Kroner, F., 1992. ARCH modeling in finance. A review of the theory and empirical evidence. J. Econometrics 52, 5–59.
Box, G.E., Jenkins, G.M., Reinsel, G.C., 1994. Time Series Analysis, Forecasting and Control. Prentice Hall.
Chariatte, V., Berchtold, A., Akré, C., Michaud, P.-A., Suris, J.-C., 2008. Missed appointments in an outpatient clinic for adolescents, an approach to predict
the risk of missing. J. Adolesc. Health 43 (1), 38–45.
Dannemann, J., Holzmann, H., 2008. Likelihood ratio testing for hidden Markov models under non-standard conditions. Scand. J. Statist. 35 (2), 309–321.
Dempster, A.P., Lard, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39 (1), 1–38.
Dietz, E., Bohning, D., 1996. Statistical inference based on a general model of unobserved heterogeneity. In: Fahrmeir, L., Francis, F., Gilchrist, R., Tutz, G.
(Eds.), Advances in GLIM and Statistical Modeling. In: Lecture Notes in Statistics, Springer, Berlin, Heidelberg, pp. 75–82.
Efron, B., 1979. Bootstrap methods: another look at the jacknife. Ann. Statist. 7 (1), 1–26.
Efron, B., Tibshirani, R.J., 1994. An Introduction to the Bootstrap. CRC Press.
Elliott, R.J., Hunterb, W.C., Jamieson, B.M., 1998. Drift and volatility estimation in discrete time. J. Econom. Dynam. Control 22, 209–218.
Frydman, H., Schuermann, T., 2008. Credit rating dynamics and Markov mixture models. J. Bank. Finance 32, 1062–1075.
Gabadinho, A., Ritschard, G., Studer, M., 2011. Analyzing and visualizing state sequences in R with TraMineR. J. Stat. Softw. 40 (4).
Giudici, P., Rydén, T., Vandekerkhove, P., 2000. Likelihood-ratio tests for hidden Markov models. Biometrics 56 (3), 742–747.
Hamilton, J.D., 1989. A new apporach to the economic analysis of nonstationary time series and the business cycle. Econometrica 57 (2), 357–384.
Hamilton, J.D., 1994. Time Series Analysis. Princeton University Press.
Hassan, M.Y., El-Bassiouni, M.Y., 2013. Modelling Poisson marked point processes using bivariate mixture transition distributions. J. Stat. Comput. Simul.
83 (8), 1440–1452.
Hassan, M.Y., Lii, K.-S., 2006. Modeling marked point processes via bivariate mixture transition distribution models. J. Amer. Statist. Assoc. 101 (475),
1241–1252.
Hayashi, T., 2004. A discrete-time model of high-frequency stock returns. Quant. Finance 4, 140–150.
Helske, J., Eerola, M., Tabus, I., 2010. Minimum description length based hidden Markov model clustering for life sequence analysis. In: Proceedings of the
Third Workshop on Information Theoretic Methods in Science and Engineering.
Hill, M., 1991. The Panel Study of Income Dynamics: A User’s Guide. SAGE Publications.
Hox, J.J., 1995. Applied Multilevel Analysis. TT-Publikaties, Amsterdam.
Kapetanios, G., 2008. A bootstrap procedure for panel data sets with many cross-sectional units. Econom. J. 11 (2), 377–395.
Kass, R.E., Raftery, A.E., 1995. Bayes factors. J. Amer. Statist. Assoc. 90 (430), 773–795.
Kim, D., Kon, S.J., 1994. Alternative models for the conditional heteroscedasticity of stock returns. J. Bus. 67 (4), 563–598.
Kon, S.J., 1984. Models of stock returns: a comparison. J. Finance 39 (1), 147–165.
Kovar, J.G., Rao, J.N.K., Wu, C.F.J., 1988. Bootstrap and other methods to measure errors in survey estimates. Canad. J. Statist. 16, 25.
Le, N.D., Martin, D.R., Raftery, A.E., 1996. Modelling flat stretches, bursts, and outliers in time series using mixture transition distribution models. J. Amer.
Statist. Assoc. 91 (436), 1504–1515.
Leroux, B.G., 1992. Consistent estimation of a mixing distribution. Ann. Statist. 20 (3), 1350–1360.
Le Strat, Y., Carrat, F., 1999. Monitoring epidemiologic surveillance data using hidden Markov models. Stat. Med. 18 (24), 3463–3478.
Lokshin, M., Ravallion, M., 2001. Household income dynamics in two transition economies. World Bank 1–40.
Louis, T.A., 1982. Finding the observed information matrix when using the EM algorithm. J. Roy. Statist. Soc. Ser. B 44 (2), 226–233.
Luo, J., Qiu, H.-B., 2009. Parameter estimation of the WMTD model. Appl. Math. J. Chinese Univ. 24 (4), 379–388.
McLachlan, G.J., Basford, K.E., 1988. Mixture Models: Inference and Applications to Clustering. Marcel Dekker Inc.
D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 145
McLachlan, G.J., Krishnan, T., 1996. The EM Algorithm and Extensions. John Wiley & Sons, New York.
McLachlan, G., Peel, D., 2000. Finite Mixture Models. In: Wiley Series in Probability and Statistics.
Muthen, B.O., 2001. Second-generation structural equation modeling with combination of categorical and continuous latent variables: new opportunities
for latent class/latent growth modeling. In: Collins, L.M., Sayer, A. (Eds.), New Methods for the Analysis for Change. American Psychological Association,
Washington, DC, pp. 291–322.
Netzer, O., Lattin, J.M., Srinivasan, V., 2008. A hidden Markov model of customer relationship dynamics. Mark. Sci. 27 (2), 185–204.
Newton, M.A., Raftery, A.E., 1994. Approximate Bayesian inference with the weighted likelihood bootstrap. J. Roy. Statist. Soc. Ser. B 56 (1), 3–48.
Rabiner, L.I., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286.
Raftery, A.E., 1985. A model for high-order Markov chains. J. R. Stat. Soc. Ser. B 47 (3), 528–539.
Raftery, A.E., 1995. Bayesian model selection in social research. Sociol. Methodol. 25, 111–163.
Redner, R.A., Walker, H.F., 1984. Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26, 195–239.
Schwarz, G.E., 1978. Estimating the dimension of a model. Ann. Statist. 6 (2), 461–464.
Sclattmann, P., 2009. Medical Applications of Finite Mixture Models. In: Statistics for Biology and Health. Springer.
Shirley, K.E., Small, D.S., Lynch, K.G., Maisto, S.A., Oslin, D.W., 2010. Hidden Markov models for alcoholism treatment trial data. Ann. Appl. Stat. 4 (1),
366–395.
Stephens, M., 2000. Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 62 (4), 795–809.
Viterbi, A.J., 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inform. Theory 16 (2), 260–269.
Weigend, A.S., Mangeas, M., Srivastava, A.N., 1995. Nonlinear gated experts for time series: discovering regimes and avoiding overfitting. Int. J. Neural Syst.
6 (4), 373–399.
Weigend, S.W., Shi, S., 2000. Predicting daily probability distributions of S&P500 returns. J. Forecast. 19, 375–392.
Wellekens, C., 1987. Explicit time correlation in hidden Markov models for speech recognition. In: Proceedings ICASSP. pp. 384–386.
Wong, C.S., Chan, W.S., 2005. Mixture Gaussian time series modelling of long-term market returns. N. Am. Actuar. J.
Wong, C.S., Li, W.K., 2000. On a mixture autoregression model. J. R. Stat. Soc. Ser. B 62, 92–115.
Wong, C.S., Li, W.K., 2001. On a mixture autoregressive conditional heteroscedastic model. J. Amer. Statist. Assoc. 96, 982–995.