Content uploaded by Danilo Bolano

Author content

All content in this area was uploaded by Danilo Bolano on Feb 21, 2018

Content may be subject to copyright.

Computational Statistics and Data Analysis 93 (2016) 131–145

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis

journal homepage: www.elsevier.com/locate/csda

General framework and model building in the class of Hidden

Mixture Transition Distribution models

Danilo Bolano a,b,∗, André Berchtold a,c

aNational Center of Competence in Research LIVES, Switzerland

bInstitute of Demographic and Life Course Studies, University of Geneva, Switzerland

cInstitute of Social Sciences, University of Lausanne, Switzerland

article info

Article history:

Received 29 April 2014

Received in revised form 29 August 2014

Accepted 12 September 2014

Available online 22 September 2014

Keywords:

Mixture model

Model selection

Hidden Markov model

Mixture Transition Distribution model

BIC

Panel data

abstract

Modeling time series that present non-Gaussian features plays as central role in many

fields, including finance, seismology, psychological, and life course studies. The Hidden

Mixture Transition Distribution model is an answer to the complexity of such series. The

observed heterogeneity can be induced by one or several latent factors, and each level of

these factors is related to a different component of the observed process. The time series

is then treated as a mixture and the relation between the components is governed by a

Markovian latent transition process. This framework generalizes several specifications that

appear separately in related literature. Both the expectation and the standard deviation of

each component are allowed to be functions of the past of the process. The latent process

can be of any order, and can be modeled using a discrete Mixture Transition Distribution.

The effects of covariates at the visible and hidden levels are also investigated. One of the

main difficulties lies in correctly specifying the structure of the model. Therefore, we pro-

pose a hierarchical model selection procedure that exploits the multilevel structure of our

approach. Finally, we illustrate the model and the model selection procedure through a real

application in social science.

©2014 Elsevier B.V. All rights reserved.

1. Introduction

Real data are often a combination of many different, possibly non-observed causes that lead to apparently unpredictable

behaviors. For example, in the context of longitudinal data, time series may show non-homogeneous behaviors, can switch

between alternative regimes characterized by a low or high variance, can contain extreme values, and the distribution of

future values can take complex multimodal shapes.

The Hidden Mixture Transition Distribution (HMTD) model considered in this study is a general framework to study

time series. The model can be used to describe and analyze the evolution of any continuous variable observed on a set of M

independent sequences that can also vary in length. The model integrates several refinements of the mixture model and the

hidden Markov model. More specifically, it can be used for different purposes, including describing observed data, searching

for a generalizable model, testing hypotheses, prediction, and classifying time series.

Mixture models are a popular and efficient approach, both for cross-sectional data and time series, to describe multimodal

distributions that do not correspond to any specific statistical family. Related literature provides many examples of the

usefulness of such models since the work of Weldon and Pearson at the end of the 19th century. Historically, mixture models

∗Correspondence to: University of Geneva (CH), bd. du Pont d’Arve 40, Switzerland. Tel.: +41 22 379 98 74.

E-mail addresses: Danilo.Bolano@unige.ch,d_bolano@yahoo.it (D. Bolano), Andre.Berchtold@unil.ch (A. Berchtold).

http://dx.doi.org/10.1016/j.csda.2014.09.011

0167-9473/©2014 Elsevier B.V. All rights reserved.

132 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145

for continuous-valued time series were introduced in Le et al. (1996) as the Gaussian Mixture Transition Distribution (GMTD)

model, building upon the earlier work of Raftery (1985) in the discrete case. See Berchtold and Raftery (2002) for a complete

review of the basic principles of Mixture Transition Distribution (MTD) models. The general principle of all MTD-like models

for count data is to combine different Gaussian distributions (called components) using a mixture model in which the mean

of each distribution is a function of the past observed process. The weights associated to each component are interpreted as

the probability of that specific component generating the next value of the process. This class of models has been expanded

on in several ways. For example, allowing detailed specifications for the mean of each component (e.g., Wong and Li, 2001),

allowing the variance of each component to depend on its past (Wong and Li, 2001;Berchtold,2003), and replacing the fixed

probabilities associated with each component by a Markov chain (Bartolucci and Farcomeni, 2010). Then, covariates have

been included (Chariatte et al.,2008;Luo and Qiu, 2009) used non-Gaussian distributions, and extensions to bivariate data

have been suggested (Hassan and Lii, 2006;Hassan and El-Bassiouni, 2013).

Hidden Markov Models (HMMs) form another class of stochastic processes often used to represent and analyze complex

time series in the presence of over-dispersion. They are particularly well suited to analyzing data switching between several

regimes. In its traditional formulation, a HMM combines a hidden first-order Markov chain with different conditionally

independent distributions for the observed process. Here each distribution is related to one of the states of the Hidden

Markov chain. Many developments to the basic HMM have been proposed, including using high-order Markov chains and

removing the conditional independence hypothesis. The latter change allows the observed process to depend on both its

past and the hidden process (Wellekens,1987;Berchtold,1999,2002). Under the hypothesis of stationarity, the marginal

distribution of the observations in a HMM is a finite mixture, with the number of components being equal to the number of

states in the unobserved Markov process. Therefore, we may consider the two models as one unique approach. When the

transitions between components are driven by a Markov chain, the mixture transition model becomes a HMM. In addition,

when the observations are allowed to follow an autoregressive process, the HMM becomes a mixture transition model. Some

authors even refer to HMMs as Markov-dependent mixture models (as coined by Leroux, 1992).

It is worth noting that mixture transition and Markovian models are distributional stochastic models, in contrast to

other well-known and widespread approaches that are essentially point processes. The latter group includes the ARMA and

ARIMA models, which are based on autoregressive equations, and the ARCH and GARCH models, which explicitly consider

the variance of the process. See, for example, Box et al. (1994), Hamilton (1994), and Bollerslev et al. (1992) for a complete

review of these models, and Kon (1984) and Kim and Kon (1994) for a comparison of the different models. When trying

to predict the next value of a series, the advantage of the point approach is that the model will provide a clear answer as

one numerical value (associated with a confidence interval). However, the drawback is that, in most cases, the answer is

either inaccurate or completely wrong. Given the high variability of many time series, the expectation of the model is not a

good estimator of the value of the next observation. Even when modeling the variance of the process, there is only a small

probability of accurately predicting the next value. However, in the probabilistic approach, instead of trying to determine

the next value of the series, a non-null probability is associated with values (discrete case) or intervals (continuous case),

which are then possible candidates to be the next data. Then, an adequate probabilistic model does not provide a single

value, but rather leads to a complete representation of possible futures through a (possibly multimodal) distribution. In that

sense, the answer given by this approach generally has a higher probability of helping to make the right decisions, because

it shows all possibilities rather than one (probably) wrong value.

Hidden Markov models were historically used for speech recognition (Rabiner,1989;Baum and Petrie, 1966), but many

applications have since been found in other fields, including econometrics (e.g., Elliott et al.,1998;Hayashi,2004;Netzer

et al.,2008) and the biosciences (e.g., Le Strat and Carrat, 1999;Shirley et al.,2010). Mixture models for count data are

also quite common in finance and biomedical studies (Sclattmann, 2009) and behavioral studies (under the name of growth

mixture modeling, e.g., Muthen, 2001). However, Mixture Transition Models seem to be used quite exclusively in economics

and finance (e.g., Wong and Chan, 2005;Frydman and Schuermann, 2008). In fact, despite their unanimously recognized

advantages, mixture transition and hidden models are still only sparsely used in social sciences. This is unfortunate, since

the current trend in this field is clearly to switch from cross-sectional to longitudinal surveys, hence the need for advanced

methods for modeling longitudinal data showing non-Gaussian distributions.

In addition, even though many developments have been proposed during the past few decades on the basic MTD and

HMM models, there is still a need for a more general framework that integrates these refinements. As a result, this study has

three objectives. First, we define a general framework that integrates the many extensions presented previously separately

in the literature (see Section 2), and then we discuss the estimation procedure (see Section 3). Second, as noted by Rabiner

(1989), the number of possibilities offered by hidden models is so large that it becomes difficult to identify an adequate

model structure for a particular research question. Therefore, in Section 4, we outline a search strategy similar to that used

with other multilevel models. Finally, we illustrate the model and the proposed procedure using a real dataset, the US Panel

Study of Income Dynamics (see Section 5).

2. The HMTD model

The Hidden Mixture Transition Distribution (HMTD) model developed in this paper combines a hidden and an observed

level. It can be used to describe and analyze the evolution of any continuous variable observed on a set of Mindependent

sequences, which may vary in length.

D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 133

At the hidden level, a latent discrete variable, S, taking values in the finite set {1,...,k}follows a Markov chain of order

ℓ≥0 (ℓ=0 is a special case reducing the model to a simple mixture). Each of these values represents a different hidden

state of the process. In addition to its lags, the value taken by Sat time tcan also be influenced by one or several categorical

covariates. Since Sis unobserved, we never know its exact value at time t, but we can estimate its distribution and the most

likely sequence of states corresponding to each sequence of observations.

At the observed level, we consider a random variable, X, taking values in R. A Gaussian component is associated with

each hidden state of the latent process, with an expectation and variance that may depend on both the lags of Xand on a

set of categorical and/or continuous covariates. Since we only know the distribution of the possible hidden states at time t,

the visible model is a mixture of the kGaussian components.

The model can be estimated by simultaneously using as many independent sequences of observations as required.

Each sequence typically corresponds to the observation of a separate subject. Each hidden state and its associated visible

component can instead be interpreted as one possible behavior of the subjects under investigation and multiplying the

number of components, we allow the subjects to follow different behaviors. This enables us to capture both the complexity

of the overall population and the evolution of each individual over time.

The following subsections describe the two levels of the model in more detail.

2.1. The visible level

Let {Xt,t∈N∗}be a sequence of random variables taking values in R. Let Xt−1

t−adenote the past observations between

time t−aand t−1. A frequently used and convenient hypothesis assumes that the probability of Xt, given its past, follows

a Gaussian distribution, but this hypothesis is generally too simplistic to account for the complexity of real data. A better

solution is to assume that the observed time series was generated by kdifferent sub-models, each model being used for one

or several parts of the overall time series, and to write the resulting model as a mixture (McLachlan and Peel, 2000). We

then have

F(xt|xt−1

1)=

k

g=1

λg(t)Gg(xt|xt−1

t−rg,C(t))

where F(xt|xt−1

1)is the cumulative distribution function of xt, given its past, Gg(xt|xt−1

t−rg)is a cumulative distribution function

of xt, given a part of its past (from xt−rgto xt−1,rg≥1), C(t)represents a set of covariates available at time t, and λg(t)is

the weight of the gth component at time t, with

k

g=1

λg(t)=1, λg(t) > 0,∀g,t.

Different specifications can be used for the Gs. However, we only consider the Gaussian case here because, as pointed

out by Rabiner (1989), Gaussian mixtures can approximate almost all continuous density functions as closely as necessary.

The gth component is then written as

Gg(xt|xt−1

t−rg)=Φxt−µg,t

σg,t.

In order to explicitly incorporate the dependence between successive observations into the model, the expectation and

the standard deviation of each Gaussian component are written as functions of the past. The expectation of the gth compo-

nent is specified by

µg,t=ϕg,0+

pg

i=1

ϕg,ixt−i+

cg

j=1

δg,jcj(t), pg≥0,cg≥0.(2.1)

The first part of the equation is an autoregressive model, with the ϕg,icoefficient associated with the ith lag of Xt. The

second part represents the influence of the covariates: cj(t)is the jth covariate observed at time t, and δg,jis its associated

coefficient. Covariates can be continuous or categorical. As usual, in order to facilitate the interpretation of the results, we

suggest recoding the categorical covariates as 0–1 dummy variables. When the number of lags is fixed as zero (pg=0) and

there are no covariates (cg=0), the expectation of the component becomes a constant (Eq. (2.1)).

Different specifications can be chosen to model the standard deviation of the gth component:

σg,t=

θg,0+

qg

j=1

θg,jx2

t−j,qg≥0,(2.2)

σg,t=

θg,0+

qg

j=1

θg,j(xt−j−xt−1

t−qg)2,qg≥2,(2.3)

134 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145

σg,t=

θg,0+

qg

j=1

θg,j(xt−j−µg,t)2,qg≥1,(2.4)

σg,t=

θg,0+

qg

j=1

θg,je2

g,t−j,qg≥1,(2.5)

where

xt−1

t−qg=1

qg

qg

j=1

xt−j,

eg,t−j=xt−j−µg,t−j

and

θg,0>0,∀g,

θg,j≥0,∀g,j=1,...,qg.

The first three specifications were introduced in Berchtold (2003). The case of a constant standard deviation is included in

the first specification by fixing qg=0. Since this first specification uses only the past squared observations, it should be used

only on datasets in which a substantial part of the data is homoscedastic. Indeed, if we compare two time series, Sa

tand Sb

t,

such that Sb

t=Sa

t+c, the two series have the same standard deviation, but the parameters of Eq. (2.2) would take different

values. The next two specifications of the standard deviation do not suffer from this problem because they are both derived

from the usual standard deviation formula. The difference lies in the reference point used. Eq. (2.3) directly compares each

lag to the empirical mean. The only restriction here is that the number of lags has to be greater than or equal to 2. Eq. (2.4)

compares each lag to the expectation of the component. As noted by Wong and Li (2000), the latter approach is somewhat

curious, since µg,tis not always a good predictor of the next observation when the series is highly variable. However, in

practice, this specification has the advantage of ensuring consistency between the two time-dependent elements (µg,tand

σg,t) of each Gaussian component. Finally, Eq. (2.5) is the ARCH specification proposed by Wong and Li (2001).

With regard to modeling the expectation, it is also possible to let the standard deviation of a component depend on ex-

ternal factors, and to introduce covariates into Eqs. (2.2)–(2.5). Furthermore, when the variability of the studied process is

related to its magnitude, using specification (2.4), with the covariates included in the modeling of µg,t, could prove partic-

ularly useful.

In the above equations, pgdenotes the number of lags used when modeling the expectation of the gth component, and qg

denotes the corresponding number of lags used in the modeling of the variance. The larger time dependence for each com-

ponent is then rg=max(pg,qg). Note that, even if pg>1 or qg>1, it is not mandatory to use all lags between 1 and pgor

qg. In practice, it can be interesting to associate some lags with only some components. For instance, in the original GMTD

model (Le et al., 1996), the expectation of the gth component used only the gth lag of the observed variable, X. Another

possibility is to constrain the sum of several parameters to take a particular value. For instance, Le et al. (1996) proposed

imposing the following constraint on a component g:

pg

i=1

ϕg,i=1.

2.2. The hidden level

In the previous subsection, we wrote the weight of the gth component as a function of time: λg(t). Of course, it is possible

to remove this dependence and consider

λg(t)=λg,∀t.

In this case, the HMTD model reduces to a pure mixture model. On the other hand, following Weigend and Shi (2000), there

are at least two different possibilities for specifying varying weights. The first one, called ‘‘gated experts’’, allows each weight

to depend on a set of covariates (Weigend et al., 1995). We then have

λt=(λ1(c1(t)), λ2(c2(t)), . . . , λk(ck(t))).

Since covariates are supposed to evolve over time, the component weights do too.

The second possibility is to let the weights at time tdepend on the past through a Markov chain and, in this way, to

implement a HMM at the hidden level. We then write

λg(t)=P(St=g|St−1,...,St−ℓ)

where Strepresents the component chosen at time t, and ℓis the order of the Markov chain. When the order of the hidden

Markov chain is set to zero, the model once again reduces to a pure mixture.

D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 135

The last two approaches can be combined to let the weights of the components depend on both the past of the latent

process and on the covariates. This is accomplished by adding additional terms to the previous equation:

λg(t)=P(St=g|St−1,...,St−ℓ,C1(t), . . . , Ccg(t)). (2.6)

Eq. (2.6), even if still describing transition probabilities, no longer corresponds to a Markov chain. Moreover, the number of

parameters to be estimated can become prohibitive, especially when there is more than one lag. A solution is to model the

hidden process through a discrete MTD model (Berchtold and Raftery, 2002), as follows:

λg(t)=

ℓ

j=1

ψjqsj,g+

cg

h=1

fh(ch),

where qsj,gdenotes the transition probability from state value sjat time t−jto state value gat time t, and ψjis the weight

of lag j. Then, the set of all transition probabilities can be written as a transition matrix, Q:

Q=

q1,1· · · q1,k

.

.

..

.

.

qk,1· · · qk,k

.

Then, C= {C1,...,Ck}is a set of covariates and f1,...,fkare transformation functions. When the covariates are categorical,

we can write

cg

h=1

fh(ch)=

cg

h=1

γhqch,g

where qch,gdenotes the probability of the transition from the value of the hth covariate to the state value gat time t, and

γhis the corresponding coefficient. Of course, the preceding equations have to be constrained in order for the results to be

probabilities.

By allowing all possible transitions between hidden states to occur, we can represent the behavior of the observed time-

series as precisely as needed. However, it is also sometimes useful to constrain the transition matrix of the hidden level. For

instance, consider representing the transitions between successive hidden states using a first-order Markov chain, with no

covariates. With four components, the matrix is written as

A=

q1,1q1,2q1,3q1,4

q2,1q2,2q2,3q2,4

q3,1q3,2q3,3q3,4

q4,1q4,2q4,3q4,4

.

An interesting case occurs when the different states are hierarchical. In other words, when a subject enters in state si, he/she

can only stay in this state or move to state sj, such that sj>si(see the left-hand side of Table 1). For example, in behavioral

studies, we observe this kind of situation when the phenomenon evolves with the age of the subjects. Young subjects start

in the first hidden state and then switch to higher states as they grow older. Hearing capabilities are a good example. Young

people are supposed to have maximal capabilities, which then decline with age, passing through different stages.

Here, the first state would represent people with full hearing capabilities, and the last state could represent those who are

deaf. Of course, the probability of remaining in the last state is one. If we want to impose a strict order between the states (one

can switch only to the next state), then the transition matrix can be represented as is shown on the right-hand side of Table 1.

Another useful specification is to set Ato the identity matrix. Each state is then absorbing, meaning that it is impossible to

leave a state. This implies that each independently observed subject is associated with one, and only one state during its

observation period. The model then classifies the subjects into kmutually exclusive groups based on the sequence of data

observed for each subject. Other approaches that use HMMs to classify sequence data are presented in Helske et al. (2010).

From a practical point of view, the constraints discussed above are very easy to set, because the Expectation–

Maximization (EM) algorithm used to estimate the parameters of the HMTD model has the property that parameters initial-

ized to zero will not be updated during the optimization process. Hence, it is sufficient to initialize the required parameters

to zero.

3. Estimation

Belonging to the class of hidden Markov models, the HMTD model can be estimated using the general framework

proposed by Rabiner (1989), in which three different problems have to be considered:

1. The computation of the (log-)likelihood of the sequence of observed data, given the current model.

2. The identification of the optimal sequence of hidden states, given the current model and the sequence of observed data.

3. The estimation of the optimal model parameters, given the sequence of observed data—that is, the parameters that

maximize the log-likelihood of the model.

136 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145

Table 1

Transition matrices for hierarchical states.

A=

q1,1q1,2q1,3q1,4

0q2,2q2,3q2,4

0 0 q3,3q3,4

0 0 0 1

,A=

q1,1q1,20 0

0q2,2q2,30

0 0 q3,3q3,4

0 0 0 1

The first problem is solved through an iterative computation known as the Forward–Backward procedure (Rabiner, 1989).

The second problem, sometimes called the decoding problem, is solved using the Viterbi algorithm (Viterbi, 1967). Finally, the

third problem is solved using a version of the Expectation–Maximization (EM) algorithm especially designed for the Hidden

Markov Model, known as the Baum–Welch algorithm (Baum et al., 1970). Each of these three problems has been studied

extensively and the interested reader can find detailed discussions in Baum et al. (1970), Dempster et al. (1977), Rabiner

(1989), McLachlan and Krishnan (1996), Berchtold (2002) and Berchtold (2004). Even though the framework presented in

this study incorporates many extensions to the basic HMM and MTD models, the estimation procedure remains essentially

the same. Only minor adjustments are necessary in order to explicitly incorporate the relationships between successive

observations and the effect of the covariates.

Although the maximization of the log-likelihood is not difficult from a purely computational point of view, it is much

more difficult to assess the quality of the solution. Different problems can occur during the estimation phase. For instance,

even though it has been demonstrated that EM algorithms converge to a maximum of the log-likelihood, there is no way

of ensuring that this is the global maximum, rather than a local optimum. Even for quite simple HMTD models with few

components and lags, the solution space can be very complex with many local optima. Different solutions have been pro-

posed to overcome this issue, ranging from several independent runs of the EM algorithm and variants of the basic algorithm

(e.g., the Classification EM and the Stochastic EM algorithm) to a combination of the EM algorithm and a gradient-type or

genetic algorithm. More details can be found in Biernacki et al. (2000), Böhning (2001) and Berchtold (2004).

Another well-known situation is the possible degeneracy of the log-likelihood, with the log-likelihood taking values

larger than one. This issue is easy to detect and then to avoid by imposing constraints on the model parameters, especially

on the coefficients of the standard deviation. Finally, when a component is fitted to very small subsets of the data, the degen-

eracy in the maximum likelihood estimation might be due to particular features of those subsets, preventing a good general-

ization of the results. To avoid these fallacious situations, we again constrain some of the parameters, for instance, by limiting

the ratio between the largest and smallest weights of the different components. For further details, refer to Berchtold (2003).

Other difficulties related to mixture models are formulating standard errors and t-statistics, in order to perform a di-

agnostic of the parameters estimated from the EM algorithm, and computing confidence intervals. The simple use of the

expectation is typically not possible, and obtaining the Hessian matrix is not straightforward. In the literature, two classes

of methods have been proposed to estimate the variance matrix of maximum likelihood estimators: information-based

methods and resampling methods. Information-based methods propose estimating the variance matrix as the inverse of

the information matrix. One way is to approximate the information matrix from the complete(-data) likelihood (e.g., Louis,

1982) so the mixture membership is treated as known. On the other hand, Dietz and Bohning (1996) proposed approximating

the Fisher information using the original data (i.e., using the incomplete-data likelihood). These methods are computation-

ally efficient. However, in both cases, we only have an asymptotic approximation of the covariance matrix, so the sample

size must be large enough to guarantee that the asymptotic theory of maximum likelihood holds. More recently, Boldea

and Magnus (2009) derived the analytical forms of the score vector and Hessian matrix for multivariate Gaussian mixture

models, which are used to estimate the observed information matrix directly.

The bootstrap method is a more popular technique for performing model diagnostics in mixture modeling. In this class

of methods, we can distinguish three main approaches: the parametric bootstrap (Basford et al.,1997;McLachlan and Peel,

2000), the non-parametric bootstrap (Efron, 1979), and a modified version of the non-parametric bootstrap, proposed by

Newton and Raftery (1994), known as the weighted bootstrap. In the latter approach, the data are weighted proportionally to

the number of times that a sample value occurs in the bootstrap re-sample. The parametric and non-parametric bootstrap

methods differ in the way the re-sampling is performed. In the first case, the repeated draws are made from the fitted

mixture (i.e., the mixture with parameters fixed at their estimated values). In the non-parametric approach, the bootstrap

samples are drawn directly from the sampling distribution of the original data. Whatever resampling technique is chosen,

the bootstrap approach estimates the model in each bootstrap sample, then computes the in-sample bootstrap standard

errors of the corresponding parameter estimates using a Monte Carlo approximation (McLachlan and Peel, 2000). Efron and

Tibshirani (1994) showed that 50–100 bootstrap replications are sufficient for reliable standard error estimation.

In the case of resampling-based methods, because of the invariance of the likelihood under permutations of the mixture

components, a label-switching problem may arise (Redner and Walker, 1984). However, according to McLachlan and Peel

(2000), if we use the parameters estimated from the original data as initial values in the EM algorithm in each bootstrap

sample, this problem should remain rare. Nevertheless, the label-switching problem has been extensively studied in the

literature, and different solutions have been proposed (Stephens, 2000).

Longitudinal analyses in the social sciences generally focus more on panel data than on univariate time series. Therefore,

when employing a bootstrap method, two further issues have to be addressed. First, we have to choose how to perform

the sampling. Several methods of resampling panel data have been proposed in the literature: temporal sampling, where

D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 137

the resampling is performed at each time point; the block bootstrap, which re-samples blocks of consecutive observations;

and individual or cross-sectional sampling, which analyzes individual sequences as unique whole unit in order to preserve

intra-individual dependence (Kapetanios, 2008).

The second issue is related to longitudinal panel weights. The resampling must account for the sample weights to avoid

a biased estimation, in a way similar to the weighted bootstrap proposed by Newton and Raftery (1994). In survey method-

ologies, resampling methods from stratified populations have been introduced (e.g., Kovar et al., 1988), but these still need

to account for complex sampling designs, as well as deal with non-response and longitudinal panel weights.

4. Model selection

The great flexibility of the HMTD model can sometimes also be a drawback. For example, it is difficult to decide on the

structure for the ‘‘best’’ model to fit the data. Should we increase the number or lags, or the number of components? Should

we use more covariates? Should we use a non-constant variance? These are just some of the options we need to consider.

However, before addressing these questions, it is necessary to consider the concept of a statistical model and the notion of

a ‘‘best’’ model.

Modeling a time series can be motivated by different objectives, for instance:

1. Obtaining the most precise representation of the data under investigation.

2. Obtaining a model able to represent times series similar to the observed data.

3. Building a model that provides the best possible predictions.

4. Classifying the data into a finite set of mutually exclusive categories.

These different objectives are, of course, related. However, even the notion of a ‘‘best’’ model can differ. In general, we con-

sider that model Adescribes a given time series better than model Bdoes when model Ahas the greater probability of

generating the data. This is equivalent to saying that the (log-)likelihood of model A, given the data, is larger than the (log-)

likelihood of model B. On the other hand, since it is almost always possible to increase the (log-)likelihood of a model by sim-

ply adding more parameters, the model representing the data in the most accurate way can be very complex in terms of free

parameters that need to be estimated. In such a case, this model is perhaps not the most desirable for the purpose of analysis

and decision-making. This is why we generally prefer to select a model based on a penalized criterion, considering both the

precision of the modeling and the complexity. The Bayesian Information Criterion (BIC, see Schwarz,1978;Kass and Raftery,

1995;Raftery,1995) is an example of such a criterion. Moreover, Leroux (1992) proves the consistency of maximum (penal-

ized) likelihood estimators for a mixing distribution obtained from the number of components selected using the AIC or BIC.

For a given model, the BIC is defined as

BIC = −2LL +Plog(N)

where LL is the log-likelihood of the model, Pis the number of independent parameters, and Nis the number of data used

to compute the log-likelihood. The model with the smallest BIC is chosen. For the log-likelihood, we can choose between

the observed and the complete log-likelihood. In the first case, we consider the contribution of the visible part of the model,

given the hidden part, and the BIC explains how well the model fits the data. However, by using the complete log-likelihood,

we have the information from both the visible and the hidden part of the model. Since we are interested in assessing the

quality of both levels of the HMTD, we use the latter approach.

When we are mainly interested in using the model to predict the probability of future events, we have to consider that the

future behavior of the series will, generally, not be the same as its past behavior. Consequently, a model that too precisely de-

scribes a particular, but rarely observed feature of the data can be disadvantageous. In that case, using the BIC can also be in-

teresting, since the precise description of a particular aspect of a time series generally involves many additional parameters.

Another use of mixture models is in classifying data. Here, we assume that each component of the model generates

a different kind of data, and we try to allocate each observed data element to one of the components in order to obtain

homogeneous clusters. In the context of time series, we usually try to obtain a representation in which groups of several

successive data are identified. We then need a model that is sufficiently rough, which will not be too influenced by rela-

tively small variations in the data. In practice, the mixture model is used to associate, a posteriori, each observation to one

component (e.g., McLachlan and Basford, 1988), and in particular, to the component with the largest expectation.

Whatever the final goal of the modeling, we can adopt either a deductive or an inductive approach. In the deductive case,

we start from a clear theory, which the statistical model should reflect. For instance, if the theory tells us that three different

behaviors can be adopted by the subjects under study, so we should choose a three component model, one to represent each

separate behavior. Next, the theory tells us that the first behavior is memoryless, while the other two can be explained by

their past behavior. The first component should then have a constant mean, while the other two should use one or several

lags, and so on. If the theory is sufficiently detailed, then we can use it to completely define the structure of the model.

Once computed, a theory-driven model can be analyzed and its parameters interpreted. Hypotheses can be tested by

computing alternative models that modify aspects of the initial model. These alternative models are then compared to

the initial model based on either likelihood ratio tests (Giudici et al.,2000;Dannemann and Holzmann, 2008) or the BIC. If

alternative models prove to be superior to the first theory-driven model, then the theory may be questioned. The alternative

is to perform an empirical search adopting an inductive approach. In this case, we are not driven by a theory, but instead

138 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145

search for the model that best explains the data under investigation. Given the large number of different models that could

possibly be specified from the same dataset, the question then arises of how to define an optimal search strategy. A possibility

is to rely on the fact that the HMTD belongs to the family of multilevel models and to adopt an ad hoc strategy, such as the

one proposed by Hox (1995). The idea is to work hierarchically, starting from a very basic model, then adding more advanced

elements one at a time. After each change, the new model is compared to the previous one using a measure such as the BIC.

In the context of a HMTD, this can be summarized as follows:

1. Begin with a two-component model, a first-order hidden transition matrix, constant expectations and variance for both

components, and no covariates.

2. Try to find the optimal number of components by computing additional models with 3, 4, 5, ... components. Note that

since no other explanatory elements are used in the model, the number of components could become very large. In this

situation, a limit (for instance 5 or 10) should be imposed.

3. Add lags to the modeling of the expectation of the visible components.

4. Allow the variance of each component to be non-constant using one of the specifications proposed in Section 2.1.

5. Replace the first-order transition matrix with a high-order transition matrix.

6. Introduce covariates at the visible level.

7. Introduce covariates at the hidden level.

This strategy calls for several remarks:

•Since we start with a very simple model in which the only explanatory element lies in the number of visible components,

this number could be very large at first. Even if it is restricted to a maximal value in step 2, it is likely that introducing

additional explanatory elements will later render some of the components useless. A good idea is then to try, at least at

the end of each of the aforementioned steps, to reduce the number of components.

•Beginning with a refinement of the visible model before trying the hidden model is based on the fact that we know the

structure of the data at the visible level. However, the structure of the hidden level is known only in distribution. There-

fore, we have better control of the visible level, making it more useful to begin with a precise model of the observed level.

•Covariates are introduced in the model in the last steps of this strategy. In general, longitudinal data are, at least partially,

self-explaining. Therefore, we should begin by using the information contained in the data, as much as possible, before

using external information. Of course, if the phenomenon under study is known to be strongly related to a particular

external factor, for example gender or age, then this factor can be introduced at an earlier stage.

Finally, always keep in mind that a theory-driven model allows more generalization than does an empirical model. In that

sense, and whatever the approach used, one should avoid trying models without a real practical interpretation. The overall

‘‘best’’ model might not be the model with the optimal BIC value or the one that optimally fits a sample of data. Instead, it

is the model that allows to best understand the phenomenon under study.

5. An empirical application

In this section, we illustrate an application of the HMTD and the selection procedure described in Section 4on unbalanced

panel data. The data are from the core representative sample of the Panel Study of Income Dynamics (PSID). The PSID is the

longest running nationally representative household panel study and allows unique opportunities to conduct life course

research and generational studies over four decades. The PSID began collecting socio-economic and demographic data in

1968 on a national sample of US households (about 4800 families) with the objective of studying the dynamics of income

and poverty in the United States (Hill, 1991). In order to allow the sample to remain representative of the US population, the

PSID follows each family member of the original 1968 sample, even if they move out of the original household and create

their own new families. This occurred yearly until 1997, then changed to every two years. As of 2009, 8690 households had

been interviewed, incorporating 24,385 individuals.

Our application concerns household income from 1968 to 2009 in the United States. The family income used in this

example reflects income from any source (labor, assets, transfers, and so on) and from all persons living in the family unit.

Since 1979, the value of family income was topcoded at $99,999. In 1980, this increased to $999,999, and since 1981, has

been $9,999,999. To correctly compare the evolution over four decades, the data used in this study are adjusted for inflation

in 2008 dollars (Consumer Price Index-U series).

During a life course, it is highly likely that a household will suffer a series of independent economic shocks due to the loss

of a job, national economic downturns, changes in the family structure, and so on. Analyzing the household income dynamics

over almost four decades, the impact of these shocks should be relevant. Thus, using an appropriate model becomes crucial.

For instance, if we represent the income dynamics with a linear stationary (autoregressive) process, the time taken to

recover from an economic shock is assumed to be the same for poor and wealthier households. In the literature, one of

the alternatives proposed is to use a nonlinear model in the classic way of quadratic or cubic functions of lagged income

(Lokshin and Ravallion, 2001).

The HMTD model, with different possible specifications for the expectation and the standard deviation of each Gaussian

component (see Section 2.1), constitutes a valid alternative to this classic approach. In addition, using the HMTD, we can

also simultaneously consider the income stratification over time in terms of the serially inter-temporal correlation between

D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 139

Table 2

Step 1. Finding the optimal number of components.

No. components p q Order HC No. of parameters Complete log-likelihood BIC

2 2 0 1 10 −62,532.2 125,165

3 2 0 1 18 −48,134.8 96,449.1

4 2 0 1 28 −42,105.6 84,491.7

Note:prepresents the dependence order of each expectation, and qdenotes the number of lags for each standard deviation.

successive income levels (the level of income today is naturally a good predictor of its value in subsequent years), as well

as the problem of population heterogeneity. Interpreting the hidden variable, S, as different discrete income regimes, issues

related to the heterogeneity in income dynamics can be solved in two ways. The first is to compute the most likely sequence

of hidden wealth states directly, and the transition probabilities between them. The second is to constrain the hidden

transition matrix to the identity matrix, and then to use the HMTD to classify the households into mutually exclusive

homogeneous groups according to their income trajectories.

In order to capture the effects of demographic events and relevant changes in household composition on the earning

dynamics process, we include four socio-demographic factors in the model: the size of the household, the age and gender of

the head of the household, and his/her educational level. In this illustrative example, we focus only on the effect of certain

socio-demographic characteristics on the evolution of household income over time, even though other aspects, such as job

history and job sector, may be relevant. The variables are all considered to be time varying to capture the effect of changes

in the household composition on the earning dynamic process. We assume that a household may suffer a negative shock

when a change occurs involving the head of the household. For instance, owing to the gender gap in earnings, having a man

or a woman as the head of the household probably affects the overall income of the family.

In order to show the relevance of the model using medium- and long-term panel data, we only include households that

have been interviewed at least 10 times over the four decades covered by the PSID.

5.1. Model selection procedure

As underlined in the previous section, a crucial point is the model selection procedure we use to find the ‘‘best’’ possible

model. For illustration purposes, we use an inductive data-driven approach here. Considering the large amount of data

available in the PSID database and the large number of models that could be estimated, and to avoid an unnecessary large

computation time, we consider a random subsample of 1000 US households from the core representative sample. The dataset

used is made up of 1000 multivariate series of different lengths (from 10 to 36 observations for each household) representing

the (log transformation of the) household income. To compare the specifications, all models are computed using the same

number of observations with a total of 22,496 data points.

In the following tables, we summarize the model specifications and in order to assess the quality of the model, we report

the complete log-likelihood and the BIC. Following Raftery (1995), a BIC difference of more than five points suggests ‘‘strong’’

evidence in favor of the model with the lower BIC.

As shown in Eq. (2.1), it is possible to assign a different number of lags when modeling the expectation of each component.

This allows components to have a shorter or longer memory, according to a specific research question, but for convenience,

we set the same number of lags for all components in this application.

In applied research, in particular in social sciences, researchers might be more interested in an interpretable and gener-

alizable model than in the numerical optimal one. A possibility is to set some constraints on the number of components and

lags in order to have an interpretable model and also to avoid to compute too many different models with different specifi-

cations saving computational time. These constraints can be based for instance on previous investigations or on theoretical

assumptions.

In this illustration, we assume that there are no more than four types of income regimes (i.e. the maximum number of

components is four) representing for instance four types of income shocks: idiosyncratic or aggregate, transitory or perma-

nent.

We also make the assumption that, from an economic point of view, the number of lags when modeling the expectation

for each component should be between two and four. Considering only a time dependence of order one, the current income

may be over-influenced by negative short-term shocks. On the other hand, due to business cycle, we assume that income

shocks are not likely to persist after four years (e.g. Hamilton, 1989).

The idea of the proposed model selection procedure is to work hierarchically, starting from the simplest possible model.

So, we try first to find the optimal number of components (i.e. income regimes) without explanatory variables using a first-

order hidden transition matrix and a constant variance.

Looking at the quality of the models shown in Table 2, and using a hidden chain of order one and two lags for the ex-

pectation, the BIC always improves when increasing the number of components. Then, considering the constraint discussed

above, the HMTD with the lowest BIC is the one with four components (i.e., a BIC of 84,491.7).

By increasing the number of independent parameters through additional components, the log-likelihood always im-

proves, but these additional components may not bring useful information to the model, resulting in a very marginal change

140 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145

Table 3

Step 2. Add lags for the expectation.

No. components p q Order HC No. of parameters Complete log-likelihood BIC

4 2 0 1 28 −42,105.6 84,491.7

4 3 0 1 32 −35,065.2 70,451.1

4 4 0 1 36 −27,871.1 56,103

Table 4

Step 3. Add lags to the modeling of the standard deviation.

No. components p q Order HC No. of parameters Complete log-likelihood BIC

4 4 0 1 36 −27,871 56,103

4 4 1 1 40 −26,642.2 53,685.2

4 4 2 1 44 −26,635.1 53,711.1

Table 5

Recursive step: Complexity reduction.

No. components p q Order HC No. of parameters Complete log-likelihood BIC

4 4 1 1 40 −26,642.2 53,685.2

4 3 1 1 36 −27,570 55,501

4 3 0 1 32 −35,065.2 70,451.1

3 4 1 1 27 −42,348.1 84,966.8

3 4 0 1 24 −42,426.9 85,094.4

Table 6

Step 4. Order of the hidden chain.

No. components p q Order HC No. of parameters Complete log-likelihood BIC

4 4 1 0 27 −47,669.8 95,610.3

4 4 1 1 40 −26,642.2 53,685.2

4 4 1 2 76 −24,320.9 49,403.4

4 4 1 3 220 −24,258.1 50,720.8

in the log-likelihood. Since at the beginning of the selection procedure the only explanatory element is the number of com-

ponents, the number of free parameters and then the subsequent penalization term in the BIC are relatively low, making

it easier to improve the BIC. For this reason and to avoid unnecessary complicated models, it is a good idea to set some

constraints on the components. It might be a limit on the maximum number of components, as in the example, or fixing a

threshold in the membership probabilities. In such a way, an additional component that represents only few cases will not

be included in the model selection procedure.

In the second step, after having fixed the number of components at four, we then increase the time dependence for the

expectation of each component. As mentioned before, we also make an assumption on the maximum number of lags for the

expectation. From an economic perspective, we assume the existence of a business cycle such that economic shocks should

not persist after a cycle of four years. So the maximum number of lags we consider for the expectation is four. Moreover,

from the 1997, the PSID data are collected every two years so considering four time points for the expectation we actually

cover a period of four to eight years.

According to the BIC reported in Table 3, the most adequate model uses a time dependence of order four for the expec-

tation (i.e., a BIC =56,103).

Once we have found the best combination of the number of components and lags for the expectation, we introduce the

dependence between successive observations in the specification of the standard deviation. Then, we increase the order of

the hidden transition matrix. For the specification of the standard deviation, we use Eq. (2.3), comparing each lag to the

empirical mean of the component. As a result of the nonlinearity, during the estimation of the standard deviations, the log-

likelihood sometimes diverges owing to the difficulty of identifying good initial conditions for the optimization procedure.

Therefore, we include at most two lags. Table 4 shows that the best solution is to include only one lag (i.e., a BIC =53,685.2).

Since this step-by-step procedure does not compare all possible models, it might not lead directly to the best possible

solution. Before including a higher dependence at the hidden level, it can be interesting to try to reduce the complexity of

the model by reducing the number of components and/or the number of lags. In this particular case, any simplification of

the model leads to a worse solution, as shown in Table 5, which summarizes different models in reverse order of complexity.

Turning now to the hidden level (Table 6), we notice first that the HMTD performs better than a pure mixture model. The

model with a hidden chain of order zero has a BIC almost double that of the observed level for a HMTD of order one. On the

other hand, by increasing the hidden order from one to two, we obtain a significantly better model (i.e., a BIC =49,403.4),

however using a third-order model is useless.

D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 141

Table 7

Recursive step: Complexity reduction.

No. components p q Order HC No. of parameters Complete log-likelihood BIC

4 4 1 2 76 −24,320.9 49,403.4

4 4 0 2 72 −24,470.8 49,663.1

4 3 1 2 72 −24,619.2 49,959.8

4 3 0 2 68 −25,388.1 51,457.6

3 4 1 2 39 −29,075 58,540.8

3 4 0 2 36 −30,488.1 61,337

Once we have fixed at two the order of the hidden process and before including the effects of external factors, we can

again try to reduce the complexity of the model (see Table 7). However, as before, this step did not lead to a better BIC, so

we stay with the previous model.

The last step of the suggested procedure is to introduce the effect of external covariates. The covariates may be included

at the visible level, in the specification of the mean and standard deviation, and/or at the hidden level by influencing the

transition probabilities between hidden states. As evident from Eq. (2.1), it is possible to analyze the effect of a covariate on

one single component instead of on the overall mixture. This is useful to overcome problems of multi-collinearity, because

using the same covariate on multiple components might give redundant information to the model. Observing the effect of

the covariate on only one component might also be useful when we use the mixture models to classify data or when the

analysis is theory-driven, for instance, when we search for the effect of external factors on a specific group. In our example,

we include only one covariate at a time when modeling the expectation of each component. We consider four covariates.

Two are categorical: having a higher level of education (H. Edu) and the gender of the head of the household being male

(Male). The other two are continuous: the age of the respondent (Age) and the size of the family (H. Size). Table 8 reports the

quality of the more relevant models. The best model is that using all four covariates, each covariate being associated with

one of the four components (i.e., a BIC =44,946.8).

5.2. Model interpretation

In this section, we present an approach to interpret the results of a HMTD. To sum up, for the visible level, we have a

mixture of four components, with four lags for the expectation of each component, four covariates, and a time dependence of

order one for the standard deviation. The hidden level is driven by a second-order transition matrix. The dependent variable

is the (log transformation of the) income levels of a random sub-sample of 1000 US households from the core representative

sample of the PSID database. Four covariates are considered: the size of the household, and the age, gender, and educational

level of the head of the household.

The transition matrix between hidden states is as follows (in reduced form):

Q=

St

St−2St−11 2 3 4

1 1 0.8200 0.0246 0.1403 0.0152

2 1 0.7594 0.0010 0.2198 0.0198

3 1 0.8661 0.0190 0.0917 0.0233

4 1 0.9195 0.0440 0.0149 0.0216

1 2 0.6566 0.0072 0.3148 0.0214

2 2 0.3368 0.0000 0.6437 0.0195

3 2 0.3987 0.0069 0.5685 0.0259

4 2 0.7504 0.0016 0.2261 0.0219

1 3 0.8629 0.0067 0.1029 0.0275

2 3 0.8542 0.0007 0.1029 0.0422

3 3 0.8240 0.0173 0.1318 0.0269

4 3 0.9031 0.0053 0.0726 0.0190

1 4 0.8563 0.0037 0.0271 0.1128

2 4 0.8692 0.0000 0.0462 0.0846

3 4 0.8620 0.0026 0.0239 0.1115

4 4 0.9482 0.0049 0.0191 0.0278

.

Here, Qrepresents the general second-order transition process between hidden states, given the value assumed by the

hidden variable in the two previous periods (St−2,St−1). The distribution of the first hidden state (π1) and the second hidden

state, conditional on the first (π2|1), are

π1=(0.7805;0.0091;0.1716;0.0388) π2|1=

0.8412 0.0221 0.1167 0.0200

0.5356 0.0039 0.4383 0.0222

0.8611 0.0075 0.1025 0.0289

0.8839 0.0028 0.0291 0.0842

.

142 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145

Table 8

Step 5. Introducing external covariates.

k p q Order HC No. of parameters Covariate Complete log-likelihood BIC

4 4 1 2 76 −24,320.9 49,403.4

4 4 1 2 77 Male −24,126.8 49,025.2

4 4 1 2 77 H. Edu −24,167.5 49,106.7

4 4 1 2 77 H. Size −24,310.8 49,393.1

4 4 1 2 78 Male, Age −22,285.1 45,351.6

4 4 1 2 78 Male, H. Size −22,320.8 45,423.3

4 4 1 2 79 Male, Age, H. Size −22,161.1 45,113.8

4 4 1 2 80 Male, H. Edu, Age, H. Size −22,072.5 44,946.8

Note:kis the number of components; pthe dependence order of each expectation; qthe number of lags for each standard deviation.

Covariate: Male: Man as the head of the household; H. Edu: Higher level of education; Age: Age of the head of the household; H. Size: Size of household.

Fig. 1. Graphical representation of hidden states.

Both the transition matrix and the distributions of the hidden states indicate the predominance of the first state (and,

thus, of the first component at the visible level of the model). The process starts in the first hidden state in 78%. Considering

the overall series, the first hidden state is used 93.44% of the time, the second hidden state 0.67%, the third 4.27%, and the

fourth 1.61%. For a graphical representation of the most common transitions, see Fig. 1 (lower, right corner). With the support

of the R package TraMineR (Gabadinho et al., 2011), we can graphically represent the most likely sequence of hidden states

and their distributions.

The cross-sectional entropy plotted in Fig. 1 (lower, left corner) shows an increasing variability in the distribution of the

hidden states among the households over time, and two periods of significant change. The first occurs at the beginning of

the series, with the 1970s oil crisis, and the second in the early 1990s, with the economic downturns (1989–1992). Another

change in the entropy is observed with the 2008 financial crisis. For the latter two periods, the observed increase in variability

may also be due to a high level of attrition. For the last time point, we have only 383 sequences left of the original 1000.

The estimated parameters of Eqs. (2.1) and (2.3) are shown in Table 9 along with the bootstrapped standard errors and

the p-values for statistical significance using 1000 bootstrap replications (Efron and Tibshirani, 1994).

The modeling of the expectation of the first two components is dominated by the first lag, while the second lag dominates

the third component, and the third and fourth lags are the most important in the last component, but with opposite effects

(positive for the third lag and negative for the fourth lag).

The most common income regime, the first component that represents more than 90% of cases, shows a general slightly

positive trend in income dynamics given the positive and relatively high coefficients for the intercept ( ˆϕ1,0=3.0965) and

first and last lags ( ˆϕ1,1=0.6789 and ˆϕ1,4=0.2298). As mentioned before, the second state is however quite rare (it is used

only 0.67% of the time) and it represents a fluctuated regime. This component is associated both with an increasing trend in

the short run (e.g. ˆϕ2,1=3.3473) and with negative shocks. The third and fourth components finally imply lower income

levels and they are generally associated with short run negative shocks.

Among the covariates (ˆ

δ), gender and educational level seem to have the largest impact on earnings dynamics (see Table 9,

components one and four, respectively). In particular, the results show the presence of gender bias. Having a man as the

head of the household has a positive impact on the income of the family. Not surprisingly, having a high level of education

D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 143

Table 9

Estimated parameters for each component.

g1 2 3 4

ˆϕg,03.0965 (0.0819) −3.2835 (0.3610) −0.7567 (0.0815) 0.7426 (1.2834)

[<0.0001] [0.0040] [<0.0001] [0.9269]

ˆϕg,10.6789 (0.0148) 3.3473 (0.0997) −0.4644 (0.0177) −0.1578 (0.1034)

[<0.0001] [<0.0001] [0.0020] [0.2162]

ˆϕg,2−0.0493 (0.0099) −0.6291 (0.0562) 2.3375 (0.0367) 1.2624 (0.379)

[0.0020] [<0.0001] [<0.0001] [0.0060]

ˆϕg,3−0.2085 (0.0166) 0.0160 (0.0727) −1.5414 (0.1703) 3.3831 (0.4038)

[<0.0001] [0.5986] [0.0020] [0.0020]

ˆϕg,40.2298 (0.099) −1.5077 (0.0884) 0.7424 (0.1776) −3.9031 (0.4863)

[<0.0001] [<0.0001] [0.0040] [0.0020]

ˆ

δ1,Male 0.3681 (0.0547)

[<0.0001]

ˆ

δ2,H.Size 0.0713 (0.0071)

[0.2022]

ˆ

δ3,Age 0.0085 (0.0115)

[<0.0001]

ˆ

δ4,H.Edu 1.6369(0.7056)

[0.0400]

ˆ

θg,00.2364 (0.1850) 0.4523 (0.6901) 0.2498 (0.0474) 8.9801 (2.4186)

[<0.0001] [<0.0001] [<0.0001] [<0.0001]

ˆ

θg,17.8083 (0.3035) 9.5959 (0.2929) 9.2314 (0.3728) 7.8483 (1.0098)

[<0.0001] [<0.0001] [<0.0001] [<0.0001]

Note: Bootstrapped standard errors are reported in parentheses. P-values for statistical significance in brackets based on 1000 bootstrap replications.

Covariate: Male: Man as the head of the household; Age: Age of the head of the household; H. Size: Size of household; H. Edu: Higher level of education.

is positively associated with income growth. As expected, the magnitude of the coefficients of the two continuous variables

is lower, but also positive. Increasing the age of the head of the household increases the overall household income (second

component). Note that we included the covariates through a linear additive term. Thus, we neglected the possibility that

the relationship between age and earnings follows an inverted U-shape pattern, as is well known in the literature. In other

words, earnings could increase in the early years, reach a peak around middle age, and then decline thereafter. These kinds

of effects could be captured by using polynomial relationships to model the link between the continuous variables and the

variable of interest. Finally, the size of the family seems not to have a statistical significant impact on the income dynamics.

6. Discussion

In this paper we described a general probabilistic framework for modeling continuous time-series, and we integrated

simultaneously many extensions previously presented separately in related literature. We also discussed searching for a

useful model from among the possible solutions offered by the combination of lags and covariates.

Due to its flexibility, the HMTD model seems to be particularly suitable for longitudinal analyses in the social sciences

and related fields. The model is able to consider the observed heterogeneity in the population and can explain the observed

trajectories, making it useful when predicting the next observation in a series or for probabilistic clustering. As in the Latent

Class Growth model (LCGM), each level of the discrete latent variable may represent a group or a subtype of cases. However,

using the HMTD, we have two alternatives. First, we can set the transition matrix, Q, as a diagonal matrix to identify distinct

subgroups following a similar pattern in the whole series, as in the LCGM. Second, we can consider the latent states to be

different subpopulations, without including any constraints on the transition matrix, and allow individuals to move between

latent classes at each time point.

The drawback of the flexibility of the HMTD is the difficulty in finding the correct specification of the model structure: the

number of components and lags, the use of covariates at the hidden and/or observed levels, modeling the standard deviation

of each component, and so on. Given the large number of different models that could possibly be estimated, we proposed

an ad hoc hierarchical strategy. Starting from the simplest model possible, advanced elements are added, one at a time and

in each step the models are compared using information criteria such as the BIC. Finally, we illustrated the model and the

suggested model selection procedure using a real dataset. Using the US Panel Study of Income Dynamics, we analyzed the

trajectories of household income in the United States over four decades.

Another issue related to the complexity of the model is the estimation procedure. The EM algorithm allows an easy es-

timation of the parameters, but it might be quite unstable in high-dimensional settings with many local optima. Therefore,

144 D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145

we require further investigation to test variants of EM algorithms, such as the Classification EM, Stochastic EM, or genetic

algorithms.

Future work should empirically investigate the criteria used to identify the number of components and the model spec-

ification in more detail. Numerical criteria such as the BIC can identify an optimal model from sample data during the esti-

mation process. However, in empirical applications, we are more interested in interpretable and generalizable results than

in strictly optimal ones. For instance, adding an additional component that represents only a few outliers or extreme cases

is interesting from a theoretical point of view, but not necessarily from a practical perspective. These extreme cases might

be due to errors in data entry, drop-out cases, or might represent small, negligible sub-populations. Further analysis can ad-

dress this issue by introducing measures other than the BIC to combine the adequacy of the model with the interpretability

of the results. Finally, further research should also consider the case of continuous time data rather than discrete panel data.

Acknowledgments

This publication benefited from the support of the Swiss National Centre of Competence in Research LIVES Overcoming

vulnerability: life course perspectives, which is financed by the Swiss National Science Foundation. The authors are grateful

to the Swiss National Science Foundation for its financial assistance. We also thank the AE and the two referees for their

helpful comments.

References

Bartolucci, F., Farcomeni, A., 2010. A note on the mixture transition distribution and hidden Markov models. J. Time Ser. Anal. 31 (2), 132–138.

Basford, K.E., Greenway, D., McLachlan, G.J., Peel, D., 1997. Standard errors of fitted means under normal mixture models. Comput. Statist. 12, 1–17.

Baum, L.E., Petrie,, 1966. Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37 (6), 1554–1563.

Baum, L.E., Petrie, T., Soules, G., Weiss, N., 1970. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains.

Ann. Math. Stat. 41 (1), 164–171.

Berchtold, A., 1999. The double chain Markov model. Comm. Statist. Theory Methods 28 (11), 2569–2589.

Berchtold, A., 2002. High-order extensions of the double chain Markov model. Stoch. Models 18 (2), 193–227.

Berchtold, A., 2003. Mixture transition distribution (MTD) modeling of heteroscedastic time series. Comput. Statist. Data Anal. 41, 399–411.

Berchtold, A., 2004. Optimisation of mixture models: comparison of different strategies. Comput. Statist. 19, 385–406.

Berchtold, A., Raftery, A., 2002. The mixture transition distribution model for high-order Markov chains and non-Gaussian time series. Statist. Sci. 17 (3),

328–359.

Biernacki, C., Celeux, G., Govaert, G., 2000. Stratégies algorithmiques pour maximiser la vraisemblance dans les modèles de mélange. In: Actes des XXXII

Journées de Statistique.

Böhning, D., 2001. The potential of recent developments in nonparametric mixture distributions. In: Proceedings of the 10th International Symposium on

Applied Stochastic Models and Data Analysis.

Boldea, O., Magnus, J.R., 2009. Maximum likelihood estimation of the multivariate normal mixture model. J. Amer. Statist. Assoc. 104 (488), 1539–1549.

Bollerslev, T., Chou, R.Y., Kroner, F., 1992. ARCH modeling in finance. A review of the theory and empirical evidence. J. Econometrics 52, 5–59.

Box, G.E., Jenkins, G.M., Reinsel, G.C., 1994. Time Series Analysis, Forecasting and Control. Prentice Hall.

Chariatte, V., Berchtold, A., Akré, C., Michaud, P.-A., Suris, J.-C., 2008. Missed appointments in an outpatient clinic for adolescents, an approach to predict

the risk of missing. J. Adolesc. Health 43 (1), 38–45.

Dannemann, J., Holzmann, H., 2008. Likelihood ratio testing for hidden Markov models under non-standard conditions. Scand. J. Statist. 35 (2), 309–321.

Dempster, A.P., Lard, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39 (1), 1–38.

Dietz, E., Bohning, D., 1996. Statistical inference based on a general model of unobserved heterogeneity. In: Fahrmeir, L., Francis, F., Gilchrist, R., Tutz, G.

(Eds.), Advances in GLIM and Statistical Modeling. In: Lecture Notes in Statistics, Springer, Berlin, Heidelberg, pp. 75–82.

Efron, B., 1979. Bootstrap methods: another look at the jacknife. Ann. Statist. 7 (1), 1–26.

Efron, B., Tibshirani, R.J., 1994. An Introduction to the Bootstrap. CRC Press.

Elliott, R.J., Hunterb, W.C., Jamieson, B.M., 1998. Drift and volatility estimation in discrete time. J. Econom. Dynam. Control 22, 209–218.

Frydman, H., Schuermann, T., 2008. Credit rating dynamics and Markov mixture models. J. Bank. Finance 32, 1062–1075.

Gabadinho, A., Ritschard, G., Studer, M., 2011. Analyzing and visualizing state sequences in R with TraMineR. J. Stat. Softw. 40 (4).

Giudici, P., Rydén, T., Vandekerkhove, P., 2000. Likelihood-ratio tests for hidden Markov models. Biometrics 56 (3), 742–747.

Hamilton, J.D., 1989. A new apporach to the economic analysis of nonstationary time series and the business cycle. Econometrica 57 (2), 357–384.

Hamilton, J.D., 1994. Time Series Analysis. Princeton University Press.

Hassan, M.Y., El-Bassiouni, M.Y., 2013. Modelling Poisson marked point processes using bivariate mixture transition distributions. J. Stat. Comput. Simul.

83 (8), 1440–1452.

Hassan, M.Y., Lii, K.-S., 2006. Modeling marked point processes via bivariate mixture transition distribution models. J. Amer. Statist. Assoc. 101 (475),

1241–1252.

Hayashi, T., 2004. A discrete-time model of high-frequency stock returns. Quant. Finance 4, 140–150.

Helske, J., Eerola, M., Tabus, I., 2010. Minimum description length based hidden Markov model clustering for life sequence analysis. In: Proceedings of the

Third Workshop on Information Theoretic Methods in Science and Engineering.

Hill, M., 1991. The Panel Study of Income Dynamics: A User’s Guide. SAGE Publications.

Hox, J.J., 1995. Applied Multilevel Analysis. TT-Publikaties, Amsterdam.

Kapetanios, G., 2008. A bootstrap procedure for panel data sets with many cross-sectional units. Econom. J. 11 (2), 377–395.

Kass, R.E., Raftery, A.E., 1995. Bayes factors. J. Amer. Statist. Assoc. 90 (430), 773–795.

Kim, D., Kon, S.J., 1994. Alternative models for the conditional heteroscedasticity of stock returns. J. Bus. 67 (4), 563–598.

Kon, S.J., 1984. Models of stock returns: a comparison. J. Finance 39 (1), 147–165.

Kovar, J.G., Rao, J.N.K., Wu, C.F.J., 1988. Bootstrap and other methods to measure errors in survey estimates. Canad. J. Statist. 16, 25.

Le, N.D., Martin, D.R., Raftery, A.E., 1996. Modelling flat stretches, bursts, and outliers in time series using mixture transition distribution models. J. Amer.

Statist. Assoc. 91 (436), 1504–1515.

Leroux, B.G., 1992. Consistent estimation of a mixing distribution. Ann. Statist. 20 (3), 1350–1360.

Le Strat, Y., Carrat, F., 1999. Monitoring epidemiologic surveillance data using hidden Markov models. Stat. Med. 18 (24), 3463–3478.

Lokshin, M., Ravallion, M., 2001. Household income dynamics in two transition economies. World Bank 1–40.

Louis, T.A., 1982. Finding the observed information matrix when using the EM algorithm. J. Roy. Statist. Soc. Ser. B 44 (2), 226–233.

Luo, J., Qiu, H.-B., 2009. Parameter estimation of the WMTD model. Appl. Math. J. Chinese Univ. 24 (4), 379–388.

McLachlan, G.J., Basford, K.E., 1988. Mixture Models: Inference and Applications to Clustering. Marcel Dekker Inc.

D. Bolano, A. Berchtold / Computational Statistics and Data Analysis 93 (2016) 131–145 145

McLachlan, G.J., Krishnan, T., 1996. The EM Algorithm and Extensions. John Wiley & Sons, New York.

McLachlan, G., Peel, D., 2000. Finite Mixture Models. In: Wiley Series in Probability and Statistics.

Muthen, B.O., 2001. Second-generation structural equation modeling with combination of categorical and continuous latent variables: new opportunities

for latent class/latent growth modeling. In: Collins, L.M., Sayer, A. (Eds.), New Methods for the Analysis for Change. American Psychological Association,

Washington, DC, pp. 291–322.

Netzer, O., Lattin, J.M., Srinivasan, V., 2008. A hidden Markov model of customer relationship dynamics. Mark. Sci. 27 (2), 185–204.

Newton, M.A., Raftery, A.E., 1994. Approximate Bayesian inference with the weighted likelihood bootstrap. J. Roy. Statist. Soc. Ser. B 56 (1), 3–48.

Rabiner, L.I., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286.

Raftery, A.E., 1985. A model for high-order Markov chains. J. R. Stat. Soc. Ser. B 47 (3), 528–539.

Raftery, A.E., 1995. Bayesian model selection in social research. Sociol. Methodol. 25, 111–163.

Redner, R.A., Walker, H.F., 1984. Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26, 195–239.

Schwarz, G.E., 1978. Estimating the dimension of a model. Ann. Statist. 6 (2), 461–464.

Sclattmann, P., 2009. Medical Applications of Finite Mixture Models. In: Statistics for Biology and Health. Springer.

Shirley, K.E., Small, D.S., Lynch, K.G., Maisto, S.A., Oslin, D.W., 2010. Hidden Markov models for alcoholism treatment trial data. Ann. Appl. Stat. 4 (1),

366–395.

Stephens, M., 2000. Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 62 (4), 795–809.

Viterbi, A.J., 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inform. Theory 16 (2), 260–269.

Weigend, A.S., Mangeas, M., Srivastava, A.N., 1995. Nonlinear gated experts for time series: discovering regimes and avoiding overfitting. Int. J. Neural Syst.

6 (4), 373–399.

Weigend, S.W., Shi, S., 2000. Predicting daily probability distributions of S&P500 returns. J. Forecast. 19, 375–392.

Wellekens, C., 1987. Explicit time correlation in hidden Markov models for speech recognition. In: Proceedings ICASSP. pp. 384–386.

Wong, C.S., Chan, W.S., 2005. Mixture Gaussian time series modelling of long-term market returns. N. Am. Actuar. J.

Wong, C.S., Li, W.K., 2000. On a mixture autoregression model. J. R. Stat. Soc. Ser. B 62, 92–115.

Wong, C.S., Li, W.K., 2001. On a mixture autoregressive conditional heteroscedastic model. J. Amer. Statist. Assoc. 96, 982–995.