Content uploaded by Raffaella Piccarreta

Author content

All content in this area was uploaded by Raffaella Piccarreta on Nov 30, 2021

Content may be subject to copyright.

arXiv:1908.07963v3 [stat.ME] 13 Jul 2021

This is a preprint. The revised version of this paper is published as

K. Murphy, T. B. Murphy, R. Piccarreta, and I. C. Gormley (2021) “Clustering longitudinal

life-course sequences using mixtures of exponential-distance models”. Journal of the Royal

Statistical Society: Series A (Statistics in Society). [doi: 10.1111/rssa.12712].

Clustering Longitudinal Life-Course Sequences

Using Mixtures of Exponential-Distance Models

Keefe Murphy1, T. Brendan Murphy2,3,

Raﬀaella Piccarreta4, I. Claire Gormley2,3

1Department of Mathematics and Statistics, Maynooth University, Ireland

2School of Mathematics and Statistics, University College Dublin, Ireland

3Insight Centre for Data Analytics, University College Dublin, Ireland

4Department of Decision Sciences, Università Bocconi, Milano, Italy

E-mail: keefe.murphy@mu.ie

Abstract

Sequence analysis is an increasingly popular approach for analysing life courses rep-

resented by ordered collections of activities experienced by subjects over time. Here,

we analyse a survey data set containing information on the career trajectories of a

cohort of Northern Irish youths tracked between the ages of 16 and 22. We propose

a novel, model-based clustering approach suited to the analysis of such data from a

holistic perspective, with the aims of estimating the number of typical career trajec-

tories, identifying the relevant features of these patterns, and assessing the extent to

which such patterns are shaped by background characteristics.

Several criteria exist for measuring pairwise dissimilarities among categorical se-

quences. Typically, dissimilarity matrices are employed as input to heuristic clustering

algorithms. The family of methods we develop instead clusters sequences directly using

mixtures of exponential-distance models. Basing the models on weighted variants of

the Hamming distance metric permits closed-form expressions for parameter estima-

tion. Simultaneously allowing the component membership probabilities to depend on

ﬁxed covariates and accommodating sampling weights in the clustering process yields

new insights on the Northern Irish data. In particular, we ﬁnd that school examination

performance is the single most important predictor of cluster membership.

Keywords: exponential-distance models, gating covariates, life-course sequences, model-

based clustering, survey sampling weights, weighted Hamming distance.

1 Introduction

Sequence analysis (SA) is an umbrella term for tools deﬁned to explore and describe categor-

ical life-course data. Speciﬁcally, attention is focused on the ordered sequence of states (or

activities) experienced by individuals over a given time-span (usually at Tequally spaced

discrete time periods). Here we focus on the transition from school to work for a cohort

of Northern Irish youths, using survey data obtained from the 1999 sweep of the Status

Zero Survey (McVicar,2000;McVicar and Anyadike-Danes,2002) — henceforth referred to

as the MVAD data — in which, for each individual, a sequence of monthly labour market

activities experienced between the ages of 16 and 22 is recorded.

1

Typically, the goal of sequence analysis is to identify the most relevant patterns in the

data. To this end, pairwise dissimilarities among sequences in their entirety are ﬁrst assessed.

Dissimilarity matrices are then employed to identify the most typical trajectories using, in

the vast majority of applications, cluster analysis. These problems are receiving increasing

attention in the demographic and social literature, also due to the increasing number of

retrospective as well as prospective longitudinal studies, such as the British Household

Panel Survey (BHPS)1and the subsequent larger and more wide-ranging UK Household

Longitudinal Study (Understanding Society)2, or the Socio-Economic Panel for Germany

(SOEP)3, the National Longitudinal Surveys for the USA (NLS)4, and the Generations &

Gender Programme for selected European countries (GGP)5. All of these surveys, much like

the MVAD data considered in this paper, collect information about labour market activities,

as well as other signiﬁcant life events.

Quantifying the distance between such categorical sequences is not a trivial task. Optimal

matching (OM), developed by Abbott and Forrest (1986) and extended to sociology by Ab-

bott and Hrycak (1990), is popular among the SA community. OM is derived from the

edit distance originally proposed in the ﬁeld of information theory and computer science

by Levenshtein (1966). The OM metric assigns costs to the diﬀerent types of edits, namely

insertion, deletion, and substitution. Typically, insertion and deletion are assigned a cost of

1while substitution costs are allowed to vary. However, specifying these costs often involves

subjective choices, which may lead to violations of the triangle inequality if not done carefully.

Several proposals in the literature introduced criteria to improve or guide the choice of costs

in OM; Muñoz-Bullón and Malo (2003), for instance, estimate the substitution-cost matrix

in a data-driven fashion using the between-states transition rates. Alternative dissimilarity

criteria have also been introduced to allow control over the importance assigned to the char-

acteristics of the sequences (namely, the collection of experienced states, their timing, or

their duration) in the assessment of their diﬀerences: see Studer and Ritschard (2016) for an

excellent discussion. Even so, there are no results proving that one procedure is superior to

others and the choice of dissimilarity measure remains a fundamental question for researchers.

Given a dissimilarity matrix Dobtained from a set of sequences S= (s1,...,sn), where n

is the number of subjects, cluster analysis is usually applied to group sequences and identify

the most typical trajectories experienced by the sampled individuals. Heuristic clustering

algorithms, either hierarchical or partitional, are typically used. In many applications, it

is also of interest to relate sequences to a set of baseline covariates. Within the described

framework, this is solely done by relating the uncovered hard clustering partition to covari-

ates using, for example, multinomial logistic regression (MLR). This approach was adopted

in McVicar and Anyadike-Danes (2002), after applying Ward’s agglomerative clustering al-

gorithm (Ward,1963) to an OM dissimilarity matrix to obtain G= 5 clusters of the MVAD

sequences, without performing model selection. Such an approach is questionable from a

few points of view. Firstly, the original sequences are substituted by a categorical variable

indicating cluster membership, thus disregarding the heterogeneity within clusters. This

is clearly only sensible when the clusters are suﬃciently homogeneous otherwise sequences

which are weakly related to clusters would be regarded as similar to those in their cluster.

However, a clear clustering structure can often be obtained only by increasing the number

of clusters (often with some clusters possibly small in size). More importantly, suitable par-

titions do not necessarily lead to suitable response variables as input for the MLR. It thus

seems desirable to cluster sequences and relate the clusters to the covariates simultaneously.

1https://www.iser.essex.ac.ul/bhps.

2https://www.understandingsociety.ac.uk.

3https://www.diw.de/en/soep.

4https://www.bls.gov/nls/.

5https://www.ggp-i.org.

2

Thus, the aim of our analysis is three-fold; to estimate the number of typical trajectories

in the MVAD data, to identify the relevant features of these patterns, and to establish to

what extent such patterns are shaped by the individuals’ background characteristics, as cap-

tured by a set of baseline covariates measured at age 16. To address these issues, we propose

to cluster the MVAD sequences in a model-based fashion, allowing the covariates to aﬀect the

soft cluster membership probabilities, rather than leaving them exogenous to the clustering

model. This permits us to better understand if and to what extent the typical sequence pat-

terns characterising each cluster are aﬀected by speciﬁc covariates. Model-based clustering

methods typically assume that the data arise from a ﬁnite mixture of Gdistributions;

Bouveyron et al. (2019) provide an excellent overview. In principle, any distribution(s)

can be used, though the term ‘model-based clustering’ was popularised by Banﬁeld and

Raftery (1993), in which the component distributions are assumed to be parsimoniously

parameterised multivariate Gaussians with component-speciﬁc parameters. Such models

have been recently extended to the mixture of experts setting (Gormley and Frühwirth-

Schnatter,2019) to facilitate dependence on ﬁxed covariates (Murphy and Murphy,2020).

However, these models can be problematic when applied to dissimilarity matrices, either due

to non-identiﬁability or because the input data are usually far from Gaussian. This prob-

lem cannot be addressed by applying multidimensional scaling to Dbecause the resulting

low-dimensional conﬁguration is also typically far from Gaussian. Notably, our attempts to

ﬁt non-Gaussian mixtures in these settings did not yield useful results.

Another popular framework for clustering categorical data is latent class analysis (LCA;

Lazarsfeld and Henry 1968)). Agresti (2002) shows the connection between model-based

clustering and LCA. Such models are ﬁnite mixtures in which the component distributions

are assumed to be multi-way cross-classiﬁcation tables with all variables mutually indepen-

dent. Latent class regression models (LCR; Dayton and Macready 1988) are particularly

interesting, because their connection to the mixture of experts framework permits the in-

clusion of covariates to predict the latent class memberships. However, ﬁtting such models

is challenging when the sequence length, the number of categories, or the number of latent

classes are even moderately large, due to the explosion in the number of parameters.

Evidently, there is a conﬂict of perspectives between the model-based and the heuristic,

distance-based approaches to clustering in the SA community. For this reason, and the others

mentioned above, we model the sequences directly (in the sense that the sequences them-

selves are treated as inputs, rather than D) with the implicit substitution costs which deﬁne

the distance metric being estimable parameters of a generative probability model rather than

inputs (either estimated or subjectively speciﬁed), via D, to a heuristic clustering algorithm.

This is achieved using parsimonious mixtures of exponential-distance models, which typi-

cally depend on a central sequence and a precision parameter in a way that relates to the

chosen distance metric. Our framework for analysing the MVAD data, as a model-based

approach which nonetheless relies on distances, thus reconciles the aforementioned conﬂict.

Mostly for reasons of computational convenience, we use dissimilarities based on sim-

ple matching, in particular the Hamming distance (Hamming,1950). Although the focus

on substitution operations has the sociological advantage of targeting trajectories with con-

temporaneous similarities — in contrast to the prohibited insertion and deletion operations,

which focus on matching states irrespective of their timing — this distance is liable to suﬀer

from temporal rigidity, since anticipations and/or postponements of the same choices in life

courses are not accounted for. Hence, similar sequences shifted by one time period may be

maximally distant from one another. While misalignment is less of a concern for sequences

exhibiting long durations in the same state, we address the issue using weighted variants of

the Hamming distance, characterised by a range of constraints on the precision parameters

in the mixture setting. This leads to the novel MEDseq model family, which can be seen

3

as similar to a version of the k-medoids/PAM algorithm (Kaufman and Rousseeuw,1990,

Chapter 2) based on the Hamming distance with some restrictions relaxed. We defer the

comparison to Section 4.2 as the parallels relate to technical issues of model estimation.

Importantly, information is also available with the MVAD data on the survey sampling

weights, which are only incorporated in the MLR stage of the analysis in McVicar and

Anyadike-Danes (2002). While sampling weights can be incorporated into heuristic cluster-

ing algorithms, such as Ward’s hierarchical clustering (by weighting the linkages between

clusters) or k-medoids, and subsequently in the MLR, one of the advantages of our ap-

proach is that both the covariates and the weights are incorporated simultaneously. This is

achieved by leveraging the model-based paradigm; the weights are incorporated by appro-

priately weighting the likelihood function and the covariates are incorporated by assuming

they inﬂuence the soft component membership probabilities.

MEDseq models, like standard SA heuristic clustering algorithms and LCA models, ap-

proach the clustering task from the holistic perspective of treating trajectories as whole units

of analysis, in order to uncover groups of similar sequences. In contrast, a number of mul-

tistate models employing ﬁnite mixtures with Markov components (e.g. Melnykov 2016a;

Pamminger and Frühwirth-Schnatter 2010) or with hidden Markov components (Helske

et al.,2016) have recently attained popularity for the analysis of categorical sequence data.

Such models focus on modelling instantaneous transitions within the life course and on fac-

tors that might explain the probability of experiencing them. As described by Wu (2000),

this amounts to a diﬀerence between considering sequences in their entirety under the MED-

seq framework or as time-to-event processes under the Markovian framework. Indeed, as

our aim is to establish sequence typologies for the MVAD data, a holistic approach is prefer-

able to Markovian approaches. The former concentrates on questions of global similarities

and considers the full richness of the trajectories without discarding the details of episode

ordering, duration, or transition (Muñoz-Bullón and Malo,2003), while the latter frame-

work makes the often unsuitable simplifying assumption that trajectories can be eﬃciently

summarised only by their recent past (Piccarreta and Studer,2019).

The remainder of the article is organised as follows. Section 2presents some ex-

ploratory analysis of the MVAD data. Section 3develops the MEDseq family of mixtures of

exponential-distance models. Section 4describes the model ﬁtting procedure and discusses

factors aﬀecting performance. Section 5presents results for the MVAD data, including appli-

cations of MEDseq models and comparisons to other methods. The insights gleaned from

the MVAD data under the optimal MEDseq model are summarised in Section 6. The paper

concludes with a discussion on the MEDseq methodology and potential future extensions in

Section 7. A software implementation of the full MEDseq model family is provided by the as-

sociated Rpackage MEDseq (Murphy et al.,2021). The package was developed speciﬁcally for

this application and is available from https://www.r-project.org (RCore Team,2021).

2 Status Zero Survey: MVAD Data

The term ‘MVAD data’ refers throughout to a cohort of n= 712 Northern Irish youths

aged 16 and eligible to leave compulsory education as of July 1993 who were observed at

monthly intervals until June 1999 as part of the Status Zero Survey (Armstrong et al.,

1997;McVicar,2000;McVicar and Anyadike-Danes,2002). The subjects were interviewed

about the labour market activities they experienced, distinguishing between employment

(EM), further education (FE), higher education (HE), joblessness (JL), school (SC), or

training (TR). Each observation iis represented by an ordered categorical sequence of length

T= 72, with an alphabet Vof size v= 6 possible states, e.g. si= (si,1, si,2,...,si,72)⊤=

4

(SC,SC, ...,TR,TR, ...,EM,EM)⊤. The sequences share a common length, the time periods

are equally spaced, and there are no missing data. In the context of the Northern Irish

education system at the time, SC refers to secondary school, which may be a grammar school

to which entrance is granted upon completion of an exam. At age 16, students take General

Certiﬁcate of Secondary Education (GCSE) examinations; students who do well are eligible

to continue in school for a further two years (to sit A-level exams) or to leave, e.g. to a

training/apprenticeship programme (TR). Further education (FE) is distinguished from

higher education (HE); FE typically refers to applied post-GCSE courses while HE refers

to third-level/university courses, typically pursued at age 18 after the successful completion

of A-level exams. Notably, the transitions HE SC and TR HE are never observed.

It is of interest to relate the MVAD sequences to covariates in order to understand

whether diﬀerent characteristics (related to gender, community, geographic and social con-

ditions, and personal abilities) impact on the school-to-work trajectories. These covariates

are summarised in Table 1. All covariates were measured at the age of 16 (i.e. at the start

of the study period in July 1993), with the exception of ‘Funemp’ and ‘Livboth’, and are

thus static background characteristics. As achieving 5 or more grades at A–C in GCSE ex-

ams is the traditional cut-oﬀ point for progression to the additional two-years of secondary

school required for a transition to HE, we expect the ‘GCSE5eq’ covariate in particular

to be strongly associated with the clustering. The MVAD data also come with associated

observation-speciﬁc survey sampling weights, which depend on the ‘Grammar’ and ‘Loca-

tion’ covariates. Speciﬁcally, the sample was stratiﬁed in such a way that a predetermined

number of subjects were in each state, for each location and both school types, immediately

after the end of the compulsory education period (Armstrong et al.,1997).

Table 1: Available covariates for the MVAD data set. For binary covariates, the event denoted by 1is

indicated. Otherwise, the levels of the categorical covariate ‘Location’ are grouped in curly brackets.

Covariate Description

Catholic 1=yes

FMPR SOC code of father’s current or most recent job as of the beginning of the survey,

1=SOC1 (Standard Occupational Classiﬁcation: professional, managerial, or related)

Funemp Father’s employment status as of June 1999, 1=employed

GCSE5eq Qualiﬁcations gained by the end of compulsory education, 1=5+ GCSE grades at A–C, or equivalent

Gender 1=male

Grammar Type of secondary education, 1=grammar school

Livboth Living arrangements as of June 1995, 1=living with both parents

Location Location of school, one of ﬁve Education and Library Board areas in Northern Ireland,

{Belfast, North Eastern, South Eastern, Southern, Western}

The MVAD data are available in the Rpackages MEDseq and TraMineR (Gabadinho et al.,

2011). As the data have been used to illustrate some of the functionalities of the TraMineR

package in its associated vignette6, interesting features of an exploratory analysis of the data

can be found therein. However, we reproduce plots of the transversal state distributions in

Figure 1and the transversal entropies in Figure 2, i.e. the Shannon entropies of the state

distributions at each time point (Billari,2001), with the sampling weights accounted for in

both cases. Notably, fewer than vstates are observed in certain months.

Figure 1shows that the number of subjects who found employment increased over time.

Conversely, fewer students were in training or further education by the end of the obser-

vation period. Most students appear to have entirely left school within 2/3 years of the

commencement of the survey. Finally, while students only reached the age of 18 and began

to pursue higher education from July 1995 onwards, a number of students had already pur-

sued further education during the two preceding years. Figure 2conﬁrms that the level of

heterogeneity in the state distribution varies over time. In particular, the entropy declines

after Sep 1995, by which point most students had left school.

6https://cran.r-project.org/web/packages/TraMineR/vignettes/TraMineR-state-sequence.pdf.

5

Weighted Proportions

Jul.93 Jul.94 Jul.95 Jul.96 Jul.97 Jul.98

0.0 0.2 0.4 0.6 0.8 1.0

EMployment

Further Education

Higher Education

JobLessness

SChool

TRaining

Time

Figure 1: Overall state distribution for the weighted

MVAD data, coloured by state.

Time

Weighted Entropy Index

Jul.93 Jul.94 Jul.95 Jul.96 Jul.97 Jul.98

0.0 0.2 0.4 0.6 0.8 1.0

Figure 2: Transversal entropy plot for the weighted

MVAD data.

Interestingly, many students were jobless during the ﬁrst two months of observation. As

the vast majority of cases notably remained in the same state in this period, which coincided

with the summer break from school, all subsequent analyses are conducted on a version of

the data with the ﬁrst two time points removed. Hence, we work hereafter with sequences of

length T= 70, commencing with the return to school in September 1993. As the sampling

design depends on ‘Grammar’ and ‘Location’, the term ‘all covariates’ henceforth refers

to all other covariates in Table 1. While Murphy and Murphy (2020) show that the same

covariate can aﬀect more than one part of a mixture of experts model, and in diﬀerent ways,

removing the quantities used to deﬁne the weights eases the interpretability of the results.

3 Modelling

In this section, we introduce the novel family of MEDseq models. The exponential-distance

model is described in Section 3.1, extended to account for the sampling weights in Section

3.2, expanded into a family of mixtures in Section 3.3, and ﬁnally embedded within the mix-

ture of experts framework in Section 3.4 in order to accommodate the available covariates.

3.1 Exponential-Distance Models

For an arbitrary distance metric d(·,·), location parameter θ, and precision parameter λ, the

probability mass function (PMF) of an exponential-distance model (EDM) for sequences is

f(si|θ, λ, d) = exp(−λd(si,θ))

Pσ∈ST

vexp(−λd(σ,θ)) = Ψ(λ, θ|T, v)−1exp(−λd(si,θ)) ,(1)

with the corresponding log-likelihood function given by

ℓ(θ, λ |S,d) =

n

X

i=1

log f(si|θ, λ, d) = −λ

n

X

i=1

d(si,θ)−nlog Ψ(λ, θ|T, v).(2)

Such a model is analogous to the Gaussian distribution (characterised by the squared

Euclidean distance from the mean) and similar to the Mallows model for permutations (Mal-

lows,1957). Indeed, mixtures of Mallows models have been used to cluster rankings (Murphy

and Martin,2003). We only consider models with λ≥0. When λ= 0, the distribution

of sequences is uniform. For λ > 0, the central sequence θ= (θ1,...,θT)is the mode, i.e.

the sequence with highest probability, and the probability of any other sequence decays

6

exponentially as its distance from θincreases. The precision parameter λcontrols the speed

of this decay. Larger λvalues cause sequences to concentrate around θ, tending toward a

point-mass as λ→ ∞. Notably, λis not identiﬁable when all sequences are identical.

The log-likelihood in (2) is generally intractable, as the normalising constant Ψ(λ, θ|T , v)

depends on the parameter λ(under OM and other, more complicated distances, it can

also depend on θ), as well as the ﬁxed constants T > 1and v > 1, and requires a sum

over all possible sequences. With reference to the MVAD data, for example, computing

Ψ(λ, θ|T, v)is practically infeasible as there are cardST

v=vT= 670 possible sequences.

Fortunately, however, the normalising constant exists in closed form under the Hamming

distance, dH(si,sj) = PT

t=1 1(si,t 6=sj,t), in a manner which facilitates direct enumeration

and crucially does not depend on θ, as a sum with only T+ 1 terms. Consider, for example,

the Hamming distances between all ternary (v= 3) sequences of length T= 4. From the

arbitrary reference sequence (0,0,0,0), there is 1count of a distance of 0,8counts of a

distance of 1,24 counts of a distance of 2,32 counts of a distance of 3, and 16 counts of a

distance of 4. Thus, ΨH(λ|T= 4, v = 3) = e0+ 8e−λ+ 24e−2λ+ 32e−3λ+ 16e−4λ. Hence,

the normalising constant under the Hamming distance metric depends on the parameter λ,

the sequence length T, and the number of categories v, and simpliﬁes greatly:

ΨH(λ|T, v) =

T

X

p=0 T

p(v−1)pexp(−λp) = (v−1) e−λ+ 1T.(3)

Inspired by the generalised Mallows model (Irurozki et al.,2019), the EDM in (1) based

on the Hamming distance can be extended to one based on the weighted Hamming distance.

By introducing Tprecision parameters λ1,...,λT, one for each time point (i.e. sequence

position), and expressing the exponent in (1) as dWH(si,θ|λ1,...,λT) = PT

t=1 λt1(si,t 6=θt)

rather than λdH(si,θ) = λPT

t=1 1(si,t 6=θt), diﬀerent time periods can contribute diﬀerently

to the overall distance, weighted according to the period-speciﬁc precision parameters. Thus,

the distance from θto siunder dWH(·,·|·)becomes a sum of the λtvalues associated with

each time point which diﬀers from the corresponding θt, across the whole sequence. This acts

as implicit variable selection and allows modelling situations in which there is high consensus

regarding the state values of some time periods, with large uncertainty about the values of

others. Accounting for the alignment of contemporaneous matchings in this way helps to pre-

vent sequences with the same (Hamming) distance from θfrom having the same probability.

Given that sequences equidistant from θcan nevertheless exhibit element-wise mismatches

between themselves, this may help later, in the mixture setting, to induce stronger between-

cluster separation and within-cluster homogeneity. The non-constant transversal entropies

in Figure 2suggest that this extension may also be fruitful in terms of capturing diﬀerent

degrees of dispersion in the state distributions of the MVAD data over time. Crucially, the

various beneﬁts outlined above can be achieved without any tractability sacriﬁces. The

log-likelihood in (2) is merely rewritten with the weighted Hamming distance decomposed

into its Tcomponents and the normalising constant in (3) also modiﬁed accordingly:

ℓ(θ, λ1,...,λT|S,dWH) = −

n

X

i=1 "T

X

t=1 λt1(si,t 6=θt) + log(v−1) e−λt+ 1#.

Though other dissimilarity measures are available for sequences, we henceforth consider

measures based on the Hamming distance only, chieﬂy for the computational reasons outlined

above. In our setting, λdH(·,·)can be seen as a special case of OM with all substitution costs

set to λand no insertions or deletions. As it has time-varying substitution costs, dWH (·,·|·)

is similar to the dynamic Hamming distance (Lesnard,2010), a prominent alternative to

7

OM. However, such costs in our models are always assumed to be common with respect

to each pair of states. Hence, dWH (·,·|·)equates to the Gower distance between nominal

variables (Gower,1971) with equally weighted states and unequally weighted time points.

3.2 Incorporating Sampling Weights

Sampling weights are often associated with life-course data, as the data typically arise from

surveys where the weights are used to correct for representivity bias under stratiﬁed sampling

designs. Following Chambers and Skinner (2003), the sampling weights w= (w1,...,wn)

are incorporated into the EDM by exponentiating the likelihood of each sampled unit by the

attached weight wi, which is akin to unit ibeing observed witimes. The resultant pseudo

likelihood Lw(·|·)reweights the likelihood contribution for each unit in order to rebalance

the information in the observed sample to approximate the balance of information in the

target ﬁnite population. The sampling weights ware thus interpretable as being inversely

proportional to the unit inclusion probabilities, remain ﬁxed, and are conﬁned to those

included in the sample. Notably, f(si|θ, λ, d)wi∝f(si|θ, wiλ, d), such that the weights

induce a unit-speciﬁc rescaling of the precision parameter; it follows that the observed data

are independent but not identically distributed.

A secondary beneﬁt of incorporating weights is that it facilitates computational gains

in the presence of duplicate cases. Such duplicates are likely when dealing with discrete

life-course data. This non-uniqueness can be exploited using likelihood weights for compu-

tational eﬃciency, by ﬁtting models to the subset of unique sequences only, weighted by the

sum of the sampling weights (if available, otherwise wi= 1 ∀i) across each corresponding set

of duplicates. In modifying win this way, cases with diﬀerent sampling weights which are

otherwise duplicates are also treated as duplicates, in such a way that the (pseudo) likelihood

is unaltered. The number of duplicates clearly lowers when considering both the sequences

themselves and their associated covariate patterns. In particular, all cases are unique when

there are continuous covariates. Nonetheless, in the MVAD data, and in many applications,

the covariates are all categorical. Hence, exploiting non-uniqueness in this manner can be

extremely computationally convenient. For instance, only 490 of the n= 712 sequences in

the MVAD data set are distinct. However, to avoid notational confusion, all subsequent

expressions are written as though duplicate cases have not been discarded.

Though the weights for the MVAD data sum to ≈711.52, we henceforth follow Xu et al.

(2013) in always assuming that the weights have been normalised to sum to the sample size

n. In doing so, subsequent expressions are simpliﬁed further and the use of model selection

criteria (see Section 4.3) relying on the pseudo likelihood is facilitated. While the resultant

rescaling of the MVAD weights is negligible, we note that multiplying wby a scalar does

not aﬀect parameter estimation.

3.3 A Family of Mixtures of Exponential-Distance Models

Extending the EDM based on the Hamming distance with sampling weights to the model-

based clustering setting yields a pseudo likelihood function of the form

Lw(θ1,...,θG, λ |S,w,dH) =

n

Y

i=1 "G

X

g=1

τg

exp(−λdH(si,θg))

((v−1) e−λ+ 1)T#wi

,

where the mixing proportions τ1,...,τGare positive and sum to 1. Thus, the clustering

approach is both model-based and distance-based, thereby bridging the gap between these

two ‘cultures’ in the SA community.

8

The mixture setting naturally suggests a further extension, whereby the precision param-

eter λcan be constrained or unconstrained across clusters, in addition to the aforementioned

allowance for the precision parameters to vary (or not) across time points. Within a family of

models we term ‘MEDseq’, we thus deﬁne the CC,UC,CU, and UU models, where the ﬁrst

letter denotes whether precision parameters are constrained (C) or unconstrained (U) across

clusters and the second denotes the same across time points. Notably, all models deviate

from the simple matching distance on which they are based, as even the most constrained CC

model could be said to employ a weighted variant thereof, by virtue of allowing for λ6= 1.

The model family allows moving between more parsimonious models and more heavily pa-

rameterised, ﬂexible models which may provide a better ﬁt to the data. As the precision

parameters relate to the substitution costs characterising variants of the Hamming distance,

quantities used to deﬁne the overall distance measure are allowed to vary in diﬀerent ways,

while still being treated as model parameters rather than inputs. In particular, models with

names beginning with Ureﬂect scenarios in which the implicit substitution costs diﬀer across

clusters. Hence, the UU model is analogous to the hierarchical Wardpalgorithm (de Amorim,

2015), in the sense of having cluster-speciﬁc feature weights (albeit with no tuning required).

Given the role played by λwhen it takes the value 0, whereby the state distribution is uni-

form, it is convenient and natural to include a noise component (denoted by N), whose single

precision parameter is ﬁxed to 0, to robustify inference by capturing deviant cases and min-

imising their deleterious eﬀects on parameter estimation for the other, more deﬁned clusters.

Adding this extension to each of the 4models above, regardless of how precision parameters

are otherwise speciﬁed, completes the MEDseq model family with the CCN,UCN,CUN, and

UUN models. When G= 1, the CC,CU, and CCN models can be ﬁtted. When G= 2, the

UCN and UUN models are equivalent to the CCN and CUN models, respectively, as there

is only one non-noise component. As the noise component arises naturally from restricting

the parameter space, we consider the noise component as one of the Gcomponents, denoted

hereafter with the subscript 0. All 8model types are summarised further in Appendix A.

3.4 Incorporating Covariates

We now illustrate how to incorporate the available covariate information into the clustering

process, both to guide the construction of the clusters and to ﬁnd the typical trajectories

which can be best predicted by covariates. As is typical for model-based clustering analyses,

the data are augmented in MEDseq models by introducing a latent cluster membership

indicator vector zi= (zi,1, . . . , zi,G)⊤, where zi,g = 1 if observation ibelongs to cluster gand

zi,g = 0 otherwise. The MEDseq approach can be easily extended to incorporate the possible

eﬀects of covariates on the assignments of sequences to clusters by allowing the covariates

to inﬂuence the distribution of the latent variable zi. Thus, such covariates are interpreted

diﬀerently from those used to deﬁne the sampling weights, if any.

The inclusion of covariates is achieved under the mixture of experts framework (Jacobs

et al.,1991;Gormley and Frühwirth-Schnatter,2019), by extending the mixture model to

allow the mixing proportions for observation ito depend on covariates xi. This, rather than

having covariates enter through the component distributions, is particularly attractive, as

the interpretation of the remaining component-speciﬁc parameters is the same as it would

be under a model without covariates. For example, in the case of the CC MEDseq model

f(si|θ1,...,θG, λ, β1,...,βG,xi,dH)wi="G

X

g=1

τgxi|βgexp(−λdH(si,θg))

((v−1) e−λ+ 1)T#wi

,

where the mixing proportions τgxi|βg(henceforth τg(xi), for simplicity) are referred to as

the ‘gating network’, with τg(xi)>0and PG

g=1 τg(xi) = 1, as usual, and β1,...,βGare the

9

gating network regression parameters. Such a model can be seen as a conditional mixture

model (Bishop,2006) because, given the covariates xi, the distribution of the sequences is

a ﬁnite mixture model under which zihas a multinomial distribution with a single trial

and probabilities equal to τg(xi). The distance-based k-medoids algorithm, though closely

related (see Section 4.2), does not accommodate the inclusion of gating covariates in this way.

Incorporating covariates in ‘hard’ clustering algorithms using MLR, as per McVicar

and Anyadike-Danes (2002), has been criticised because the hard assignment of extraneous

cases can negatively impact internal cluster cohesion and the MLR coeﬃcient estimates

(Piccarreta and Studer,2019). An advantage of the noise component in MEDseq models

is that it captures uniformly distributed sequences that deviate from those in the other,

more deﬁned clusters. Filtering outliers in this way lessens their impact on the non-noise

gating network coeﬃcients, thereby enabling more accurate inference and improving the

interpretability of the eﬀects of the covariates. Moreover, the ‘soft’ partition obtained under

the model-based paradigm allows the cluster membership probabilities for sequences lying

on the boundary between two neighbouring clusters to be quantiﬁed and the eﬀect of such

sequences on the gating network coeﬃcients to be mitigated.

As per Murphy and Murphy (2020), the CCN,UCN,CUN, and UUN models which include

an explicit noise component can be restricted to having covariates only inﬂuence the mixing

proportions for the non-noise components, with all observations therefore assumed to have

equal probability of belonging to the uniform noise component (i.e. by replacing τ0(xi)

with τ0). We refer to the former setting as the gated noise (GN) setting and to the latter

as the non-gated noise (NGN) setting. The NGN setting is the more parsimonious one,

makes more clear the distinction between EDM components and the uniform component,

and is particularly apt when τ0is expected to be small and/or the sequences are expected to

overwhelm the gating covariate(s) in determining which cases are noise. Gating covariates

can only be included when G≥2under the GN setting or when there are 2or more

non-noise components under the NGN setting.

4 Model Estimation

This section describes our model-ﬁtting approach and some implementation issues that arise

in practice. Speciﬁcally, Section 4.1 outlines the ECM algorithm employed for parameter

estimation, Section 4.2 discusses the initialisation thereof with reference to the similarities

between MEDseq models and the k-medoids and k-modes (Huang,1997) algorithms, and the

issues of model selection, covariate selection, and model validation are treated in Section 4.3.

4.1 Model Fitting via ECM

Parameter estimation is greatly simpliﬁed by the existence of a closed-form expression for

the normalising constant for MEDseq models based on the Hamming or weighted Hamming

distances. We focus on maximum (pseudo) likelihood estimation using a simple variant

of the EM algorithm (Dempster et al.,1977). For simplicity, model ﬁtting details are

described chieﬂy for the CC MEDseq model with sampling weights and gating covariates.

Additional details for other model types are deferred to Appendix B; so, too, are technical

details pertaining to estimation of the precision parameter(s). The complete data pseudo

likelihood for the CC model is given by

Lw

c(θ1,...,θG, λ, β1,...,βG|S,X,Z,w,dH) =

n

Y

i=1 "G

Y

g=1 τgxiexp(−λdH(si,θg))

((v−1) e−λ+ 1)T!zi,g #wi

,

10

and the complete data pseudo log-likelihood hence has the form:

ℓw

c(θ1,...,θG, λ, β1,...,βG|S,X,Z,w,dH) =

n

X

i=1

G

X

g=1

zi,gwi[log τgxi−λdH(si,θg)−

Tlog(v−1) e−λ+ 1.

(4)

Under this model, the distribution of sidepends on the latent cluster membership variable

zi, which in turn depends on covariates xi, while siis independent of xiconditional on zi.

The iterative algorithm for MEDseq models follows in a similar manner to that for

standard mixture models. It consists of an E-step (expectation) which replaces for each

observation the missing data ziwith their expected values b

zi, which sum to 1, followed by a

M-step (maximisation), which maximises the expected complete data pseudo log-likelihood.

The M-step consists of a series of conditional maximisation (CM) steps in which each pa-

rameter is maximised individually, conditional on the other parameters remaining ﬁxed.

Hence, model ﬁtting is in fact conducted using an expectation conditional maximisation

(ECM) algorithm (Meng and Rubin,1993). Aitken’s acceleration criterion is used to as-

sess convergence of the non-decreasing sequence of weighted pseudo log-likelihood estimates

(Böhning et al.,1994). Parameter estimates produced on convergence achieve at least a

local maximum of the pseudo likelihood function. Upon convergence, cluster memberships

are estimated via the maximum a posteriori (MAP) classiﬁcation, i.e. cases are assigned to

the cluster gto which they most probably belong via MAP(b

zi) = arg maxg∈ {1,...,G}bzi,g.

The E-step (with similar expressions when λis unconstrained across clusters and/or time

points) involves computing expression (5), where (m+ 1) is the current iteration number:

bz(m+1)

i,g =Ezi,g si,xi,b

θ(m)

g,b

λ(m)

,b

β(m)

g, wi,dH

=bτ(m)

gxifsib

θ(m)

g,b

λ(m)

,dH

PG

h=1bτ(m)

hxifsib

θ(m)

h,b

λ(m)

,dH.(5)

Note that the weights wiappear in neither the numerator nor the denominator, leaving the

E-step unchanged regardless of the inclusion or exclusion of weights.

Subsequent subsections describe the CM-steps for estimating the remaining parame-

ters in the model. These individual CM-steps rely on the current estimates b

Z(m+1) =

b

z(m+1)

1,...,b

z(m+1)

nto provide estimates of the gating network regression coeﬃcients b

β(m+1)

g,

and hence the mixing proportion parameters bτ(m+1)

g(xi), as well as the central sequence(s)

b

θ(m+1)

gand component precision parameter(s) b

λ(m+1), though technical details for the latter,

as they are the element which distinguishes the various MEDseq model types, are deferred

to Appendix B. It is clear from (4) that the sampling weights can be accounted for by

simply multiplying every b

z(m+1)

iby the corresponding weight wi. Conversely, in the CM-

steps which follow, corresponding formulas for unweighted MEDseq models can be recovered

by replacing bz(m+1)

i,g wiwith bz(m+1)

i,g .

4.1.1 Estimating the Gating Network Coeﬃcients

The portion of (4) corresponding to the gating network, given by Pn

i=1 PG

g=1 zi,gwilog τg(xi),

is of the same form as a MLR model with weights given by wi, here written with component

1as the baseline reference level for identiﬁability reasons:

log τg(xi)

τ1(xi)= log Pr(zi,g = 1)

Pr(zi,1= 1) =e

xiβg∀g≥2,with β1= (0,...,0)⊤,

where e

xi= (1,xi). Thus, methods for ﬁtting such models, with b

Z(m+1) as the response,

can be used to estimate the gating network regression parameters b

β(m+1)

g. As closed-form

updates are unavailable for MLR coeﬃcients, due to the nonlinear numerical optimisation

11

involved, this step merely increases (rather than maximises) the expectation of this term.

However, the monotonicity of the sequence of pseudo log-likelihood estimates is preserved

and convergence is still guaranteed. Subsequently, the mixing proportions are given by

bτ(m+1)

g(xi) = expe

xib

β(m+1)

g

PG

h=1 expe

xib

β(m+1)

h.

Conversely, τis estimated exactly via bτ(m+1)

g=n−1Pn

i=1 bz(m+1)

i,g wiwhen there are no gating

covariates. Since Pn

i=1 wi=n, this is simply the weighted mean of the g-th column of the

matrix b

Z(m+1). However, τcan also be constrained to be equal (i.e. τg=1

/G∀g) across

clusters. Thus, situations where τi,g =τg(xi),τi,g =τg, or τi,g =1

/Gare accommodated.

The standard errors of the gating network’s MLR at convergence are not a valid means of

assessing the uncertainty of the coeﬃcient estimates as the cluster membership probabilities

are estimated rather than ﬁxed and known. Therefore, we adapt the weighted likelihood

bootstrap (WLBS) of O’Hagan et al. (2019) to the MEDseq setting. This is implemented

by multiplying the sampling weights wby draws from an n-dimensional symmetric uniform

Dirichlet distribution and reﬁtting the MEDseq model. To ensure stable estimation of the

standard errors, B= 1000 such samples are used here. To ensure rapid convergence and

to circumvent label-switching problems, the estimated b

Zmatrix from the original model is

used to initialise the ECM algorithm for each sample with new likelihood weights. Finally,

the standard errors of the gating network coeﬃcients across the Bsamples are obtained.

Although this approach does not produce fully valid variance estimates when there are

sampling weights which arise from stratiﬁed designs, we adopt the WLBS in what follows in

order to provide approximate standard errors. This issue is particularly pronounced when

the probability of being included in the sample depends on quantities being modelled. This

concern provides additional justiﬁcation for the aforementioned removal of the Grammar

and Location covariates from our analysis.

4.1.2 Estimating the Central Sequences

The location parameter θis sometimes referred to as the Fréchet mean or the central se-

quence. The k-medoids/PAM algorithm, which is closely related to the MEDseq models with

certain restrictions imposed (see Section 4.2), ﬁxes the estimate of b

θgto be the medoid of

cluster g(Kaufman and Rousseeuw,1990), i.e. the observed sequence si∈Swith minimum

weighted distance from the others currently assigned to the same cluster. This estimation

approach is especially quick as the Hamming distance matrix for the observed sequences is

pre-computed. Notably, this greedy search strategy may fail to ﬁnd the optimum solution.

However, for a G= 1 unweighted EDM based on the Hamming distance, the maximum

likelihood estimate (MLE) of θis given simply by the modal sequence, meaning that each

b

θtis independently given by the most frequent state at the t-th time point. This is intuitive

when dH(si,θ)is expressed as T−PT

t=1 1(si,t =θt), as b

θmaximises the number of element-

wise agreements. Thus, the parameter has a natural interpretation. For more complicated

distance metrics, the ﬁrst-improvement algorithm (Hoos and Stützle,2004) or a genetic

algorithm could be used to estimate θ. Notably, the modal sequence need not be an observed

sequence in S. It is also notable that any b

θtmay be non-unique under any of the proposed

estimation strategies. Such ties, if any, are broken at random.

For G > 1, under the ECM framework, central sequence position estimates b

θ(m+1)

g,t are

given by arg maxϑ∈Vt

Pn

i=1 bz(m+1)

i,g wi1(si,t =ϑ), where Vtis the subset of vt≤vstates

observed at time point tacross all cases. As this expression is independent of the precision

parameter(s), it holds for all MEDseq model types, including those based on weighted Ham-

12

ming distance variants. Thus, b

θ(m+1)

gis similarly estimated easily and exactly via a weighted

mode (much like k-modes), whereby each b

θ(m+1)

g,t is given by the category corresponding to

the maximum of the sum of the weights bz(m+1)

i,g wiassociated with each of the vtobserved

state values. Similarly, the central sequence under a weighted G= 1 model is also estimated

via a weighted mode, with the weights given only by wi. Notably, to estimate the central

sequences for a MEDseq model of any type without sampling weights, one need only remove

wifrom these terms. Note also that θ0does not need to be estimated for models with an

explicit noise component as it does not contribute to the likelihood.

4.2 ECM Initialisation and Comparison to k-medoids

MEDseq models share relevant features with the PAM algorithm. Both consider sequences

from a holistic perspective and both rely on distances to a cluster centroid. However,

PAM treats the matrix of pairwise distances between sequences as a pre-computed input,

while under MEDseq models the distances to the centroids (and the costs which deﬁne the

distance metric) are recomputed at each iteration, with the sequences themselves as input.

Otherwise, compared to PAM based on the Hamming distance, MEDseq models diﬀer only

in that i) θgis estimated by the modal sequence rather than the medoid, ii) τis estimated,

or dependent on covariates via τg(xi), rather than constrained to be equal, iii) λis free to

vary across clusters and/or time points, rather than being implicitly set to 1, iv) a noise

component can be included, and v) the ECM algorithm rather than the classiﬁcation EM

algorithm (CEM; Celeux and Govaert,1992) is used. The CEM algorithm employed by PAM

uses hard assignments ezi,g, computed in its C-step, such that ez(m+1)

i,g = 1 if g=MAPb

z(m+1)

i

and ez(m+1)

i,g = 0 otherwise, for which the denominator in (5) need not be evaluated.

Thus, a CC model ﬁtted by CEM, with λ= 1, equal mixing proportions, and the

central sequences estimated by the medoid rather than the modal sequence, is equivalent

to k-medoids based on the Hamming distance. We leverage these similarities by applying

k-medoids to the Hamming distance matrix in order to initialise the ECM algorithm with

‘hard’ starting values for the allocation matrix Z. In particular, we rely on a weighted

version of PAM available in the Rpackage WeightedCluster (Studer,2013), itself initialised

using Ward’s hierarchical clustering. The more closely related k-modes algorithm (Huang,

1997) is not used, as case-weighted implementations are currently unavailable. In any case,

our strategy is less computationally onerous than using multiple random starts. Moreover,

our experience suggests that the ECM algorithm converges quickly when our initialisation

strategy is adopted and that a great many number of random starts are required in order to

achieve comparable performance. For models with an explicit noise component, an initial

guess of the prior probability τ0that observations are noise is required. Allocations are then

initialised, assuming the last component is the one associated with λg= 0, by multiplying

the initial (G−1)-column Zmatrix by 1−τ0and appending a column in which each entry

is τ0. We caution that the initial τ0should not be too large.

4.3 Model Selection and Validation

In contrast to heuristic clustering approaches like k-medoids and Ward’s hierarchical method,

the model-based paradigm facilitates principled model-selection using likelihood-based in-

formation criteria. In the MEDseq setting, the notion of model selection refers to identifying

the optimal number of components Gin the mixture and ﬁnding the best MEDseq model

type in terms of constraints on the precision parameters. Variable selection on the subset

of covariates included in the gating network can also improve the ﬁt. For a given set of

covariates, one would typically evaluate all model types over a range of Gvalues and choose

13

simultaneously both the model type and Gvalue according to some criterion. Thereafter,

diﬀerent ﬁts with diﬀerent covariates can be compared according to the same criterion.

The Bayesian Information Criterion (BIC; Schwarz 1978) includes a penalty term which

depends on the number of free parameters kin the model. The parameter counts can be

deceptive for MEDseq models. In particular, regarding the estimation of b

θg,t, we note that

identifying the modal state for a given time point implicitly involves estimating occurrence

probabilities for (vt−1) states and then selecting the most common. This is accounted for

in Appendix A, wherein the number of free parameters in under each MEDseq model type

is summarised. We also note that the penalty klog nis applied to the maximum pseudo

log-likelihood estimate in the sample-weighted setting (Xu et al.,2013).

Beyond its use in identifying the optimal Gand precision parameter settings, the BIC is

also employed in greedy stepwise selection algorithms in order to guide the inclusion/exclusion

of relevant gating covariates. We propose a bi-directional search strategy in which each step

can potentially consist of adding or removing a non-noise component or adding or removing

a covariate. Interaction terms are not considered. Every potential action is evaluated over

all possible model types at each step, rather than considering changing the model type as an

action in itself. Changing the gating covariates or changing the number of components can

aﬀect the model type, as observed by Murphy and Murphy (2020). While this makes the

stepwise search more computationally intensive, it is less likely to miss optimal models as it

explores the model space. For steps involving both gating covariates and a noise component,

models with both the GN and NGN settings can be evaluated and potentially selected.

A backward stepwise search starts from the model, with all covariates included, consid-

ered optimal in terms of the number of components Gand the MEDseq model type. On the

other hand, a forward stepwise search uses the optimal model with no covariates included

as its starting point. In both cases, the algorithm accepts the action yielding the highest

increase in the BIC at each step. The computational beneﬁts of upweighting unique cases

and discarding redundant cases are stronger for the forward search, as early steps with fewer

covariates are likely to have fewer unique cases across sequence patterns and covariates.

As a means of validating the model chosen by BIC, we turn to silhouette analysis to assess

the quality of the clustering in terms of internal cohesion, where high cohesion indicates high

between-cluster distances and strong within-cluster homogeneity. Typically, the silhouette

width is deﬁned for clustering methods which produce a ‘hard’ partition (Rousseeuw,1987),

and the average silhouette width (ASW) or weighted average silhouette width (wASW;

Studer 2013) is used as a model selection criterion. However, Menardi (2011) introduces the

density-based silhouette (DBS) for model-based clustering methods. This allows the ‘soft’

assignment information to be used, which is discarded when using the MAP assignments in

the computation of the wASW. The empirical DBS for observation iis given by

c

dbsi=

logbz0

i

bz1

i

maxh∈ {1,...,n}logbz0

h

bz1

h.(6)

As observations are assigned to clusters via the MAP classiﬁcation, c

dbsiis proportional to

the log-ratio of the posterior probability associated with the MAP assignment of observation

i(denoted by bz0

i) to the maximum posterior probability that the observation belongs to

another cluster (denoted by bz1

i). Use of the MAP classiﬁcation implies 0≤c

dbsi≤1∀i,

with high values indicating a well-clustered data point. Ultimately, the mean or the median

c

dbs value can be used as a global quality measure, albeit with two modiﬁcations. Firstly, we

identify a set of crisply assigned observations having bz1

ilower than a tolerance parameter ǫ,

14

here set equal to 10−100. These observations are given c

dbsivalues of 1and are excluded from

the computation of the maximum in the denominator of (6) for reasons of numerical stability.

Secondly, we account for the sampling weights by computing a weighted mean density-based

silhouette criterion (wDBS). While neither the wDBS nor wASW are deﬁned for G= 1,

unlike the BIC, they are not employed here as model selection criteria. These silhouette

summary measures are used only to validate MEDseq clustering solutions and to facilitate

comparisons with other methods in Section 5.2. Higher values are preferred for both criteria.

5 Analysing the MVAD Data

Results of ﬁtting MEDseq models to the weighted MVAD data are provided in Section

5.1. All results were obtained via our purpose-built Rpackage MEDseq (Murphy et al.,

2021). The impact of discarding the sampling weights is also studied. A comparison against

other approaches, including hierarchical, partitional, and model-based clustering methods,

is included in Section 5.2. A discussion of the insights gleaned from the solution obtained

by the optimal MEDseq model is deferred to Section 6.

5.1 Application of MEDseq

Weighted MEDseq models are ﬁt for G= 1,...,25, across all 8model types (where al-

lowable), ﬁrstly with all covariates included in the gating network (again, where allowable).

The noise components, where applicable, are treated using the NGN setting. Figure 3shows

the behaviour of the BIC for these models. To better highlight the diﬀerences in BIC, lower

values for G < 5are not shown. Under these conditions, a G= 11 UUN model is identiﬁed

as optimal. The same model type and number of components are identiﬁed as optimal

when the noise components are treated with the GN setting, and when the same analysis

is repeated with no covariates at all.

−110000 −105000 −100000 −95000

Number of Components (G)

BIC

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

CC

UC

CU

UU

CCN

UCN

CUN

UUN

Figure 3: BIC values for all MEDseq model types, with weights and all covariates, for a range of Gvalues.

In reﬁning the model further via greedy stepwise selection, both the forward search (see

Table 2) and backward search (see Table 3) thus begin with the same number of components

and the same model type. As previously stated, covariates used to deﬁne the sampling

weights are excluded in both cases. Notably, no step in either search elects to modify G

15

or the model type. Both searches converge to the same G= 11 UUN model with only the

single covariate ‘GCSE5eq’ in the NGN gating network, though the search in the forward

direction does so in fewer steps. Under this model, the probability of belonging to the noise

component is constant and does not depend on the included covariate.

Table 2: Summary of the steps taken to improve the BIC in the forward direction.

Optimal Step GModel Type Gating Covariates Gating Type BIC

—11 UUN —−93190.08

Add ‘GCSE5eq’ 11 UUN GCSE5eq NGN −92953.85

Stop 11 UUN GCSE5eq NGN −92953.85

Table 3: Summary of the steps taken to improve the BIC in the backward direction.

Optimal Step GModel Type Gating Covariates Gating Type BIC

—11 UUN Catholic, FMPR, Funemp, GCSE5eq, Gender, Livboth NGN −93111.30

Remove ‘FMPR’ 11 UUN Catholic, Funemp, GCSE5eq, Gender, Livboth NGN −93068.09

Remove ‘Livboth’ 11 UUN Catholic, Funemp, GCSE5eq, Gender NGN −93025.73

Remove ‘Catholic’ 11 UUN Funemp, GCSE5eq, Gender NGN −92994.32

Remove ‘Funemp’ 11 UUN GCSE5eq, Gender NGN −92967.23

Remove ‘Gender’ 11 UUN GCSE5eq NGN −92953.85

Stop 11 UUN GCSE5eq NGN −92953.85

Notably, there is little diﬀerence between the respective clusterings produced by the

various models including no covariates, all covariates, and only GCSE5eq. Indeed, both the

soft b

Zmatrices and hard MAP assignments are almost identical between each pair of models;

relative to the optimal model after stepwise selection, there are only 1and 2cases assigned to

diﬀerent clusters under equivalent models with no covariates and all covariates, respectively.

Thus, the sequences themselves overwhelm the covariates and there is little confounding

between the simultaneous roles of GCSE5eq under the optimal model in guiding both the

construction of the clusters and their interpretation. Moreover, the parsimony aﬀorded by

discarding the other covariates simpliﬁes the interpretation greatly. Thus, while adapting

the ‘two-step’ approach introduced for LCR (Bakk and Kuha,2018) to the MEDseq setting

may be of interest for other applications, the results for the MVAD data do not diﬀer greatly

from those presented in Section 6, as shown in Appendix C.

For completeness, the analysis above is repeated with the sampling weights discarded

entirely and consideration given where appropriate to the two covariates used to deﬁne w.

In doing so, identical inference is obtained on the model type; however, the results diﬀer in

terms of the optimal G(now 10), the uncovered partition, and the estimated model param-

eters. This is not surprising, as failure to account for win the clustering produces biased

estimates of the component-speciﬁc parameters and the cluster membership probabilities, as

well as the gating network coeﬃcients. Additionally, an extra gating covariate (Grammar)

is included after stepwise selection in the unweighted analysis. However, the results are

reasonably robust to a coarsening of the sequences; in repeating all analyses with the data

subsetted into six-monthly intervals, similar inferences are again obtained. Notably, the

ECM algorithm’s runtime is not greatly reduced in doing so. Indeed, MEDseq models scale

more poorly with n(or, more speciﬁcally, the number of unique cases) rather than Tor v, as

the number of (pseudo) likelihood evaluations required for large nis more computationally

expensive than the number of simple matching evaluations required for long sequences.

5.2 Other Clustering Methods

To contrast the MEDseq results for the MVAD data with those obtained by other methods,

we present a non-exhaustive comparison against some distance-based and some Markovian

approaches. Regarding the former, we present only some common heuristic methods which

16

treat the distance matrix as the input using distance metrics which are commonly adopted

in the literature on life-course sequences, namely PAM and Ward’s method based on the

Hamming distance and OM. We note that fuzzy clustering oﬀers an alternative distance-

based perspective which also allows for soft assignments (see D’Urso (2016) for an excellent

overview), with further, separate extensions for incorporating covariates and including a

noise cluster in Studer (2018) and D’Urso and Massari (2013), respectively. However, this

paradigm is not considered further, both for the sake of brevity and because case-weighted

implementations are currently unavailable. LCA and LCR, ﬁt via the Rpackage poLCA

(Linzer and Lewis,2011), are also excluded, as they encounter computational diﬃculties due

to the explosion in the number of parameters for G≥3. Among the considered methods,

only MEDseq and the distance-based methods can accommodate the sampling weights.

Firstly, MEDseq models with no covariates and all covariates are compared against

weighted versions of k-medoids, using the Rpackage WeightedCluster (Studer,2013), and

Ward’s hierarchical clustering. Here, k-medoids is itself initialised using Ward’s method.

Neither method can be compared to MEDseq models in terms of BIC or wDBS values, as they

are not model-based and do not yield ‘soft’ cluster membership probabilities, respectively.

Thus, Figure 4shows a comparison of wASW values using MAP classiﬁcations where neces-

sary. Only the MEDseq model type (and gating network setting, for models with covariates)

with the highest wASW for each Gvalue is shown, for clarity. Note that the wASW is com-

puted using the observed Hamming distance matrix, which both comparators in Figure 4

utilise directly, while MEDseq models are only based on the Hamming metric. Nonetheless,

MEDseq models show superior or competitive performance across the majority of Gvalues.

In particular, the optimal model identiﬁed after stepwise selection achieves wASW=0.386.

The superior wASW values achieved by MEDseq models provide evidence that the proposed

methodology, which embeds features of the distance-based approaches into a model-based

setting, yields more compact and well-separated clusters. Notably, similar conclusions are

drawn when OM — with the same cost settings as used in McVicar and Anyadike-Danes

(2002) — is used in place of the Hamming distance for k-medoids and Ward’s method.

0.30 0.35 0.40 0.45

Number of Components (G)

Weighted ASW

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

MEDseq: no covariates

MEDseq: all covariates

k−medoids

Ward's Hierarchical Clustering

Figure 4: Values of the wASW measure, using Hamming distances, for the best MEDseq model type for

each Gvalue with no covariates and all covariates. Corresponding values for weighted versions of k-medoids

and Ward’s hierarchical clustering based on the Hamming distance are also shown.

Secondly, ﬁnite mixtures with ﬁrst-order Markov components, ﬁt via the Rpackage

ClickClust (Melnykov,2016b), are also included in the comparison. This package allows

the initial state probabilities to be either estimated or equal to 1

/vfor all categories; both

scenarios are considered and other function arguments are set to their default values. The

17

wASW values for the ClickClust models are not shown in Figure 4; they are much lower

than those of the other methods up to G= 5 and turn negative thereafter. Though this

implies inferior clustering behaviour for ClickClust models, the method also returns a

b

Zmatrix of cluster membership probabilities. Hence, these models are also compared to

MEDseq in terms of the wDBS measure in Figure 5. Again, only the best model of each type

is shown for each Gvalue; here, the MEDseq models again exhibit the best performance

over the entire range. Notably, the optimal G= 11 UUN MEDseq model with ‘GCSE5eq’

in the gating network achieves wDBS=0.455. An advantage of ClickClust is that it allows

sequences of unequal lengths, but this is not a concern for the MVAD data.

0.1 0.2 0.3 0.4 0.5 0.6

Number of Components (G)

Weighted Mean DBS

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

MEDseq: no covariates

MEDseq: all covariates

ClickClust

Figure 5: Values of the wDBS measure for the best MEDseq model type at each Gvalue with no covariates

and all covariates. Corresponding values for the best ClickClust model are also shown.

Thirdly, the Rpackage seqHMM (Helske and Helske,2019) provides tools for ﬁtting mix-

tures of hidden Markov models, with gating covariates inﬂuencing cluster membership prob-

abilities. Such models allow cluster memberships to evolve over time, similar to mixed

membership models (Airoldi et al.,2014). They thus cannot be directly compared to MED-

seq models. However, we note that the seqHMM package provides a pre-ﬁtted model for

the MVAD data, with the ﬁrst two months also discarded and no covariates. The model

has 2clusters, with 3and 4hidden states, respectively, and achieves wDBS=0.50 and

wASW=0.23. Otherwise identical seqHMM models, including either all covariates or only

the GCSE5eq covariate chosen for the optimal MEDseq model via stepwise selection, both

achieve wDBS=0.47 and wASW=0.23. Notably, these wDBS and wASW values are much

worse than those for MEDseq models with G= 2. Overall, the ClickClust and seqHMM

results suggest that holistic approaches — MEDseq models, in particular — yield better

clusterings than Markovian ones for the MVAD data.

6 Discussion of the MVAD Results

To better inform a discussion of the results obtained by the optimal G= 11 UUN model for

the MVAD data, with the covariate GCSE5eq in the NGN gating network, its estimated

central sequences are ﬁrst shown in Figure 6. Seriation has been applied, using the observed

Hamming distance matrix and the travelling salesperson combinatorial optimisation algo-

rithm (Hahsler et al.,2008), in order to give consecutive numbers to clusters with similar

estimated (weighted) modal sequences. Each cluster’s label is derived from the represen-

tation of b

θgin State-Permanence-Sequence format (SPS; Aassve et al.,2007). The same

18

ordering and labels are used in all subsequent graphical and tabular displays of results. The

uncovered clusters are shown in Figure 7, to which additional seriation has been applied in

order to also group the observations within clusters, for visual clarity. Finally, the average

time spent in each state by cluster — weighted by wiand the estimated cluster membership

probabilities — is shown in Table 4, along with the cluster sizes.

Time

Clusters

10

9

8

7

6

5

4

3

2

1

(TR,10)−(JL,2)−(TR,3)−(JL,55)

(SC,10)−(FE,36)−(EM,24)

(FE,22)−(EM,48)

(TR,5)−(EM,65)

(TR,22)−(EM,48)

(TR,37)−(EM,33)

(SC,25)−(EM,45)

(SC,24)−(FE,36)−(HE,10)

(FE,25)−(HE,45)

(SC,25)−(HE,45)

Sep.93 Jun.94 Apr.95 Feb.96 Dec.96 Oct.97 Aug.98 Jun.99

EMployment

Further Education

Higher Education

JobLessness

SChool

TRaining

Figure 6: Central sequences of the optimal G= 11 UUN model with the GCSE5eq gating covariate. The

SPS labels on the right characterise each non-noise cluster by the distinct successive states in b

θg, with

associated durations (in months).

Time

Clusters

Noise

10

9

8

7

6

5

4

3

2

1

Sep.93 Jun.94 Apr.95 Feb.96 Dec.96 Oct.97 Aug.98 Jun.99

(TR,10)−(JL,2)−(TR,3)−(JL,55)

(SC,10)−(FE,36)−(EM,24)

(FE,22)−(EM,48)

(TR,5)−(EM,65)

(TR,22)−(EM,48)

(TR,37)−(EM,33)

(SC,25)−(EM,45)

(SC,24)−(FE,36)−(HE,10)

(FE,25)−(HE,45)

(SC,25)−(HE,45)

EMployment

Further Education

Higher Education

JobLessness

SChool

TRaining

Figure 7: Clusters uncovered under the optimal G= 11 UUN model with the GCSE5eq gating covariate.

The rows correspond to the n= 712 observed sequences, including duplicate cases previously discarded

during model ﬁtting, grouped according to the MAP classiﬁcation and ordered according to the observed

Hamming distance matrix. Each cluster is named according to the SPS representation of b

θg.

19

Table 4: Average durations (in months) spent in each state by cluster, weighted by bzi,g wi, for the optimal

11-component UUN model, with the SPS labels derived from b

θg. Estimated cluster sizes bngcorrespond to

the MAP partition.

Cluster: g(SPS) bngEMployment Further Higher JobLessness SChool TRaining

Education Education

1 (SC,25)-(HE,45) 87 3.77 0.29 38.45 0.89 26.07 0.54

2 (FE,25)-(HE,45) 59 4.65 26.51 37.63 0.45 0.76 0.00

3 (SC,24)-(FE,36)-(HE,10) 18 3.40 30.58 8.56 4.07 21.84 1.56

4 (SC,25)-(EM,45) 32 35.60 1.68 3.63 2.85 25.60 0.63

5 (TR,37)-(EM,33) 60 28.29 1.24 0.00 3.38 1.35 35.74

6 (TR,22)-(EM,48) 67 45.84 1.47 0.00 3.15 1.51 18.03

7 (TR,5)-(EM,65) 165 57.50 2.11 0.00 5.16 1.62 3.62

8 (FE,22)-(EM,48) 95 41.30 22.65 0.99 3.04 1.30 0.72

9 (SC,10)-(FE,36)-(EM,24) 56 21.82 35.19 0.27 3.99 6.15 2.58

10 (TR,10)-(JL,2)-(TR,3)-(JL,55) 55 8.40 3.38 0.22 42.89 4.20 10.91

Noise — 18 21.50 11.42 1.19 14.51 2.20 19.18

This solution tends to group individuals who experience trajectories that are similar or

that diﬀer only for relatively short periods. In particular, the dominating combinations of

states experienced over time are clearly identiﬁed, and diﬀerences in durations and/or age at

transition are quite limited in size. Within clusters, substantial reduction of misalignments

and/or diﬀerences in the durations of spells are evident. Ultimately, the partition is char-

acterised not only by the sequencing (i.e. the experienced, ordered combinations of states),

but also by the spell durations and the ages at transitions which appear mostly homoge-

neous within clusters. This can be explained by the fact that cases in the identiﬁed groups

tend to dedicate the same period of time (spells of 1, 2, or 3 years) to further/higher educa-

tion and/or training. This is interesting because one might expect the chosen dissimilarity

metric, as it based on the Hamming distance, to attach higher importance to the sequencing.

The 11-cluster solution for the MVAD data separates individuals who continued in school

(clusters 1, 3, and 4), or otherwise prolonged their studies after the end of compulsory

education (clusters 2, 8, and 9), from those who entered the labour market (clusters 5, 6,

and 7). The clear division visible for some clusters in Figure 7around Autumn 1995, when

new semesters of further and higher education commenced and the majority of those still

remaining in school had eventually left, corresponds to the time point in Figure 2after

which the entropies declined. Interestingly, individuals who experienced prolonged periods

of unemployment are mostly isolated in cluster 10; this is particularly important because

the Status Zero Survey aimed to identify such ‘at risk’ subjects. From this we conclude

that youth unemployment in Northern Ireland in this period was predominantly a problem

of small numbers experiencing long spells of non-participation in the labour market rather

than large numbers dipping into brief, frictional spells.

Clusters 1, 3, and 4 include subjects who continued school for about two years, presum-

ably to retake previously failed examinations or to pursue academic or vocational qualiﬁ-

cations. These individuals are split into two groups depending on whether they continued

their studies (FE: cluster 3, or HE: cluster 1) or were employed directly (cluster 4). Clus-

ters 2, 8, and 9 group subjects who initially entered further education, for about two years

(clusters 2 and 8) or more (cluster 9). Most subjects in clusters 8 and 9 entered employment

directly after further education, whereas the vast majority of those in cluster 2 transitioned

to higher education, where they remained until the end of the observation period.

As for the clusters of individuals who moved quickly to the labour market after the

end of compulsory education, it is possible to distinguish between individuals who almost

immediately found a job and remained in employment for most of the observation period

(the large cluster 7) and individuals who entered government-supported training schemes

(clusters 5 and 6). A further separation is between subjects who were employed after about

20

2 years of training (cluster 6) and those who participated in training for a much longer

period (cluster 5). Importantly, most of the individuals in these two clusters were able to

ﬁnd a job even if some respondents experienced some periods of unemployment.

It is interesting to observe that the cluster of careers dominated by persistent unem-

ployment (cluster 10) is characterised by diﬀerent experiences at the end of the compulsory

education period. Indeed, some subjects entered employment directly after the end of com-

pulsory education but left or lost their job after some months, while some prolonged their

education before becoming unemployed. However, the majority entered a training period

that did not evolve into steady employment.

Notably, the optimal model identiﬁed is a UUN model, i.e. one whose precision param-

eters vary across both clusters and time points. Thus, model selection favours a ﬂexible,

heavily-parameterised MEDseq variant which, while based on the simple Hamming dis-

tance, has cluster-speciﬁc and period-speciﬁc costs which allow element-wise mismatches

between sequences and the central sequences in diﬀerent time periods in diﬀerent clusters

to contribute diﬀerently to the overall distance measure. While a display of the estimated

precision parameters is omitted, for brevity, their values can be easily examined via the

MEDseq Rpackage. Nonetheless, it is already clear that the model captures diﬀerent degrees

of heterogeneity in the cluster-speciﬁc state distributions of each month.

The coeﬃcients of the gating network with associated WLBS standard errors are given

in Table 5, from which a number of interesting eﬀects can be identiﬁed. The interpretation

of the eﬀects of the covariates is made clearer by virtue of there being just one included

after stepwise selection. For completeness, gating network coeﬃcients and associated WLBS

standard errors for the model with all covariates included are provided in Appendix C.

Table 5: Multinomial logistic regression coeﬃcients and associated WLBS standard errors (in parentheses),

with SPS labels, for the NGN gating network of the optimal G= 11 UUN model with the GCSE5eq covariate.

Recall that GCSE5eq=1for subjects who achieved 5 or more grades at A–C (or equivalent) in GCSE exams.

Cluster: g(SPS) (Intercept) GCSE5eq

1(SC,25)-(HE,45) — —

2(FE,25)-(HE,45) −0.95 (0.44)−0.47 (0.49)

3(SC,24)-(FE,36)-(HE,10) −0.46 (0.63)−1.23 (0.73)

4(SC,25)-(EM,45) 0.58 (0.44)−2.18 (0.58)

5(TR,37)-(EM,33) 1.03 (0.38)−3.43 (0.55)

6(TR,22)-(EM,48) 1.19 (0.35)−3.73 (0.50)

7(TR,5)-(EM,65) 1.70 (0.32)−4.09 (0.47)

8(FE,22)-(EM,48) 0.60 (0.38)−2.20 (0.42)

9(SC,10)-(FE,36)-(EM,24) 0.95 (0.39)−3.20 (0.55)

10 (TR,10)-(JL,2)-(TR,3)-(JL,55) 0.90 (0.36)−3.73 (0.72)

Relative to the reference cluster (cluster 1), characterised by those who prolonged their

schooling for two years to sit A-level exams before successfully transitioning to higher edu-

cation, all slope coeﬃcients are notably negative. All students achieving 5 or more grades

at A–C in GCSE exams are therefore less likely to belong to all other clusters, relative

to cluster 1. Thus, the reference level for the eﬀect of GCSE5eq is appropriate and the

interpretation is guided only by the magnitude of the slope coeﬃcients and their associated

standard errors, as well as the intercepts. Firstly, the eﬀects for cluster 2 and 3, capturing

other subjects who were in higher education by the end of the observation period, appear

slight (on the basis of the size of the standard errors of their slopes). Coupled with the

negative intercepts for these clusters, this suggests, as expected, that more academically

inclined students tend to prolong their education in order to improve their job prospects.

Conversely, all other intercepts are positive and all other slope coeﬃcients appear to be

signiﬁcantly diﬀerent from 0. We can say, therefore, despite the two-year continuation in

school of subjects in cluster 4, that students who do well in GCSE exams are less likely to

21

belong to this cluster. Furthermore, we can see the coeﬃcient magnitudes increasing and

the standard errors decreasing as we move from cluster 5 to cluster 7. As these clusters are

distinguished only by the length of the training period prior to securing stable employment,

this again suggests that academically poor students are quick to ﬁnd a job, presumably in

an unskilled capacity. Similar conclusions can be drawn for clusters 8 and 9, i.e. subjects

who secured employment of some kind after some time in further education rather than

third-level education. Finally, those who achieved 5 or more high GCSE grades are less

likely to experience persistent spells of joblessness (cluster 10).

The optimal G= 11 UUN model contains a uniform noise component. The BIC chooses

such a model over G= 10 models without a noise component and G= 11 models with all

non-noise components. Detecting outliers in this way allows the remaining non-noise clusters

to be modelled more clearly. Figure 8focuses on the noise component, which isolates errant,

directionless subjects who don’t neatly ﬁt into any of the deﬁned clusters and transition

quite frequently between states. This includes transitions in and out of further education,

employment, and training. Most subjects here are early school-leavers. Under the model’s

NGN gating network, the probability of belonging to this noise component is constant

(≈0.025) and independent of the included GCSE5eq covariate.

Time

Observations

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Sep.93 Jun.94 Apr.95 Feb.96 Dec.96 Oct.97 Aug.98 Jun.99

EMployment

Further Education

Higher Education

JobLessness

SChool

TRaining

Figure 8: Observations assigned to the noise component of the optimal G= 11 UUN model with the

GCSE5eq covariate in the NGN gating network.

7 Conclusion

The Status Zero Survey followed a sample of Northern Irish youths over a six-year pe-

riod, recording their employment activities at monthly intervals, in order to explore their

unfolding career trajectories and identify those at risk of prolonged unemployment. Here

we present a model-based clustering approach, with the aims of assessing how many typ-

ical trajectories there are, what kinds of typical trajectories there are, and what kinds of

individuals are more likely to experience which kinds of trajectories. Our approach is con-

22

trasted to heuristic approaches previously employed in analyses of these data. In McVicar

and Anyadike-Danes (2002), Ward’s hierarchical clustering algorithm is applied to an OM

dissimilarity matrix to identify relevant patterns in the data, with subjective costs. Notably,

reference is not made to the associated covariates until the uncovered clustering structure

is examined. In particular, MLR is used to relate the hard assignments of the sequences to

clusters to a set of baseline covariates. It is also notable that the sampling weights are incor-

porated only in the MLR stage and not in the clustering itself. This is arguably a three-step

approach, comprising the computation of pairwise string distances using OM (or some other

distance metric), the hierarchical or partition-based clustering, and the (weighted) MLR.

MEDseq models, conversely, oﬀer a more coherent ‘one-step’ model-based approach. The

sequences are modelled directly using a ﬁnite mixture of exponential-distance models, with

the Hamming distance and weighted variants thereof employed as the distance metric. A

range of precision parameter settings have been explored to allow diﬀerent time points con-

tribute diﬀerently to the overall distance. Thus, varying degrees of parsimony are accommo-

dated. Sampling weights are accounted for by weighting each observation’s contribution to

the pseudo likelihood. Dependency on covariates is introduced by relating the cluster mem-

bership probabilities to covariates under the mixture of experts framework. Thus, MEDseq

models treat the weights, the relation of covariates to clusters, and the clustering itself si-

multaneously. Hence, MEDseq provides a coherent framework for estimating the number

of clusters, identifying the relevant features of these patterns, and assessing whether these

patterns are somehow inﬂuenced or shaped by the subjects’ background characteristics.

Model selection in the MEDseq setting identiﬁes a reasonable solution for the MVAD

data and shows that clustering the sequences in a holistic manner allows new insights to

be gleaned from these data. In particular, 11 distinct components are found, of which

10 have interpretable typical trajectories and one is an additional noise component which

captures deviant cases. Thus, supported by the use of an information criterion appropriate

for this model-based analysis, a more granular view of the MVAD cohort than the 5groups

uncovered in McVicar and Anyadike-Danes (2002) is provided. Furthermore, allowing for

the other covariates with which the sampling weights used here are deﬁned, GCSE exam

performance at the end of the compulsory education period is found to be the most single

most important predictor of cluster membership.

Opportunities for future research are varied and plentiful. Co-clustering approaches

could be used to simultaneously provide clusters of the observed sequence trajectories and

the time periods (Govaert and Nadif,2013). Such an approach could be especially useful for

the UUN model type identiﬁed as optimal for the MVAD data, as it would reduce the number

of within-cluster period-speciﬁc precision parameters required. Indeed, parsimony has been

achieved in a similar fashion in the context of ﬁnite mixtures with Markov components

(Melnykov,2016a). Additionally, grouping trajectories across time as well would enable

more eﬃcient summaries of the durations of the spells in speciﬁc states, which tend to be

long for the MVAD data. In particular, using co-clustering approaches which respect the

ordering of the sequences by restricting the column-wise clusters to form contingent blocks

would be particularly desirable. Indeed, failure to fully account for the temporal ordering of

events, due to the invariance of the Hamming distance to permutations of the time periods,

is a general limitation of our framework which future work will endeavour to address.

It may also be of interest for other applications to extend the MEDseq models to ac-

commodate sequences of diﬀerent lengths, for which the Hamming distance is not deﬁned.

These diﬀerent lengths could be attributable to missing data, either by virtue of sequences

not starting on the same date, shorter follow-up time for some subjects, or non-response

for some time points. While the Hamming distance is only deﬁned for equal-length strings,

23

adapting the MEDseq models to such a setting would be greatly simpliﬁed if aligning the

sequences of diﬀerent lengths is straightforward. Another limitation of MEDseq models is

that time-varying covariates are not accommodated in the gating network. Notably, neither

of these concerns are relevant for the MVAD data.

However, our analysis of the MVAD data is limited by two aspects of the gating network

portion of our framework. The ﬁrst substantive limitation relates to the WLBS approach

used for quantifying uncertainty in the MLR coeﬃcients. As the sampling weights arise

from stratiﬁcation, the standard errors obtained in this fashion are approximate. Thus,

examining alternative approaches to produce fully valid variance estimates in the MEDseq

setting in the presence of complex sampling designs is an interesting future research avenue.

The second limitation relates to the stepwise procedure used to identify relevant covari-

ates. As this strategy depends on an information criterion, namely the BIC, whose penalty

term is based on a parameter count, it may be prudent to relax the assumption that gating

covariates must aﬀect all components. As the number of components chosen here (G= 11)

is moderately large, a large number of extra parameters are associated with each extra

covariate (see Appendix A). Thus, only GCSE5eq is identiﬁed as optimal, as it is signiﬁ-

cantly associated with many of the typical trajectories. However, we note, for example, that

Catholics are largely underrepresented in cluster 7 and largely overrepresented in cluster 10

(characterised by persistent employment and persistent joblessness, respectively) despite the

omission of the covariate indicating religious aﬃliation from the optimal model. Incorpo-

rating regularisation penalties into the MLR to shrink certain gating network coeﬃcients to

zero could thus be a fruitful alternative to the present stepwise covariate-selection method.

Another potential extension is to consider MEDseq models with alternative distance

metrics. The distance metric in García-Magariños and Vilar (2015), which accounts for

the temporal correlation in categorical sequences, is of particular interest; so, too, is OM.

In general, heuristic distance-based clustering (including fuzzy methods) can more easily

accommodate more sophisticated distances, while changing the MEDseq distance metric

fundamentally alters the model, which needs the normalising constant and the conditional

maximisation steps for parameter estimation to be tailored to the choice of metric.

MEDseq models, by virtue of being based on the Hamming distance for computational

reasons, implicitly assume substitution-cost matrices with zero along the diagonal and a

single value common to all other entries. The relationship between the exponent of an

EDM based on the Hamming distance and the Hamming distance itself (with a common

cost, typically equal to 1) is apparent from the fact that multiplying the substitution-cost

matrix by any positive scalar, as per normalised variants of the Hamming distance (Elzinga,

2007;Gabadinho et al.,2011), yields the same model, because its value is absorbed into λ.

This is also the case for models employing weighted Hamming distance variants under which

the precision parameters, and hence the otherwise common substitution costs, vary across

clusters and/or time points. However, all model types in the MEDseq family cannot account

for situations in which some states are more diﬀerent than others — e.g. one where the cost

associated with moving from employment to joblessness is assumed to be greater than the

cost associated with moving from school to training — as they assume that substitution

costs are the same between each pair of states. Such concerns are most pronounced when

there is an explicit ordering to the states, e.g. education levels (Studer and Ritschard,2016).

Basing MEDseq on OM, for instance, would require the subjective speciﬁcation, or

preferably estimation, of the v(v−1)/2oﬀ-diagonal entries of symmetric substitution-cost

matrices. Potentially, as per the range of precision settings used for the MVAD application,

the substitution-cost matrices could also be allowed to vary across clusters and/or time

points. However, the normalising constant under an EDM using OM depends both on the

24

heterogeneous substitution costs and on θand is unavailable in closed form, thereby greatly

complicating model ﬁtting. Indeed, dependence on θrenders even oﬄine pre-computation of

the normalising constant infeasible for even moderately large Tor v. Truncation of the sum

over all sequences or importance sampling techniques could be used to address the intract-

ability. Though not a concern for the MVAD data, as one substitution is equivalent to a

deletion and an insertion for equal-length sequences, considering insertions and deletions

also would present further challenges. In any case, some level of approximation would be

required, while the ECM algorithm for MEDseq models based on simple matching is exact.

As well as removing the normalising constant’s dependence on θ, another positive con-

sequence of the homogeneity of substitution costs with respect to pairs of states under the

Hamming distance is that the ECM algorithm used for parameter estimation scales well

with the sequence length Tand the size of the alphabet v, especially since such normalising

constants need to be computed once, Gtimes, or G−1times per iteration, depending on the

precision parameter settings. Though potentially restrictive, having only one parameter as-

sociated with each substitution-cost matrix, regardless of its order v, helps address concerns

about overparameterisation (Studer and Ritschard,2016), especially when the substitution

costs implied by the precision parameter(s) vary across clusters and/or time points.

Furthermore, it is likely that results on the MVAD data would not diﬀer greatly with OM

used in place of the Hamming distance, particularly for models where λvaries across clusters

and/or time points, save for a solution with potentially fewer clusters being found. Indeed,

McVicar and Anyadike-Danes (2002) also consider a setting with common substitution costs

and ﬁnd that their results do not greatly diﬀer from their solution with state-dependent

costs. This implies that the notion that some states in the MVAD data are closer to each

other than others can be questioned. Ultimately, the UUN model adopted here preserves

the timing of events, by prohibiting time-warping insertion and deletion operations, while

accounting (in a cluster-speciﬁc fashion) for the timing, as well as the number, of element-

wise mismatches between sequences, in such a way that all states are assumed to be equally

diﬀerent. Given the correspondence between Hamming distance weights, precision param-

eters, and implicit substitution costs in MEDseq models, it is notable that these are treated

as parameters rather than inputs, and are thus estimated as part of model ﬁtting rather

than pre-speciﬁed along with the matrix of pairwise distances between sequences.

Overall, our analysis of the MVAD data provides a more granular view of the cohort of

Northern Irish youths than previously available, supplemented by interpretable parameter

estimates achieved through a coherent model-based framework. The MEDseq model family

appears promising from the perspective of reconciling the distance-based and model-based

cultures within the SA community. Indeed, the results for the MVAD data are encouraging

in this respect; they seem to suggest that the unconstrained precision parameter settings

adequately address the misalignment issues inherent in the use of the Hamming distance.

It remains to be seen if this holds for more turbulent sequences, e.g. those related to

employment activities tracked over longer periods.

Acknowledgements

This publication has emanated from research conducted with the ﬁnancial support of Science

Foundation Ireland under Grant number SFI/12/RC/2289_P2. Additionally, R. Piccarreta

acknowledges the support from MIUR-PRIN 2017 project 20177BRJXS. For the purpose of Open

Access, the authors have applied a CC BY public copyright licence to any Author Accepted

Manuscript version arising from this submission. The authors also thank Matthias Studer and

members of the Sequence Analysis Association for helpful discussions.

25

References

Aassve, A., F. Billari, and R. Piccarreta (2007). Strings of adulthood: a sequence analysis of

young British women’s weekly work-family trajectories. European Journal of Population 23 (3),

369–388. 18

Abbott, A. and J. Forrest (1986). Optimal matching methods for historical sequences. Journal of

Interdisciplinary History 16 (3), 471–494. 2

Abbott, A. and A. Hrycak (1990). Measuring resemblance in sequence data: an optimal matching

analysis of musician’s careers. American Journal of Sociology 96 (1), 145–185. 2

Agresti, A. (2002). Categorical Data Analysis. New York: John Wiley & Sons. 3

Airoldi, E. M., D. M. Blei, E. A. Erosheva, and S. E. Fienberg (2014). Handbook of Mixed Mem-

bership Models and Their Applications. New York, USA: Chapman and Hall/CRC Press. 18

Armstrong, D., D. Istance, R. Loudon, S. McCready, G. Rees, and D. Wilson (1997). ‘Status

0’: a socio-economic study of young people on the margin. Belfast: Training and Employment

Agency, Northern Ireland Economic Research Centre. 4,5

Bakk, Z. and J. Kuha (2018). Two-step estimation of models between latent classes and external

variables. Psychometrika 83 (4), 871–892. 16

Banﬁeld, J. and A. E. Raftery (1993). Model-based Gaussian and non-Gaussian clustering. Bio-

metrics 49 (3), 803–821. 3

Billari, F. C. (2001). The analysis of early life courses: complex description of the transition to

adulthood. Journal of Population Research 18 (2), 119–142. 5

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York: Springer. 10

Böhning, D., E. Dietz, R. Schaub, P. Schlattmann, and B. G. Lindsay (1994). The distribution of

the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals

of the Ins titute of Statistical Mathematics 46 (2), 373–388. 11

Bouveyron, C., G. Celeux, T. B. Murphy, and A. E. Raftery (2019). Model-Based Clustering and

Classiﬁcation for Data Science: With Applications in R. Cambridge Series in Statistical and

Probabilistic Mathematics. Cambridge: Cambridge University Press. 3

Celeux, G. and G. Govaert (1992). A classiﬁcation EM algorithm for clustering and two stochastic

versions. Computational Statistics and Data Analysis 14 (3), 315–332. 13

Chambers, R. L. and C. J. Skinner (2003). Analysis of Survey Data. Chichester: John Wiley &

Sons. 8

Dayton, C. M. and G. B. Macready (1988). Concomitant-variable latent-class models. Journal of

the American Statistical Association 83 (401), 173–178. 3

de Amorim, R. C. (2015). Feature relevance in Ward’s hierarchical clustering using the Lpnorm.

Journal of Classiﬁcation 32 (1), 46–62. 9

Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data

via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodol-

ogy) 39 (1), 1–38. 10

D’Urso, P. (2016). Fuzzy clustering. In C. Hennig, M. Meila, F. Murtagh, and R. Rocci (Eds.),

Handbook of Cluster Analysis, Chapter 24, pp. 245–575. New York: Chapman and Hall. 17

D’Urso, P. and R. Massari (2013). Fuzzy clustering of human activity patterns. Fuzzy Sets and

Systems 215, 29–54. 17

Elzinga, C. H. (2007). Sequence analysis: metric representations of categorical time series. Technical

report, Department of Social Science Research Methods, Vrije Universiteit, Amsterdam. 24

26

Gabadinho, A., G. Ritschard, N. S. Müller, and M. Studer (2011). Analyzing and visualizing state

sequences in Rwith TraMineR. Journal of Statistical Software 40(4), 1–37. 5,24

García-Magariños, M. and J. A. Vilar (2015). A framework for dissimilarity-based partitioning

clustering of categorical time series. Data Mining and Know ledge Discovery 29 (2), 466–502. 24

Gormley, I. C. and S. Frühwirth-Schnatter (2019). Mixtures of experts models. In S. Frühwirth-

Schnatter, G. Celeux, and C. P. Robert (Eds.), Handbook of Mixture Analysis, Chapter 12, pp.

279–316. London: Chapman and Hall/CRC Press. 3,9

Govaert, G. and M. Nadif (2013). Co-Clustering: Models, Algorithms and Applications. London:

ISTE-Wiley. 23

Gower, J. C. (1971). A general coeﬃcient of similarity and some of its properties. Biometrics 27 (4),

857–871. 8

Hahsler, M., K. Hornik, and C. Buchta (2008). Getting things in order: an introduction to the R

package seriation. Journal of Statistical Software 25 (3), 1–34. 18

Hamming, R. W. (1950). Error detecting and error correcting codes. The Bell System Technical

Journal 29 (2), 147–160. 3

Helske, S. and J. Helske (2019). Mixture hidden Markov models for sequence data: the seqHMM

package in R.Journal of Statistical Software 88 (3), 1–32. 18

Helske, S., J. Helske, and M. Eerola (2016). Analysing complex life sequence data with hidden

Markov modeling. In G. Ritschard and M. Studer (Eds.), LaCOSA II: Proceedings of Interna-

tional Conference on Sequence Analysis and Related Methods, pp. 209–240. 4

Hoos, H. and T. Stützle (2004). Stochastic Local Search: Foundations and Applications. San

Francisco, CA, USA: Morgan Kaufmann Publishers Inc. 12

Huang, Z. (1997). A fast clustering algorithm to cluster very large categorical data sets in data

mining. In H. Lu, H. Motoda, and H. Luu (Eds.), KDD: Techniques and Applications, pp. 21–34.

Singapore: World Scientiﬁc. 10,13

Irurozki, E., B. Calvo, and J. A. Lozano (2019). Mallows and generalized Mallows model for

matchings. Bernoulli 25 (2), 1160–1188. 7

Jacobs, R. A., M. I. Jordan, S. J. Nowlan, and G. E. Hinton (1991). Adaptive mixtures of local

experts. Neural Computation 3 (1), 79–87. 9

Kaufman, L. and P. J. Rousseeuw (1990). Partitioning around medoids (program PAM). In

L. Kaufman and P. J. Rousseeuw (Eds.), Finding Groups in Data: An Introduction to Cluster

Analysis, Chapter 2, pp. 68–125. New York: John Wiley & Sons. 4,12

Lazarsfeld, P. F. and N. W. Henry (1968). Latent Structure Analysis. Boston: Houghton Miﬄin. 3

Lesnard, L. (2010). Setting cost in optimal matching to uncover contemporaneous socio-temporal

patterns. Sociological Methods & Research 38 (3), 389–419. 7

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals.

Soviet Physics Doklady 10 (8), 707–710. 2

Linzer, D. A. and J. B. Lewis (2011). poLCA: an Rpackage for polytomous variable latent class

analysis. Journal of Statistical Software 42 (10), 1–29. 17

Mallows, C. L. (1957). Non-null ranking models. Biometrika 44 (1/2), 114–130. 6

McVicar, D. (2000). Status 0 four years on: young people and social exclusion in Northern Ireland.

Labour Market Bulletin 14, 114–119. 1,4

McVicar, D. and M. Anyadike-Danes (2002). Predicting successful and unsuccessful transitions

from school to work by using sequence methods. Journal of the Royal Statistical Society: Series

A (Statistics in Society) 165 (2), 317–334. 1,2,4,10,17,23,25

27

Melnykov, V. (2016a). Model-based biclustering of clickstream data. Computational Statistics and

Data Analysis 93 (C), 31–45. 4,23

Melnykov, V. (2016b). ClickClust: an Rpackage for model-based clustering of categorical sequences.

Journal of Statistical Software 74 (9), 1–34. 17

Menardi, G. (2011). Density-based silhouette diagnostics for clustering methods. Statistics and

Computing 21 (3), 295–308. 14

Meng, X. L. and D. R. Rubin (1993). Maximum likelihood estimation via the ECM algorithm: a

general framework. Biometrika 80 (2), 267–278. 11

Muñoz-Bullón, F. and M. A. Malo (2003). Employment status mobility from a life-cycle perspective:

a sequence analysis of work-histories in the BHPS. Demographic Research 9 (7), 119–162. 2,4

Murphy, K. and T. B. Murphy (2020). Gaussian parsimonious clustering models with covariates

and a noise component. Advances in Data Analysis and Classiﬁcation 14 (2), 293–325. 3,6,10,

14

Murphy, K., T. B. Murphy, R. Piccarreta, and I. C. Gormley (2021). MEDseq: mixtures of

exponential-distance models with covariates.Rpackage version 1.3.0. 4,15

Murphy, T. B. and D. Martin (2003). Mixtures of distance-based models for ranking data. Com-

putational Statistics and Data Analysis 41 (3–4), 645–655. 6

O’Hagan, A., T. B. Murphy, L. Scrucca, and I. C. Gormley (2019). Investigation of parameter

uncertainty in clustering using a Gaussian mixture model via jackknife, bootstrap and weighted

likelihood bootstrap. Computational Statistics 34 (4), 1779–1813. 12

Pamminger, C. and S. Frühwirth-Schnatter (2010). Model-based clustering of categorical time

series. Bayesian Analysis 5 (2), 345–368. 4

Piccarreta, R. and M. Studer (2019). Holistic analysis of the life course: methodological challenges

and new perspectives. Advances in Life Course Research 41, 100251. 4,10

RCore Team (2021). R: a language and environment for statistical computing. Vienna, Austria: R

Foundation for Statistical Computing. 4

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster

analysis. Computational and Applied Mathematics 20, 53–65. 14

Schwarz, G. (1978). Estimating the dimension of a model. T he Annals of Statistics 6 (2), 461–464.

14

Studer, M. (2013). WeightedCluster library manual: a practical guide to creating typologies of

trajectories in the social sciences with R. Technical report, LIVES Working Papers 24. 13,14,

17

Studer, M. (2018). Divisive property-based and fuzzy clustering for sequence analysis. In

G. Ritschard and M. Studer (Eds.), Sequence Analysis and Related Approaches: Innovative

Methods and Applications, pp. 223–239. Cham: Springer International Publishing. 17

Studer, M. and G. Ritschard (2016). What matters in diﬀerences between life trajectories: a

comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society:

Series A (Stati stics in Society) 179 (2), 481–511. 2,24,25

Ward, J. (1963). Hierarchical grouping to optimize an objective function. Journal of the American

Statistical Association 58 (301), 236–244. 2

Wu, L. L. (2000). Some comments on sequence analysis and optimal matching methods in sociology:

review and prospect. Sociological Methods & Research 29 (1), 41–64. 4

Xu, C., J. Chen, and H. Mantell (2013). Pseudo-likelihood-based Bayesian information criterion

for variable selection in survey data. Survey Methodology 39 (2), 303–322. 8,14

28

Appendices

Appendix A The MEDseq Model Family: Parameter Counts

The models in the MEDseq family diﬀer only in their treatment of the precision parameters,

which diﬀerentiate the Hamming distance and the weighted variants thereof. The BIC is

used in order to choose between the 8model types, identify the optimal G, and guide the

inclusion of gating covariates. Table A.1 summarises the number of free parameters kin the

BIC penalty term under each MEDseq model type, in order to demonstrate the increasing

level of complexity in moving from the most parsimonious CCN model to the most heavily

parameterised UU model.

The number of parameters contributing to each b

θgestimate notably depends on the

number of states represented across all cases in each time point. Note also that parameters

relating to b

θg,t corresponding to estimated precision parameters are counted, while those

associated with ﬁxed precision parameter values of 0are not counted. Similarly, precision

parameters estimated as 0are counted, but precision parameters ﬁxed at 0associated with

the noise component are not.

The number of gating network parameters is not accounted for in Table A.1. When

covariates are included, there are (r+ 1) ×(G−1) or (r+ 1)×(G−2) +1 extra parameters

— under the GN and NGN settings, respectively — where r+ 1 is the dimension of the

associated design matrix, including the intercept term. When τis not covariate-dependent,

there are G−1extra parameters when τis unconstrained or only 1extra parameter if τis

constrained and the model includes a noise component, in which case τ0is allowed to vary.

Table A.1: Number of estimated parameters under each MEDseq model type. Models with names ending

with the letter N, indicating the presence of a noise component for which the single precision parameter is

ﬁxed to 0, behave like the corresponding model without this component for all other components. Thus, λ

and all subscript variants thereof refer here to the non-noise components only.

Model Precision λg(Clusters) λt(Time Points) Number of Parameters

Central Sequence(s) Precision

CC λg,t =λConstrained Constrained GPT

t=1 (vt−1) 1

CCN (G−1) PT

t=1 (vt−1) 1(G > 1)

UC λg,t =λgUnconstrained Constrained GPT

t=1 (vt−1) G

UCN (G−1) PT

t=1 (vt−1) G−1

CU λg,t =λtConstrained Unconstrained GPT

t=1 (vt−1) T

CUN (G−1) PT

t=1 (vt−1) 1(G > 1)T

UU λg,t =λg,t Unconstrained Unconstrained GPT

t=1 (vt−1) GT

UUN (G−1) PT

t=1 (vt−1) (G−1) T

Appendix B Estimating MEDseq Precision Parameters

For ﬁxed θ, the PMF in (1) belongs to the exponential family with natural parameter λ.

Thus, under any distance metric, the method of moments estimate of λis equal to the MLE.

Hence, with b

θalready estimated as per Section 4.1.2,b

λensures that the expected distance

of observations from b

θis equal to the observed average distance from b

θ, since the solution of

∂ℓ(λ|S,b

θ,d)

n∂λ =Pσ∈ST

vdσ,b

θexp−λdσ,b

θ

Pσ∈ST

vexp−λdσ,b

θ −1

n

n

X

i=1

d(si,b

θ)

29

implies

Eλ(d(S,b

θ)) = Pσ∈ST

vdσ,b

θexp−λdσ,b

θ

Pσ∈ST

vexp−λdσ,b

θ =dS,b

θ=1

n

n

X

i=1

dsi,b

θ.(7)

Under the Hamming distance, the value of the expectation in (7) holds for any arbitrary

reference sequence in place of b

θ. As the denominator in (7) — corresponding to the normal-

ising constant in (3), under the Hamming distance — is a function of λ, it is crucial that

it exists in closed form in order to estimate the precision parameter. Hence, with known b

θ,

the MLE for λfor an unweighted single-component CC model can be obtained as follows:

ℓλ|S,b

θ,dH=−λndHS,b

θ−nT log(v−1) e−λ+ 1,

∂ℓ (·)

∂λ =nT (v−1)

eλ+ (v−1) −ndHS,b

θ,

∴b

λ= log (v−1) T

dHS,b

θ−1!,

which notably relies on the inverse of the average Hamming distance normalised by the

sequence length T. However, this can yield a negative value for b

λ. Recall that we only

consider λ≥0. Since all distances are non-negative and typically not identical, ∂ℓ(·)

∂λ is

negative ∀λ > 0in the case where the suﬃcient statistic dHS,b

θ> v−1T(v−1), with

limλ→∞ ∂ℓ(·)

∂λ =−ndHS,b

θ. Thus,

b

λ= max 0,log(v−1) T

dHS,b

θ−1!.

When dHS,b

θ< v−1T(v−1), such that b

λ > 0, the identity log(c(a/b −1)) = log(c) +

log(a−b)−log(b)is used for numerical stability, otherwise b

λis set to 0. When sampling

weights are included, following the same steps as above yields the corresponding estimate

b

λ= max 0,log(v−1) + logT n

Pn

i=1 widHsi,b

θ−1!.(8)

The ECM algorithm is employed when G > 1, in which case the CM-step for b

λ(m+1)

under a CC MEDseq mixture model with sampling weights is given by

∂ℓw

c(·)

∂λ =T(v−1) Pn

i=1 PG

g=1 zi,gwi

eλ+ (v−1) −

n

X

i=1

G

X

g=1

zi,gwidHsi,b

θg,

∴b

λ(m+1) = max

0,log(v−1) + log T n

Pn

i=1 PG

g=1 bz(m+1)

i,g widHsi,b

θ(m+1)

g−1!

.(9)

As per (8), this requires the current estimate of each component’s central sequence. When

there are no sampling weights, one need only drop the witerms from (8) and (9) to esti-

mate the precision parameters of unweighted MEDseq models. While b

λcan potentially be

estimated as zero, the inclusion of a noise component in the CCN,UCN,CUN, and UUN

models makes this explicit, by restricting one cluster to have λg ,t = 0 ∀t= 1,...,T.

30

However, when b

λg,t is estimated as zero rather than ﬁxed to zero, the corresponding

θg,t parameter must be estimated, as it aﬀects the likelihood indirectly through its role in

estimating the precision parameter(s). In particular — taking the UU model as an example

— all state values in the t-th sequence position with non-zero bz(m+1)

i,g are identical to b

θ(m+1)

g,t

when the corresponding denominator in Table B.2 evaluates to zero, such that b

λ(m+1)

g,t → ∞.

Expressions for the weighted complete data pseudo likelihood functions for all model

types in the MEDseq family are given in Table B.1. All models are written here as though

gating network covariates xiare included. However, the gating networks of models with a

noise component are written in the NGN form employed by the optimal model identiﬁed

in Section 5.1 rather than the GN form, i.e. it is assumed that τ0is constant, meaning the

covariates do not aﬀect the probability of belonging to the noise component (see Section 3.4).

Table B.2 outlines the corresponding CM-steps for the precision parameter(s). All deriva-

tions closely follow the same steps as in (9) for the CC model and the normalised sampling

weights are accounted for in all cases. These formulas can be simpliﬁed somewhat for un-

weighted models and/or models without gating covariates. Recall that the ﬁrst letter of

the model name denotes whether the precision parameters are constrained/unconstrained

across clusters, the second denotes the same across time points (i.e. sequence positions),

and model names ending with the letter Ninclude a noise component.

Table B.1: Weighted complete data pseudo likelihood functions for all MEDseq model types, which dif-

fer according to the constraints imposed on the precision parameters across clusters and/or time points.

The expressions for the various weighted Hamming distance metric variants employed, and the associated

normalising constants, are given in full.

Model Weighted Complete Data Pseudo Likelihood

CC Qn

i=1 QG

g=1 τg(xi)exp(−λPT

t=1 1(si,t 6=θg,t ))

((v−1)e−λ+1)Tzi,g wi

UC Qn

i=1 QG

g=1 τg(xi)exp(−λgPT

t=1 1(si,t 6=θg,t ))

((v−1)e−λg+1)Tzi,g wi

CU Qn

i=1 QG

g=1 τg(xi)exp(−PT

t=1 λt1(si,t 6=θg,t))

QT

t=1((v−1)e−λt+1) zi,g wi

UU Qn

i=1 QG

g=1 τg(xi)exp(−PT

t=1 λgt1(si,t 6=θg,t))

QT

t=1((v−1)e−λg,t +1)zi,g wi

CCN Qn

i=1 QG−1

g=1 τg(xi)exp(−λPT

t=1 1(si,t 6=θg,t))

((v−1)e−λ+1)Tzi,g τ0

vTzi,0wi

UCN Qn

i=1 QG−1

g=1 τg(xi)exp(−λgPT

t=1 1(si,t 6=θg,t ))

((v−1)e−λg+1)Tzi,g τ0

vTzi,0wi

CUN Qn

i=1 QG−1

g=1 τg(xi)exp(−PT

t=1 λt1(si,t 6=θg,t))

QT

t=1((v−1)e−λt+1) zi,g τ0

vTzi,0wi

UUN Qn

i=1 QG−1

g=1 τg(xi)exp(−PT

t=1 λg,t1(si,t6=θg,t))

QT

t=1((v−1)e−λg ,t +1)zi,g τ0

vTzi,0wi

31