ArticlePDF Available

Exploring potential unknown subgroups in your data: An introduction to finite mixture models for applied linguistics

Authors:
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
1
Exploring potential unknown subgroups in your data:
An introduction to finite mixture models for applied linguistics
Tove Larsson1 & Gregory R. Hancock2
1Northern Arizona University; 2University of Maryland
Abstract
This article provides an introduction to finite mixture models in an applied linguistics context.
Mixture models can be used to address questions relating to whether there are unknown
subgroups in one’s data, and if so, which participants/texts are likely to belong to which
subgroup. Put differently, the technique enables us to assess whether our data might come
from a heterogeneous population that is made up of latent classes. As such, mixture models
offer a model-based framework to answer research questions for which the field previously
has either attempted to use nonparametric heuristic techniques (e.g., cluster analysis) or has
left entirely unanswered. An example of such research questions would be, ‘Does the
treatment work equally well for all the participants, or are there unknown subgroups in the
data that respond differently to the treatment?’ The article starts by introducing univariate
mixture models and then broadens the scope to cover bivariate and multivariate mixture
models. It also discusses some known pitfalls of the technique and how one might ameliorate
these in practice.
Keywords: Mixture modeling, latent classes, data heterogeneity, underlying groupings,
population subgroups.
1 Introduction
In applied linguistics studies, including corpus linguistics, it is not uncommon for researchers
to compare survey responses or frequencies of occurrence of linguistic features across groups
that have been determined a priori (e.g., treatment vs. control, L1 vs. L2 speakers). In making
such comparisons, we typically assume that any differences across groups are specifically
because of the grouping variable we are investigating (e.g., treatment, language background).
However, there are often systematic differences among cases under study that are not known
a priori, having little or nothing to do with known grouping variables. Said differently, our
data may contain sources of heterogeneity indicative of previously unknown subgroups, so-
called latent classes, which could be indicative, for example, of a treatment being more or less
effective for different subgroups, or of subgroups of texts with different profiles of linguistic
characteristics. Note that we use latent to mean unobserved, such that should such classes in
fact exist, we do not know the specific membership of any observation (see Bollen, 2002, for
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
2
a general discussion of latent variables, and Larsson, Plonsky, & Hancock, 2022, for a
discussion of latent variables in an applied linguistics context).
In order to assess whether data come from such a heterogenous population (and if so,
what the characteristics of those underlying groupings may be), we can use a method related
to the broad structural equation modeling framework: finite mixture models (see, e.g.,
Hancock & Schoonen, 2015; Pastor & Gagné, 2013). Finite mixture models (henceforth
mixture models) enable us to assess whether our data appear to come from a heterogenous
population consisting of a finite number of latent classes, where those classes may differ in
key parameters ranging from the fairly simple (e.g., means, variances) to the more complex
(e.g., regression slopes, factor analytic model structure). The technique is more commonly
applied in fields such as psychology and epidemiology, but so far very rarely used in applied
linguistics research (though see Yu, Lowie, & Peng, 2022, and Lou, Chaffee, & Noels, 2022).
In this paper, we introduce mixture modeling and illustrate its usefulness for applied
linguistics researchers. In Section 2, we offer a non-technical explanation of the method and
discuss key differences from other methods. Sections 3 and 4 subsequently offer step-by-step
guidance of how to apply mixture modeling to univariate and multivariate applied linguistics
data, respectively. Section 5 discusses some practical issues that may emerge in this
framework, and Section 6 concludes the paper.
2 An introduction to mixture models
A common study design in applied linguistics is one that includes a comparison between two
or more predetermined groups. We may ask, for example, whether there is a notable
(statistical and practical) difference between treatment and control groups after an
intervention, or whether language learners of a given language background differ on some
outcome variable from learners of a different language background. In such designs, we
commonly focus on testing hypotheses regarding means, employing such increasingly general
frameworks as t-tests, analysis of variance (ANOVA), and the general linear model (GLM).
However, what if we wish to leave such questions open to other potential groupings,
ones for which we may have no a priori grouping variable? For example, unbeknownst to the
researcher, the data at hand may have come from multiple populations differing in central
tendency (higher and lower means), or in dispersion (homogeneous vs. heterogeneous
variance), or in variable relations (stronger vs. weaker correlations or regression slopes).
Mixture models can help us to identify such unknown groupings in our data (see, e.g.,
McLachlan & Peel, 2000; Titterington, Smith, & Makov, 1985).
Pastor and Gagné (2013: 345) offer the following overview of mixture models and their
use:
When a population lacks information about group membership but there is a research
question to be answered involving the parameter estimates of potentially multiple groups, a
mixture analysis is called upon. A mixture analysis is therefore an analysis that estimates
parameters for a given number of hypothesized groups, known as classes, in a single data
set without the availability of a classification variable or other such a priori information
about group membership with which to sort the data.
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
3
The primary question in mixture modeling is thus whether there appears to exist such
subgroups, and if so, their number and nature. Specific research questions could include,
‘Were there subgroups for which the suprasegmentals treatment works better vs. worse?’,
‘What are the groupings in the data when it comes to frequency of use of emphatics?’, and
‘Do we have evidence to suggest that some respondents failed to take the task seriously?’. If
our data do indeed provide evidence of multiple subgroups, researchers may subsequently
wish to assess each case’s probability of membership in different classes, or even evaluate
other variables that may have had an influence on those cases’ class membership and/or
outcomes that may be influenced by that class membership.
Consider, to start, a univariate setting, in which a researcher is curious whether the
distribution of scores on a particular variable is the result of the sample coming from one or
more populations. For the sake of illustration, we assume that the populations’ scores are
normally distributed. The goal is then to identify whether the data are consistent with a single
population (with only one set of mean and variance parameters) or with multiple populations
(each with its own mean and variance parameters). Different solutions provided by the
technique can then be compared to determine which one has the best fit to the data. To make
this more concrete, consider Figure 1 below. While the univariate distribution in A seems to
come from a single normal population, the distribution in B seems to come from a mixture of
two normally distributed populations, the parameters of which the model helps us estimate.
Figure 1. Homogeneous (a) and heterogeneous (b) univariate distributions
The above example also highlights some important aspects of the practice of finite
mixture models. Researchers using it must (i) specify the shapes of the component population
distributions, and (ii) decide on the number of classes to model (Pastor & Gagné, 2013: 347).
When it comes to (i), we will focus on normal distributions in this paper, as it is intended to
offer an introductory account of mixture models and as software tends to assume this
distribution (see, e.g., McLachlan & Peel, 2000, for more information on other kinds of
distributions). For (ii), because finite mixture models are common practiced in an exploratory
manner (“I wonder how many classes there are”) rather than a confirmatory manner (“Theory
says there should be three classes”), a common approach is to assess which of several
competing models differing in number of classes (i.e., in their associated parameters) has the
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
4
most desirable relative fit vis-à-vis the data (see Sections 3 and 4).1 This distinguishes mixture
modeling from nonparametric procedures such as cluster analysis, for which there are
typically no formal model comparison procedures. In addition, mixture models are better
equipped to accommodate missing data (through their use of full information maximum
likelihood estimation), and as they are statistical models, we can obtain information such as
confidence intervals and standard errors of classes’ parameter estimates.
If a multi-class model is ultimately chosen, researchers may wish to evaluate which
cases appear to belong to which class. Unlike nonparametric procedures such as cluster
analysis where each case is assigned as a member of a specific cluster, mixture models do not
make such discrete assignments. Rather, for mixture models, each case has a probability of
membership in each class based on its likelihood within each of the subpopulations (e.g., how
close a score is to each class’s mean in a univariate normal distribution). Based on these
probabilities, a researcher may then elect to make class assignments for the individual cases.
With this conceptual overview in place, we will now turn to a more practical overview
of mixture models and why they might be advantageous in an applied linguistics setting.
3 Applying mixture modeling to univariate applied linguistic data
All mixture models have the following parts: latent classes, mixing probabilities, component
distributions, and an aggregate distribution. Consider again Figure 1b. The latent classes are
the unobserved subgroups in the data, shown as normal underlying distributions, and the
mixing probabilities are the relative proportion of each subgroup (e.g., .25 for the left
subgroup and .75 for the right subgroup). The component distributions specify the presumed
data distribution of each subgroup, such as both being normally distributed. The aggregate
distribution (drawn atop the two underlying distributions in Figure 1b) is the weighted sum of
the component distributions, that is, the observed distribution in the overall population, where
the weights are the classes’ mixing probabilities.
In the following example, we will use a contrived dataset representing a standardized
exam including essays written by 600 first-year adult learners of French. While not used
directly in the mixture model, we also have information about the students first-language (L1)
background (Norwegian, Spanish, and Mandarin Chinese) and their level of study (beginner
1.1 or 1.2). The exams have been error coded manually, and we are interested in looking at
the learners’ mastery of the past participle (passé composé) form of the verb to express a
specific action that took place at a precise time (e.g., I left the room). In French, the past
participle is used for temporary actions of this kind, whereas the imperfect (imparfait) is used
1 While mixture models are more commonly practiced in an exploratory manner in this way, their roots are
actually confirmatory. That is, although mixture models are most often used in an exploratory manner to see if
difference classes exist in the data, the technique can also be used to assess specific hypotheses. Examples
include a priori hypotheses about a specific number of classes, as well as much more refined hypotheses
suggesting specific degrees of separation between theorized classes as well as differences in their relative
univariate or multivariate heterogeneity/dispersion. This framework is thus versatile enough to accommodate
research questions all along the exploratory-confirmatory continuum.
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
5
to describe an action that had a longer duration (e.g., I lived in Brussels). For example, to
express the temporary action ‘they came in’, the target-like form would be the past participle
(1), rather than the imperfect (2).
(1) Ils sont entrés [They came in]
(2) *Ils entraient [They came in]
In more detail, our dataset comprises the percentage of times out of 20 that a learner
correctly used the past participle instead of the imperfect when describing an action that took
place at a precise time. The mean for our overall sample (n=600) is 49.27%, with a standard
deviation of 29.25%. In this example we are interested in answering the following three
research questions:
1. Are there subgroups in the data with regard to the extent to which the learners produce
the past participle in a target-like manner?
2. What is the relative size of each subgroup?
3. Are there any patterns found in the data that could help us understand why learners
might belong to specific subgroups?
Below we will go through the common steps in mixture modeling: model specification, model
estimation, model selection, and possible class assignment.
3.1 Model specification
The outcome variable with which we would like to conduct a mixture model is the percentage
of times a learner correctly used the past participle instead of the imperfect when describing
an action that took place at a precise time. We know from previous research that speakers of
L1s such as English who do not make this distinction may struggle to internalize this rule
more than speakers of languages that have similar rules, although the effect decreases with
increased proficiency (e.g., Heilenman & McDonald, 1993). Now, having designated the key
variable of interest, the first step in conducting the mixture modeling process is model
specification.
Model specification primarily entails three aspects: (i) choosing the shape of the
component distribution(s) within classes, (ii) articulating any potential cross-class constraints,
and (iii) designating the number of latent classes potentially of interest. With regard to the
first aspect, based on prior research we expect the distribution to be reasonably normal within
all subpopulations. Second, as for potential constraints, we have no a priori reason to suspect
that (or question whether) characteristics of different classes ought to be the same (e.g.,
homogeneous variances), and thus will leave then unconstrained. Third, concerning the
number of classes, unless there are strong theory-based reasons to choose a particular number
or small range of classes, a common procedure is to fit several models differing in number of
classes and making a selection based on proper model convergence, fit, parsimony, and
interpretability. For univariate distributions, as in the current example, we may also visually
inspect the overall sample distribution in an attempt to get an initial data-driven sense of the
number of possible classes (doing so becomes much less feasible in multivariate scenarios).
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
6
Figure 2 displays a histogram of our dataset where the x-axis shows the percentage of
times (in 5% bins) a learner correctly used the past participle instead of the imperfect when
describing an action that took place at a precise time, and the frequency of the observations in
each bin is displayed on the y-axis. In our case, we can conclude, based on visual inspection,
that the data do not seem to come from a single normally distributed population, but it is not
clear how many different subgroups we might have in our data.
Figure 2. A histogram of the dataset
3.2 Model estimation
The second step is to fit the model to the data, thereby yielding estimates of the relevant
parameters. The more classes are included in the model, the more parameters need to be
estimated, resulting in a less parsimonious model. That is, for models with k classes,
parameters for each of the k distributions have to be estimated (e.g., mean and variance for
normal distributions). In addition, the mixing proportions must be estimated (reflecting the
classes’ estimated proportional contribution to the overall sample); these proportions are
themselves constrained to sum to 1 (e.g., in a three-class model, the classes’ component
distributions might be estimated to occur with relative frequencies .15 and .35 for the first
two, and thus necessarily .50 for the third; see Pastor & Gagné, 2013: 348ff). Thus, for
example, in a two-class univariate model, five parameters need to be estimated: the mean and
variance of the first class, the mean and variance of the second class, and one class’s mixing
proportion (the second class’s proportion is automatically determined as 1 minus the first;
more generally, for the same reason, with k classes one must estimate k-1 mixing
proportions).
Mixture models, like structural equation models more generally, are commonly fit using
Maximum Likelihood (ML) estimation, iterating to estimates of the model parameters for
which the likelihood (probability) of the data in one’s sample is maximized. With a finite
mixture (i.e., k-class) model, ML is still typically employed, but the likelihood being
maximized is a weighted combination of as many likelihoods as there are classes, where the
weights are the mixing proportions reflecting the relative size of the classes. Thus, estimates
of model parameters for all k classes (as well as mixing proportions) are iteratively sought to
define conditions for a mixture of k populations that maximize the likelihood of our observed
data (under specific compound distributional assumptions).
Histogram of PC_PROP
PC_PROP
Frequency
020 40 60 80 100
020 40 60 80 100
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
7
However, with this increased complexity for the ML process in the context of mixture
models come issues that need to be surmounted, such as local solutions (i.e., the model arrives
at a solution that is not the actual maximum) and singularities (i.e., no solution can be found
as the variance of a component distribution is too small). In order to ameliorate these issues, it
is advisable to run multiple models with different start values seeding the iteration process. In
the software package R (R Core Team, 2024) and other software programs as well, such as
Mplus (Muthén & Muthén, 2023), we can automatically run models with hundreds or even
thousands of random start values. What this means in practice is that the estimator starts
seeking the best solution from different starting places, which lowers the risk of basing one’s
conclusions on a single and erroneous solution.
Note that we need not leave all parameters free to be estimated, as in the so-called
unrestricted model described above. Rather, we could impose restrictions based on theory or
previous research, such as (but not limited to) setting certain parameters to be equal across
classes. For example, as we are typically interested in looking at mean differences across
classes and may have no reason to believe classes differ in their dispersion, we might choose
to restrict the variances to be equal across the classes. In doing so, we decrease the number of
parameters to be estimated, which also can decrease the risk that there will be an estimation
issue with the model. We could then compare the fit of this restricted model to that of the
corresponding unrestricted model in which all parameters are estimated freely.
3.3 Model selection
Once we have fit our models (with and/or without restrictions), the next step is model
selection. Here, we have two main approaches: statistical and informational. For models that
are nested (that is, models in which one is a constrained version of the other), we may
statistically test the difference between two models’ log-likelihood (LL) ratios. This could be
used, for example, to compare a two-class model having all parameters unconstrained vs. a
two-class model in which the variances are constrained equal across classes. For models that
are not nested, for example models with unconstrained parameters but differing in number of
classes, such as an unrestricted two-class model vs. an unrestricted three-class model, a
bootstrap likelihood ratio test (which approximates a formal test of whether the decreases in
the LL for each model are statistically significant) can be used to assess if the latter is a
statistically significant improvement over the former.
More generally, for comparing any models (nested or non-nested), information criteria
are commonly used, which offer a balance between model fit and model complexity (i.e.,
number of parameters). This could lead us to choose an unrestricted two-class model over an
unrestricted three-class model because the latter, while offering improved fit, might not do so
to a degree justifying its additional parameters. Many information criteria exist, including
variations on the Akaike information criterion (AIC; see also the consistent AIC or CAIC)
and the Bayesian information criterion (BIC; see also the sample-adjusted BIC or saBIC). In
the context of mixture models, large simulation studies (e.g., Nylund et al., 2007) have shown
that there are differences among how well these indices perform: for example, the AIC can
lead to the selection of models with too many classes, whereas the BIC and its derivatives
often perform better.
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
8
To illustrate model selection, we fit three competing unrestricted models to our dataset
(n = 600): a one-class model (with two parameters: one mean and one variance), a two-class
model (with five parameters: two means, two variances, and one unique mixing proportion),
and a three-class model (with eight parameters: three means, three variances, and two unique
mixing proportions). We used the R software (R Core Team, 2024) and the R package mclust
(Scrucca et al., 2023); our code can be found in the Appendix.
As is clear from Table 1, relative to a one-class model, all information criteria are
smaller for a two-class model, indicating a better balance of fit and parsimony. When it comes
to choosing between the two- and three-class models, the information criteria again agree, in
this case that the three-class model offers further improvement over the two-class model. Note
that a lack of agreement among fit indices certainly can happen as the models become less
distinguishable in their ability to represent the data, often leading researchers to make the
final selection based on examining if the additional class is meaningful and driven by more
than just a few aberrant cases.
Table 1. Fit indices for a one-class, two-class, and three-class models
Classes
n
p
LL
AIC
BIC
CAIC
saBIC
1
600
2
-2876.121
5756.242
5765.036
5767.036
5758.687
2
600
5
-2645.529
5301.057
5323.042
5328.042
5307.168
3
600
8
-2633.367
5282.734
5317.910
5325.910
5292.512
In addition, as previously mentioned, we may take a statistical approach called a
bootstrap likelihood ratio test (bLRT), which approximates a formal test of whether the
decreases in the LL for each model are statistically significant. In our case, we used 1000
bootstrap resamples (with replacement, each of size n=600) from the original parent sample.
Results in Table 2 show that, indeed, the incremental improvement with each additional class
is statistically significant. These results are also consistent with the information criteria, and
thus we will retain the model with three classes.
Table 2. Comparison of models using a bootstrap likelihood ratio test (bLRT)
bLRT
bootstrap p-value
461.19
0.001
18.13
0.001
A graphical overview of the fit of the one-, two-, and three-class models for our data
appears in Figures 3–5, with corresponding parameter estimates for each model displayed in
Table 3. As is shown, the one-class model (Model 1) has a massive variance, which is
consistent with the possible existence of more than one underlying population. The two-class
model (Model 2) indicates that one class uses the past participle correctly approximately
every fifth time on average, whereas the other class uses it in a target-like manner almost
three quarters of the time on average. The three-class model (Model 3), which was the model
ultimately selected, indicates that there is a relatively small proportion of remarkably
homogeneous students (i.e., having a small variance relative to the other classes) falling
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
9
between the two larger classes, who use the past participle correctly just over one third of the
time in their writing.
Figure 3. Fit of a one-class model of the percentage of passé composé
Figure 4. Fit of a two-class model of the percentage of passé composé
Figure 5. Fit of a three-class model of the percentage of passé composé
Table 3. Parameter estimates for the classes in each model
Class 1
Class 2
Class 3
Proportion
Mean
Variance
Proportion
Mean
Variance
Proportion
Mean
Variance
Model 1
1.0
49.27
853.37
Model 2
0.46
19.59
74.74
0.54
74.49
130.78
Model 3
0.41
17.80
50.49
0.05
35.47
4.90
0.54
74.50
130.01
3.4 Class assignment
As a final step, researchers may wish to look more closely at the individual students to see
their probability of membership within each class, and possibly relate that information to
One-class model
PC_PROP
Frequency
20 40 60 80 100
020 40 60 80
Two-class model
PC_PROP
Frequency
20 40 60 80 100
020 40 60 80
Three-class model
PC_PROP
Frequency
20 40 60 80 100
020 40 60 80
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
10
other variables. For example, Tables 4–6 show the level and L1 of the students who have the
highest probability of belonging to the first, second, and third class, respectively. A closer
look at the classes suggests a pattern: the prototypical members of class 1 are L1 Chinese and
Norwegian students at the beginner 1.1 level. For class 2, it is L1 Chinese and Norwegian
students at the beginner 1.2 level, and for class 3 it is L1 Spanish students at the 1.2 level. As
is shown, class membership for the students with the highest probabilities is somewhat more
easily determined for the first and third classes; the highest probability assigned to anyone in
class 2 is 0.905.
Table 4. Students with the highest probability of belonging to class 1
Class 1
Class 2
Class 3
LEVEL
L1
>.999
<.001
<.001
Beg1.1
Chinese
>.999
<.001
<.001
Beg1.1
Chinese
>.999
<.001
<.001
Beg1.1
Chinese
>.999
<.001
<.001
Beg1.1
Chinese
>.999
<.001
<.001
Beg1.1
Chinese
>.999
<.001
<.001
Beg1.1
Norwegian
>.999
<.001
<.001
Beg1.1
Norwegian
>.999
<.001
<.001
Beg1.1
Norwegian
>.999
<.001
<.001
Beg1.1
Norwegian
>.999
<.001
<.001
Beg1.1
Norwegian
Table 5. Students with the highest probability of belonging to class 2
Class 1
Class 2
Class 3
LEVEL
L1
0.083
0.905
0.011
Beg1.2
Norwegian
0.083
0.905
0.011
Beg1.2
Norwegian
0.083
0.905
0.011
Beg1.2
Norwegian
0.083
0.905
0.011
Beg1.2
Norwegian
0.083
0.905
0.011
Beg1.2
Chinese
0.083
0.905
0.011
Beg1.2
Chinese
0.083
0.905
0.011
Beg1.2
Chinese
0.083
0.905
0.011
Beg1.2
Chinese
0.083
0.905
0.011
Beg1.2
Chinese
0.083
0.905
0.011
Beg1.2
Chinese
Table 6. Students with the highest probability of belonging to class 3
Class 1
Class 2
Class 3
LEVEL
L1
<.001
<.001
>.999
Beg1.2
Spanish
<.001
<.001
>.999
Beg1.2
Spanish
<.001
<.001
>.999
Beg1.2
Spanish
<.001
<.001
>.999
Beg1.2
Spanish
<.001
<.001
>.999
Beg1.2
Spanish
<.001
<.001
>.999
Beg1.2
Spanish
<.001
<.001
>.999
Beg1.2
Spanish
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
11
<.001
<.001
>.999
Beg1.2
Spanish
<.001
<.001
>.999
Beg1.2
Spanish
<.001
<.001
>.999
Beg1.2
Spanish
Given all of the above results, we will now return to our three research questions. First,
there definitely appear to be subgroups in the data with regard to the extent to which the
learners produce the past participle in a target-like manner. Visual inspection along with
formal tests indicate that a one-class model does not have acceptable fit to the data. Second,
with regard to the relative size of each subgroup, it seems as if just over half the students
(54%) use the past participle in a target-like manner most of the time, whereas the other half
is bifurcated with a majority of students (41% of the total sample) only occasionally
producing the correct form and a smaller but quite homogeneous minority (5% of the total
sample) using it in a target-like manner approximately a third of the time. Third, concerning
whether there are patterns in the data that could help us predict which learners can be found in
each subgroup, an examination of the cases would suggest ‘yes.’ Although the exact effect of
L1 and level of study variables remain to be tested formally, we observe quite distinct clusters
of L1 and level of study within our respective subgroups.
4 Applying mixture modeling to bivariate and multivariate applied linguistic data
Mixture models can scale up to encompass bivariate, and more generally multivariate,
analyses. In this section, we will start by looking at mixtures in a bivariate analysis, and then
illustrate what a multivariate mixture model looks like. As was the case in the previous
section, we will limit the discussion to normally distributed data.
While univariate mixture models are relatively straightforward for graphical
visualization, multivariate mixtures are more difficult to visualize as models with p variables
will have p dimensions. Further, the number of parameters to estimate also scales up with a
higher number of variables. Like unrestricted univariate mixtures, with unrestricted
multivariate mixtures there will be k-1 mixing probabilities that are free to be estimated (i.e.,
one fewer than the number of classes), and p means and p variances for each of the k classes.
In addition, we also have covariances between all possible pairs of variables, specifically p(p-
1)/2 unique covariance parameters within each class.
We might also have hypotheses that are more parsimonious, containing restrictions that
reduce the number of parameters to be estimated. For example, if we have support from
theory that classes should have the same degree of variance on specific variables, we might
run models that constrain one or more variables’ variances to be equal across classes. In fact,
if we wish to assess whether the classes differ only in multivariate location (means) and not
the multivariate shape of the distribution (variable variances and/or covariances), a model
constraining the latter across classes can be analyzed. Such a model would be the latent
analog of a multivariate analysis of variance (MANOVA), where multivariate
homoscedasticity (i.e., homogeneity of dispersion) is assumed across all observed groups.
Alternatively, if variables are believed to be independent within classes, covariances among
variables may be fixed to zero within classes, yielding so-called latent profile analyses. This
name reflects the fact that the classes can be characterized by a plot of their means for all
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
12
measured variables displayed as a profile on a horizontal axis (with standard deviations
denoted around means if desired); different classes are thus evidenced by differences in the
shape of the depicted profiles. Both the homoscedastic and latent profile models contain fewer
parameters than the unrestricted multivariate mixture models, and as such can be easier to
estimate in practice; however, to reiterate, their investigation should be motivated by theory
beliefs or compelling research questions reflecting voids therein.
To illustrate what an analysis of multivariate data can look like, we will use three
variables from the dataset from Larsson, Biber, and Hancock (forthcoming): attributive
adjectives, adverbs, and pre-modifying nouns. These variables are all examples of linguistic
features that have been investigated in relation to grammatical complexity of different
registers (see, e.g., Biber et al., 2022; Biber, Larsson, and Hancock, forthcoming). In Larsson
et al. (forthcoming), these three features were found to occur with the highest frequencies in a
dataset comprising eight spoken and written registers: conversational opinion, conversational
narrative, classroom teaching, formal lectures, opinion blogs, textbooks, research articles, and
fiction. The registers that were included all vary to some degree on situational characteristics
such as interactivity, communicative purpose, mode, and level of expertise of the audience.
We can expect based on this previous research that attributive adjectives and pre-modifying
nouns will be more frequent in informational written registers (e.g., research articles),
whereas adverbs are more likely to be frequent in interactional spoken registers (e.g.,
conversational opinion). The overall means and standard deviations for the complexity
features in the full dataset (n = 1229) can be found in Table 7.
Table 7. Mean and SD of the normalized (per 1,000 words) per-text frequencies of the complexity features in the
full dataset
Feature
Mean
SD
Attributive adjectives
30.6
21.18
Premodifying nouns
21.4
19.77
Adverbs
74.4
24.39
For the purposes of the present paper, we will ask the following research questions in
this section:
1. Are there subgroups in these data with regard to the frequency of the complexity
features investigated, and what is the nature of these subgroups in terms of variables’
intra- and inter-subgroup characteristics?
2. What is the relative size of each subgroup?
3. Are there any patterns found in the data that could help us understand which texts can
be found in each subgroup?
We will now go through the same steps as for the univariate mixture models in Section
3: Model specification, model estimation, model selection, and class assignment. Note that, in
the interest of space, some aspects already covered in Section 3 will not be reiterated in the
same detail in the present section.
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
13
4.1 Model specification
For a bivariate, or more generally a multivariate, mixture analysis, after selecting the variables
to be included in the model, the steps involved in model specification are as stated previously:
(1) choosing the shape of the component distribution(s) within classes, (2) articulating any
potential cross-class constraints, and (3) designating the number of latent classes potentially
of interest. In the univariate case presented previously, plotting the data was suggested as an
initial way to potentially inform these steps. Although in the current case we can plot the
univariate distributions of each variable or even bivariate plots for pairs of variables, and
although we certainly would never discourage researchers from doing so to become better
acquainted with their data, it might be less helpful in illustrating the latent classes. This is
because when subgroups exist, they technically exist in the multivariate space and as such
might not be as evident in fewer dimensions. Thus, with three variables we might be able to
plot the data and get some visual evidence for the existence of classes, however a three-
dimensional plot is still generally rendered as a projection into two dimensions (e.g., on the
printed page or computer screen) and hence might not be illuminating. And examples with
four or more variables become even more challenging. We will therefore not provide the
associated three-dimensional plots for the variables used in the present section.
Regarding step 1, for the variables in question we have no theoretical reason to suspect
that the distributions will be other than (multivariate) normal within each potential subgroup,
and hence we will use this as the component distribution for all classes. As for steps 2 and 3, a
common approach is to analyze a set of theoretically-justifiable competing models, with
different permutations of number of classes and distributional restrictions, and then use (for
example) information criteria to help select the best fitting model. For illustration in the
current paper, we will fit Model 1 as a completely unrestricted model (i.e., each class having 3
means, 3 variances, and 3 covariances), Model 2 as a homoscedastic model with the same
shape across classes but potentially different locations (i.e., same as Model 1 but with
variances and covariances constrained to be equal across classes), and Model 3 as a latent
profile model in which variables’ covariances are assumed to be zero within each class (i.e.,
each class having 3 means and 3 variances), with up to four classes attempted to be extracted
for each model. We used the R packages mclust (Scrucca et al., 2023) and mixture (Pocuca
et al., 2023); the R code can be found in the Appendix. Like for univariate mixtures, we need
to fit all our models using multiple sets of start values to avoid spurious local solutions; we
used 1,000 random starts. The three models are summarized in Table 8.
Table 8. Overview of the model restrictions
Model
Covariances
Variances
Means
1) Unrestricted
freely estimated
freely estimated
freely estimated
2) Homoscedastic
constrained across classes
constrained across classes
freely estimated
3) Profile
fixed to zero
freely estimated
freely estimated
4.2 Model estimation and model selection
Once we have run the 12 competing model permutations (3 model types each with 1–4
classes), we need to decide which model to retain. Like for the univariate distribution in
Section 3, we could look at all the possible fit indices for all the models and all the classes;
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
14
however, given the large number of different models and classes we are estimating, we will
here focus our discussion on BIC, given its usefulness for deciding among solutions. The BIC
values for each solution (1–4 classes) can be found in Table 9.
Table 9. Overview of the model fit (BIC) for 1-4 classes
Model
1 class
2 classes
3 classes
4 classes
1) Unrestricted
31587.64
30505.13
30298.53
30272.78
2) Homoscedastic
31587.64
30945.71
30883.59
30638.03
3) Profile
33193.73
30886.87
30413.50
30325.47
As is shown, Model 1 consistently exhibits the same or better fit, meaning that it is
preferable to (a) allow the classes to be different in means and variances and (b) allow for
nonzero and heterogeneous covariances among the variables across classes. When it comes to
class enumeration, the process follows the same general procedure as described previously in
the univariate case. We see a relatively large improvement (i.e., decrease in BIC) when we
move from one to two classes, less improvement from two to three classes, and relatively
small improvement from three to four classes. As in the univariate case, we used a bootstrap
log-likelihood test, in this case to compare the three- and four-class solutions; a statistically
significant difference was found (bLRT=96.9, p < .001), and we will therefore retain the four-
class version of Model 1.2
Taking a closer look at the model selected, the mixing probabilities for the four classes
were .31 (class 1), .23 (class 2), .32 (class 3), and .14 (class 4). To help understand these
classes, we can look at the mean profiles for the different classes to attempt to interpret the
solution and the class membership.3 Of course the classes may differ in variances and/or
covariances as well, but plotting the mean profiles is often very illuminating as differences in
multivariate location are often the strongest driver of the number of classes (relative to their
within-class variances, not unlike univariate standardized effect sizes). Figure 6 provides a
graphical overview of the mean profiles across the three complexity features for each of the
four classes.
2 Note that we also fitted five- and six-class versions of Model 1, but neither had better fit (i.e., smaller BIC)
than the four-class solution.
3 It should be mentioned, however, that the entropy is relatively low (E = 0.71), which suggests that the classes
overlap a good deal and, thus, it is difficult to assign class membership with high accuracy. The entropy measure
captures the degree of separation of the estimated component distributions, with numbers closer to 1 indicating
clearer separation (and thus greater ability to classify cases accurately should one wish to do so).
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
15
Figure 6. Mean profiles for the complexity features for the four-class unconstrained model
Whether or not we formally assign class membership to the individual texts, examining the
mean profiles can be interpretationally useful. As is shown in Figure 6, the largest class, Class
4, seems to differ from other classes in that the frequencies are relatively higher for attributive
adjectives, lower for adverbs, and higher for premodifying nouns. Previous research has
shown that this profile is common for written informational registers such as research articles
(e.g., Biber et al., 2022). Indeed, a closer look at the texts that had high posterior probabilities
for this class shows that the top 200 almost exclusively contain written registers, most notably
research articles.
While Classes 1, 2, and 3 are more similar to one another in mean profile than to Class
4, Class 3 stands out in opposition to Class 4. The texts in this class have the relatively lowest
mean frequency of attributive adjectives and premodifying nouns, and the highest frequencies
for adverbs. When we look at the top-200 texts that had the highest posterior probabilities for
this class, we see that this is the most “conversational” class: Conversational narrative and
conversational opinion make up the majority of the texts (although there are also some written
opinion blogs in this class as well).
Classes 1 and 2 have similar mean frequencies for adverbs, but Class 1 has somewhat
higher frequencies for attributive adjectives and premodifying nouns than Class 2. The texts
that have the highest posterior probabilities for Class 1 come from either conversational
writing (opinion blogs) or informational spoken registers (lectures and classroom teaching).
For Class 2, fiction features most prominently, followed by lectures and classroom teaching.
It is also worth noting explicitly that there is no perfect one-to-one mapping between
classes and registers, or between class and mode (written vs. speech), which suggests that
020 40 60 80 100
Complexity features
Mean Level
Class 1
Class 2
Class 3
Class 4
Attr adjectives Adverbs Premod nouns
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
16
there is intra-register and intra-mode variability and overlap in terms of the frequency of use
of the complexity features investigated. The four classes may be interpreted, loosely, as
follows:
Class 1: Conversational writing and informational speech
Class 2: Narrative production
Class 3: Conversational prose for a general audience
Class 4: Informational writing for an expert audience
Thus, the patterns found in the data appear to be explained more by situational characteristics
such as intended audience and rhetoric purpose, rather than by a priori register and mode
distinctions.
5 Potential issues with mixture models
Just as with any technique, there are potential challenges and limitations with mixture models
as well. We will discuss the three most notable here. First, fitting a mixture model using the
wrong probability distribution could lead to erroneous conclusions. Consider a univariate
example to illustrate this point. If we assume that our variable in question is normally
distributed in the population, we also assume that any deviation from normality in the
distribution of the overall sample (e.g., skewness or bimodality) is the result of underlying
classes. Thus, if we erroneously assume a normal distribution when we should instead assume
a different intrinsic distribution for our variable, the mixture model might suggest the
existence of classes that do not exist in reality. The take-home message here, of course, is that
researchers should know, as much as possible, what to expect from their variables. Although
the examples in this paper justified the use of normal distributions, the mixture modeling
framework allows for other types of distributions as well, including skewed univariate and
multivariate distributions for continuous variables, as well as distributions specifically for
categorical outcomes (e.g., Bernoulli distributions).
The second challenge mentioned here is that we may not always be able to reliably
assign class membership cases (e.g., texts, participants) whose data are included in the model.
That is, if there is a great deal of overlap between two classes, cases in the heavily
overlapping region can be difficult to classify with certainty. And this can be the case even
though the existence of classes is clearly justified by the model selection process.
In the end, this issue may or may not be troublesome for a researcher, as mixture models
have both a direct application and an indirect application. In the direct application, we aim to
discern latent subgroups; in the indirect application, we use the technique to provide an
approximation for the distribution of complex data (see Pastor & Gagné, 2013: 381-382). In
terms of solutions to the issue if the goal is indeed class assignment, one option is to prioritize
clustering criteria such as entropy (rather than parsimony criteria, such as BIC) at the class
enumeration stage, when choosing among the different solutions (see Celeux & Soromenho,
1996). The entropy criterion measures the extent to which the model provides well-separated
clusters. However, it is important to note that there is not always a solution where reliable
class assignment is possible; in these cases, we must resort to the indirect application of
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
17
mixture modeling, namely being able to approximate – and thus capture – the overall
distribution of the data in question. While it may seem disappointing not to have arrived at
clear class membership for all cases, the distribution arrived at can have implications such as
capturing non-normal factors in a confirmatory factor analysis or non-linear relations in
simple or multiple regression (please see Bauer & Shanahan, 2007, for more information on
such applications).
Finally, sample size planning for mixture models can be challenging. For those focusing
on latent class differences in means, variances, and covariances, the models themselves are
relatively simple as compared to mixtures of, say, multiple regression, path analysis, or latent
variable structural equation models. For these models in which classes differ in univariate or
multivariate distribution, class detection is primarily driven by differences in means rather
than differences in (co)variances. That is to say, power to detect class differences is mostly a
function of the distributional separation, which is primarily governed by differences in
location. The challenge then, as with all sample size planning, is to anticipate the potential
class separation ahead of time in order to determine the necessary sample size. This is made
more complex by the fact that one is not sampling explicitly from each class as with known
groups, and thus we have no control over the proportion of cases from each potential class.
Perhaps the best approach then is to simulate a variety of distributional conditions to
determine a sample size with adequate power to detect important differences across a variety
of realistic configurations; the reader is referred to, for example, Muñoz and Acuña (1999),
for more details.
6 Conclusion and looking ahead
This article introduced mixture models to an applied linguistics audience and showed how
this technique can be used to identify underlying groupings in our data that were not known
a priori. That is, this technique helps us answer questions of whether there are previously
unknow groupings in the data, and if so, what the relative size of each group is. In addition, it
can often help us to understand which texts/participants are likely to be found in each group,
information that we can use to help interpret the groupings identified. Given the introductory
nature of this paper, we focused on normally distributed data to illustrate how to work with
univariate and multivariate mixtures. We also discussed some potential issues with this
technique, including erroneous distributional assumptions and pitfalls related to class
membership allocation.
Looking ahead, there are several topics under the mixture modeling umbrella that this
introductory article was not able to cover. For example, researchers interested in data that
contain binary and/or ordinal variables may wish to look into such applications of mixture
models. Furthermore, there is ample literature on how to model non-normal distributions
(e.g., Asparouhov & Muthén, 2016). In addition, other applications of mixture models include
using these as an overlay with commonly applied techniques in the field such as t-tests,
multiple regression, and factor analysis. We hope researchers will explore these techniques
further for their rich potential within applied linguistic research.
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
18
References
Asparouhov, T., & Muthén, B. (2016). Structural equation models and mixture models with
continuous nonnormal skewed distributions. Structural Equation Modeling: A
Multidisciplinary Journal, 23, 1–19. http://dx.doi.org/10.1080/10705511.2014.947375
Bauer, D. J., & Shanahan, M. J. (2007). Modeling complex interactions: Person centered and
variable centered approaches. In T. D. Little, J. A., Bovaird, & N. A. Card (Eds.).
Modeling contextual effects in longitudinal studies (pp. 225–284). Lawrence Erlbaum
Associates.
Bauer, D. J., & Steinley, D. (2021). Mixture Modeling Demonstration Notes: R.
Biber, D., Larsson, T., & Hancock, G. R. (Forthcoming). The linguistic organization of
grammatical text complexity: Comparing the empirical adequacy of theory-based
models. Corpus Linguistics and Linguistic Theory. https://doi.org/10.1515/cllt-2023-
0016
Biber, D., Gray, B., Staples, S., & Egbert, J. (2022). The Register-Functional approach to
grammatical complexity. Routledge.
Bollen, K. A. (2002). Latent variables in psychology and the social sciences. Annual review of
psychology, 53(1), 605–634.
Celeux, G., Soromenho, G. (1996). An entropy criterion for assessing the number of clusters
in a mixture model. Journal of Classification, 13, 195–212.
https://inria.hal.science/inria-00074799
Hancock, G. R., & Schoonen, R. (2015). Structural Equation Modeling: Possibilities for
Language Learning Researchers. Language Learning, 65(1), 160–184.
https://doi.org/10.1111/lang.12116
Heilenman, L. K., & McDonald, J. L. (1993). Processing strategies in L2 learners of French:
The role of transfer. Language Learning, 43(4), 507–557.
https://doi.org/10.1111/j.1467-1770.1993.tb00626.x
Larsson, T., Biber, D., & Hancock, G. R. (Forthcoming). On the role of cumulative
knowledge building and specific hypotheses: The case of grammatical complexity.
Corpora, 19(3).
Larsson, T., Plonsky, L., & Hancock, G. R. (2022). On learner characteristics and why we
should model them as latent variables. International Journal of Learner Corpus
Research, 8(2), 237–260. https://doi.org/10.1075/ijlcr.21007.lar
Lou, N. M., Chaffee, K. E., & Noels, K. A. (2022). Growth, fixed, and mixed mindsets:
Mindset system profiles in foreign language learners and their role in engagement and
achievement. Studies in Second Language Acquisition, 44(3), 607–632.
https://doi.org/10.1017/S0272263121000401
McLachlan, G., & Peel, D. (2000). Finite mixture models. Wiley.
Muñoz, M. A., & Acuña, J. D. (1999). Sample size requirements of a mixture analysis method
with applications in systematic biology. Journal of Theoretical Biology, 196(20), 263–
265. https://doi.org/10.1006/jtbi.1998.0826
Muthén, L. K., & Muthén, B. O. (1998-2023). Mplus user's guide (8th ed.). Muthén &
Muthén.
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
19
Nylund, K. L., Asparouhov, T., & Muthén, B. O. (2007). Deciding on the number of classes
in latent class analysis and growth mixture modeling: A Monte Carlo simulation
study. Structural equation modeling: A multidisciplinary Journal, 14(4), 535–569.
https://doi.org/10.1080/10705510701575396
Pastor, D. A., & Gagné, P. (2013). Mean and covariance structure mixture models. In Gregory
R. Hancock & R. O. Mueller (Eds.), Structural equation modeling: A second course
(2nd ed) (pp. 343–393). Information Age Pub.
Pocuca, N., Browne R., & McNicholas, P. (2023). mixture: Mixture models for clustering and
classification. R package version 2.0.6. https://CRAN.R-project.org/package=mixture
R Core Team (2024). R: A language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/.
Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2023). Model-based clustering,
classification, and density estimation using mclust in R. Chapman and Hall/CRC.
https://mclust-org.github.io/book/
Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis of finite
mixture distributions. Wiley.
Yu, H., Lowie, W. & Peng, H. (2022). Understanding salient trajectories and emerging
profiles in the development of Chinese learners’ motivation: a growth mixture
modeling approach. International Review of Applied Linguistics in Language
Teaching. https://doi.org/10.1515/iral-2022-0036
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
20
Appendix
R code for the univariate and multivariate mixture models fitted in the paper (adapted from
Bauer & Steinley, 2021).
### Univariate mixtures ###
## Install/load packages required
install.packages("mclust") # install package mclust
library(mclust) # once installed, load package mclust
## Load in and prepare the data
MIX_all<-read.delim(stringsAsFactors=T, file.choose()) # load in the data,
which has four variables: student ID, L1, GROUP, PC_PROP
summary(MIX_all)
MIX_omit <- na.omit(MIX_all) # omit NAs, if applicable
MIX<-MIX_omit$PC_PROP # identify the variable of interest for the mixture
models
summary(MIX) # summary of the dataset
str(MIX) # variable structure of the dataset
hist(MIX, breaks =15, ylim=c(0,100), xlim=c(0,100)) # histagram for dataset
## Fit models with random starts
# 1-class model fit
mix.1 <-Mclust(MIX, G=1, modelNames='V', control=emControl(tol=1.e-12),
verbose=FALSE)
mix.1$parameters # obtain information about the mixing probabilities,
means, and variances for each class
# 2-class model fit, with 200 random starts
set.seed(1234)
BIC.2 <- NULL
LL.2 <- NULL
for(j in 1:200)
{
rBIC <-mclustBIC(MIX, G=2, modelNames='V', control=emControl(tol=1.e-12),
verbose=FALSE,
initialization=list(hcPairs=hcRandomPairs(MIX)))
rModel<-mclustModel(MIX, BICvalues = rBIC)
LL.2 <- c(LL.2, rModel$loglik)
BIC.2 <- mclustBICupdate(BIC.2, rBIC)
}
mix.2 <-mclustModel(MIX, BICvalues = BIC.2)
mix.2$parameters
post.2<-mix.2$z
entropy.2<-post.2*log(post.2)
entropy.2<-sum(-1*entropy.2)
entropy.2
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
21
# 3-class model fit, with 200 random starts
BIC.3 <- NULL
LL.3 <- NULL
for(j in 1:200)
{
rBIC <-mclustBIC(MIX, G=3, modelNames='V', control=emControl(tol=1.e-12),
verbose=FALSE,
initialization=list(hcPairs=hcRandomPairs(MIX)))
rModel<-mclustModel(MIX, BICvalues = rBIC)
LL.3 <- c(LL.3, rModel$loglik)
BIC.3 <- mclustBICupdate(BIC.3, rBIC)
}
mix.3 <-mclustModel(MIX, BICvalues = BIC.3)
mix.3$parameters
post.3<-mix.3$z
entropy.3<-post.3*log(post.3)
entropy.3<-sum(-1*entropy.3)
entropy.3
# Compute and compare fit indices
K<-rep(1:3) # number of classes fit
n<-rep(600,3) # sample size for each model
p<-c(2,5,8) # number of parameters estimated for each
LL<-c(mix.1$loglik,mix.2$loglik,mix.3$loglik)
entropy<-c(0,entropy.2,entropy.3)
fit<-cbind.data.frame(K,n,p,LL,entropy)
fit$AIC <- -2*fit$LL + 2*fit$p
fit$BIC <- -2*fit$LL + fit$p*log(fit$n)
fit$CAIC <- -2*fit$LL + fit$p*(log(fit$n) + 1)
fit$ssBIC <- -2*fit$LL + fit$p*log((fit$n + 2)/24)
fit$CLC <- -2*fit$LL + 2*fit$entropy
fit$ICL.BIC <- -2*fit$LL + fit$p*log(fit$n) + 2*fit$entropy
fit$NEC <- fit$entropy/(fit$LL-fit[1,]$LL)
fit[1,]$NEC <- 1
fit$E <-1-fit$entropy/(fit$n*log(fit$K))
fit
# Obtain bootstrapped LRT
mclustBootstrapLRT(MIX,"V",nboot=600,maxG=2,control=emControl(tol=1.e-12))
# bootstrapping (n = 600) on K+1 classes (here 3 classes)
# Examine which observations fall into each class, using the two known
variables 'L1' and 'GROUP'
attach(MIX_omit)
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
22
postprobs<-as.data.frame(mix.3$z)
postprobs2<-cbind(GROUP,postprobs)
names<-data.frame(GROUP,L1)
pprobs.names<-merge(names,postprobs2,by.x="GROUP",by.y="GROUP")
sort.c1<-pprobs.names[order(-pprobs.names[,3]),] #to get highest-
probability students for class 1 in 3-class solution
sort.c1[1:200,]
sort.c2<-pprobs.names[order(-pprobs.names[,4]),] #to get highest-
probability students for class 2 in 3-class solution
sort.c2[1:200,]
sort.c3<-pprobs.names[order(-pprobs.names[,5]),] #to get highest-
probability students for class 3 in 3-class solution
sort.c3[1:200,]
### Multivariate mixtures ###
## Install/load packages required
library(mclust)
library(mixture)
library(mix)
MIX_all<-read.delim(stringsAsFactors=T, file.choose())
MIX_omit <- na.omit(MIX_all) # Remove NAs, if applicable
summary(MIX_omit)
str(MIX_omit)
## Fit the models
set.seed(14112)
MIX_MAT<-as.matrix(MIX_omit[,12:14]) # include only only attr adj, adv, and
NN
str(MIX_omit[,12:14])
# Fit model set 1
mm1<-gpcm(MIX_MAT,1:4,mnames="VVV",start=1000) # fit all four classes
summary(mm1)
# Fit model set 2
mm2<-gpcm(MIX_MAT,1:4,mnames="EEE",start=1000)
summary(mm2)
# Fit model set 3
mm3<-gpcm(MIX_MAT,1:4,mnames="VVI",start=1000)
summary(mm3)
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
23
## Compare the best fitting models
mm1_3<-gpcm(MIX_MAT,3,mnames="VVV",start=1000) # fit just mm1 with 3
classes
mm1_4<-gpcm(MIX_MAT,4,mnames="VVV",start=1000) # fit just mm1 with 4
classes
## Conduct likelihood ratio test for these two
LL.fullprop<-mm1_4$BIC[,,1]
p.fullprop<-mm1_4$BIC[,,2]
LL.LPAprop<-mm1_3$BIC[,,1]
p.LPAprop<-mm1_3$BIC[,,2]
LR<-2*(LL.fullprop - LL.LPAprop)
df<-p.fullprop - p.LPAprop
pval <- 1 - pchisq(LR,df)
LR
df
pval
## Look at the parameter estimates of the selected model
mm1_4$gpar
## Print correlation matrix
cov2cor(mm1_4$gpar[[3]]$sigma) # for the three variables
## Compute standardized model entropy
entropy <- -1*(sum((mm1_4$z)*log(mm1_4$z)))
E <-1-entropy/(1229*log(4))
E
## Examining which features fall into each class
postprobs<-as.data.frame(mm1_4$z)
postprobs
postprobs2<-cbind(MIX_omit$REGISTER,postprobs)
postprobs2
sort.c1<-postprobs2[order(-postprobs2[,2]),] #to get highest-probability
registers for class 1 in 4-class solution
sort.c1[1:200,]
sort.c2<-postprobs2[order(-postprobs2[,3]),] #to get highest probability
registers for class 2 in 4-class solution
sort.c2[1:200,]
sort.c3<-postprobs2[order(-postprobs2[,4]),] #to get highest probability
registers for class 3 in 4-class solution
sort.c3[1:200,]
sort.c4<-postprobs2[order(-postprobs2[,5]),] #to get highest probability
registers for class 4 in 4-class solution
Larsson, T., & Hancock, G. R. (2024). Exploring potential unknown subgroups in your data: An introduction to finite mixture models for
applied linguistics. Research Methods in Applied Linguistics, 3(2), 100117.
24
sort.c4[1:200,]
Article
Full-text available
Larsson, T., Biber, D., Hancock, G. R. (2024). On the role of cumulative knowledge building and specific hypotheses: The case of grammatical complexity. Corpora, 19(3). Abstract As corpus linguistics matures as a field, there are an increasing number of research areas in which we have accrued sufficient knowledge that we can start building knowledge in a cumulative manner by (a) synthesizing findings and generalizations made by previous research and interpreting new findings in relations to those, and (b) formulating and testing increasingly specific predictions/hypotheses resulting from (a). The present paper outlines what a move toward cumulative knowledge building may look like for the field and offers a case study on grammatical complexity as illustration. In building knowledge in a more systematic way, we can engage more deeply with the claimed generalizable findings from previous research and help move the field's state-of-the-art forward.
Article
Full-text available
Although there is a long tradition of research analyzing the grammatical complexity of texts (in both linguistics and applied linguistics), there is surprisingly little consensus on the nature of complexity. Many studies have disregarded syntactic (and structural) distinctions in their analyses of grammatical text complexity, treating it instead as if it were a single unified construct. However, other corpus-based studies indicate that different grammatical complexity features pattern in fundamentally different ways. The present study employs methods that are informed by structural equation modeling to test the goodness-of-fit of four models that can be motivated from previous research and linguistic theory: a model treating all complexity features as a single dimension, a model distinguishing among three major structural types of complexity features, a model distinguishing among three major syntactic functions of complexity features, and a model distinguishing among nine combinations of structural type and syntactic functions. The findings show that text complexity is clearly a multi-dimensional construct. Both structural and syntactic distinctions are important. Syntactic distinctions are actually more important than structural distinctions, although the combination of the two best accounts for the ways in which complexity features pattern in texts from different registers.
Article
Full-text available
Learner corpus research has a strong tradition of collecting metadata. However, while we tend to collect rich descriptive information about learners on directly measurable variables such as age, year of study, and time spent abroad, we frequently do not know much about learner characteristics that cannot be measured directly (and that thus need to be measured through questionnaires and tests) such as language aptitude, working memory, and motivation, which have been identified as important variables in neighboring fields such as Second Language Acquisition. In this position piece, we (i) join the proponents of increased focus on learner characteristics in LCR in arguing in favor of collecting information about such variables and (ii) introduce an analytical framework that can be used to model these variables. Specifically, the primary focus of this paper is to discuss the concept of latent variables as it relates to LCR and show how their standard form can be used to model learner characteristics within the structural equation modeling analytical framework.
Article
Full-text available
Based on the theoretical framework of the L2 Motivational Self System (L2MSS), the present study aims to make a methodological contribution to L2 motivation research. With the application of a novel growth mixture modeling (GMM) technique, the study depicted developmental trajectories of three motivational variables (ideal L2 self, ought-to L2 self, and L2 learning experience) of 176 Chinese tertiary-level students over a period of two semesters. Results showed two to three salient classes with typical developmental patterns for the three motivational variables respectively, with which the study gained fresh insights into the developmental processes of motivation beyond the individual level. Our study further established three main multivariate profiles of motivation characterized by a distinct combination of different motivational variables. The findings extend our understanding of motivational dynamics, providing a nuanced picture of emergent motivational trajectories systemically. Additionally, GMM has shown to be an effective and applicable method for the identification of salient patterns in motivation development, which leads to practical implications.
Article
Full-text available
Although classical statistical techniques have been a valuable tool in second language (L2) research, L2 research questions have started to grow beyond those techniques’ capabilities, and indeed are often limited by them. Questions about how complex constructs relate to each other or to constituent subskills, about longitudinal development in those constructs and factors affecting that development, and about differences among populations in average amounts of complex constructs or in their relations require a broader analytical framework. Fortunately, that of structural equation modeling (SEM), a versatile and ever‐expanding family of techniques, is able to accommodate such questions and many more. The current article describes some of the questions that can be addressed by SEM, presents some research examples within the existing L2 literature, and then provides examples of the incredible potential of SEM, cautions in its practice, and resources for further information.
Article
Language learners' mindsets -- their beliefs about whether language is a fixed aptitude that is immutable or a malleable capacity that can be developed -- are associated with achievement goals, language-use anxiety, reappraisals of challenges, and persistence. This study integrates these mindset-related constructs to identify mindset-system profiles among foreign language learners. A latent profile analysis of 234 university students in foreign language courses revealed three distinct profiles. The fixed (21.8%) and growth (20.5%) profiles showed distinct and contrasting patterns of goals, reappraisals, anxiety, and persistence. However, most learners (57.7%) endorsed a mixed profile. Although mindsets alone did not predict grades, students in the growth profile were consistently most engaged and achieved the highest grades, suggesting that mindsets function as a system, in concert with related factors. This person-centered approach enhances our understanding of the complexity and functions of the mindset system, as well as the motivation of learners with mixed mindsets.
Article
In this article we describe a structural equation modeling (SEM) framework that allows nonnormal skewed distributions for the continuous observed and latent variables. This framework is based on the multivariate restricted skew t distribution. We demonstrate the advantages of skewed SEM over standard SEM modeling and challenge the notion that structural equation models should be based only on sample means and covariances. The skewed continuous distributions are also very useful in finite mixture modeling as they prevent the formation of spurious classes formed purely to compensate for deviations in the distributions from the standard bell curve distribution. This framework is implemented in Mplus Version 7.2.