ArticlePDF Available

Joint Maximum Likelihood Estimation for High-dimensional Exploratory Item Response Analysis

Authors:

Abstract and Figures

Multidimensional item response theory is widely used in education and psychology for measuring multiple latent traits. However, exploratory analysis of large-scale item response data with many items, respondents, and latent traits is still a challenge. In this paper, we consider a high-dimensional setting that both the number of items and the number of respondents grow to infinity. A constrained joint maximum likelihood estimator is proposed for estimating both item and person parameters, which yields good theoretical properties and computational advantage. Specifically, we derive error bounds for parameter estimation and develop an efficient algorithm that can scale to very large datasets. The proposed method is applied to a large scale personality assessment data set from the Synthetic Aperture Personality Assessment (SAPA) project. Simulation studies are conducted to evaluate the proposed method.
Content may be subject to copyright.
Joint Maximum Likelihood Estimation for
High-dimensional Exploratory Item Response Analysis
Yunxiao Chen
Department of Psychology, Emory University
Xiaoou Li
School of Statistics, University of Minnesota
Siliang Zhang
Shanghai Center for Mathematical Sciences, Fudan University
Abstract
Multidimensional item response theory is widely used in education and psychology
for measuring multiple latent traits. However, exploratory analysis of large scale item
response data with many items, respondents and latent traits is still a challenge. In
this paper, we consider a high-dimensional setting that both the number of items and
the number of respondents grow to infinity. A constrained joint maximum likelihood
estimator is proposed for estimating both item and person parameters, which yields
good theoretical properties and computational advantage. Specifically, we derive error
bounds for parameter estimation and develop an efficient algorithm that can scale to
very large data sets. The proposed method is applied to a large scale personality assess-
ment data set from the Synthetic Aperture Personality Assessment (SAPA) project.
Simulation studies are conducted to evaluate the proposed method.
KEY WORDS: Item response theory, high-dimensionality, diverging number of items, joint
maximum likelihood estimator (JMLE), alternating minimization, projected gradient de-
scent, matrix completion, SAPA project
1
arXiv:1712.06748v1 [stat.ME] 19 Dec 2017
1 Introduction
Multidimensional item response theory (MIRT) models have been widely used in psychology
and education for measuring multiple latent traits based on dichotomous or polytomous
items. The concept of MIRT dates back to McDonald (1967), Lord et al. (1968) and Reckase
(1972), and is closely related to linear factor analysis (e.g. Anderson, 2003) that models
continuous multivariate data. We refer the readers to Reckase (2009) for a review of MIRT
models. An important application of MIRT models is exploratory item factor analysis,
which serves to uncover a set of latent traits underlying a battery of psychological items.
Such exploratory analysis facilitates the understanding of large scale response data with
many items, respondents and possibly many latent traits.
A key step in MIRT based analysis is parameter estimation, which can be numerically
challenging under the high-dimensional setting, i.e., when the number of items, the sample
size, and the dimension of the latent space are large. The most popular method for es-
timating item parameters is the maximum marginal likelihood estimator (MMLE). In this
approach, respondent specific parameters (i.e., their latent trait levels) are treated as random
effects and the marginal likelihood function of item parameters is obtained by integrating
out the random effects. Item parameter estimates are obtained by maximizing the marginal
likelihood function. This approach typically involves evaluating a K-dimensional integral,
where Kis the number of latent traits. When Kis moderately large (say, K10), eval-
uating the integral can be numerically challenging. Many approaches have been proposed
to approximate the integral, including adaptive Gaussian quadrature methods (e.g. Schilling
and Bock, 2005), Monte Carlo integration (e.g. Meng and Schilling, 1996), fully Bayesian
estimation methods (e.g. B´eguin and Glas, 2001; Bolt and Lall, 2003; Edwards, 2010), and
data augmented stochastic approximation algorithm (e.g. Cai, 2010a,b). However, even with
these state-of-the-art algorithms, the computation is time-consuming.
An alternative approach to parameter estimation in MIRT models is the joint maximum
likelihood estimator (JMLE; see, e.g. Embretson and Reise, 2000). This approach treats
2
both the person and item parameters as fixed effects (i.e., model parameters). Parameter
estimates are obtained by maximizing the joint likelihood function of both person and item
parameters. Traditionally, the JMLE is less preferred to the MMLE. This is partly because,
under the usual asymptotic setting where the number of respondents grows to infinity and
the number of items is fixed, the number of parameters in the joint likelihood function also
diverges, resulting in inconsistent estimate of item parameters (Neyman and Scott, 1948;
Andersen, 1973; Haberman, 1977; Ghosh, 1995) even for simple IRT models.
In this paper, we propose a variant of joint maximum likelihood estimator, which adds
L2constraints to the parameter space. This constrained joint maximum likelihood estima-
tor (CJMLE) is computationally efficient and statistically solid under the high-dimensional
setting. In terms of computation, an alternating minimization (AM) algorithm with pro-
jected gradient decent update is proposed, which can be parallelled efficiently. Specifically,
we implement this parallel computing algorithm in R with core functions written in C++
through Open Multi-Processing (OpenMP, Dagum and Menon, 1998) that can scale to very
large data. For example, according to our simulation study, the algorithm can fit a data set
with 25,000 respondents, 1000 items, and 10 latent traits in 12 minutes on a single Intel(R)
machine (Core(TM) i7CPU@5650U@2.2 GHz) with eight threads (OpenMP enabled). The-
oretically, we provide error bounds on parameter estimation, under the setting that both
the numbers of items and respondents grow to infinity. Specifically, we show that the item
parameters can be consistently estimated up to a scaling and a rotation (possibly oblique).
Consequently, the latent structure of test items may be investigated by using appropriate ro-
tational methods (see e.g. Browne, 2001). Both our computational algorithm and theoretical
framework are new to the psychometrics literature.
Our theoretical framework assumes a diverging number of items, which is suitable when
analyzing large scale data. To the best of our knowledge, such an asymptotic setting has not
received enough attention, except in Haberman (1977, 2004) and Chiu et al. (2016). Our
theoretical analysis applies to a general MIRT model that includes the multidimensional
3
two-parameter logistic model (Reckase and McKinley, 1983; Reckase, 2009) as a special
case, while the analyses in Haberman (1977, 2004) and Chiu et al. (2016) are limited to
the unidimensional Rasch model (Rasch, 1960; Lord et al., 1968) and cognitive diagnostic
models (Rupp et al., 2010), respectively. Our technical tools for studying the properties of
the CJMLE include theoretical developments in matrix completion theory (e.g. Cand`es and
Plan, 2010; Cand`es and Tao, 2010; Cand`es and Recht, 2012; Davenport et al., 2014) and
matrix perturbation theory (e.g. Stewart and Sun, 1990).
We apply the proposed method to a data set from the Synthetic Aperture Personality
Assessment (SAPA) project (Condon and Revelle, 2015) that is collected to evaluate the
structure of personality constructs in the temperament domain. Applying traditional ex-
ploratory item factor analysis methods to this data set is challenging because (1) the data
set is of large scale, containing 23,681 respondents, 696 items, and possibly many latent
traits, (2) the responses are not continuous and thus linear factor analysis is inappropriate,
and (3) the data contain “massive missingness” by design that is missing completely at ran-
dom. The proposed method turns out to be suitable for analyzing the data set. Specifically,
the prediction power of the MIRT model and CJMLE is studied and the latent structure is
investigated and interpreted.
The rest of the paper is organized as follows. In Section 2, we propose the CJMLE and an
algorithm for its computation. Theoretical properties of CJMLE are developed in Section 3,
followed by simulation studies and real data analysis in Sections 4 and 5. Finally, discussions
are provided in Section 6. Proofs of our theoretical results are provided in Appendix.
2 Method
2.1 Model
To simplify the presentation, we focus on binary response data. We point out that our
framework can be extended to ordinal and nominal responses without much difficulty. Let i=
4
1, ..., N indicates respondents and j=1, ..., J indicates items. Each respondent is represented
by a K-dimensional latent vector θi=(θi1, ..., θiK )and each item is represented by K
parameters aj=(aj1, ..., ajK ). Let Yij be the response from respondent ito item j, which
is assumed to follow distribution
P(Yij =1θi,aj)=exp(a
jθi)
1+exp(a
jθi).(1)
This model can be viewed as an extension of the multidimensional two-parameter logistic
model (M2PL; Reckase and McKinley, 1983; Reckase, 2009). That is, if θi1is set to be 1 for
all i, then (1) becomes
P(Yij =1θi,aj)=exp(aj1+K
k=2aj2θi2)
1+exp(aj1+K
k=2aj2θi2),
which takes the same form as a multidimensional two-parameter logistic model with K1
latent traits. Similar to many latent factor models, Model (1) is also rotational and scaling
invariant, as will be discussed in Section 3.
2.2 Constrained Joint Maximum Likelihood Estimator
The traditional way of estimating item parameters in an item response theory model is
through the marginal maximum likelihood estimator. That is, the latent vector θis are fur-
ther assumed to be independent and identically distributed samples from some distribution
F(θ). Let yij be the observed value of random variable Yij. The marginal likelihood function
is a function of a1, ..., aJ, defined as
LM(a1, ..., aJ)=N
i=1J
j=1
P(Yij =1θi,aj)yij (1P(Yij =1θi,aj))1yij F(dθi)
=N
i=1J
j=1
exp(a
jθiyij )
1+exp(a
jθi)F(dθi).
(2)
5
Marginal maximum likelihood estimates (MMLE) of a1, ..., aJare then obtained by maxi-
mizing LM(a1, ..., aJ). Appropriate constraints may be imposed on a1, ..., aJto ensure iden-
tifiability. In this approach, the latent trait θis are treated as random effects, while the item
parameters a1, ..., aJare treated as fixed effects. When Kis large, the marginal maximum
likelihood estimator suffers from computational challenge brought by the K-dimensional in-
tegration in (2). The computation becomes time-consuming when Kis moderately large
(say when K10) even with the state-of-art algorithms.
An alternative approach is the joint maximum likelihood estimator (JMLE; see e.g. Em-
bretson and Reise, 2000). In this approach, both θis and ajs are treated as fixed effects and
their estimates are obtained by maximizing the joint likelihood function
LJ(θ1, ..., θN,a1, ..., aJ)=N
i=1
J
j=1
P(Yij =1θi,aj)yij (1P(Yij =1θi,aj))1yij
=N
i=1
J
j=1
exp(a
jθiyij )
1+exp(a
jθi).
(3)
Traditionally, the JMLE is less preferred to the MMLE. This is partly because, under the
usual asymptotic setting where the number of respondents grows to infinity and the number
of items is fixed, the number of parameters in LJalso diverges, resulting in inconsistent
estimate of item parameters even for the Rasch model that takes a simple form (Andersen,
1973; Haberman, 1977; Ghosh, 1995).
Under the high-dimensional setting, N,J, and Kcan all be large, so that MMLE is
computationally infeasible. Interestingly, JMLE becomes a good choice. Intuitively, the
number of parameters in the joint likelihood is in the order K(N+J), which can be sub-
stantially smaller than the number of observed responses (in the order of N J) when Kis
relatively small. Consequently, under reasonable regularities, JMLE yields satisfactory sta-
tistical properties, as will be discussed in Section 3. In fact, Haberman (1977) shows that
under the Rasch model, the estimation of both respondent parameters and item parameters
is consistent, when Jand Ngrow to infinity and J1log(N)converges to 0. Unfortunately,
6
the results of Haberman (1977) relies on the simple form of the Rasch model and thus can
hardly be generalized to the general model setting as in (1).
Computationally, (3) can be optimized efficiently. That is because maximizing (3) is
equivalent to maximizing its logarithm, which can be written as
lJ(θ1, ..., θN,a1, ..., aJ)=log LJ(θ1, ..., θN,a1, ..., aJ)
=N
i=1
J
j=1
a
jθiyij log(1+exp(a
jθi)).
(4)
Due to its factorization, lJis a biconvex function of (θ1, ..., θN)and (a1, ..., aJ), meaning
that lJis a convex function of (θ1, ..., θN)when (a1, ..., aJ)is given and vise versa. As a
consequence, (4) can be optimized by iteratively maximizing with respect to one given the
other. This optimization program can be further speeded up by parallel computing, thanks
to the special structure of lJ. See Section 2.4 for further discussions.
Following the above discussion, we propose a constrained joint maximum likelihood esti-
mator, defined as
(ˆ
θ1, ..., ˆ
θN,ˆ
a1, ..., ˆ
aJ)=arg min
θ1,...,θN,a1,...,aJlJ(θ1, ..., θN,a1, ..., aJ),
s.t. θi2C, i,
aj2C, j,
(5)
where denotes the Euclidian norm of a vector and Cis a positive tuning parameter that
imposes regularization on the magnitude of each θiand aj. When C=, (5) becomes exactly
a JMLE. The constraints in (5) avoid overfitting when the number of model parameters is
large.
This estimator has the property of rotational invariance. To see this, we rewrite the joint
likelihood function in the matrix form. Let AJ×K=(a1, ...., aJ), ΘN×K=(θ1, ..., θN), and
7
MN×J=(mij )ΘA. Then the joint log-likelihood function can be rewritten as
lJ(Θ, A)=lJ(M)=N
i=1
J
j=1
mij yij log(1+exp(mij )).
The rotational invariance property of the CJMLE is described by the following proposition.
Proposition 1. Let (ˆ
A, ˆ
Θ)be a minimizer of (5). Then for any K×Korthogonal matrix
Q,(ˆ
AQ, ˆ
ΘQ)is also a minimizer of (5).
The CJMLE is even invariant up to a scaling and an oblique rotation if the estimate
(ˆ
A, ˆ
Θ)satisfies ˆ
aj2<Cfor all jor ˆ
θi2<Cfor all i. We summarize it by the following
proposition.
Proposition 2. Let (ˆ
A, ˆ
Θ)be a minimizer of (5). Suppose that ˆ
aj2<Cfor all jor
ˆ
θi2<Cfor all i. Then there exists an invertible K×Knon-identity matrix R, such that
(ˆ
AR1,ˆ
ΘR)is still a minimizer of (5).
Like the tuning parameter in ridge regression, our tuning parameter Calso controls the
bias-variance trade-off of estimation. Roughly speaking, for a large value of C, the true
parameters θ
iand a
jare likely to satisfy the constraints. Thus, the estimation tends to
have a small bias. In the meanwhile, the variance may be large. When Cis small, the bias
tends to be large and variance tends to be small.
2.3 Practical Consideration: Missing Responses
In practice, each respondent may only respond to a small proportion of items, possibly due
to the data collection design. For example, in the SAPA data that are analyzed in Section 5,
each respondent is only administered a small random subset of 696 items (containing on
average 86 items) to avoid high time costs to respondents.
Our CJMLE can be modified to handle missing responses, while maintaining its com-
putational advantage. In addition, as will be shown in Section 3, under a reasonable reg-
ularity condition on the data missingness mechanism, the CJMLE still has good statistical
8
properties. Let =(ωij)N×Jdenote the non-missing responses, where ωij =1 if item jis
administered to respondent i. Then the joint likelihood function becomes
l
J(Θ, A)=l
J(ΘA)=
i,jωij =1
mij yij log(1+exp(mij )),
and our CJMLE becomes (ˆ
Θ,ˆ
A)=arg min
Θ,A l
J(ΘA).
s.t. θi2C, i,
aj2C, j.
(6)
If ωij =1 for all iand j, no response is missing and (6) is the same as (5). From now on, we
focus on the analysis of (6), which includes (5) as a special case when ωij =1 for all iand j.
2.4 Algorithm
We develop an alternating minimization algorithm for solving (6) that is often used to fit
low rank models (see e.g. Udell et al., 2016). This algorithm can be efficiently paralleled. In
this algorithm, we assume the number of latent traits Kand tuning parameter Care known.
In practice, when having no knowledge about Kand C, they are chosen by a T-fold (e.g.,
T=5) cross validation.
To handle the constraints in (6), a projected gradient descent update is used in each
iteration. A projected gradient descent update is as follows. Let xbe a K-dimensional
vector. We define projection operator:
ProxC(y)=arg min
xx2Cyx2=
yif y2C;
C
yyif y2>C.
(7)
Here, ProxC(y)returns a feasible point (i.e., the constraint ProxC(y)2C) that is most
9
close to y. Consider optimization problem
min
x
f(x)
s.t. x2C,
(8)
where fis a differentiable convex function. Denote the gradient of fby g. Then a projected
gradient descent update at x(0)is defined as
x(1)=ProxC(x(0)ηg(x(0))),
where η>0 is a step size decided by line search. Due to the projection, x(1)2C.
Furthermore, it can be shown that for sufficiently small η,f(x(1))<f(x(0)), when fsatisfies
mild regularity conditions and g(x(0))0. We refer the readers to Parikh et al. (2014) for
further details about this projected gradient descent update.
Algorithm 1 (Alternating Minimization algorithm for CJMLE).
1 (Initialization) Input responses yij , nonmissing response indicator , dimension of
latent factor K, constraint parameter C, iteration number m=1, and initial value
Θ(0)and A(0).
2 (Alternating minimization) At mth iteration, perform parallel computation for each
block:
For each respondent i, update
θ(m)
i=ProxC(θ(m1)
iηgi(θ(m1)
i, A(m1))),(9)
where gi(θ(m1)
i, A(m1))is the gradient of
li(θ, A(m1))=
jωij=1yij
K
k=1
θka(m1)
jk +log(1+exp(K
k=1
θka(m1)
jk ))
10
with respect to θ=(θ1, ..., θK)at θ(m1)
i.η>0is a step size chosen by line search,
which guarantees that li(θ(m)
i, A(m1))<li(θ(m1)
i, A(m1)).
For each item j, update
a(m)
j=ProxC(a(m1)
jη˜gj(a(m1)
j,Θ(m))),(10)
where ˜gj(a(m1)
j,Θ(m))is the gradient of
˜
lj(a,Θ(m1))=
iωij=1yij
K
k=1
akθ(m)
ik +log(1+exp(K
k=1
akθ(m)
ik ))
with respect to a=(a1, ..., aK)at a(m1)
j.η>0is a step size chosen by line
search, which guarantees that ˜
lj(a(m)
j,Θ(m))<˜
lj(a(m1)
j,Θ(m)).
3 (Output) Iteratively perform Step 2 until convergence. Output ˆ
Θ=Θ(M),ˆ
A=A(M),
where Mis the last iteration number.
The algorithm guarantees that the joint likelihood function increases in each iteration.
The parallel computing in step 2 of the algorithm is implemented through OpenMP, which
greatly speeds up the computation even on a single machine with multiple cores. The
efficiency of this parallel algorithm is further amplified, when running on a computer cluster
with many machines.
3 Theory
The proposed CJMLE has good statistical properties under the high-dimensional setting,
when Jand Nboth grow to infinity. We assume the number of latent traits Kis fixed and
known and the true parameters Θand Asatisfy
A1. θ
i2Cand a
j2Cfor all i,j.
11
We first derive an error bound for ˆ
Θˆ
AΘ(A), where Θand Aare the true parameters.
It requires an assumption on the mechanism of data missingness – missing completely at
random.
A2. ωij s in Ω are independent and identically distributed Bernoulli random variables with
P(ωij =1)=n
NJ .
Note that under assumption A1 and A2, we observe on average nresponses, that is,
E
N
i=1
J
j=1
ωij =n.
With assumptions A1 and A2, we have the following theorem.
Theorem 1. Suppose that assumptions A1 and A2 are satisfied. Further assume that n
(N+J)log(JN ).Then there exist absolute constants C1and C2, such that
1
NJ ˆ
Θˆ
AΘ(A)2
FC2CeCJ+N
n(11)
is satisfied with probability 1C1(N+J).
We call the left hand side of (11) the scaled Frobenius loss. We provide some remarks
on Theorem 1. First, under the asymptotic regime that Nand Jboth grow to infinity,
the probability that (11) is satisfied converges to 1. Second, the left hand side of (11) is
a reasonable scaling of the squared Frobenius norm of the error ∆ =ˆ
Θˆ
AΘ(A). This
is because ∆ is a N×Jmatrix and thus 2
Fhas N×Jterms. When both Nand J
grow to infinity, the unscaled 2
Fmay diverge with a positive probability. Third, when
n(N+J)log(JN )as assumed, the right hand size of (11) converges to 0. In particular,
if no response is missing, then n=N J and the right hand size of (11) converges to 0 with a
speed O(max{1J , 1N}). Fourth, the constant Cplays a role in the right hand size of
12
(11). For any Csatisfying A1, the larger Cbeing used in (6), the larger the upper bound in
(11). Ideally, we’d like to use the smallest Cthat satisfies A2. Finally, the error bound in
Theorem 1 can be further used to quantify the error of predicting respondents’ responses to
items that have not been administered due to the missing data design.
We then study the recovery of A, as it is of our interest to explore the factor structure
of items via the loading matrix. However, due to the rotational and scaling invariance
properties, it is not appropriate to measure the closeness between Aand ˆ
Aby any matrix
norm directly. Following the canonical approach in matrix perturbation theory (e.g. Stewart
and Sun, 1990), we consider to measure their closeness by the largest principal angle between
the linear spaces Vand ˆ
Vthat are spanned by the column vectors of Aand ˆ
A, respectively.
Specifically, for two linear spaces Uand V, the largest principal angle is defined as
sin (U, V )max
uUu0min
vVv0sin (u, v),
where (u, v)is the angle between two vectors. In fact, sin (U, V )0 and equality is
satisfied when U=V.
If the largest principal angle between Vand ˆ
Vis 0, then V=ˆ
V, meaning that there
exist a rank-Kmatrix R, such that A=ˆ
AR. In other words, Aand ˆ
Aare equivalent up
to a scaling and an oblique rotation. Roughly speaking, if the largest principal angle is close
to zero and Ahas a simple loading structure, the simple structure of Amay be found by
applying a rotational method to ˆ
A, such as the promax rotation (Hendrickson and White,
1964).
In what follows, we provide a bound on sin (V,ˆ
V). We make the following assumption.
A3. There exists a positive constant C3>0, such as the Kth largest singular value of
Θ(A), denoted by σ
K, satisfying
σ
KC3NJ ,
13
when Nand Jgrow to infinity.
Theorem 2. Suppose that assumptions A1, A2, and A3 are satisfied and n(N+J)log(JN).
Then
sin (V,ˆ
V)2
C3C2CeC1
2J+N
n1
4
.
is satisfied with probability 1C1(N+J).
Theorem 2 has a straightforward corollary.
Corollary 1. Under the same assumptions of Theorem 2, sin (V,ˆ
V)converges to 0 in
probability as N, J .
We point out that Assumption A3 is mild. The following proposition implies that as-
sumption A3 is satisfied with high probability when θ
iks and a
jk s are independent samples
from certain distributions.
Proposition 3. Suppose that θ
iks are i.i.d. samples with mean 0 and variance δ2
1and
a
jk s are i.i.d. samples with mean 0 and variance δ2
2, then there exists a constant C3>0,
σ
KNJ C3in probability, when Nand Jgrow to infinity.
4 Simulation
4.1 Study I
We evaluate the performance of CJMLE and verify the theoretical results by simulation
studies. The number of items Jtakes value 300, 400, ..., 1000 and for each given J, sample
size N=25J. The number of latent traits K=5,10 are considered and are assumed to be
known. We sample θiks from a truncated normal distribution,
θik TN(µ, σ, l, u),
14
0.0
0.5
1.0
1.5
300 400 500 600 700 800 900 1000
J
cateq=0.87
q=0.95
q=1
(a) K=5
0
3
6
9
12
300 400 500 600 700 800 900 1000
J
cateq=0.87
q=0.95
q=1
(b) K=10
Figure 1: The scaled Frobenius loss under different simulation settings (dashed line: no
missing (q=1); dotted line: moderate missing (q=0.95); solid line: massive missing (q=
0.87)).
where the underlying normal distribution has mean µ=0 and standard deviation σ=1 and
the truncation interval [l, u]=[1.5,1.5]. This implies that θi22.25K. The loading
parameter ajk s are generated from a mixture distribution. Let δj k be a random indicator,
satisfying P(δjk =1)=0.8. If δjk =0, aj k is set to be 0 and if δj k =1, ajk is sampled from the
truncated normal distribution TN(0,1,1.5,1.5). Under this setting, about 20% of the ajk ’s
are 0, which mimics a simple loading structure where each item does not measure all the
latent traits. Similarly, aj22.25K. Missing data are considered by setting n=(NJ )q,
where qis set to be 0.87, 0.95, and 1, corresponding to the situations of “massive missing”,
“moderate missing”, and “no missing”. To provide some intuition, when J=1000 and
N=25,000, only about 11% and 43% of the entries of (yij )N×Jare not missing when q=0.87
and 0.95, respectively. For each N,J, and K, 50 independent replications are generated.
We verify the theoretical results. To guarantee the satisfaction of assumption A1, we set
C=2.25K. Results are shown in Figures 1 and 2. In the two panels of Figures 1, the x-axis
records the value of Jand the y-axis represents the scaled Frobenius loss ˆ
MM2
F(NJ ).
The lines in Figure 1 are obtained by connecting the median of the scaled Frobenius loss
15
0.1
0.2
0.3
300 400 500 600 700 800 900 1000
J
cateq=0.87
q=0.95
q=1
(a) K=5
0.1
0.2
0.3
0.4
300 400 500 600 700 800 900 1000
J
cateq=0.87
q=0.95
q=1
(b) K=10
Figure 2: The largest principal angle between Vand ˆ
Vunder different simulation settings
(dashed line: no missing (q=1); dotted line: moderate missing (q=0.95); solid line: massive
missing (q=0.87)).
among 50 replications for different Js, given Kand q. For each setting of J,K, and q,
an interval is also plotted that shows the upper and lower quartiles of the scaled Frobenius
loss. According to Figure 1, for each missing data situation, the scaled Frobenius loss tends
to converge to 0 as Nand Jsimultaneously grow. In addition, we observe that the less
missing data, the smaller the scaled Frobenius loss. Results from Figure 1 are consistent
with Theorem 1.
In Figure 2, the x-axis records the number of items Jand the y-axis records the largest
principal angle between the column spaces of Aand ˆ
A, sin (V,ˆ
V). Similar as above, the
median and upper and lower quartiles of the largest principal angle among 50 independent
replications are presented. According to Figure 2, the largest principal angle between the
two spaces Vand ˆ
Vtend to converge to 0 when both Nand Jdiverge, for given Kand q.
In addition, having less missing data yields a smaller principal angle. Results from Figure 2
are consistent with Theorem 2.
16
0.0
0.1
0.2
0.3
0.4
1k 2k 3k 4k 5k 6k 7k 8k 9k 10k
N
cateJ=100
J=200
J=400
(a) Scaled Frobenius loss
0.00
0.05
0.10
0.15
0.20
0.25
1k 2k 3k 4k 5k 6k 7k 8k 9k 10k
N
cateJ=100
J=200
J=400
(b) Largest principal angle
Figure 3: The scaled Frobenius loss and largest principal angle when J=100 and N varies
from 1,000 to 10,000 under the “no missing” setting.
4.2 Study II
We then investigate the situation when Jis fixed and Ngrows. Specifically, we consider
K=5, J=100,200,400, and Ngrows from 1,000 to 10,000. The generation of θik and ajk
is the same as in Study I and Cis still set to be 2.25K. For ease of presentation, we only
show the “no missing” case (q=1). The patterns are similar in the presence of missing
data. Results are shown in Figure 3. The left panel of Figure 3 plots sample size N(x-axis)
versus scaled Frobenius loss (y-axis) and the right panel plots sample size N(x-axis) versus
largest principal angle (y-axis). According to Figure 3, when Jis fixed and Ngrows, both
the scaled Frobenius loss and the largest principal angles first decrease and then tend to
stabilize around some positive values. This result implies that the CJMLE may not perform
well when either Jor Nis small.
17
5 Real Data Analysis
We apply the proposed method to analyzing selected personality data1from the Synthetic
Aperture Personality Assessment project in Condon and Revelle (2015). The SAPA Project
is a collaborative online data collection tool for assessing psychological constructs across
multiple domains of personality. This data set was collected to evaluate the structure of
personality constructs in the temperament domain. The data set contains N=23,679 re-
spondents’ responses to J=696 personality assessment items. The items are from either the
International Personality Item Pool (Goldberg, 1999; Goldberg et al., 2006) or the Eysenck
Personality Questionnaire - Revised (Eysenck et al., 1985). The data contain “massive miss-
ingness” by its design and the missingness mechanism can be classified as missing completely
at random. The mean number of items to which respondents answered is 86.1 (sd = 58.7;
median = 71). The mean number of administrations for each item is 2,931 (sd = 781; median
= 2,554). All the items are on a six-category rating scale and in this analysis we dichotomize
them by truncating at the median response of each item. The readers are referred to Condon
and Revelle (2015) for more detailed information about this data set.
We first explore the latent dimensionality of data. Specifically, we consider the dimension
K=1, 5, 10, 15, 20 and 25 and the constraint tuning parameter C=2, 4, 6, 8, and 10. The
combination of Kand Cthat best predicts the missing responses are investigated by five-fold
cross validation. Let Ω0be the indicator matrix of non-missing responses. We randomly split
the non-missing responses into five non-overlapping sets that are of equal sizes, indicated by
(t)=(ω(t)
ij )N×J, t =1,2, ..., 5, satisfying 5
t=1(t)=0. Moreover, we denote Ω(t)=st(s),
indicating the data excluding set t. For given Kand C, we find the CJMLE for each Ω(t)
(t=1, ..., 5), denoted by (ˆ
A(t),ˆ
Θ(t)), by solving (6). Then the fitted model with parameters
(ˆ
A(t),ˆ
Θ(t))is used to predict non-missing responses in set t. The prediction accuracy is
1The data set is available from http://dx.doi.org/10.7910/DVN/SD7SVE.
18
measured by the following log-likelihood based cross-validation (CV) error:
5
t=1
i,jω(t)
ij =1
yij log ˆm(t)
ij +(1yij )log(1ˆm(t)
ij )
,
where ˆm(t)
ij are entries of the matrix ˆ
M(t)=ˆ
Θ(t)(ˆ
A(t)). The combination of Kand Cthat
yields smaller CV error is preferred, as the corresponding model tends to better predict
unobserved responses. Results are shown in Figures 4 and 5. Except when K=1, for each
value of K, the log-likelihood based CV error first decreases and then increases and achieve
the minima at C=4. When K=1, the CV error decreases as Cincreases from 2 to 10.
This may be due to a bias-variance trade-off, where bias decreases and variance increases
as Cincreases. When K=1, the bias term may dominate the variance, due to stringent
unidimensionality assumption. In addition, the CV errors when K=1 is much higher than
the smallest CV errors under other choices of K, as shown in Figure 5. It means that a
unidimensional model is inadequate to capture the underlying structure of the 696 items.
Figure 6 shows the smallest CV error when K=5, 10, 15, 20, and 25. Specifically, the
smallest CV error is achieved when K=15 and C=4, meaning that in terms of prediction,
15 latent traits tend to best characterize this personality item pool in the temperament
domain.
We then explore the factor loading matrix ˆ
A, where K=15 and C=4 are chosen based
on the CV error above. The loading matrix after promax rotation has good interpretation.
Specifically, we present 10 items with highest absolute value of loadings for each latent trait,
as shown in Tables 1 and 2, where “(+)” and “()” denote positive and negative loadings.
The latent traits are ordered based on the total sum of squares (TSS) of loadings. In fact,
items heavily load on latent traits 2-12 seem to be about “compassion”, “extroversion”,
“depression”, “conscientiousness”, “intellect”, “irritability”, “boldness”, “intellect”, “per-
fection”, “traditionalism”, and “honesty”, respectively. Except for latent traits 6 and 9 that
seem to be both about “intellect”, all the others among traits 2-12 seem to be about different
19
57006000
57007000
57008000
57009000
57010000
57011000
2 4 6 8 10
C
CV error
K=1
56850000
56855000
56860000
56865000
56870000
2 4 6 8 10
C
CV error
K=5
56840000
56880000
56920000
2 4 6 8 10
C
CV error
K=10
56800000
56850000
56900000
56950000
57000000
2 4 6 8 10
C
CV error
K=15
56850000
56900000
56950000
57000000
2 4 6 8 10
C
CV error
K=20
56850000
56900000
56950000
2 4 6 8 10
C
CV error
K=25
Figure 4: Log-likelihood based cross-validation error under different combinations of Kand
C.
56800000
56850000
56900000
56950000
57000000
1 5 10 15 20 25
K
CV error
cate
C=10
C=2
C=4
C=6
C=8
Figure 5: Log-likelihood based cross-validation error under different combinations of Kand
C.
20
56810000
56820000
56830000
56840000
5 10 15 20 25
K
CV error
Figure 6: Smallest log-likelihood based cross-validation error when K=5,10,15,20, and 25.
well-known personality facets. Items that heavily load on latent traits 1, and 13-15 seem to
be heterogeneous and these latent traits are worth further investigation.
6 Discussion
In this paper, we propose a constrained maximum likelihood estimator for analyzing large
scale item response data, which allows for the presence of high percentage of missing re-
sponses. It differs from the traditional JMLE, by adding constraint on the Euclidian norms
of both the item and respondent parameters. An efficient parallel computing algorithm is
proposed that is scalable even on a single machine to large data sets with tens of thousands
of respondents, thousands of items, and more than ten latent traits, with good timings. The
CJMLE also has good statistical properties. In particular, we provide error bounds on the
parameter estimates and show that the linear space spanned by the column vectors of the
factor loading matrix can be consistently recovered, under mild regularity conditions and the
high-dimensional setting that both the numbers of items and respondents grow to infinity.
This result implies that the true loading structure can be learned by applying proper rota-
tional methods to the estimated loading matrix, when the true loading matrix has a simple
structure (i.e., each latent trait only measures a subset of the items). These theoretical
21
Trait 1 (TSS = 348.5) Trait 2 (TSS = 279.4) Trait 3 (TSS = 231.1)
() Am not interested in other
people’s problems.
() Can’t be bothered with other’s
needs.
() Don’t understand people who get
emotional.
() Am not really interested in
others.
() Pay too little attention to details.
() Don’t put my mind on the task at
hand.
() Am seldom bothered by the
apparent suffering of strangers.
() Am not easily amused.
() Shirk my duties.
() Rarely notice my emotional
reactions.
(+) Feel others’ emotions.
(+) Am sensitive to the needs of
others.
(+) Feel sympathy for those who are
worse off than myself.
(+) Sympathize with others’ feelings.
(+) Have a soft heart.
(+) Inquire about others’ well-being.
(+) Sympathize with the homeless.
(+) Am concerned about others.
(+) Know how to comfort others.
(+) Suffer from others’ sorrows.
() Keep in the background.
() Am mostly quiet when with other
people.
() Prefer to be alone.
() Dislike being the center of
attention.
() Want to be left alone.
() Seek quiet.
() Keep others at a distance.
() Tend to keep in the background
on social occasions.
() Hate being the center of
attention.
() Am afraid to draw attention to
myself.
Trait 4 (TSS = 217.2) Trait 5 (TSS = 215.4) Trait 6 (TSS = 206.1)
(+) Have once wished that I were
dead.
(+) Often feel lonely.
(+) Am often down in the dumps.
(+) Find life difficult.
(+) Feel a sense of worthlessness or
hopelessness.
(+) Dislike myself.
() Am happy with my life.
() Seldom feel blue.
() Rarely feel depressed.
(+) Suffer from sleeplessness.
(+) Get to work at once.
(+) Complete my duties as soon as
possible.
(+) Get chores done right away.
() Find it difficult to get down to
work.
() Have difficulty starting tasks.
(+) Keep things tidy.
(+) Get started quickly on doing a
job.
(+) Start tasks right away.
() Leave a mess in my room.
() Need a push to get started.
(+) Think quickly.
(+) Can handle complex problems.
(+) Nearly always have a ”ready
answer” when talking to people.
(+) Can handle a lot of information.
(+) Am quick to understand things.
(+) Catch on to things quickly.
(+) Formulate ideas clearly.
(+) Know immediately what to do.
(+) Come up with good solutions.
(+) Am hard to convince.
Trait 7 (TSS = 198.3) Trait 8 (TSS = 178.9) Trait 9 (TSS = 175.4)
() Rarely show my anger.
(+) Get irritated easily.
(+) Lose my temper.
(+) Get angry easily.
(+) Can be stirred up easily.
(+) When my temper rises, I find it
difficult to control.
(+) Get easily agitated.
() It takes a lot to make me feel
angry at someone.
() Rarely get irritated.
() Seldom get mad.
() Would never go hang gliding or
bungee jumping.
(+) Like to do frightening things.
(+) Love dangerous situations.
(+) Seek adventure.
(+) Am willing to try anything once.
(+) Do crazy things.
(+) Take risks that could cause
trouble for me.
(+) Take risks.
() Never go down rapids in a canoe.
(+) Act wild and crazy.
(+) Need a creative outlet.
() Don’t pride myself on being
original.
() Do not have a good imagination.
(+) Enjoy thinking about things.
() Do not like art.
(+) Have a vivid imagination.
() See myself as an average person.
() Consider myself an average
person.
() Am just an ordinary person.
(+) Believe in the importance of art.
Table 1: Results from analyzing SAPA data (Part I): Top 10 items with highest absolute
value of loadings on each latent trait.
22
Trait 10 (TSS = 147.5 ) Trait 11 (TSS = 135.4) Trait 12 (TSS = 134.0)
(+) Want everything to add up
perfectly.
(+) Dislike imperfect work.
(+) Want everything to be ”just
right.”
(+) Demand quality.
(+) Have an eye for detail.
(+) Want every detail taken
care of.
(+) Avoid mistakes.
(+) Being in debt is worrisome
to me.
(+) Am exacting in my work.
(+) Dislike people who don’t
know how to behave themselves.
(+) Believe in one true religion.
() Don’t consider myself
religious.
() Believe that there is no
absolute right and wrong.
() Tend to vote for liberal
political candidates.
() Think marriage is
old-fashioned and should be
done away with.
(+) Tend to vote for
conservative political
candidates.
(+) Believe one has special
duties to one’s family.
(+) Like to stand during the
national anthem.
(+) Believe that we should be
tough on crime.
(+) Believe laws should be
strictly enforced.
() Use flattery to get ahead.
() Tell people what they want
to hear so they do what I want.
(+) Would never take things
that aren’t mine.
(+) Don’t pretend to be more
than I am.
() Tell a lot of lies.
() Play a role in order to
impress people.
() Switch my loyalties when I
feel like it.
(+) Return extra change when a
cashier makes a mistake.
() Use others for my own ends.
() Not regret taking advantage
of someone impulsively.
Trait 13 (TSS = 120.3) Trait 14 (TSS = 117.5) Trait 15 (TSS = 111.7)
(+) Like to take my time.
(+) Like a leisurely lifestyle.
(+) Would never make a
high-risk investment.
(+) Let things proceed at their
own pace.
(+) Always know why I do
things.
(+) Always admit it when I
make a mistake.
() Have read the great literary
classics.
(+) Am more easy-going about
right and wrong than most
people.
(+) Value cooperation over
competition.
(+) Don’t know much about
history.
() Am interested in science.
() Trust what people say.
() Find political discussions
interesting.
(+) Don’t [worry about]
political and social problems.
() Would not enjoy being a
famous celebrity.
(+) Believe that we coddle
criminals too much.
() Enjoy intellectual games.
() Believe that people are
basically moral.
() Like to solve complex
problems.
() Trust people to mainly tell
the truth.
() Dislike loud music.
(+) Like telling jokes and funny
stories to my friends.
() Seldom joke around.
() Prefer to eat at expensive
restaurants.
(+) Laugh aloud.
() Most things taste the same
to me.
() People spend too much time
safeguarding their future with
savings and insurance.
(+) Amuse my friends.
() Love my enemies.
() Am not good at deceiving
other people.
Table 2: Results from analyzing SAPA data (Part II): Top 10 items with highest loadings
on each latent trait.
23
developments are consistent with results from the simulation studies. Our simulation results
also imply that the high-dimensional setting that both the numbers of respondents and items
grow to infinity is important. When either the sample size or the number of items is small,
the CJMLE may not work well. The proposed model is applied to analyzing large scale data
from the Synthetic Aperture Personality Assessment project. It is found that a model with
15 latent traits has the best prediction power based on results from cross validation and the
majority of the traits seem to be homogeneous and correspond to well-known personality
facets.
The proposed method may be extended along several directions. First, the current algo-
rithm and theory focus on binary responses. It is worth extending them to multidimensional
IRT models for ordinal response data, such as the generalized partial credit model (Yao and
Schwarz, 2006) which is an extension of the multidimensional two-parameter logistic model
to analyzing ordinal responses. Second, even after applying rotational methods, the obtained
factor loading matrix may still be dense and difficult to interpret. To better pursue a simple
loading structure, it may be helpful to further impose L1regularization of the factor loading
parameters, as introduced in Sun et al. (2016). Third, it is also important to incorporate
respondent specific covariates in the analysis, so that the relationship between baseline co-
variates and psychological latent traits can be investigated. Finally, the current theoretical
framework assumes the number of latent traits as known. Theoretical results on estimating
the number of latent traits based on the CJMLE are also of interest, which may provide
guidance on choosing the number of latent traits.
24
7 Appendix
7.1 Proof of Theorem 1
Proof. The proof of Theorem 1 is similar to that of (Davenport et al., 2014, Theorem 1).
Thus, we only state the main steps and omit the details. Let
¯
lJ(M)=lJ(M)lJ(0),(12)
where 0is an N×Jmatrix whose entries are all zero. Then, we have the following lemma
from Davenport et al. (2014).
Lemma 1 (Lemma A.1 of Davenport et al. (2014)).There exist constant C0and C1such
that for all α,r, N, J and n,
P
sup
MαrN J ¯
lJ(M)E¯
lJ(M)C0αLγrn(N+J)+N J log(N+J)
C1
N+J,(13)
where Lγ=1under the settings of the current paper and denotes the nuclear norm of
a matrix.
Let α=Cand r=Kin the above lemma, we have
P
sup
MCKN J ¯
lJ(M)E¯
lJ(M)C0CKn(N+J)+N J log(N+J)
C1
N+J.(14)
Define
H=M=(mij )1iN,1jJmij =a
jθi,θi2Cand aj2C, for all i, j.(15)
Note that if MH, then
MNJ rank(M)MCKN J. (16)
25
Thus, from (14), we further have
Psup
MH¯
lJ(M)E¯
lJ(M)C0CKn(N+J)+N J log(N+J) (17)
P
sup
MCKN J ¯
lJ(M)E¯
lJ(M)C0CKn(N+J)+N J log(N+J)
(18)
C1
N+J.(19)
We use the following result, which is a slight modification of the last equation on p.210 of
Davenport et al. (2014).
nD(Mˆ
M)2 sup
MH¯
lJ(M)E¯
lJ(M),(20)
where ˆ
M=ˆ
Θˆ
A,D(M1M2)denotes the Kullback-Leibler divergence between the joint
distribution of {Yij ; 1 iN, 1jJ}when the model parameters are M1and M2. In
addition, we have the following inequality, which is a direct application of Lemma A.2 in
Davenport et al. (2014) and the third equation on page 211 of Davenport et al. (2014). For
any M1, M2such that M1,M2C,
M1M22
F8βCNJ D(M1M2),(21)
with βC=(1+eC)2eCunder our settings. Combining (19), (20) and (21), we can see that
with probability 1 C1
N+J,
ˆ
MM2
F16βCNJ
n×C0CKn(N+J)+NJ log(N+J).(22)
Note that for C0, βC4eC. Thus, we can rearrange terms in the above inequality and
26
simplify it as
1
NJ ˆ
MM2
F64CeCK(J+N)
n×
1+NJ log(N+J)
n(N+J).(23)
For n(N+J)log(N J ), we further have
NJ log(N+J)
n(N+J)NJ
(N+J)21
4.(24)
Combine the above equation with (23) and note that Kis assumed fixed, we complete the
proof.
7.2 Proof of Theorem 2
Proof. Denote M=Θ(A)and ˆ
M=ˆ
Θˆ
A. Let σ
1σ
2σ
min{N,J }0 be the
singular values of Mand v
1, ..., v
min{N,J }be the corresponding singular vectors. Similarly,
we let ˆσ1ˆσ2ˆσmin{N,J }0 be the singular values of ˆ
Mand ˆv1, ..., ˆvmin{N,J}be the
corresponding singular vectors. Due to the form of Mand ˆ
M, only their first Ksingular
values can be nonzero.
We first show that V=span{v
1, ..., v
K}. Let M=UΣVbe the singular decomposition
of M, where V=(v
1, ..., v
K), ΣK×K=diag(σ
1, σ
2, ..., σ
K), and UN×Kbe the left singular
matrix. Then
Θ(A)=UΣV.(25)
According to assumption A3, for large enough Nand J,σ
K>0. Thus, Θis of full column
rank, implying that K×Kmatrix (Θ)Θis of full rank. Thus, (A)=((Θ)Θ)1(Θ)UΣV,
or equivalently, A=V(((Θ)Θ)1(Θ)UΣ).It implies that Vspan{v
1, ..., v
K}. Sim-
ilarly, we have V=A(Σ1UΘ), implying that span{v
1, ..., v
K}V. Thus,
span{v
1, ..., v
K}=V.
27
We then focus on the event E1:
ˆ
Θˆ
AΘ(A)FNJ C2CeC1
2J+N
n1
4
,
which holds with probability at least 1 C1(N+J)according to Theorem 1. Under this
event and by Weyl’s perturbation theorem (see, e.g. Stewart and Sun, 1990), we have
σ
KˆσKˆ
Θˆ
AΘ(A)2ˆ
Θˆ
AΘ(A)F,
where 2denotes the spectral norm of a matrix and the second inequality is due to the
relationship between matrix spectral norm and matrix Frobenius norm. Thus, when event
E1happens and for sufficiently large Nand J,
ˆσKσ
Kˆ
Θˆ
AΘ(A)F
NJ
C3C2CeC1
2J+N
n1
4
>0,
(26)
which is because ˆ
Θˆ
AΘ(A)Fis of order o(NJ )when n(N+J)log(JN )and N
and Jgrow to infinity. Then following the same proof above, we have
span{ˆv1, ..., ˆvK}=ˆ
V .
Following the Modified Davis-Kahan-Wedin sine theorem (Theorem 20) in O’Rourke et al.
(2013), we have
sin (V,ˆ
V)2ˆ
Θˆ
AΘ(A)2
σ
K
.
Because of the relationship between the matrix spectral norm and Frobenius norm and
28
assumption A3, we have
sin (V,ˆ
V)2ˆ
Θˆ
AΘ(A)2
σ
K2ˆ
Θˆ
AΘ(A)F
σ
K2ˆ
Θˆ
AΘ(A)F
C3NJ .
Under the event E1, we have
sin (V,ˆ
V)2
C3C2CeC1
2J+N
n1
4
.
7.3 Proof of Propositions
Proof of Proposition 1.
To show that (ˆ
AQ, ˆ
ΘQ)is also a minimizer of (5), one only needs to show that
(a). (ˆ
ΘQ)(ˆ
AQ)=ˆ
Θˆ
A;
(b). ˆ
aj2=ˆ
ajQ2,j;
(c). ˆ
θi2=ˆ
θiQ2,i.
(a) is true, because (ˆ
ΘQ)(ˆ
AQ)=ˆ
ΘQQˆ
A=ˆ
Θ(QQ)ˆ
A=ˆ
Θˆ
A, where the last equality is
because Qis a orthogonal matrix. (b) is true because multiplying by a orthogonal matrix
does not change the Euclidian norm of a vector. (c) is true for the same reason.
Proof of Proposition 2. We consider the situation that ˆ
θi2<Cfor all iand the case when
ˆ
aj2<Cfor all jis handled similarly. Let Rbe an invertible matrix with singular values
dKdK1d1, satisfying
min
kdk1 and max
kd2
kC
maxi{1,...,N}θi2.
Then
29
(a). (ˆ
ΘR)(ˆ
AR1)=ˆ
Θˆ
A.
(b). ˆ
θiR2C.
(c). ˆ
ajR12C.
(a) is trivial. (b) is due to
ˆ
θiD2D2
2ˆ
θi2ˆ
θi2C
maxi{1,...,N}ˆ
θi2C,
where 2denotes the matrix spectral norm. The proof of (c) is similar to that of (b).
Proof of Proposition 3.
First note that σ2
Kis the Kth eigenvalue of A(Θ)Θ(A). Note that A(Θ)Θ(A)
and (A)A(Θ)Θshare the same nonzero eigenvalues. Therefore, we only needs to show
that with probability tending to 1, the minimum eigenvalue of (A)A(Θ)Θsatisfies
λmin ((A)A(Θ)Θ)C2
3NJ, (27)
for some positive constant C3.
Since
λmin ((A)A(Θ)Θ)λmin ((A)A)×λmin ((Θ)Θ),
we bound the two minimum eigenvalues on the right hand side separately. By law of large
number, (A)A(Jδ2
1)converges in probability to the K×Kidentity matrix. Therefore,
by applying Weyl’s theorem again, with probability tending to 1, as Jgrows to infinity, we
have
λmin (A)A
Jδ2
11
2.
Similarly, we can show that with probability tending to 1, as Ngrows to infinity,
λmin (Θ)Θ
Nδ2
21
2.
30
Combining the above two,
λmin ((A)A)×λmin ((Θ)Θ)δ2
1δ2
2
4NJ,
with probability tending to 1, as Nand Jgrow to infinity. Thus, (27) is satisfied by choosing
C3=δ1δ2
2.
References
Andersen, E. B. (1973). Conditional inference and models for measuring. Mentalhygiejnisk
Forlag, Copenhagen, Denmark.
Anderson, T. W. (2003). An introduction to multivariate statistical analysis. Wiley, Hoboken,
NJ.
eguin, A. A. and Glas, C. A. (2001). MCMC estimation and some model-fit analysis of
multidimensional IRT models. Psychometrika, 66(4):541–561.
Bolt, D. M. and Lall, V. F. (2003). Estimation of compensatory and noncompensatory multi-
dimensional item response models using Markov chain Monte Carlo. Applied Psychological
Measurement, 27(6):395–414.
Browne, M. W. (2001). An overview of analytic rotation in exploratory factor analysis.
Multivariate Behavioral Research, 36(1):111–150.
Cai, L. (2010a). High-dimensional exploratory item factor analysis by a Metropolis–Hastings
Robbins–Monro algorithm. Psychometrika, 75(1):33–57.
Cai, L. (2010b). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor
analysis. Journal of Educational and Behavioral Statistics, 35(3):307–335.
31
Cand`es, E. and Recht, B. (2012). Exact matrix completion via convex optimization. Com-
munications of the ACM, 55(6):111–119.
Cand`es, E. J. and Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE,
98(6):925–936.
Cand`es, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix
completion. IEEE Transactions on Information Theory, 56(5):2053–2080.
Chiu, C.-Y., K¨ohn, H.-F., Zheng, Y., and Henson, R. (2016). Joint maximum likelihood
estimation for diagnostic classification models. Psychometrika, 81(4):1069–1092.
Condon, D. and Revelle, W. (2015). Selected personality data from the SAPA-Project: On
the structure of phrased self-report items. Journal of Open Psychology Data, 3(1).
Dagum, L. and Menon, R. (1998). OpenMP: an industry standard API for shared-memory
programming. Computational Science & Engineering, IEEE, 5(1):46–55.
Davenport, M. A., Plan, Y., van den Berg, E., and Wootters, M. (2014). 1-bit matrix
completion. Information and Inference, 3(3):189–223.
Edwards, M. C. (2010). A Markov chain Monte Carlo approach to confirmatory item factor
analysis. Psychometrika, 75(3):474–497.
Embretson, S. E. and Reise, S. P. (2000). Item response theory for psychologists. Lawrence
Erlbaum Associates Publishers, Mahwah, NJ.
Eysenck, S. B., Eysenck, H. J., and Barrett, P. (1985). A revised version of the psychoticism
scale. Personality and Individual Differences, 6(1):21–29.
Ghosh, M. (1995). Inconsistent maximum likelihood estimators for the Rasch model. Statis-
tics & Probability Letters, 23(2):165–170.
32
Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring
the lower-level facets of several five-factor models.
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R.,
and Gough, H. G. (2006). The international personality item pool and the future of
public-domain personality measures. Journal of Research in Personality, 40(1):84–96.
Haberman, S. J. (1977). Maximum likelihood estimates in exponential response models. The
Annals of Statistics, 5(5):815–841.
Haberman, S. J. (2004). Joint and conditional maximum likelihood estimation for the Rasch
model for binary responses. ETS Research Report Series RR-04-20.
Hendrickson, A. E. and White, P. O. (1964). Promax: A quick method for rotation to oblique
simple structure. British Journal of Mathematical and Statistical Psychology, 17(1):65–70.
Lord, F. M., Novick, M. R., and Birnbaum, A. (1968). Statistical theories of mental test
scores. Addison-Wesley, Oxford, England.
McDonald, R. P. (1967). Nonlinear factor analysis (Psychometric Monographs, No.15).
Psychometric Corporation, Richmond, VA.
Meng, X.-L. and Schilling, S. (1996). Fitting full-information item factor models and an
empirical investigation of bridge sampling. Journal of the American Statistical Association,
91(435):1254–1267.
Neyman, J. and Scott, E. L. (1948). Consistent estimates based on partially consistent
observations. Econometrica, 16(1):1–32.
O’Rourke, S., Vu, V., and Wang, K. (2013). Random perturbation of low rank matrices:
Improving classical bounds. arXiv preprint arXiv:1311.2657.
Parikh, N., Boyd, S., et al. (2014). Proximal algorithms. Foundations and Trends®in
Optimization, 1(3):127–239.
33
Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests. Nielsen
and Lydiche, Copenhagen, Denmark.
Reckase, M. (2009). Multidimensional item response theory. Springer, New York, NY.
Reckase, M. D. (1972). Development and application of a multivariate logistic latent trait
model. PhD thesis, Syracuse University, Syracuse NY.
Reckase, M. D. and McKinley, R. L. (1983). Some latent trait theory in a multidimensional
latent space. In Weiss, D. J., editor, Proceedings of the 1982 Item Response Theory
and Computerized Adaptive Testing Conference, pages 151–188. University of Minnesota,
Department of Psychology, Minneapolis MN.
Rupp, A. A., Templin, J., and Henson, R. A. (2010). Diagnostic measurement: Theory,
methods, and applications. Guilford Press, New York, NY.
Schilling, S. and Bock, R. D. (2005). High-dimensional maximum marginal likelihood item
factor analysis by adaptive quadrature. Psychometrika, 70(3):533–555.
Stewart, G. and Sun, J. (1990). Matrix Perturbation Theory. Academic Press, Cambridge,
MA.
Sun, J., Chen, Y., Liu, J., Ying, Z., and Xin, T. (2016). Latent variable selection for multidi-
mensional item response theory models via L1regularization. Psychometrika, 81(4):921–
939.
Udell, M., Horn, C., Zadeh, R., Boyd, S., et al. (2016). Generalized low rank models.
Foundations and Trends®in Machine Learning, 9(1):1–118.
Yao, L. and Schwarz, R. D. (2006). A multidimensional partial credit model with associated
item and test statistics: An application to mixed-format tests. Applied Psychological
Measurement, 30(6):469–492.
34
... However, if N and J simultaneously grow to infinity, Haberman (1977) shows that the joint maximum likelihood estimator is consistent under the Rasch model, a simple unidimensional item response theory model. Then under a general family of multidimensional item response theory models and under an exploratory factor analysis setting, an estimator similar to (13) is considered in Chen et al. (2019) but without the zero constraints given by the Q-matrix. ...
... Remark 3. The error bound (14) holds even when one or more latent factors are not structurally identifiable. In particular, (14) holds when removing the constraint a j P D j from (13), which corresponds to the exploratory factor analysis setting where no design matrix Q is pre-specified (or in other words, q jk " 1 for all j and k; see the setting of Chen et al., 2019). In that case, the best one can achieve is to recover the linear space spanned by the column vectors of Θ˚and similarly the linear space spanned by the column vectors of A˚. ...
... In that case, the best one can achieve is to recover the linear space spanned by the column vectors of Θ˚and similarly the linear space spanned by the column vectors of A˚. To make sense of such exploratory factor analysis results, one needs an additional rotation step to find an approximately sparse estimate of A˚ (Chen et al., 2019). ...
Preprint
Latent factor models are widely used to measure unobserved latent traits in social and behavioral sciences, including psychology, education, and marketing. When used in a confirmatory manner, design information is incorporated, yielding structured (confirmatory) latent factor models. Motivated by the applications of latent factor models to large-scale measurements which consist of many manifest variables (e.g. test items) and a large sample size, we study the properties of structured latent factor models under an asymptotic setting where both the number of manifest variables and the sample size grow to infinity. Specifically, under such an asymptotic regime, we provide a definition of the structural identifiability of the latent factors and establish necessary and sufficient conditions on the measurement design that ensure the structural identifiability under a general family of structured latent factor models. In addition, we propose an estimator that can consistently recover the latent factors under mild conditions. This estimator can be efficiently computed through parallel computing. Our results shed lights on the design of large-scale measurement and have important implications on measurement validity. The properties of the proposed estimator are verified through simulation studies.
... Here n denotes the total number of models in the dataset and m is the total number of model families. To prove identifiability, we adopt standard assumptions from the factor analysis literature (Chen et al., 2019) or regression literature, which assumes that the skills vectors θ (i) ∈ R 1×d 's are standardized, i.e., their average is null while their covariance matrix is fixed, rank(Λ) = d, and rank(X) = p. Assumption 3.1 (Identifiability constraints). ...
... One possible choice for the covariance matrix is Ψ = I d (Chen et al., 2019), which assumes uncorrelated skills. One implicit implication of Assumption 3.1 is that n ≥ p ≥ d must be satisfied, otherwise the covariance matrix cannot be full rank. ...
... Sloth is heavily inspired by (exploratory) factor analysis models. Factor analysis is a statistical technique used to identify underlying relationships between observed variables by reducing the data's dimensionality (Bishop & Nasrabadi, 2006;Chen et al., 2019). It assumes that multiple observed variables are influenced by a smaller number of unobserved/latent variables called factors (skills θ (i) , in our case). ...
Preprint
Full-text available
Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws requires training models of varying sizes for every family. In this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a novel scaling law that leverages publicly available benchmark data and assumes LLM performance is driven by low-dimensional latent skills, such as reasoning and instruction following. These latent skills are influenced by computational resources like model size and training tokens but with varying efficiencies across model families. Sloth exploits correlations across benchmarks to provide more accurate and interpretable predictions while alleviating the need to train multiple LLMs per family. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks, from Open LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance efficiently and offers insights into scaling behaviors for downstream tasks such as coding and emotional intelligence applications.
... However, we can train a predictive model that predicts the probability that a certain user U will like an image I. Models with similar purposes are known as verifiers and are largely used in the generative models' recent literature (Cobbe et al., 2021;Lightman et al., 2023). Our model for the probability of each user U liking a certain image I is inspired by matrix factorization techniques (Töscher et al., 2009;Koren et al., 2009;Ong et al., 2024) and Item Response Theory (IRT) from the field of psychometrics (Cai et al., 2016;Chen et al., 2019) but also used in the evaluation of LLMs (Maia Polo et al., 2024a;. Concretely, we assume that our verifier v(U, I) is given by ...
Preprint
Full-text available
Modern recommender systems follow the guiding principle of serving the right user, the right item at the right time. One of their main limitations is that they are typically limited to items already in the catalog. We propose REcommendations BEyond CAtalogs, REBECA, a new class of probabilistic diffusion-based recommender systems that synthesize new items tailored to individual tastes rather than retrieve items from the catalog. REBECA combines efficient training in embedding space with a novel diffusion prior that only requires users' past ratings of items. We evaluate REBECA on real-world data and propose novel personalization metrics for generative recommender systems. Extensive experiments demonstrate that REBECA produces high-quality, personalized recommendations, generating images that align with users' unique preferences.
... They estimate factors via a singular value decomposition, and infer loadings and residual error variances via conjugate normal-inverse gamma priors, showing concentration of the induced posterior on the covariance at the true values and validity of the coverage of entrywise credible intervals. This approach is related to joint maximum likelihood or maximum a posteriori estimates, which have been shown to be consistent for generalized linear latent variable models (Moustaki & Knott 2000), when both sample size and data dimensionality diverge (Chen et al. 2019, Lee et al. 2024, Mauri & Dunson 2024. ...
Preprint
Full-text available
This article focuses on covariance estimation for multi-study data. Popular approaches employ factor-analytic terms with shared and study-specific loadings that decompose the variance into (i) a shared low-rank component, (ii) study-specific low-rank components, and (iii) a diagonal term capturing idiosyncratic variability. Our proposed methodology estimates the latent factors via spectral decompositions and infers the factor loadings via surrogate regression tasks, avoiding identifiability and computational issues of existing alternatives. Reliably inferring shared vs study-specific components requires novel developments that are of independent interest. The approximation error decreases as the sample size and the data dimension diverge, formalizing a blessing of dimensionality. Conditionally on the factors, loadings and residual error variances are inferred via conjugate normal-inverse gamma priors. The conditional posterior distribution of factor loadings has a simple product form across outcomes, facilitating parallelization. We show favorable asymptotic properties, including central limit theorems for point estimators and posterior contraction, and excellent empirical performance in simulations. The methods are applied to integrate three studies on gene associations among immune cells.
... The cJML estimates are obtained using the R package "mirtjml" (Zhang et al., 2020) using S ¼ 5 ffi ffi ffi ffi K p ¼ 8:66 in Equation 5 (which is the default value of the package) and with a tolerance of 0.001 (default of the package is 5 which may be too lenient for the present study). In addition, the nMML, dMML, and wMML estimates are obtained using the EM algorithm as implemented in the R package "mirt" (Chalmers, 2012), where dMML and wMML are applied to each dimension separately (as the full 3 dimensional factor model is numerically too demanding for these approaches). ...
... Sun et al. (2016) proposed an L 1 -regularized method for selecting latent variables in multidimensional item response theory models. Chen et al. (2019) studied theoretical properties of the joint maximum likelihood estimator for exploratory item factor analysis. analyzed structured latent factor models, focusing on how design constraints affect their identifiability and estimation. ...
Preprint
Full-text available
The self-attention mechanism is the backbone of the transformer neural network underlying most large language models. It can capture complex word patterns and long-range dependencies in natural language. This paper introduces exponential family attention (EFA), a probabilistic generative model that extends self-attention to handle high-dimensional sequence, spatial, or spatial-temporal data of mixed data types, including both discrete and continuous observations. The key idea of EFA is to model each observation conditional on all other existing observations, called the context, whose relevance is learned in a data-driven way via an attention-based latent factor model. In particular, unlike static latent embeddings, EFA uses the self-attention mechanism to capture dynamic interactions in the context, where the relevance of each context observations depends on other observations. We establish an identifiability result and provide a generalization guarantee on excess loss for EFA. Across real-world and synthetic data sets -- including U.S. city temperatures, Instacart shopping baskets, and MovieLens ratings -- we find that EFA consistently outperforms existing models in capturing complex latent structures and reconstructing held-out data.
Preprint
Full-text available
This study establishes the consistency of Bayesian adaptive testing methods under the Rasch model, addressing a gap in the literature on their large-sample guarantees. Although Bayesian approaches are recognized for their finite-sample performance and capability to circumvent issues such as the cold-start problem; however, rigorous proofs of their asymptotic properties, particularly in non-i.i.d. structures, remain lacking. We derive conditions under which the posterior distributions of latent traits converge to the true values for a sequence of given items, and demonstrate that Bayesian estimators remain robust under the mis-specification of the prior. Our analysis then extends to adaptive item selection methods in which items are chosen endogenously during the test. Additionally, we develop a Bayesian decision-theoretical framework for the item selection problem and propose a novel selection that aligns the test process with optimal estimator performance. These theoretical results provide a foundation for Bayesian methods in adaptive testing, complementing prior evidence of their finite-sample advantages.
Article
Standard item response theory (IRT) models are ill-equipped for when the probability of a correct response depends on the location in the test where an item is encountered—a phenomenon we refer to as position effects. Unmodeled position effects complicate comparing respondents taking the same test. We propose a position-sensitive IRT model that is a mixture of two item response functions, capturing the difference in response probability when the item is encountered early versus late in the test. The mixing proportion depends on item location and latent person-level characteristics, separating person and item contributions to position effects. We present simulation studies outlining various features of model performance and end with an application to a large-scale admissions test with observed position effects.
Article
Full-text available
Matrix perturbation inequalities, such as Weyl's theorem (concerning the singular values) and the Davis–Kahan theorem (concerning the singular vectors), play essential roles in quantitative science; in particular, these bounds have found application in data analysis as well as related areas of engineering and computer science. In many situations, the perturbation is assumed to be random, and the original matrix has certain structural properties (such as having low rank). We show that, in this scenario, classical perturbation results, such as Weyl and Davis–Kahan, can be improved significantly. We believe many of our new bounds are close to optimal and also discuss some applications.
Article
Full-text available
A Bayesian procedure to estimate the three-parameter normal ogive model and a generalization of the procedure to a model with multidimensional ability parameters are presented. The procedure is a generalization of a procedure by Albert (1992) for estimating the two-parameter normal ogive model. The procedure supports analyzing data from multiple populations and incomplete designs. It is shown that restrictions can be imposed on the factor matrix for testing specific hypotheses about the ability structure. The technique is illustrated using simulated and real data.
Article
Joint maximum likelihood estimation (JMLE) is developed for diagnostic classification models (DCMs). JMLE has been barely used in Psychometrics because JMLE parameter estimators typically lack statistical consistency. The JMLE procedure presented here resolves the consistency issue by incorporating an external, statistically consistent estimator of examinees’ proficiency class membership into the joint likelihood function, which subsequently allows for the construction of item parameter estimators that also have the consistency property. Consistency of the JMLE parameter estimators is established within the framework of general DCMs: The JMLE parameter estimators are derived for the Loglinear Cognitive Diagnosis Model (LCDM). Two consistency theorems are proven for the LCDM. Using the framework of general DCMs makes the results and proofs also applicable to DCMs that can be expressed as submodels of the LCDM. Simulation studies are reported for evaluating the performance of JMLE when used with tests of varying length and different numbers of attributes. As a practical application, JMLE is also used with “real world” educational data collected with a language proficiency test.
Article
We develop a latent variable selection method for multidimensional item response theory models. The proposed method identifies latent traits probed by items of a multidimensional test. Its basic strategy is to impose an L1L_{1} penalty term to the log-likelihood. The computation is carried out by the expectation–maximization algorithm combined with the coordinate descent algorithm. Simulation studies show that the resulting estimator provides an effective way in correctly identifying the latent structures. The method is applied to a real dataset involving the Eysenck Personality Questionnaire.
Article
Describes a method of item factor analysis based on Thurstone's multiple-factor model and implemented by marginal maximum likelihood estimation and the em algorithm. Statistical significance of successive factors added to the model were tested by the likelihood ratio criterion. Provisions for effects of guessing on multiple-choice items, and for omitted and not-reached items, are included. Bayes constraints on the factor loadings were found to be necessary to suppress Heywood cases. Applications to simulated and real data are presented to substantiate the accuracy and practical utility of the method. (PsycINFO Database Record (c) 2000 APA, all rights reserved)(unassigned)
Article
Since its release in 1976, Wingersky, Barton, and Lord's (1982) LOGIST has been the most widely used computer program for estimating the parameters of the 3-parameter logistic (Birnbaum) item response model. An alternative program, Mislevy and Bock's (1983) BILOG, has recently become available. This paper compares the approaches taken by the two programs, and offers some initial guidelines for choosing between the two programs for particular applications.