Low Rank Language Models for Small Training Sets
ABSTRACT Several language model smoothing techniques are available that are effective for a variety of tasks; however, training with small data sets is still difficult. This letter introduces the low rank language model, which uses a low rank tensor representation of joint probability distributions for parameter-tying and optimizes likelihood under a rank constraint. It obtains lower perplexity than standard smoothing techniques when the training set is small and also leads to perplexity reduction when used in domain adaptation via interpolation with a general, out-of-domain model.
[show abstract] [hide abstract]
ABSTRACT: We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the cross-entropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods.07/1996;
SIAM Review. 01/2009; 51:455-500.
[show abstract] [hide abstract]
ABSTRACT: This paper presents a family of probabilistic latent variable models that can be used for analysis of nonnegative data. We show that there are strong ties between nonnegative matrix factorization and this family, and provide some straightforward extensions which can help in dealing with shift invariances, higher-order decompositions and sparsity constraints. We argue through these extensions that the use of this approach allows for rapid development of complex statistical models for analyzing nonnegative data.Computational Intelligence and Neuroscience 02/2008;
IEEE SIGNAL PROCESSING LETTERS, VOL. 18, NO. 9, SEPTEMBER 2011489
Low Rank Language Models for Small Training Sets
Brian Hutchinson, Student Member, IEEE, Mari Ostendorf, Fellow, IEEE, and Maryam Fazel, Member, IEEE
Abstract—Several language model smoothing techniques are
available that are effective for a variety of tasks; however, training
with small data sets is still difficult. This letter introduces the low
rank language model, which uses a low rank tensor representa-
tion of joint probability distributions for parameter-tying and
optimizes likelihood under a rank constraint. It obtains lower
perplexity than standard smoothing techniques when the training
set is small and also leads to perplexity reduction when used in
domain adaptation via interpolation with a general, out-of-domain
Index Terms—Language model, low rank tensor.
stantially when training on large data sets. Empirical studies
show that the modified Kneser–Ney method works well over a
range of training set sizes on a variety of sources , although
other methods are more effective when pruning is used in
training large language models . While large training sets
are valuable, there are situations where they are not available,
including system prototyping for a new domain or training lan-
guage models that specialize for communicative goals or roles.
This letter addresses the problem of language model training
from sparse data sources by casting the smoothing problem
as low rank tensor estimation. By permitting precise control
over model complexity, our low rank language models are
able to fit the small in-domain data with better generalization
ANGUAGE model smoothing has been well studied,
and it is widely known that performance improves sub-
II. RANK IN LANGUAGE MODELING
Every -gramlanguagemodelimplicitlydefinesan th-order
joint probability tensor
An unsmoothed maximum likelihood-estimated language
model can be viewed as an entrywise sparse tensor. The ob-
vious problem with parameterizing a language model directly
by the entries of
is that, under nearly all conditions, there are
Manuscriptreceived April14,2011;revised June 09,2011;accepted June 14,
2011. Date of publication June 27, 2011; date of current version July 07, 2011.
This work was supported in part by NSF CAREER Grant ECCS-0847077 and
by the Office of the Director of National Intelligence (ODNI), Intelligence Ad-
vanced Research Projects Activity (IARPA). All statements of fact, opinion or
conclusions contained herein are those of the authors and should not be con-
strued as representing the official views or policies of the NSF, IARPA, the
ODNI or the U.S. Government. The associate editor coordinating the review
of this manuscript and approving it for publication was Prof. Constantine L.
The authors are with the Department of Electrical Engineering, University
of Washington, Seattle, WA 98195 USA (e-mail: email@example.com-
ington.edu; firstname.lastname@example.org; email@example.com).
Digital Object Identifier 10.1109/LSP.2011.2160850
Fig. 1. Singular values of conditional probability matrix for unsmoothed (solid
line), modified-KN smoothed (dashed), and add-
models. Trained on 150 K words of broadcast news text, with a 5 K word
too many degrees of freedom for reliable estimation. Hence, a
substantial amount of research has gone into the smoothing of
-gram language models.
One can compare different smoothing methods by their ef-
fects on the properties of
. In particular, even highly distinct
approaches to smoothing have the effect of reducing, either ex-
actly or approximately, the rank of the tensor
rank implies a reduction in model complexity, yielding a model
that is easier to estimate from the finite amount of training data.
In the matrix case, reducing the rank of the joint probability ma-
trix is equivalent to pushing the distributions over a vocabulary
. More generally, a low rank tensor implies that the
set of distributions
are largely governed by a common set
underlying factors. Although the factors need not
be interpretable, and certainly not predefined, one might envi-
sion that a set of syntactic (e.g., part-of-speech), style and/or se-
phenomenon of smoothing in a conditional probability matrix.
Both modified Kneser–Ney and add-  smoothing shrink the
mass of the singular values over the unsmoothed estimate. This
effect is most pronounced on small training sets, which require
The number of factors (the rank
a mechanism to control model complexity. The benefits of con-
trolling model complexity are well-known: a model that is too
expressive can overfit the training, while a model that is too
simple may not be able to capture the inherent structure. By re-
ducing the mass of singular values, existing smoothing methods
effectively reduce the complexity of the model. Although they
do so in a meaningful and interpretable way, it is unlikely that
any fixed approach to smoothing will be optimal, in the sense
that it may return a model whose complexity is somewhat more
or somewhat less than ideal for the given training data. In this
. Reducing the
) of a tensor thus provides
1070-9908/$26.00 © 2011 IEEE
490IEEE SIGNAL PROCESSING LETTERS, VOL. 18, NO. 9, SEPTEMBER 2011
letter we test the hypothesis that it is the rank-reducing be-
havior that is important in the generalization of smoothed lan-
matched to this objective.
Letdenote the tensor rank,
the size of the vocabulary, and
-gram tokens in the training set. Let
vectors of length , and
are written in script font (e.g.,
the order of the-gram
the number of
denote the set of
matrices. Tensorsthe set of
A. Tensor Rank
is defined to be the smallest for which there exist
matrices, (2) decomposes a tensor into the sum of rank-1 com-
denotes the tensor product,a generalization of the outer
. Like the singular value decomposition for
B. Tensor Rank Minimization
There are two dominant approaches to estimating low rank
tensors. One approach solves an optimization problem that pe-
nalizes the tensor rank, which encourages but does not impose
a low rank solution:
higher, not only is this problem NP-hard , but there are
no tractable convex relaxations of the notion of rank in (2).
Recently, researchers ,  have proposed a nuclear norm
relaxation of a different concept of rank, built upon the tensor
-rank . Under their relaxation, assuming
convex, (3) is convex, but the approach requires
memory, which is prohibitive for reasonable size vocabularies.
Instead, one can impose a hard rank constraint:
denotes the feasible set.) When the tensor is order-3 or
In this nonconvex problem,
best result is used. This approach allows one to reduce the space
by explicitly encoding the parameters
in the low-rank factored form of (2), which makes scaling to
real-world datasets practical.
is a predetermined hard limit; in
IV. LOW RANK LANGUAGE MODELS
Our low rank language models (LRLMs) represent
probabilities in a factored tensor form:
The model is parametrized by the non-negative component
and the factor matrices
denotes a joint probability distribution, we must
is entry-wise non-negative and sums to one. We
will see later that requiring our parameters to be non-negative
provides substantial benefits for interpretability and leads to an
ative tensor rank, which is never less than the tensor rank.
Because all of the parameters in (5) are non-negative, we can
constrain the rows of
to sum to one. It is then sufficient
to sum to one for
constraints, the rows of the factor matrices can be interpreted as
position-dependent unigram models over our vocabulary, and
the elements of
as priors on each component:
to sum to one. Under these
Note that when
On the other extreme, when
), it can represent any possible joint probability distribu-
mits us to carefully control model complexity, so that it can be
matched to the amount of training data available.
We construct the probability of a word sequence using the
-gram Markov assumption:
,degenerates to a unigram model.
is sufficiently high rank (
between these extremes per-
are a designated sentence start token. Note
that in (7), unlike traditional language mixture models,
does not take the form of a sum of conditional
distributions. By learning joint probabilities directly, we can
capture higher-order multilinear behavior.
The connection between non-negative tensor factorization
and latent variable models has been previously explored in the
literature (e.g., in , ). Non-negative tensor factorization
models have also been applied to other language processing
applications, including subject-verb-object selectional pref-
erence induction  and learning semantic word similarity
. Without drawing the connection to low rank tensors,
Lowd and Domingos  propose Naive Bayes models for
estimating arbitrary probability distributions that can be seen
as a generalization of (6).
HUTCHINSON et al.: LOW RANK LANGUAGE MODELS FOR SMALL TRAINING SETS 491
Our criterion for language model training is to maximize the
log-likelihood of the -grams appearing in the training data.
Formally, we find a local solution to the problem:
valid joint probability distributions;
the -grams in the training data (obtained by sliding a window
over each sentence); and
bution given by
imum likelihood objective yields models that are highly overfit
to the data; in particular, they are plagued with zero probability
-grams. The parameter tying implied by the low rank form
model; in practice, some additional smoothing is still required.
The low-rank language model can be interpreted as a mixture
Using this interpretation, we propose an expectation-maxi-
mization (EM) approach to training our models, iterating:
1) Given model parameters, assign the responsibilities
of each componentto the
denotes the set of element-wise non-negative tensors
is the probability distri-
-gram models, the max-
2) Given responsibilities , re-estimate
perplexity on a held-out development set begins to in-
The above training is only guaranteed to converge to a local
optimum, which means that proper initialization can be impor-
tant. A simple initialization is reasonably effective for bigrams:
randomly assign each training sample to one of the
components and estimate the component statistics similar to
step 2. To avoid zeroes in the component models, a small count
mass weighted by the global unigram probability is added to
is an indicator function. Iterations continue until
1By using overlapping -grams, the samples are no longer independent, and
the estimation is not strictly maximum likelihood of the original data. While no
token is double counted in a distribution
multiple . The implication is that the distribution is not consistent with respect
(a desirable property).
, each will be counted infor
LANGUAGE MODEL EXPERIMENT DATA
Our expectation is that the LRLM will be good for appli-
cations with small training sets. The experiments here first
evaluate the LRLM by training on a small set of conversational
speech transcripts and then in a domain adaptation context,
which is another common approach when there is data sparsity
in the target domain. The adaptation strategy is the standard
approach of static mixture modeling, specifically linearly inter-
polating a large general model trained on out-of-domain data
with the small domain-specific model.
A. Experimental Setup
Our experiments use LDC English broadcast speech data,2
with broadcast conversations (BC) or talkshows as the target
domain. This in-domain data is divided into three sets: training,
larger set of broadcast news speech, which is more formal in
style and less conversational. Table I summarizes the data sets.
We train several bigram low rank language models (LR2) on
the in-domain (BC) data, tuning the rank (in the range of 25 to
300). Because the initialization is randomized, we train models
for each rank ten times with different initializations and pick the
one that gives the best performance on the development set. As
standard language models with modified Kneser-Ney (mKN)
smoothing. Our general trigram (G3), trained on BN, also uses
mKN smoothing. Finally,each of the in-domain models is inter-
polated with the general model. We use the SRILM toolkit 
to train the mKN models and to perform model interpolation.
The vocabulary consists of the top 5 K words in the in-domain
(BC) training set.
The experimental results are presented in Table II. As ex-
pected, models using only the small in-domain training data
however, the LRLM gives the best perplexity, 3.6% lower than
the best baseline.Notably,theLR bigram outperformsthe mKN
trigram. The LR trigram gave no further gain; extensions to ad-
dress this are described later. The LRLM results are similar to
mKN when training on the larger BN set.
Benefiting from a larger training set, the out-of-domain
model alone is much better than the small in-domain models.
Interpolating the general model with any of the in-domain
models yields an approximately 15% reduction in perplexity
over the general model alone, highlighting the importance of
in-domain data. However, the different target-domain models
are contributing complementary information: when the in-do-
main models are combined performance further improves. In
particular, combining the baseline trigram and LRLM gives the
largest relative reduction in perplexity.
492IEEE SIGNAL PROCESSING LETTERS, VOL. 18, NO. 9, SEPTEMBER 2011
IN-DOMAIN TEST SET PERPLEXITIES. B DENOTES IN-DOMAIN BASELINE
MODEL, G DENOTES GENERAL MODEL, AND LR DENOTES IN-DOMAIN LOW
RANK MODEL. EACH MODEL IS SUFFIXED BY ITS
Fig. 2. Low rank language model perplexity by rank.
SAMPLES DRAWN RANDOMLY FROM LRLM MIXTURE COMPONENTS
Fig. 2 reports LRLM perplexity for the LR2 model by rank
(the number of mixture components). For an in-domain bigram
model, using approximately
as a full bigram joint probability matrix.
mixture components is
Each component in the model specializes in some particular
language behavior; in this light, the LRLM is a type of mixture
nents capture, we investigated likely -grams for different mix-
ture components. We find that components tend to specialize
in one of four ways: 1) modeling the distribution of words fol-
lowing a common word, 2) modeling the distribution of words
preceding a common word, 3) modeling sets of -grams where
the words in both position are relatively inter-changeable with
the other words in the same position, and 4) modeling semantic
-grams. Table III illustrates these four types, showing
sample -grams randomly drawn from different components of
a trained low rank model.
Language model smoothing techniques can be viewed as op-
erations on joint probability tensors over words; in this space, it
is observed that one common thread between many smoothing
methods is to reduce, either exactly or approximately, the tensor
rank. This letter introduces a new approach to language mod-
eling that more directly optimizes the low rank objective, using
a factored low-rank tensor representation of the joint proba-
bility distribution. Using a novel approach to parameter-tying,
the LRLM is better suited to modeling domains where training
resources are scarce. On a genre-adaptation task, the LRLM ob-
tains lower perplexity than the baseline (modified Kneser–Ney-
The standard file formats used for interpolating different lan-
guage models cannot compactly represent the low rank param-
eter structure of LRLMs. Thus, despite having relatively few
free parameters, storing LRLMs in standard formats can result
in prohibitively large files when the -gram order or vocabulary
size is large. To interpolate higher
require either development of a new format or implementation
of interpolation within LRLM training itself.
In addition to implementation issues, our initial experiments
did not obtain gains for trigrams as for bigrams. Possible im-
provements that may address this include alternative initializa-
a nonconvex objective), exploration of smoothing in combina-
tion with regularization, and other low-rank parameterizations
of the model (e.g., the Tucker decomposition ). For domain
adaptation, there are many other approaches that could be lever-
aged , and the LRLM might be useful as the filtering LM
used in selecting data from out-of-domain sources . Finally,
it would be possible to incorporate additional criteria into the
LRLM training objective, e.g., minimizing distance to a refer-
-gram order LRLMs will
 S. F. Chen and J. Goodman, “An empirical study of smoothing tech-
niques for language modeling,” Comput. Speech Lang., vol. 13, no. 4,
pp. 359–394, Oct. 1999.
 C. Chelba, T. Brants, W. Neveitt, and P. Xu, “Study on interaction
between entropy pruning and Kneser–Ney smoothing,” in Proc. Inter-
speech, 2010, pp. 2422–2425.
 C. J. Hillar and L.-H. Lim, “Most tensor problems are NP hard,” in
Proc. CORR, 2009.
 M. Signoretto, L. D. Lathauwer, and J. A. K. Suykens, Nuclear Norms
for Tensors and Their Use for Convex Multilinear Estimation ESAT-
SISTA, K. U. Leuven, Belgium, 2010, Tech. Rep. 10-186.
 R. Tomioka, K. Hayashi, and H. Kashima, “On the extension of trace
norm to tensors,” in Proc. NIPS Workshop on Tensors, Kernels and
Machine Learning, Dec. 2010.
 T. G. Kolda and B. W. Bader, “Tensor decompositions and applica-
tions,” SIAM Rev., vol. 51, no. 3, pp. 455–500, Sep. 2009.
 A. Shashua and T. Hazan, “Non-negative tensor factorization with ap-
2005, pp. 792–799.
 M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic latent vari-
able models as nonnegative factorizations,” Comput. Intell. Neurosci.,
 T. Van de cruys, “A non-negative tensor factorization model for selec-
tional preference induction,” Nat. Lang. Eng., vol. 16, pp. 417–437.
 P. D. Turney, Empirical Evaluation of Four Tensor Decomposition Al-
gorithms Institute for Information Technology, NRC Canada, 2007,
Tech. Rep. ERB-1152.
 D. Lowd and P. Domingos, “Naive bayes models for probability esti-
mation,” in Proc. ICML, 2005, pp. 529–536.
 A. Stolcke, “SRILM – An extensible language modeling toolkit,” in
Proc. ICSLP, 2002, pp. 901–904.
 J. R. Bellegarda, “Statistical language model adaptation: Review and
perspectives,” Speech Commun., vol. 42, pp. 93–108, 2004.
 I. Bulyko, M. Ostendorf, M. Siu, T. Ng, A. Stolcke, and O. Çetin,
“Webresources forlanguage modelinginconversationalspeech recog-
nition,” ACM Trans. Speech Lang. Process., vol. 5, no. 1, pp. 1–25,