Content uploaded by Santosh Kesiraju
Author content
All content in this area was uploaded by Santosh Kesiraju on Feb 23, 2017
Content may be subject to copyright.
Learning document representations using subspace multinomial model
Santosh Kesiraju1,2, Luk´
aˇ
s Burget1, Igor Sz˝
oke1, Jan “Honza” ˇ
Cernock´
y1
1Brno University of Technology, Speech@FIT and
IT4I Center of Excellence, Brno, Czech Republic
2International Institute of Information Technology - Hyderabad, India
qkesiraju@stud.fit.vutbr.cz, {burget, szoke, cernocky}@fit.vutbr.cz
Abstract
Subspace multinomial model (SMM) is a log-linear model and
can be used for learning low dimensional continuous represen-
tation for discrete data. SMM and its variants have been used for
speaker verification based on prosodic features and phonotactic
language recognition. In this paper, we propose a new variant
of SMM that introduces sparsity and call the resulting model
as `1SMM. We show that `1SMM can be used for learning
document representations that are helpful in topic identification
or classification and clustering tasks. Our experiments in docu-
ment classification show that SMM achieves comparable results
to models such as latent Dirichlet allocation and sparse topical
coding, while having a useful property that the resulting docu-
ment vectors are Gaussian distributed.
Index Terms: Document representation, subspace modelling,
topic identification, latent topic discovery
1. Introduction
Learning document representations that elicit the underlying se-
mantics (or topics) is essential for tasks such as document clas-
sification, topic identification, query based document retrieval.
In this paper, we propose to use subspace multinomial model
(SMM) [1], for obtaining a compact and continuous represen-
tation for the documents, which are further used for document
classification and clustering tasks. By using evaluation metrics
like classification accuracy and normalized mutual information
(NMI), we show that the obtained representations can be use-
ful for the aforementioned tasks. SMM was first proposed for
speaker verification based on prosodic features [1]. It was later
used for phonotactic language recognition (LRE), where dis-
crete n-gram counts of an utterance are assumed to be gener-
ated by a specific multinomial (or n-gram) distribution, and the
parameters of these distributions are modelled using subspace
techniques [2].
Document clustering or latent topic discovery has been
studied for more than two decades. In most of the approaches,
the first step is to represent a document in the form of vector,
where every element corresponds to the weighted frequency of
a word in the vocabulary as occurred in a document. Word se-
lection can be done to decide the size of the vocabulary, which
will essentially eliminate the stop words and other irrelevant
words that do not significantly contribute to the semantics of
the document [3]. This allows us to obtain a fixed dimensional
representation for every document, which further enables to per-
form clustering or train classifiers. Even with careful word se-
lection, the document vectors are usually sparse as many words
are not shared across all the documents. To overcome this, la-
tent variables are introduced, which are in a much lower dimen-
sion than the actual dimension of the document vectors. In the
latent space, it is possible to compute similarities between doc-
uments even if they do not share common words. The latent
space also allows us to project both the documents and words
into the same space. Popular approaches include latent seman-
tic analysis (LSA) [4], non-negative matrix factorization (NMF)
[5], probabilistic LSA (PLSA) [6], and latent Dirichlet alloca-
tion (LDA) [7].
LSA and NMF use matrix factorization to obtain the la-
tent space while minimizing the reconstruction error. PLSA
and LDA come under probabilistic topic models (PTM), and are
seen as generative models for documents [8]; where every doc-
ument is modelled as a distribution over topics (latent variables)
and every topic is modelled as a distribution over vocabulary.
It was argued that sparsity in the semantic space is desired
in text modelling [9, 10], and NLP taks [11]. Sparsity induc-
ing topic model was proposed in [10], where the authors show
that it is possible to obtain sparse representation for document
codes and word codes. In [11], sparse log-linear models were
used to learn a small and useful feature space for dialogue-act
classification.
In the proposed SMM, we obtain a low dimensional con-
tinuous representation for every document while the objective
function is based on maximizing the likelihood of the observed
data, which is similar to that of PTM. In PTM, the latent vectors
have probabilistic interpretation (points on simplex), whereas in
SMM the latent vectors are Gaussian-like distributed [2], that
helps in clustering and training classifiers. Similar idea was
proposed in [12], where subspace technique was used to obtain
compact representation of multiple topic spaces learned from
LDA. The technique used in [12] is different from SMM, as the
former was based on subspace Gaussian mixture models which
is proposed for modelling continuous data, where as the SMM
acts directly on the observed word counts (discrete data).
2. Subspace multinomial model
Let Dbe the number of document in the collection, with drep-
resenting the document index and Vbe the total number of
words in the vocabulary, with irepresenting the word index.
Every word in a document can be seen as an independent event
(bag-of-words) generated by a document specific multinomial
distribution, and let cddenote the vector of word occurrences
in document d. We now describe the basic subspace multino-
mial model [1].
PREPRESS PROOF FILE CAUSAL PRODUCTIONS1
2.1. Basic SMM
Let φdi be the parameters (word probabilities) of the docu-
ment specific multinomial distribution, and the corresponding
log likelihood of the document can be written as
log(P(cd|φd)) =
V
X
i=1
cdi log(φdi),(1)
where Piφdi = 1, φdi ≥0and cdi is the count of word
iin document d. The parameter φdi of the multinomial dis-
tribution, which belongs to the exponential family, can be re-
parameterized using the natural parameters (η) [13] as
φdi =exp(ηdi)
Piexp(ηdi),(2)
which is also known as the softmax function. The subspace
model assumes that these natural parameters live in a much
smaller space, and can be expressed as
ηd=m+T wd.(3)
Here wd∈RKis the document specific latent vector, also
known as the i-vector, T∈RV×Kis known as the total vari-
ability matrix (bases matrix) which spans a linear subspace in
log-probability domain, and m∈RVcan be seen as a vector
of offset or bias.
The model parameters, wis initialized to zeros, Twith
small random values, and mwith log of probabilities of words
as estimated from the training set (this can be seen as an average
distribution over the entire training set). The parameters wand
Tare updated iteratively and alternately (mis not updated in
our experiments) by using Newton-Raphson like update steps
that maximizes the joint log-likelihood of all the documents,
L=
D
X
d=1
V
X
i=1
cdi log(φdi).(4)
The update equations take the following form [1]:
wnew
d=wd+H−1
d∇wd,(5)
tnew
i=ti+H−1
i∇ti.(6)
Here tiis the ith row in T.∇wdand ∇tiare the gradients
with respect to the objective function in Eq. (4). Since the pa-
rameters (rows in Tand every document i-vector) are updated
independently, the corresponding Hmatrices (Hiand Hd) are
much smaller and faster to compute [14], as compared to the
conventional full Hessian matrix in Newton-Raphson optimiza-
tion [13].
2.2. Limitations
In the document collection, the most frequently occurring words
are the stop words which do not have the ability to semanti-
cally discriminate the documents. So, when using full vocabu-
lary including the stop words, the number of parameters in the
model increases, which could lead to over-fitting. To over come
this, the model can be regularized. A variant of SMM (sub-
space n-gram model) was proposed for language recognition in
[2], where the authors used `2regularized model, which could
be interpreted as MAP point estimates of the parameters with
Gaussian prior. It was observed in [2] that, the i-vectors (w) ex-
hibit Gaussian-like distribution across various dimensions, and
the rows in Texhibit Laplace-like distribution (which does not
comply with Gaussian prior assumption). It was also suggested
in [2], that `1regularization could be applied for T, which could
be interpreted as MAP point estimate with Laplace prior. Fol-
lowing this, we propose to regularize Twith `1and wwith `2,
and call the resulting model as `1SMM.
2.3. `1SMM
The objective function from Eq. (4) becomes,
L=
D
X
d=1
V
X
i=1 cdi log(φdi)−γktik1−λ
2kwdk2(7)
Here γand λare the regularization coefficients for tand wre-
spectively. It is essential to regularize both tand w. Otherwise,
restricting the magnitude of one parameter will be compensated
by dynamic range increase in the other, during the iterative up-
date steps (Eq. (5) and (6)).
Estimating the parameters of any `1regularized function is
not trivial, as it introduces discontinuities at points where the
function is crossing the axis. To address this, several optimiza-
tion techniques were proposed [15, 16]. One of such techniques,
called as orthant-wise learning is explored in our work, as it
could be translated in a straightforward way to our optimization
scheme (Eq. (6)).
Orthant is a region in the n-dimensional space where the
sign of the variables does not change. It is equivalent to quad-
rant in 2D and octant in 3D. The important property of any `1
regularized function is that it is differentiable over any given or-
thant. In general, for any `1regularized convex objective func-
tion, if the initial point is in the same orthant as the minimum,
then the simple Newton-Raphson updates will lead to the min-
imum. In cases where we need to cross the orthant to find the
minimum, orthant-wise learning can be adopted [16].
2.4. Parameter estimation using orthant-wise learning
The gradient of tiwith respect to the function in Eq. (7) is given
by
∇ti=
D
X
d=1 cdi −φold
di
V
X
i=1
cdiwT
d−γsign(ti).(8)
Here sign is the element-wise sign operation on the vector ti.
At co-ordinates where the objective function is not differen-
tiable (i.e., when any of the co-ordinates kin ti∈RKequals
to 0), we compute the pseudo-gradient ˜
∇ti.
˜
∇tik ,
∇tik +γ, tik = 0,∇tik <−γ
∇tik −γ, tik = 0,∇tik > γ
0, tik = 0,|∇tik| ≤ γ
∇tik,|tik |>0.
(9)
Otherwise, ˜
∇ti=∇ti. For the updates following Newton-
Raphson like method, we need to ensure two things: (i) We
need to find the ascent direction d, which leads us into the cor-
rect orthant, and, (ii) a step in the ascent direction should not
cross the point of non-differentiability. In general, the search
direction d∈RKwill be of the form,
di,H−1
i˜
∇ti(10)
To ensure that the new updates (tnew
i) are along ascent direction
(di˜
∇ti>0), the co-ordinates in the search direction diare set
2
to zero, if the sign does not match with the co-ordinates in the
steepest ascent ˜
∇ti. This sign projection is denoted by PS:
PS(d)i,dik,if dik (˜
∇tik)>0,
0otherwise .(11)
To ensure that the step does not cross the point of non differen-
tiability, we apply the following orthant projection:
PO(t+d)i,0if tik(tik +dik )<0,
tik +dik otherwise .(12)
This orthant projection will set the co-ordinates in tnew
ito zero,
if they differ in sign with ti. Finally, the update for tiis given
as follows:
tnew
i=PO[ti+PS[H−1
i˜
∇ti] ] .(13)
Here Hi∈RK×Kis computed as follows:
Hi=− D
X
d=1
max cdi, φold
di
V
X
i=1
cdi!wdwT
d.(14)
The updates for wdare according to Eq. (5), with the following
gradient:
∇wd=
V
X
i=1
tT
i(cdi −φold
di
V
X
i=1
cdi)−λwd.(15)
The Hd∈RK×Kfor updating wdis given as follows:
Hd=−
V
X
i=1
tT
itimax cdi, φold
di
V
X
i=1
cdi−λI.(16)
More details on estimating the Hmatrices are given in [14].
If the updates of Tand wfail to increase the objective func-
tion in Eq. (7), we keep backtracking by halving the update
step. Typically the model converges after 15 to 20 iterations.
Once the model is trained, the i-vectors wdfor every document
dare extracted by keeping the Tfixed and using updates in Eq.
(5) that maximize the objective function. The i-vectors are ex-
tracted for both the training and test datasets and takes 3 to 5
iterations to converge.
3. Experiments
The experiments were conducted on the 20 newsgroups dataset
as it is well-studied with several benchmarking baseline sys-
tems. We have used the 20-news-bydate version as used
in [10], which contains 18775 documents in 20 categories, with
a total vocabulary of 61188 words. The training set consists
of 11269 documents with 53975 unique words and the test set
consists of 7505 documents.
3.1. Document classification
Since the document vectors (i-vectors) exhibit Gaussian-like
distribution, we have used linear Gaussian classifier, where ev-
ery class has a specific mean and the covariance matrix is shared
[13]. The classification accuracy on the test set for `1and `2
SMM for various values of γ(regularization coefficient of T)
and i-vector dimensions are shown in Fig. 1. For the purpose
of illustration, we have fixed the value of λ(regularization co-
efficient of i-vectors) at 10−4. We also give a comparison with
LDA, Discriminative LDA (DiscLDA) [17], sparse topical cod-
ing (STC) and max-margin supervised STC (MedSTC) [10] in
Table 1, along with the corresponding latent variable dimension
for which the classification accuracy is reported to be maximum
[10, 17]. Detailed comparison of STC and its variants along
with various other models can be found in [10]. It is important
to note that DiscLDA and MedSTC which achieve better classi-
fication results are supervised models i.e., topic label informa-
tion is incorporated while obtaining the latent vector representa-
tion; whereas their counterparts, LDA and STC are completely
unsupervised models like SMM.
100 200 300
i-vector dimension
68
69
70
71
72
73
74
75
76
77
Accuray on test set (in %)
(a)
`
1
and
`
2
SMM with
λ
=10
−
4
`
1
SMM
γ
=
10
−
4
γ
=
100
γ
=
101
γ
=
102
`
2
SMM
γ
=
10
−
4
γ
=
100
γ
=
101
γ
=
102
Figure 1: Classification accuracy of `1and `2SMM for various
values of γ, and i-vector dimensions.
Table 1: Comparison of classification accuracy (in %) across
various models for the best latent variable dimension (K).
Model LDA Disc
LDA STC Med
STC
`2
SMM
`1
SMM
%75.0 80.0 74.0 81.0 74.95 75.46 78.84
K110 110 90 100 100 100 1000
3.2. Document clustering
Clustering is not implicit in SMM as in NMF [5] and the la-
tent vector (i-vector) dimension doesn’t necessarily correspond
to the number of latent topics like PTM. Since, the i-vectors
exhibit Gaussian-like distribution, clustering techniques like k-
means or Gaussian mixture models can be used. In our ex-
periments, we used k-means to obtain the hard clusters. The
clustering was performed on the entire dataset (training + test),
while keeping the subspace trained only on the training set (to
maintain consistency with the classification experiments). The
resulting clusters are evaluated using NMI [3], and the average
(over 5 runs) scores are shown in Table 2. Here the number of
clusters in k-means are fixed to 20 (same as the actual number
of classes).
In [18], a model which is a mixture of LDAs (multi-grain
clustering topic model, MGCTM) was proposed that integrates
topic modelling with document clustering. In Table 3, we give
the comparison of the proposed SMM with other techniques as
reported in [18]. Here the hyper-parameters (including latent
vector dimension) are tuned to achieve best clustering perfor-
mance, and document specific parameters of MGCTM are ini-
tialized using LDA. It can be seen that the proposed SMM per-
forms better at clustering and classification at the same time
with the same model (i.e., with the same model parameters
including i-vector dimension). More experiments on various
datasets with analysis on STC, LDA, MGCTM and SMM are
left to future work.
3
Table 2: Comparison of average NMI scores of `1and `2SMM
for various values of γ,λ= 10−4, ivector-dimension= 100
and no. of clusters = 20.
γ10−4100101102103
`2SMM 0.50 0.49 0.49 0.50 0.52
`1SMM 0.56 0.57 0.58 0.58 0.45
Sparsity (%) 0.2 3.6 22.0 53.5 78.8
Table 3: Comparison of average NMI scores of other sys-
tems with `1and `2SMM for γ= 101,λ= 10−4, ivector-
dimension= 100 and no. of clusters = 20.
Method `2SMM `1SMM LDA NMF MGCTM
NMI 0.52 0.58 0.48 0.36 0.61
4. Discussion
In our experiments, we have observed that the classification ac-
curacy of SMM increases with the increasing dimensionality of
the latent variable (i-vector) which is not the case with STC or
PTM [10]. Further, we achieved 78.84 classification accuracy
on the test set for `1SMM with i-vector dimension=1000.
0.02 0.01 0.00 0.01 0.02
Basis matrix
(
T
)
0
200
400
600
800
Normalized frequency
200 100 0 100 200
i-vectors
(
w
)
0.000
0.005
0.010
0.015
Figure 2: Histograms showing the distribution of values in rows
of bases matrix T, and i-vectors.
Fig. 2 shows an example histogram of 5 randomly selected
rows in the bases Twith γ= 10.0, and the i-vectors with
λ= 10−4and K= 100, after the training phase. The dis-
tribution of the bases matrix in Fig. 2 suggests that Laplace
prior is suitable for T. The Laplace prior enforces sparsity in
the bases matrix (T), which suggests that feature (word) selec-
tion is implicit. Table 2 gives the percentage of sparsity in T
for `1SMM for various values of γ, along with the NMI scores.
It can be observed that `1SMM performs better than `2SMM
across various values of γ, reinforcing the suitability of Laplace
prior over the bases matrix T.
In an unsupervised scenario, it is possible to obtain a set
of words for each cluster that represent or discriminate it from
other clusters. One way is to subtract the global mean from the
cluster mean of i-vectors (wd) and project the resulting vector
on to the bases matrix Tand find the indices of large positive
values. These indices corresponds to the words for which the
probabilities significantly increase as compared to the average
distribution over words for the given cluster. Table 4 shows
an example of words from all the 20 clusters obtained using
k-means for `1SMM with λ= 10−4, γ = 101and i-vector
dimension (K) = 100.
Table 4: Top 5 significant words from all 20 clusters.
acceleration preferably scotia autoexec
suspension architecture sluggo windows
wagon databases compuserv exe
tires publisher nursery icons
chevy blvd pruden ini
xlib sale waco sacred
widget packaging atf worship
parameter obo fbi christianity
openwindows shipping convicted atheist
xview cod koresh prophet
physicians income murders privacy
patients socialism criminals encryption
therapy abortion firearm denning
infection welfare handguns clipper
diagnosed cramer criminal crypto
rbi israeli compute spacecraft
dodgers lebanon algorithms lunar
hitters occupied polygon moon
pitcher palestinians shareware exploration
pitching palestinian surfaces orbit
resistor hockey nubus zx
amplifier potvin quadra bikes
resistors leafs meg motorcycle
volt nhl slots riding
voltage playoff adapter bike
5. Conclusions and future work
In this paper, we have proposed a new variant of subspace multi-
nomial model called `1SMM its application to topic identifica-
tion and document clustering. We have shown that it is possi-
ble to introduce sparsity in the semantic space (bases matrix),
while retaining the useful property of the document vectors to
be Gaussian distributed. Having such a distribution, helped in
using simple classifiers and clustering techniques, rather than
relying on sophisticated models for each of the tasks.
By applying optimization techniques, we have shown how
the `1SMM could be trained. There are many optimization
techniques for `1regularized objective functions that could be
explored [16]. Our future work involves, exploring discrimina-
tive SMM and fully Bayesian modelling of SMM.
6. Acknowledgements
This work was supported by the U.S. DARPA LORELEI con-
tract No. HR0011-15-C-0115. The views expressed are those
of the authors and do not reflect the official policy or position of
the Department of Defense or the U.S. Government. The work
was also supported by Czech Ministry of Education, Youth and
Sports from the National Programme of Sustainability (NPU
II) project ”IT4Innovations excellence in science - LQ1602”.
Santosh Kesiraju is partly funded by TCS Research Fellowship
program, India.
7. References
[1] M. Kockmann, L. Burget et al., “Prosodic speaker verification
using subspace multinomial models with intersession compensa-
tion,” in INTERSPEECH, ISCA, September 2010, pp. 1061–1064.
[2] M. Soufifar, L. Burget, O. Plchot et al., “Regularized Subspace
n-Gram Model for Phonotactic iVector Extraction,” in INTER-
SPEECH, ISCA, August 2013, pp. 74–78.
4
[3] C. D. Manning, P. Raghavan, and H. Sch¨
utze, Introduction to In-
formation Retrieval. Cambridge University Press, 2008.
[4] S. Deerwester, S. T. Dumais et al., “Indexing by Latent Semantic
Analysis,” Journal of the American Society for Information Sci-
ence, vol. 41, no. 6, pp. 391–407, 1990.
[5] W. Xu, X. Liu, and Y. Gong, “Document Clustering Based on
Non-negative Matrix Factorization,” in SIGIR. New York, USA:
ACM, 2003, pp. 267–273.
[6] T. Hofmann, “Probabilistic Latent Semantic Indexing,” in SIGIR.
New York, USA: ACM, 1999, pp. 50–57.
[7] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Alloca-
tion,” JMLR, vol. 3, pp. 993–1022, March 2003.
[8] M. Steyvers and T. Griffiths, “Probabilistic Topic Models,” Hand-
book of Latent Semantic Analysis, vol. 427, no. 7, pp. 424–440,
2007.
[9] M. V. S. Shashanka, B. Raj, and P. Smaragdis, “Sparse Overcom-
plete Latent Variable Decomposition of Counts Data,” in NIPS,
December 2007, pp. 1313–1320.
[10] J. Zhu and E. P. Xing, “Sparse Topical Coding,” in Proceedings of
the 27th Conference on UAI, July 2011, pp. 831–838.
[11] Y. Chen, W. Y. Wang, and A. I. Rudnicky, “An empirical investi-
gation of sparse log-linear models for improved dialogue act clas-
sification,” in IEEE ICASSP, May 2013, pp. 8317–8321.
[12] M. Morchid, M. Bouallegue et al., “An I-vector Based Approach
to Compact Multi-Granularity Topic Spaces Representation of
Textual Documents,” in EMNLP, October 2014, pp. 443–454.
[13] C. M. Bishop, Pattern Recognition and Machine Learning.
Springer-Verlag New York, Inc., 2006.
[14] D. Povey, L. Burget et al., “The Subspace Gaussian Mixture
model-A Structured Model for Speech Recognition,” Comput.
Speech Lang., vol. 25, no. 2, pp. 404–439, Apr. 2011.
[15] G. Andrew and J. Gao, “Scalable Training of L1-Regularized
Log-Linear Models,” in ICML. New York, USA: ACM, 2007,
pp. 33–40.
[16] M. Schmidt, “Graphical Model Structure Learning with `1
Regularization,” Ph.D. dissertation, The University of British
Columbia, August 2010.
[17] S. Lacoste-Julien, F. Sha, and M. I. Jordan, “DiscLDA: Discrimi-
native Learning for Dimensionality Reduction and Classification,”
in NIPS, 2009, pp. 897–904.
[18] P. Xie and E. P. Xing, “Integrating Document Clustering and Topic
Modeling,” in UAI, August 2013.
5