Unsupervised Multi-Level Non-Negative Matrix Factorization Model: Binary Data Case

Rank determination issue is one of the most significant issues in non-negative matrix factorization (NMF) research. However, rank determination problem has not received so much emphasis as sparseness regularization problem. Usually , the rank of base matrix needs to be assumed. In this paper, we propose an unsupervised multi-level non-negative matrix factorization model to extract the hidden data structure and seek the rank of base matrix. From machine learning point of view, the learning result depends on its prior knowledge. In our unsupervised multi-level model, we construct a three-level data structure for non-negative matrix factorization algorithm. Such a construction could apply more prior knowledge to the algorithm and obtain a better approximation of real data structure. The final bases selection is achieved through L 2-norm optimization. We implement our experiment via binary datasets. The results demonstrate that our approach is able to retrieve the hidden structure of data, thus determine the correct rank of base matrix.
Journal of Information Security, 2012, 3, 245-250
Published Online October 2012
Unsupervised Multi-Level Non-Negative Matrix
Factorization Model: Binary Data Case
Qingquan Sun1, Peng Wu2, Yeqing Wu1, Mengcheng Guo1, Jiang Lu1
1Department of Electrical and Computer Engineering, The University of Alabama, Tuscaloosa, USA
2School of Information Engineering, Wuhan University of Technology, Wuhan, China
Received July 21, 2012; revised August 26, 2012; accepted September 7, 2012
Keywords: Non-Negative Matrix Factorization; Bayesian Model; Rank Determination; Probabilistic Model
1. Introduction
Non-negative matrix factorization (NMF) was proposed
by Lee and Seung [1] in 1999. NMF has become a
widely used technique over the past decade in machine
learning and data mining fields. The most significant
properties of NMF are non-negative, intuitive and part
based representative. The specific applications of NMF
algorithm include image recognition [2], audio and acous-
tic signal processing [3], semantic analysis and content
surveillance [4]. In NMF, given a non-negative dataset
, the objective is to find two non-negative fac-
tor matrices
. Here W is
called base matrix and H is named feature matrix. In ad-
dition, W and H satisfy
.. 0, 0VWHstW H (1)
K is the rank of base matrix and it satisfies the inequality
MN M N.
For NMF research, the cost function and initialization
problems of NMF are the main issues for researchers.
Now the rank determination problem becomes popular.
The rank of base matrix is indeed an important parameter
to evaluate the accuracy of structure extraction. On the
one hand, it reflects the real feature and property of data;
on the other hand, more accurate learning could help us
get better understanding and analyzing of data, thus im-
proving the performance in applications: recognition [5,6]
surveillance and tracking. The main challenge of rank
determination problem is that it is pre-defined. Therefore,
it is hard to know the correct rank of base matrix before
the updating process of components. As the same as the
cost function, there are no more priors added to the algo-
rithm in previous methods. That is why the canonical
NMF method and traditional probabilistic methods (ML,
MAP) cannot handle the rank determination problem.
Therefore in this paper, we propose an unsupervised
multi-level model to automatically seek the correct rank
of base matrix. Furthermore, we use L2-norm to show the
contribution of hyper-prior in correct bases learning pro-
cedure. Experimental results on two binary datasets dem-
onstrate that our method is efficient and robust.
The rest of this paper is organized as follows: Section
2 provides a brief review of related works. In Section 3,
we describe our unsupervised multi-level NMF model in
details. The experimental results of two binary datasets
are shown in Section 4. Section 5 concludes the paper.
2. Related Work
As we mentioned above, rank determination problem is a
new popular issue in NMF research. Actually, there are
few literatures discussing this issue. Although the author
in [7] proposed a method based on sampler selection, it
needs to pass through all the possible values of rank of
base matrix to choose the best one. Obviously, this method
is not impressive enough for unsupervised learning. In
[8], the author proposed a rank determination method
based on automatic relevance determination. In this method,
a parameter is defined relevant to the columns of W.
Then using EM algorithm to find a subset, however, this
subset of bases is not accurate to represent true bases.
Actually, the nature of this hyper-parameter is to affect
the updating procedure of base matrix and feature matrix,
thus affect the components’ distributions.
The only feasible solution is fully Bayesian models.
Such kind of methods have been proposed in [9]. In this
paper, the author addresses an EM based fully Bayesian
algorithm to discover the rank of base matrix. EM based
methods are an approximation solution. In comparison, a
little more accurate solution is Gibbs sampling based
methods. Such approach is utilized to find the correct
rank in [10]. Although such kinds of methods are flexible,
it requires successively calculation of the marginal like-
lihood for each possible value of each rank K. The
drawback is too much computation cost involved. Addi-
tionally, when such methods are applied to real time ap-
plication or some large scale dataset based applications,
the high computation load is impractical. Motivated by
the current condition, we propose a low computation,
robust multi-level model for NMF to solve rank determi-
nation problem. Our unsupervised model with multi-
lever priors only calculate once of the rank of base ma-
trix and is able to successfully find the correct rank of
base matrix given a large enough rank K. Therefore, our
method involves less computation. This will be discussed
in details in next section.
3. Unsupervised Multi-Level Non-Negative
Matrix Factorization Model
In our unsupervised multi-level NMF model, we intro-
duce a hyper-prior level. Hence, there are three levels in
our model: data model, prior model, hyper-prior model.
The model structure is shown in Figure 1. We will seek
Figure 1. Unsupervised multi-level non-negative matrix fac-
torization model.
the solutions through optimizing the maximum a poste-
rior criterion. Our approach could be depicted by the
following equation, here c
denotes equality up to a
is the prior of both W and H.
,, log log
log log
MAP p p
H (2)
The difference between our approach and the tradi-
tional MAP criterion is that in traditional one there is no
hyper-prior added to the model. Moreover, in our model
we attempt to update the hyper-priors recursively, but not
just set it as a constant.
3.1. Model Construction
In NMF algorithm, the updating rules are based on the
specific data model. Therefore, the first step is to set a
data model for our problem. Here, in our experiment we
assume that the data follows Poisson distribution. Con-
sequently, the cost function of our model will be gener-
alized KL-divergence. So given a variable x, which fol-
lows Poisson distribution with parameter
, we have
px e x
. Thus, in NMF algorithm,
given dataset V, we have the likelihood
VWH WH V (3)
The generalized KL-divergence is given by:
 
log mn
KL mn mn mn
mn mn
Dv v
VWH wh
Thus, the log-likelihood of the dataset V can be re-
written as:
1log log 1
mn mn mn
vv v
VWH (5)
From (2) and (5) we could conclude that maximizing a
posterior is equivalent to maximizing the log-likelihood,
and maximizing the log-likelihood is equivalent to mini-
mizing the KL-divergence. Thus, maximizing a posterior
is equivalent to minimizing the KL-divergence. Therefore,
it is possible to find a base matrix W and a feature matrix
H to approximate the dataset V via maximizing a poste-
rior criterion.
In data model
pVWH we regard WH as the pa-
rameter of data V. With respect to the base matrix W and
the feature matrix H, we also introduce a parameter
as a prior to them. Moreover, we define an independent
Exponent distribution for each column of W and each
row of H with prior k
because exponent distribution
has sharper performance. It is no doubt that we can
choose other exponential family distributions such as
Gaussian distribution, Gamma distribution, etc. There-
fore, the columns of W and rows of H yield:
pw e
mk k k (6)
kn k k
ph e
 (7)
Then the log-likelihood of the priors cou
te ld be rewrit-
n as:
log log kkmk
W (8)
log log kkkn
H (9)
Compare to setting
as a constant, the diversity of
and recursively upd ing of k
enable the inference
cedure to converge at the stationary point. Through
calculating the L2-morm of each column of base matrix
W, we could discover that the data finally emerges to two
clusters. One cluster contains the points of which the
L2-norm are much larger than 0, whereas in the other
cluster the L2-norm values are 0 or almost 0.
In order to find the best value for k
, here we intro-
duce hyper-prior for k
. Since k
is e parameter of
Exponent distribution e defin k
, w e
follows Gamma
distribution which is the conjugate ior for Exponent
distribution. pr
() k
kkk k kk
pab b
Here and are the hyper-priors of
b k
. Thus,
the log-l lihood ike of
is given as:
log p
log log 1 log
kk k k kkk
kab a a b
 (11)
3.2. Inference
hment of data model and the deduction After the establis
of log-likelihood of each prior, we can gain the maxi-
mum a posterior equation:
 
log 1 log 1
log log 1 log
mn mn mn
KL k mk k
kkn k
kk kkk k k
vv v
ab a ba b
Since the first factor in (12) has nothing to do with the
ors, and we have discussed the relationship between
the posterior probability and KL-divergence, here we
minimize the second factor to seek the solutions for this
criterion. In our paper, we choose gradient decent updat-
ing method as our updating rule. Although multiplicative
method is simpler, it has no detailed deduction about
why the approach works. On the contrary, gradient de-
cent updating will give us clear deduction about the
whole updating procedure. We utilize this method to in-
fer the priors W and H, as well as the hyper-priors
and b. First we find the gradient of the parameters:
WH (13)
WH (14)
mk kn k k
Then we utilize gradient coefficient to get rid of the
subtraction operation during the updating procedure for
W and H to guarantee the non-negative constrain. The
parameters k
and k
b are updated by zeroing.
The updating rules listed as follows: are
mn kn
mk mk
kn k
ww h
 (17)
kn kn
mk k
hh w
 (18)
mk kn k
 (19)
Then we find the correct bases and determine the order
of the data model by:
RB (21)
where B is defined as
R is the rank of base matrix.
4. Experimental Results and Evaluation
In this section, we apply our unsupervised multi-level
NMF algorithm on two binary datasets. One is fence
dataset, and the other is famous swimmer dataset. Both
of the experiment results demonstrate the efficacy of our
method on the rank determination issue.
4.1. Fence Dataset
We first performed our experiments on fence dataset.
Here I defined the data with four row bars (the size is 1 ×
32) and four column bars (the size is 32 × 1). The size of
each image is 32 × 32 with zero-value background, and
the value of each pixel in eight bars is one. Each image is
separated into five parts in both horizontal direction and
vertical direction. Additionally, in each image the num-
ber of row bars and the number of column bars should be
the same. For instance, there are two row bars in a sam-
ple image, then there should be two column bars in this
image. Hence, the total number of the fence dataset is N
= 69. The samples of Fence dataset are shown in Figure 2.
Here, we set the initial rank K = 16 (the initial value of
nk K needs to be larger than the value of real rank of
base matrix), the hyper-parameter a = 2,
0.05 0.05
. Figure 3 shows t
earned via our unsupervised multi-
level NMF approach, we could see that the data is sparse,
especially the base matrix. In both images, the color parts
denote the effective bases or features, and the black parts
denote irrelevant bases or features there. In addition,
from image processing perspective, we can conclude that
compared to the values of effective bases and features,
the values of irrelevant bases and features are very small,
since the color of such pixels are very dark. We could
clearly find that there are eight color column vectors in
the first image. Additionally, among the eight color vec-
tors, four are composed of several separated color pixels,
whereas the other four are composed of assembly pixels.
Actually, the former four vectors are row bars, and the
latter four vectors are column bars. We resize the dataset
in columns during factorization procedure. Hence the
row bars and column bars have different structures. Fur-
thermore, there are also eight rows in the second image,
which are the corresponding coefficients of the bases.
he base matrix
and feature matrix l
Figure 2. Sample images of fence dataset.
Figure 3. Base matrix W an ature matrix H learned via
how the bases clearly, we draw the bases
4.2. Swimmer Dataset
tial rank is set to K = 25,
d fe
our algorithm.
In order to s
Figure 4. Since we set the initial rank of base matrix K =
16, however, only eight images have non-zero values.
Moreover, the eight images show 4 row bars and 4 col-
umn bars appearing in different positions. The results are
perfectly consistent to the design of Fence dataset.
Therefore, we could get the conclusion that our algorithm
is very powerful and efficient to find the real basic com-
ponents and the correct rank.
The other dataset we used is the swimmer dataset.
Swimmer dataset is a typical dataset for feature extrac-
tion. Due to the clearly definition and composition of 16
dynamic parts, it is quite appropriate to the unique char-
acteristic of NMF algorithm, which is to learn part-based
data. As we know, however, the swimmer dataset is a
gray-level image dataset. In our experiment, we focus on
binary dataset, so first we need to convert this gray-level
dataset to binary dataset. Then apply our approach to
perform inference. In this swimmer dataset, there are 256
images totally, each of which depicts a swimming ges-
ture using one torso and four dynamic limbs. The size of
each image is 32 × 32. Each dynamic part could appear
at four different positions. Figure 5 shows some sample
images of the swimmer dataset.
In this experiment part, the ini
e initial values of hyper-parameters are a = 2,
0.05 0.05
. Figure 6 shows the experiment
Figure 4. The bases obtained by our algorithm on fence
Figure 5. Sample images of the swimmer dataset.
ages and the
ts for the swimmer dataset. It could be observed that
as for this dataset, we also could find out the correct
bases via our algorithm. In this figure there are 25 base
images. The black ones correspond to irrelevant bases,
and the other 17 images depict the torso and the limbs at
each possible position. We can see that the correct torso
and limbs are discovered successfully.
The differences between the black im
rrect base images are shown in Figure 7. Figure 7
depicts L2-norm of each column of the base matrix. The
total number of points in this figure is the same to the
initial rank. Obviously, the points are classified into two
clusters. One is zero-value cluster, and the other is lar-
ger-value cluster. Thus the rank of base matrix in swim-
mer dataset is 117RB. The results of L2-norm of
base matrix not ow we could find the correct
bases, but also tell us how we could determine the correct
rank of base matrix.
only tell us h
Figure 6. The bases of swimmer dataset learned by our al-
0 5 10 15 20 25
Base vector inde
Normalized L2 norm
Figure 7. L2-norm of base vectors.
supervised multi-level non-
