Content uploaded by Mingjie Qian
Author content
All content in this area was uploaded by Mingjie Qian on Apr 25, 2015
Content may be subject to copyright.
Unsupervised Feature Selection for Multi-View Clustering
on Text-Image Web News Data
Mingjie Qian
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, IL, USA
mqian2@illinois.edu
Chengxiang Zhai
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, IL, USA
czhai@illinois.edu
ABSTRACT
Unlabeled high-dimensional text-image web news data are
produced every day, presenting new challenges to unsuper-
vised feature selection on multi-view data. State-of-the-
art multi-view unsupervised feature selection methods learn
pseudo class labels by spectral analysis, which is sensitive
to the choice of similarity metric for each view. For text-
image data, the raw text itself contains more discrimina-
tive information than similarity graph which loses informa-
tion during construction, and thus the text feature can be
directly used for label learning, avoiding information loss
as in spectral analysis. We propose a new multi-view un-
supervised feature selection method in which image local
learning regularized orthogonal nonnegative matrix factor-
ization is used to learn pseudo labels and simultaneously
robust joint l2,1-norm minimization is performed to select
discriminative features. Cross-view consensus on pseudo
labels can be obtained as much as possible. We system-
atically evaluate the proposed method in multi-view text-
image web news datasets. Our extensive experiments on
web news datasets crawled from two major US media chan-
nels: CNN and FOXNews demonstrate the efficacy of the
new method over state-of-the-art multi-view and single-view
unsupervised feature selection methods.
Categories and Subject Descriptors
I.5.2 [Pattern Recognition]: Design Methodology—Fea-
ture Evaluation and Selection
Keywords
Multi-View Unsupervised Feature Selection
1. INTRODUCTION
Reading web news articles is an important part of peo-
ple’s daily life, especially in the current “big data” era that
we are facing a large amount of information every day due to
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
CIKM’14, November 3–7, 2014, Shanghai, China.
Copyright 2014 ACM 978-1-4503-2598-1/14/11 ...$15.00.
http://dx.doi.org/10.1145/2661829.2661993.
the advancement and development of information technol-
ogy. One ideal way is to automatically group the web news
per their content into multiple clusters, e.g., technology and
health care, then one can choose to read the latest and the
most representative news articles in a group of interest. This
procedure can be done recursively so that one can explore
the news in different resolution hierarchically. Clustering
web news is also an effective way to organize, manage, and
search news articles. Unlike traditional document cluster-
ing, images play an important role in web news articles as
is evident from the fact that almost all news articles have
one picture associated. How to effectively and efficiently
group web news articles of multiple modality is challenging
because different data types have different properties and
different feature spaces and also because the dimensionality
of feature spaces is usually very high. For example in text
feature space, the vocabulary size can be over a million. Be-
sides, there are a lot of unrelated and noisy features which
often lead to low efficiency and poor performance.
Multi-view unsupervised feature selection is desirable to
solve the problem mentioned above, since it can select most
discriminative features while considering the consensus from
data of multiple views in an unsupervised fashion. Fea-
ture size can be extremely reduced and feature quality can
be greatly enhanced. As a result, not only computation
can be more efficient but clustering performance can also
be greatly improved. However, not much work have been
done to be able to solve this problem well, especially for
multi-view clustering on web news data. State-of-the-art un-
supervised feature selection methods [2, 13] for multi-view
data use spectral clustering across different views to learn
the most consistent pseudo class labels and simultaneously
use the learned labels to do feature selection. More specif-
ically, Adaptive Unsupervised Multi-view Feature Selection
(AUMFS) [2] uses spectral clustering on a combined data
similarity graph from different views to learn the labels that
have most consensus across different views, and then use
l2,1-norm regularized robust sparse regression to learn one
weight matrix for all the features of different views to best
approximate the cluster labels. [13] presents a new unsuper-
vised multi-view feature selection method called Multi-View
Feature Selection (MVFS). MVFS also uses spectral clus-
tering on the combined data similarity graph from different
views to learn the labels, but learn one weight matrix for
each view to best fit the learned pseudo class labels by joint
squared Frobenius norm (fitting term) and l2,1-norm (rowise
sparsity-inducing). Both [2] and [13] share the disadvantage
that they’re sensitive to the combined data similarity graph,
especially when there are quite a number of unrelated and
noisy features in the feature space, and there is information
loss during graph construction.
We propose to directly utilize raw features in the main
view (e.g., text for text-image web news data) to learn pseudo
cluster labels which should also have the most consensus
with other views (e.g., image), and meanwhile the discrimi-
native features in the feature selection process will win out
to contribute more on label learning process, and in return
the improved cluster labels will help to select more discrimi-
native features for each view. Technically, we propose a new
method called Multi-View Unsupervised Feature Selection
(MVUFS) to do unsupervised feature selection for multi-
view clustering, especially focused on analyzing text-image
web news data. We propose to minimize the sum of regular-
ized data matrix factorization error and data fitting error in
a unified optimization setting. We use local learning regu-
larized orthogonal nonnegative matrix factorization to learn
pseudo cluster labels and simultaneously learn rowise sparse
weight matrices for each view by joint l2,1-norm minimiza-
tion guided by the learned pseudo cluster labels. The label
learning process and feature selection process are mutually
enhanced. For label learning, we factorize the data matrix
in the main view (e.g. text) and ensure that the learned in-
dicator matrix is as consistent as local learning predictors on
other views (e.g. image). To objectively evaluate the new
method, we build two text-image web news datasets from
two major US news media web sites: CNN and FOXNews.
Our extensive experiments show that MVUFS significantly
outperforms state-of-the-art single-view and multi-view un-
supervised feature selection methods.
2. NOTATIONS AND PRELIMINARIES
Throughout this paper, matrices are written as boldface
capital letters and vectors are denoted as boldface lowercase
letters. For matrix M= (mij), its i-th row, j-th column
are denoted by mi,mjrespectively. kMkFis the Frobe-
nius norm of M. For any matrix M∈ Rr×t, its l2,1-norm
is defined as kMk2,1=Pr
i=1 qPp
j=1 m2
ij =Pr
i=1
mi
2.
Assume that we have ninstances X={xi}n
i=1. Let Xv∈
Rn×dvdenote the data matrix in the v-th view where the i-
th row xi
v∈ Rdvis the feature descriptor of the i-th instance
in the v-th view. For text-image web news data, X1is text
view data matrix, and X2is image view data matrix. Sup-
pose these ninstances are sampled from cclasses and denote
Y= [y1,· · · ,yn]T∈ {0,1}n×c, where yi∈ {0,1}c×1is the
cluster indicator vector for xi. The scaled cluster indicator
matrix Gis defined as G= [g1,· · · ,gn]T=YYTY−1
2,
where giis the scaled cluster indicator of xi. It can be seen
that GTG=Ic, where Ic∈ Rc×cis an identity matrix.
2.1 Local learning regularization
It is often easier to produce good predictions on some
local regions of the input space instead of searching a good
global predictor f, because the function set f(x) may not
contain a good predictor for the entire input space. And
it is usually more effective to minimize prediction cost for
each local region. We adopt the local learning regularization
proposed in [3]. Let N(xi) denote the neighborhood of xi,
the local learning regularization aims to minimize the sum
of prediction errors between the local prediction from N(xi)
and the cluster assignment of xi:
K
P
k=1
n
P
i=1
fk
i(xi)−gik
=
K
P
k=1
n
P
i=1
kT
i(Ki+niλI)−1gk
i−gik
=
K
P
k=1
n
P
i=1
αT
igk
i−gik
= Tr GTLllr G.
where fk
i(xi) is the locally predicted label for k-th clus-
ter from N(xi), λis a positive parameter, Kiis the ker-
nel matrix defined on the neighborhood of xi, i.e., N(xi),
with size of ni,kiis the kernel vector defined between xi
and N(xi), gk
iis the cluster assignments of N(xi), Lllr =
(A−I)T(A−I), I∈ Rn×nis an identity matrix, and A∈
Rn×nis defined by Aij =αij,if xj∈ N (xi)
0,otherwise .
3. OPTIMIZATION PROBLEM
MVUFS solves the following optimization problem:
min kX1−GFk2
F+ Tr hGTLllr
2Gi+
α
2
X
v=1
kG−XvWvk2,1+β
2
X
v=1
kWvk2,1
s.t.GTG=Ic,G≥0,F≥0,Wv∈ Rdv×c(1)
where α, β are nonnegative parameters. To learn the most
consistent pseudo labels across different views, we use or-
thogonal nonnegative matrix factorization on the text view
regularized by local learning prediction error on the image
view. Fis the basis matrix with each row being a clus-
ter center. The fitting term P2
v=1 kG−XvWvk2,1will also
push the pseudo labels to be close to the linear prediction by
the feature weight matrices for each view, which gives the
desirable mutual reinforcement between label learning and
feature selection. Nonnegative and orthogonal constraints
imposed on the cluster indicator matrix variable are desir-
able to give a single non-zero positive entry on each row of
the label matrix. For feature selection, we adopt joint l2,1-
norm minimization [6] to learn rowise sparse weight matrices
for each view. The sparsity-inducing property of l2/l1-norm
pushes the feature selection matrix Wvto be sparse in rows.
More specifically, wj
vshrinks to zero if the j-th feature is less
correlated to the pseudo labels Y. We can thus filter out
the features corresponding to zero rows of Wv.
We apply alternating optimization to solve problem (1).
To optimize Ggiven F,Wv, v = 1,2, and Gtin the last
iteration, we solve the following subproblem:
min kX1−GFk2
F+ Tr hGTLllr
2Gi+α
2
X
v=1
kDvG−DvXvWvk2
F
s.t.GTG=Ic,G≥0,(2)
where Dvis a diagonal matrix: Dv
ii =1
20.5
gi
t−xi
vWv
−0.5
2.
It can be proved (due to space limit, we omit the proof) that
if Gt+1 is the solution of problem (2), Gt+1 will monotoni-
cally decrease the objective function of problem (1). Denote
the objective function in problem (2) by J(G), the Lagrange
function is given by L(G,Λ,Σ) = J(G)−Tr ΛGTG−I
−Tr ΣTG. The optimal Gmust satisfy the KKT condis-
Table 1: Dataset Description.
Dataset # Instances # Words # IMG-features # Classes
CNN 2107 7989 996 7
FOX 1523 5477 996 4
tions:
∇J(G)−2GΛ −Σ=0
GTG=I
ΣG=0;Σ≥0;G≥0
.Since the updated Gis
guaranteed to be nonnegative, we can ignore Σ, we thus have
∂J
∂G−2GΛ =0, giving Λ=1
2GT∂J
∂G.We first decompose
Wv=W+
v−W−
vand Λ=Λ+−Λ−, where
Λ+=GTGFFT+GTLllr+
2G+αGT2
P
v=1
D2
vG
+αGT2
P
v=1
D2
vXvW−
v
Λ−=GTX1FT+GTLllr−
2G+αGT2
P
v=1
D2
vXvW+
v.
We then obtain the following update formula for Gby ap-
plying the auxiliary function approach in [11]:
Gik ←Gik
X1FT+L−
2G+α
2
P
v=1
D2
vXvW+
v+GΛ+ik
GFFT+L+
2G+α
2
P
v=1
D2
vG+α
2
P
v=1
D2
vXvWv+GΛ−ik
.
(3)
followed by column-wise normalization. When converges, we
have (∇J(G)−2GΛ)G=0,which is exactly the KKT
complementary slackness condition.
To optimize F, we solve the subproblem: min
F≥0kX1−GFk2
F.
Since the objective function is quadratic, and F’s columns
are mutually independent, we can use blockwise coordinate
descent to update one row at a time in a cyclic order, and
the objective function value is guaranteed to decrease. The
updating formula for Fis
Fi:←max 0,Fi:−GTGi:F−GTX1i:
[GTG]ii !.(4)
To optimize Wv, we need to solve the unconstrained prob-
lem min
Wv∈Rdv×cαkG−XvWvk2,1+βkWvk2,1for each view.
There’re several optimization strategies that can solve it.
Here we adopt the simple algorithm given in [6].
Algorithm 1 MVUFS
Input: {Xv, pv}2
v=1 ,Lllr
2, α, β
Output: pvfeatures for the v-th view, v= 1,2
1: Initialize G0s.t.G0TG0=I(e.g., by K-means) and F0=
G0TX1,t←0
2: while Not convergent do
3: Given Gtand Ft, compute Wt+1
vas in [6]
4: Given Wt+1
vand Ft, compute Gt+1 by Eq. (3)
5: Given Wt+1
vand Gt+1, compute Ft+1 by Eq. (4)
6: t←t+ 1
7: end while
8: for v= 1 to 2 do
9: Sort all dvfeatures according to kwi
vk2in descending order
and select the top pvranked features for the v-th view.
10: end for
4. EXPERIMENTS
4.1 Datasets
We crawled CNN and FOXNews web news from Jan. 1st,
2014 to Apr. 4th, 2014. The category information contained
in the RSS feeds for each news article can be viewed as reli-
able ground truth. Titles, abstracts, and text body contents
are extracted as the text view data, and the image associ-
ated with the article is stored as the image view data. Since
the vocabulary has a very long tail word distribution, We
filtered out those words that occur less than or equal to 5
times. All text content is stemmed by portStemmer [8], and
we use l2-normalized TFIDF as text. For image features, we
use 7 groups of color features: Color features include RGB
dominant color, HSV dominant color, RGB color moment,
HSV color moment, RGB color histogram, HSV color his-
togram, color coherence vector [7], and 5 textural features:
four Tamura textural features [12] (coarseness, contrast, di-
rectionality, line-likeness) and Gabor transform [4, 10].
4.2 Settings
Two widely used evaluation metrics for measuring cluster-
ing performance: accuracy (ACC) and Normalized Mutual
Information (NMI) are used. We compare MVUFS with
KMeans on text with all features (KM-TXT), KMeans on
image with all features (KM-IMG), state-of-the-art single
view unsupervised feature selection methods: NDFS [5] -
Joint nonnegative spectral analysis and l2,1-norm regular-
ized regression and RUFS [9] - joint local learning regular-
ized robust NMF and robust l2,1-norm regression; multi-
view spherical KMeans with all features (MVSKM) [1],
state-of-the-art multi-view unsupervised feature selection:
AUMFS [2] - spectral clustering and l2,1-norm regularized
robust sparse regression and MVFS [13] - spectral cluster-
ing and l2,1-norm regression. For single-view unsupervised
feature selection methods, KMeans is used to calculate the
clustering performance. For multi-view unsupervised fea-
ture selection methods, multi-view spherical KMeans [1] is
used for multi-view clustering. We set the neighborhood size
to be 5. We use cosine similarity to build text graph and
Gaussian kernel for image graph. All feature selection meth-
ods have two parameters: αfor regression, and βfor sparsity
control. We do grid search for αin 10−2,10−1,...,102,
and βin α×10−2,10−1,...,102. We vary the number of
selected text features as {100,300,500,700,900}. The num-
ber of selected image features is half of selected text features.
Since K-means depends on initialization, we repeat cluster-
ing 10 times with random initialization.
4.3 Results
We need to answer several questions. First, is multi-view
clustering always better than single view clustering? From
Table 2, Table 3, and Figure 1, we can see that the an-
swer is no. It depends on the feature quality of different
views. Here the color and texture features we used for im-
age view is not tightly tied with clustering measures, which
does severely hurt the performance of multi-view cluster-
ing (MVSKM behaves much worse than KM-TXT). Fortu-
nately, if discriminative features are selected by using multi-
view feature selection methods, the multi-view clustering
performance may be significantly improved and can be bet-
ter than single-view performance. For example, MVUFS
significantly outperforms all single-view methods. Second, is
multi-view feature selection better than single-view feature
selection? We see that AUMFS, MVFS, and MVUFS out-
perform standard single view features election methods such
as NDFS and RUFS, which indicates that different views can
mutually bootstrap each other. It’s interesting to see that
both NDFS and RUFS even behave worse than without do-
ing feature selection. At last, it turns out that MVUFS
Table 2: Clustering Results (ACC%±std), ∗means statistical significance at 5% level.
Dataset KM-TXT KM-IMG NDFS RUFS MVSKM AUMFS MVFS MVUFS
CNN 50.1±7.2 23.2±1.0 31.6±6.1 31.3±5.3 32.0±2.8 54.2±4.6 50.2±4.857.9±4.9∗
FOX 76.2±7.7 43.0±0.3 56.6±9.3 61.2±8.3 73.3±2.1 83.7±1.3 84.7±0.687.9±1.0∗
Table 3: Clustering Results (NMI%±std), ∗means statistical significance at 5% level.
Dataset KM-TXT KM-IMG NDFS RUFS MVSKM AUMFS MVFS MVUFS
CNN 42.0±4.3 3.7±0.1 21.1±5.5 22.8±4.9 16.6±1.1 36.4±3.2 30.8±2.544.1±2.4∗
FOX 67.3±6.1 7.6±0.3 37.3±8.5 42.6±12.5 50.0±1.8 64.4±0.9 66.5±0.672.1±0.5∗
100 300 500 700 900
20
30
40
50
60 CNN
ACC (%)
# of Selected Features
100 300 500 700 900
40
50
60
70
80
90 FOX
ACC (%)
# of Selected Features
100 300 500 700 900
10
20
30
40
50 CNN
NMI (%)
# of Selected Features
KM−TXT
MVSKM
NDFS
RUFS
AUMFS
MVFS
MVUFS
KM−TXT
MVSKM
NDFS
RUFS
AUMFS
MVFS
MVUFS
KM−TXT
MVSKM
NDFS
RUFS
AUMFS
MVFS
MVUFS 100 300 500 700 900
30
40
50
60
70
80 FOX
NMI (%)
# of Selected Features
KM−TXT
MVSKM
NDFS
RUFS
AUMFS
MVFS
MVUFS
Figure 1: ACC and NMI with varying number of
selected features.
100 300 500 700 900
0.010.1 110100
0
20
40
60
80
Feature #
ACC for FOX (β = 1000.00)
α100 300 500 700 900
0.1 110100
1000
0
20
40
60
80
Feature #
ACC for FOX (α = 10.0)
β
Figure 2: ACC v.s. different α,β, and number of
selected features on FOX dataset for MVUFS.
outperforms both single-view clustering and feature selec-
tion methods and multi-view clustering and feature selection
methods. Since the major difference between MVUFS and
AUMFS, MVFS is label learning, we conclude that directly
learning labels from raw features from one view while ensur-
ing the most consensus with other views could select a more
discriminative feature set for all views, and spectral clus-
tering relies on the combined similarity graphs of all views
which may result in loss of discriminative information and
could undermine the performance.
4.4 Parameter Analysis
We plot ACC versus different α,β, and number of se-
lected features on FOXNews for MVUFS in Figure 2 (simi-
lar figures for NMI and on CNN dataset) due to space limit.
We see that an appropriate combination of these parameters
is crucial. However, it is unknown to us theoretically how
to choose the best parameter setting. It may depends on
datasets and measures. In practice, like many other meth-
ods, one can build a validation set in a mild scale to tune
parameters by e.g., grid search.
5. CONCLUSION
We propose a new unsupervised feature selection meth-
ods for multi-view clustering: MVUFS where local learn-
ing regularized orthogonal nonnegative matrix factorization
is performed to learn pseudo class labels on raw features.
We built two web news text-image datasets from CNN and
FOXNews, and systematically evaluate MVUFS with state-
of-the-art single-view and multi-view unsupervised feature
selection methods. Experimental results validate the effec-
tiveness of the proposed method.
Acknowledgments
This material is based upon work supported by the National
Science Foundation under Grant Number CNS-1027965.
6. REFERENCES
[1] S. Bickel and T. Scheffer. Multi-view clustering. In Proceedings
of the Fourth IEEE International Conference on Data
Mining, pages 19–26. IEEE Computer Society, 2004.
[2] Y. Feng, J. Xiao, Y. Zhuang, and X. Liu. Adaptive
unsupervised multi-view feature selection for visual concept
recognition. In Proceedings of the 11th Asian conference on
Computer Vision-Volume Part I, pages 343–357.
Springer-Verlag, 2012.
[3] Q. Gu and J. Zhou. Local learning regularized nonnegative
matrix factorization. In Twenty-First International Joint
Conference on Artificial Intelligence, 2009.
[4] T. Lee. Image representation using 2d gabor wavelets. Pattern
Analysis and Machine Intelligence, IEEE Transactions on,
18(10):959–971, 1996.
[5] Z. Li, Y. Yang, J. Liu, X. Zhou, and H. Lu. Unsupervised
feature selection using nonnegative spectral analysis. In 26th
AAAI Conference on Artificial Intelligence, 2012.
[6] F. Nie, H. Huang, X. Cai, and C. Ding. Efficient and robust
feature selection via joint l2, 1-norms minimization. Advances
in Neural Information Processing Systems, 23:1813–1821,
2010.
[7] G. Pass, R. Zabih, and J. Miller. Comparing images using color
coherence vectors. In Proceedings of the fourth ACM
international conference on Multimedia. ACM, 1997.
[8] M. Porter. An algorithm for suffix stripping. Program:
electronic library and information systems, 14(3):130–137,
1993.
[9] M. Qian and C. Zhai. Robust unsupervised feature selection. In
Proceedings of the Twenty-Third international joint
conference on Artificial Intelligence, pages 1621–1627. AAAI
Press, 2013.
[10] Y. Ro, M. Kim, H. Kang, B. Manjunath, and J. Kim. Mpeg-7
homogeneous texture descriptor. ETRI journal, 23(2):41–51,
2001.
[11] D. Seung and L. Lee. Algorithms for non-negative matrix
factorization. Advances in neural information processing
systems, 13:556–562, 2001.
[12] H. Tamura, S. Mori, and T. Yamawaki. Textural features
corresponding to visual perception. Systems, Man and
Cybernetics, IEEE Transactions on, 8(6), 1978.
[13] J. Tang, X. Hu, H. Gao, and H. Liu. Unsupervised feature
selection for multi-view data in social media. In Proceedings of
the 13th SIAM International Conference on Data Mining,
2013. SIAM, 2013.