Conference PaperPDF Available

Unsupervised Feature Selection for Multi-View Clustering on Text-Image Web News Data

Authors:
  • Microsoft, Sunnyvale CA

Abstract and Figures

Unlabeled high-dimensional text-image web news data are produced every day, presenting new challenges to unsuper-vised feature selection on multi-view data. State-of-the-art multi-view unsupervised feature selection methods learn pseudo class labels by spectral analysis, which is sensitive to the choice of similarity metric for each view. For textimage data, the raw text itself contains more discriminative information than similarity graph which loses information during construction, and thus the text feature can be directly used for label learning, avoiding information loss as in spectral analysis. We propose a new multi-view unsupervised feature selection method in which image local learning regularized orthogonal nonnegative matrix factorization is used to learn pseudo labels and simultaneously robust joint l2,1-norm minimization is performed to select discriminative features. Cross-view consensus on pseudo labels can be obtained as much as possible. We systematically evaluate the proposed method in multi-view textimage web news datasets. Our extensive experiments on web news datasets crawled from two major US media channels: CNN and FOXNews demonstrate the efficacy of the new method over state-of-the-art multi-view and single-view unsupervised feature selection methods.
Content may be subject to copyright.
Unsupervised Feature Selection for Multi-View Clustering
on Text-Image Web News Data
Mingjie Qian
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, IL, USA
mqian2@illinois.edu
Chengxiang Zhai
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, IL, USA
czhai@illinois.edu
ABSTRACT
Unlabeled high-dimensional text-image web news data are
produced every day, presenting new challenges to unsuper-
vised feature selection on multi-view data. State-of-the-
art multi-view unsupervised feature selection methods learn
pseudo class labels by spectral analysis, which is sensitive
to the choice of similarity metric for each view. For text-
image data, the raw text itself contains more discrimina-
tive information than similarity graph which loses informa-
tion during construction, and thus the text feature can be
directly used for label learning, avoiding information loss
as in spectral analysis. We propose a new multi-view un-
supervised feature selection method in which image local
learning regularized orthogonal nonnegative matrix factor-
ization is used to learn pseudo labels and simultaneously
robust joint l2,1-norm minimization is performed to select
discriminative features. Cross-view consensus on pseudo
labels can be obtained as much as possible. We system-
atically evaluate the proposed method in multi-view text-
image web news datasets. Our extensive experiments on
web news datasets crawled from two major US media chan-
nels: CNN and FOXNews demonstrate the efficacy of the
new method over state-of-the-art multi-view and single-view
unsupervised feature selection methods.
Categories and Subject Descriptors
I.5.2 [Pattern Recognition]: Design Methodology—Fea-
ture Evaluation and Selection
Keywords
Multi-View Unsupervised Feature Selection
1. INTRODUCTION
Reading web news articles is an important part of peo-
ple’s daily life, especially in the current “big data” era that
we are facing a large amount of information every day due to
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
CIKM’14, November 3–7, 2014, Shanghai, China.
Copyright 2014 ACM 978-1-4503-2598-1/14/11 ...$15.00.
http://dx.doi.org/10.1145/2661829.2661993.
the advancement and development of information technol-
ogy. One ideal way is to automatically group the web news
per their content into multiple clusters, e.g., technology and
health care, then one can choose to read the latest and the
most representative news articles in a group of interest. This
procedure can be done recursively so that one can explore
the news in different resolution hierarchically. Clustering
web news is also an effective way to organize, manage, and
search news articles. Unlike traditional document cluster-
ing, images play an important role in web news articles as
is evident from the fact that almost all news articles have
one picture associated. How to effectively and efficiently
group web news articles of multiple modality is challenging
because different data types have different properties and
different feature spaces and also because the dimensionality
of feature spaces is usually very high. For example in text
feature space, the vocabulary size can be over a million. Be-
sides, there are a lot of unrelated and noisy features which
often lead to low efficiency and poor performance.
Multi-view unsupervised feature selection is desirable to
solve the problem mentioned above, since it can select most
discriminative features while considering the consensus from
data of multiple views in an unsupervised fashion. Fea-
ture size can be extremely reduced and feature quality can
be greatly enhanced. As a result, not only computation
can be more efficient but clustering performance can also
be greatly improved. However, not much work have been
done to be able to solve this problem well, especially for
multi-view clustering on web news data. State-of-the-art un-
supervised feature selection methods [2, 13] for multi-view
data use spectral clustering across different views to learn
the most consistent pseudo class labels and simultaneously
use the learned labels to do feature selection. More specif-
ically, Adaptive Unsupervised Multi-view Feature Selection
(AUMFS) [2] uses spectral clustering on a combined data
similarity graph from different views to learn the labels that
have most consensus across different views, and then use
l2,1-norm regularized robust sparse regression to learn one
weight matrix for all the features of different views to best
approximate the cluster labels. [13] presents a new unsuper-
vised multi-view feature selection method called Multi-View
Feature Selection (MVFS). MVFS also uses spectral clus-
tering on the combined data similarity graph from different
views to learn the labels, but learn one weight matrix for
each view to best fit the learned pseudo class labels by joint
squared Frobenius norm (fitting term) and l2,1-norm (rowise
sparsity-inducing). Both [2] and [13] share the disadvantage
that they’re sensitive to the combined data similarity graph,
especially when there are quite a number of unrelated and
noisy features in the feature space, and there is information
loss during graph construction.
We propose to directly utilize raw features in the main
view (e.g., text for text-image web news data) to learn pseudo
cluster labels which should also have the most consensus
with other views (e.g., image), and meanwhile the discrimi-
native features in the feature selection process will win out
to contribute more on label learning process, and in return
the improved cluster labels will help to select more discrimi-
native features for each view. Technically, we propose a new
method called Multi-View Unsupervised Feature Selection
(MVUFS) to do unsupervised feature selection for multi-
view clustering, especially focused on analyzing text-image
web news data. We propose to minimize the sum of regular-
ized data matrix factorization error and data fitting error in
a unified optimization setting. We use local learning regu-
larized orthogonal nonnegative matrix factorization to learn
pseudo cluster labels and simultaneously learn rowise sparse
weight matrices for each view by joint l2,1-norm minimiza-
tion guided by the learned pseudo cluster labels. The label
learning process and feature selection process are mutually
enhanced. For label learning, we factorize the data matrix
in the main view (e.g. text) and ensure that the learned in-
dicator matrix is as consistent as local learning predictors on
other views (e.g. image). To objectively evaluate the new
method, we build two text-image web news datasets from
two major US news media web sites: CNN and FOXNews.
Our extensive experiments show that MVUFS significantly
outperforms state-of-the-art single-view and multi-view un-
supervised feature selection methods.
2. NOTATIONS AND PRELIMINARIES
Throughout this paper, matrices are written as boldface
capital letters and vectors are denoted as boldface lowercase
letters. For matrix M= (mij), its i-th row, j-th column
are denoted by mi,mjrespectively. kMkFis the Frobe-
nius norm of M. For any matrix M∈ Rr×t, its l2,1-norm
is defined as kMk2,1=Pr
i=1 qPp
j=1 m2
ij =Pr
i=1
mi
2.
Assume that we have ninstances X={xi}n
i=1. Let Xv
Rn×dvdenote the data matrix in the v-th view where the i-
th row xi
v∈ Rdvis the feature descriptor of the i-th instance
in the v-th view. For text-image web news data, X1is text
view data matrix, and X2is image view data matrix. Sup-
pose these ninstances are sampled from cclasses and denote
Y= [y1,· · · ,yn]T∈ {0,1}n×c, where yi∈ {0,1}c×1is the
cluster indicator vector for xi. The scaled cluster indicator
matrix Gis defined as G= [g1,· · · ,gn]T=YYTY1
2,
where giis the scaled cluster indicator of xi. It can be seen
that GTG=Ic, where Ic∈ Rc×cis an identity matrix.
2.1 Local learning regularization
It is often easier to produce good predictions on some
local regions of the input space instead of searching a good
global predictor f, because the function set f(x) may not
contain a good predictor for the entire input space. And
it is usually more effective to minimize prediction cost for
each local region. We adopt the local learning regularization
proposed in [3]. Let N(xi) denote the neighborhood of xi,
the local learning regularization aims to minimize the sum
of prediction errors between the local prediction from N(xi)
and the cluster assignment of xi:
K
P
k=1
n
P
i=1
fk
i(xi)gik
=
K
P
k=1
n
P
i=1
kT
i(Ki+niλI)1gk
igik
=
K
P
k=1
n
P
i=1
αT
igk
igik
= Tr GTLllr G.
where fk
i(xi) is the locally predicted label for k-th clus-
ter from N(xi), λis a positive parameter, Kiis the ker-
nel matrix defined on the neighborhood of xi, i.e., N(xi),
with size of ni,kiis the kernel vector defined between xi
and N(xi), gk
iis the cluster assignments of N(xi), Lllr =
(AI)T(AI), I∈ Rn×nis an identity matrix, and A
Rn×nis defined by Aij =αij,if xj∈ N (xi)
0,otherwise .
3. OPTIMIZATION PROBLEM
MVUFS solves the following optimization problem:
min kX1GFk2
F+ Tr hGTLllr
2Gi+
α
2
X
v=1
kGXvWvk2,1+β
2
X
v=1
kWvk2,1
s.t.GTG=Ic,G0,F0,Wv∈ Rdv×c(1)
where α, β are nonnegative parameters. To learn the most
consistent pseudo labels across different views, we use or-
thogonal nonnegative matrix factorization on the text view
regularized by local learning prediction error on the image
view. Fis the basis matrix with each row being a clus-
ter center. The fitting term P2
v=1 kGXvWvk2,1will also
push the pseudo labels to be close to the linear prediction by
the feature weight matrices for each view, which gives the
desirable mutual reinforcement between label learning and
feature selection. Nonnegative and orthogonal constraints
imposed on the cluster indicator matrix variable are desir-
able to give a single non-zero positive entry on each row of
the label matrix. For feature selection, we adopt joint l2,1-
norm minimization [6] to learn rowise sparse weight matrices
for each view. The sparsity-inducing property of l2/l1-norm
pushes the feature selection matrix Wvto be sparse in rows.
More specifically, wj
vshrinks to zero if the j-th feature is less
correlated to the pseudo labels Y. We can thus filter out
the features corresponding to zero rows of Wv.
We apply alternating optimization to solve problem (1).
To optimize Ggiven F,Wv, v = 1,2, and Gtin the last
iteration, we solve the following subproblem:
min kX1GFk2
F+ Tr hGTLllr
2Gi+α
2
X
v=1
kDvGDvXvWvk2
F
s.t.GTG=Ic,G0,(2)
where Dvis a diagonal matrix: Dv
ii =1
20.5
gi
txi
vWv
0.5
2.
It can be proved (due to space limit, we omit the proof) that
if Gt+1 is the solution of problem (2), Gt+1 will monotoni-
cally decrease the objective function of problem (1). Denote
the objective function in problem (2) by J(G), the Lagrange
function is given by L(G,Λ,Σ) = J(G)Tr ΛGTGI
Tr ΣTG. The optimal Gmust satisfy the KKT condis-
Table 1: Dataset Description.
Dataset # Instances # Words # IMG-features # Classes
CNN 2107 7989 996 7
FOX 1523 5477 996 4
tions:
J(G)2Σ=0
GTG=I
ΣG=0;Σ0;G0
.Since the updated Gis
guaranteed to be nonnegative, we can ignore Σ, we thus have
∂J
G2=0, giving Λ=1
2GT∂J
G.We first decompose
Wv=W+
vW
vand Λ=Λ+Λ, where
Λ+=GTGFFT+GTLllr+
2G+αGT2
P
v=1
D2
vG
+αGT2
P
v=1
D2
vXvW
v
Λ=GTX1FT+GTLllr
2G+αGT2
P
v=1
D2
vXvW+
v.
We then obtain the following update formula for Gby ap-
plying the auxiliary function approach in [11]:
Gik Gik
X1FT+L
2G+α
2
P
v=1
D2
vXvW+
v++ik
GFFT+L+
2G+α
2
P
v=1
D2
vG+α
2
P
v=1
D2
vXvWv+ik
.
(3)
followed by column-wise normalization. When converges, we
have (J(G)2)G=0,which is exactly the KKT
complementary slackness condition.
To optimize F, we solve the subproblem: min
F0kX1GFk2
F.
Since the objective function is quadratic, and F’s columns
are mutually independent, we can use blockwise coordinate
descent to update one row at a time in a cyclic order, and
the objective function value is guaranteed to decrease. The
updating formula for Fis
Fi:max 0,Fi:GTGi:FGTX1i:
[GTG]ii !.(4)
To optimize Wv, we need to solve the unconstrained prob-
lem min
Wv∈Rdv×cαkGXvWvk2,1+βkWvk2,1for each view.
There’re several optimization strategies that can solve it.
Here we adopt the simple algorithm given in [6].
Algorithm 1 MVUFS
Input: {Xv, pv}2
v=1 ,Lllr
2, α, β
Output: pvfeatures for the v-th view, v= 1,2
1: Initialize G0s.t.G0TG0=I(e.g., by K-means) and F0=
G0TX1,t0
2: while Not convergent do
3: Given Gtand Ft, compute Wt+1
vas in [6]
4: Given Wt+1
vand Ft, compute Gt+1 by Eq. (3)
5: Given Wt+1
vand Gt+1, compute Ft+1 by Eq. (4)
6: tt+ 1
7: end while
8: for v= 1 to 2 do
9: Sort all dvfeatures according to kwi
vk2in descending order
and select the top pvranked features for the v-th view.
10: end for
4. EXPERIMENTS
4.1 Datasets
We crawled CNN and FOXNews web news from Jan. 1st,
2014 to Apr. 4th, 2014. The category information contained
in the RSS feeds for each news article can be viewed as reli-
able ground truth. Titles, abstracts, and text body contents
are extracted as the text view data, and the image associ-
ated with the article is stored as the image view data. Since
the vocabulary has a very long tail word distribution, We
filtered out those words that occur less than or equal to 5
times. All text content is stemmed by portStemmer [8], and
we use l2-normalized TFIDF as text. For image features, we
use 7 groups of color features: Color features include RGB
dominant color, HSV dominant color, RGB color moment,
HSV color moment, RGB color histogram, HSV color his-
togram, color coherence vector [7], and 5 textural features:
four Tamura textural features [12] (coarseness, contrast, di-
rectionality, line-likeness) and Gabor transform [4, 10].
4.2 Settings
Two widely used evaluation metrics for measuring cluster-
ing performance: accuracy (ACC) and Normalized Mutual
Information (NMI) are used. We compare MVUFS with
KMeans on text with all features (KM-TXT), KMeans on
image with all features (KM-IMG), state-of-the-art single
view unsupervised feature selection methods: NDFS [5] -
Joint nonnegative spectral analysis and l2,1-norm regular-
ized regression and RUFS [9] - joint local learning regular-
ized robust NMF and robust l2,1-norm regression; multi-
view spherical KMeans with all features (MVSKM) [1],
state-of-the-art multi-view unsupervised feature selection:
AUMFS [2] - spectral clustering and l2,1-norm regularized
robust sparse regression and MVFS [13] - spectral cluster-
ing and l2,1-norm regression. For single-view unsupervised
feature selection methods, KMeans is used to calculate the
clustering performance. For multi-view unsupervised fea-
ture selection methods, multi-view spherical KMeans [1] is
used for multi-view clustering. We set the neighborhood size
to be 5. We use cosine similarity to build text graph and
Gaussian kernel for image graph. All feature selection meth-
ods have two parameters: αfor regression, and βfor sparsity
control. We do grid search for αin 102,101,...,102,
and βin α×102,101,...,102. We vary the number of
selected text features as {100,300,500,700,900}. The num-
ber of selected image features is half of selected text features.
Since K-means depends on initialization, we repeat cluster-
ing 10 times with random initialization.
4.3 Results
We need to answer several questions. First, is multi-view
clustering always better than single view clustering? From
Table 2, Table 3, and Figure 1, we can see that the an-
swer is no. It depends on the feature quality of different
views. Here the color and texture features we used for im-
age view is not tightly tied with clustering measures, which
does severely hurt the performance of multi-view cluster-
ing (MVSKM behaves much worse than KM-TXT). Fortu-
nately, if discriminative features are selected by using multi-
view feature selection methods, the multi-view clustering
performance may be significantly improved and can be bet-
ter than single-view performance. For example, MVUFS
significantly outperforms all single-view methods. Second, is
multi-view feature selection better than single-view feature
selection? We see that AUMFS, MVFS, and MVUFS out-
perform standard single view features election methods such
as NDFS and RUFS, which indicates that different views can
mutually bootstrap each other. It’s interesting to see that
both NDFS and RUFS even behave worse than without do-
ing feature selection. At last, it turns out that MVUFS
Table 2: Clustering Results (ACC%±std), means statistical significance at 5% level.
Dataset KM-TXT KM-IMG NDFS RUFS MVSKM AUMFS MVFS MVUFS
CNN 50.1±7.2 23.2±1.0 31.6±6.1 31.3±5.3 32.0±2.8 54.2±4.6 50.2±4.857.9±4.9
FOX 76.2±7.7 43.0±0.3 56.6±9.3 61.2±8.3 73.3±2.1 83.7±1.3 84.7±0.687.9±1.0
Table 3: Clustering Results (NMI%±std), means statistical significance at 5% level.
Dataset KM-TXT KM-IMG NDFS RUFS MVSKM AUMFS MVFS MVUFS
CNN 42.0±4.3 3.7±0.1 21.1±5.5 22.8±4.9 16.6±1.1 36.4±3.2 30.8±2.544.1±2.4
FOX 67.3±6.1 7.6±0.3 37.3±8.5 42.6±12.5 50.0±1.8 64.4±0.9 66.5±0.672.1±0.5
100 300 500 700 900
20
30
40
50
60 CNN
ACC (%)
# of Selected Features
100 300 500 700 900
40
50
60
70
80
90 FOX
ACC (%)
# of Selected Features
100 300 500 700 900
10
20
30
40
50 CNN
NMI (%)
# of Selected Features
KM−TXT
MVSKM
NDFS
RUFS
AUMFS
MVFS
MVUFS
KM−TXT
MVSKM
NDFS
RUFS
AUMFS
MVFS
MVUFS
KM−TXT
MVSKM
NDFS
RUFS
AUMFS
MVFS
MVUFS 100 300 500 700 900
30
40
50
60
70
80 FOX
NMI (%)
# of Selected Features
KM−TXT
MVSKM
NDFS
RUFS
AUMFS
MVFS
MVUFS
Figure 1: ACC and NMI with varying number of
selected features.
100 300 500 700 900
0.010.1 110100
0
20
40
60
80
Feature #
ACC for FOX (β = 1000.00)
α100 300 500 700 900
0.1 110100
1000
0
20
40
60
80
Feature #
ACC for FOX (α = 10.0)
β
Figure 2: ACC v.s. different α,β, and number of
selected features on FOX dataset for MVUFS.
outperforms both single-view clustering and feature selec-
tion methods and multi-view clustering and feature selection
methods. Since the major difference between MVUFS and
AUMFS, MVFS is label learning, we conclude that directly
learning labels from raw features from one view while ensur-
ing the most consensus with other views could select a more
discriminative feature set for all views, and spectral clus-
tering relies on the combined similarity graphs of all views
which may result in loss of discriminative information and
could undermine the performance.
4.4 Parameter Analysis
We plot ACC versus different α,β, and number of se-
lected features on FOXNews for MVUFS in Figure 2 (simi-
lar figures for NMI and on CNN dataset) due to space limit.
We see that an appropriate combination of these parameters
is crucial. However, it is unknown to us theoretically how
to choose the best parameter setting. It may depends on
datasets and measures. In practice, like many other meth-
ods, one can build a validation set in a mild scale to tune
parameters by e.g., grid search.
5. CONCLUSION
We propose a new unsupervised feature selection meth-
ods for multi-view clustering: MVUFS where local learn-
ing regularized orthogonal nonnegative matrix factorization
is performed to learn pseudo class labels on raw features.
We built two web news text-image datasets from CNN and
FOXNews, and systematically evaluate MVUFS with state-
of-the-art single-view and multi-view unsupervised feature
selection methods. Experimental results validate the effec-
tiveness of the proposed method.
Acknowledgments
This material is based upon work supported by the National
Science Foundation under Grant Number CNS-1027965.
6. REFERENCES
[1] S. Bickel and T. Scheffer. Multi-view clustering. In Proceedings
of the Fourth IEEE International Conference on Data
Mining, pages 19–26. IEEE Computer Society, 2004.
[2] Y. Feng, J. Xiao, Y. Zhuang, and X. Liu. Adaptive
unsupervised multi-view feature selection for visual concept
recognition. In Proceedings of the 11th Asian conference on
Computer Vision-Volume Part I, pages 343–357.
Springer-Verlag, 2012.
[3] Q. Gu and J. Zhou. Local learning regularized nonnegative
matrix factorization. In Twenty-First International Joint
Conference on Artificial Intelligence, 2009.
[4] T. Lee. Image representation using 2d gabor wavelets. Pattern
Analysis and Machine Intelligence, IEEE Transactions on,
18(10):959–971, 1996.
[5] Z. Li, Y. Yang, J. Liu, X. Zhou, and H. Lu. Unsupervised
feature selection using nonnegative spectral analysis. In 26th
AAAI Conference on Artificial Intelligence, 2012.
[6] F. Nie, H. Huang, X. Cai, and C. Ding. Efficient and robust
feature selection via joint l2, 1-norms minimization. Advances
in Neural Information Processing Systems, 23:1813–1821,
2010.
[7] G. Pass, R. Zabih, and J. Miller. Comparing images using color
coherence vectors. In Proceedings of the fourth ACM
international conference on Multimedia. ACM, 1997.
[8] M. Porter. An algorithm for suffix stripping. Program:
electronic library and information systems, 14(3):130–137,
1993.
[9] M. Qian and C. Zhai. Robust unsupervised feature selection. In
Proceedings of the Twenty-Third international joint
conference on Artificial Intelligence, pages 1621–1627. AAAI
Press, 2013.
[10] Y. Ro, M. Kim, H. Kang, B. Manjunath, and J. Kim. Mpeg-7
homogeneous texture descriptor. ETRI journal, 23(2):41–51,
2001.
[11] D. Seung and L. Lee. Algorithms for non-negative matrix
factorization. Advances in neural information processing
systems, 13:556–562, 2001.
[12] H. Tamura, S. Mori, and T. Yamawaki. Textural features
corresponding to visual perception. Systems, Man and
Cybernetics, IEEE Transactions on, 8(6), 1978.
[13] J. Tang, X. Hu, H. Gao, and H. Liu. Unsupervised feature
selection for multi-view data in social media. In Proceedings of
the 13th SIAM International Conference on Data Mining,
2013. SIAM, 2013.

Supplementary resource (1)

... . MVUFS learns pseudo-labels with local manifold regularization and simultaneously selects discriminative features from each view via sparse projection to the pseudo-label-indicating matrix [22]. RMVFS realizes view-specific sparse feature selection based on the multi-view K-means model [23]. ...
... Considering that the independence between the columns of Y, problem (22) can be achieved by solving ...
Article
Full-text available
Feature selection is a basic and important step in real applications, such as face recognition and image segmentation. In this paper, we propose a new weakly supervised multi-view feature selection method by utilizing pairwise constraints, i.e., the pairwise constraint-guided multi-view feature selection (PCFS for short) method. In this method, linear projections of all views and a consistent similarity graph with pairwise constraints are jointly optimized to learning discriminative projections. Meanwhile, the l2,0-norm-based row sparsity constraint is imposed on the concatenation of projections for discriminative feature selection. Then, an iterative algorithm with theoretically guaranteed convergence is developed for the optimization of PCFS. The performance of the proposed PCFS method was evaluated by comprehensive experiments on six benchmark datasets and applications on cancer clustering. The experimental results demonstrate that PCFS exhibited competitive performance in feature selection in comparison with related models.
... We treat each modality as a view. Fox-News [31] and CNN-News [31] (Fox-N and CNN-N in short) are CNN and FOX web news data, respectively. Each news has an image view and a text view. ...
... We treat each modality as a view. Fox-News [31] and CNN-News [31] (Fox-N and CNN-N in short) are CNN and FOX web news data, respectively. Each news has an image view and a text view. ...
... To achieve feature selection, 2,1 -norm is usually used to impose row sparsity on the v-th feature selection matrix [23]. Based on 2,1 -norm, we can select the top ranked features according to the descending order of 2 -norm of the rows in U (v) . ...
... Since the objective function in Eq. (8) is not convex for the variables U (v) , V, S (v) , R and α (v) simultaneously, we divide (8) into five sub-objective functions, namely, (9), (16), (20), (23) and (26). Hence, we can prove the convergence of Algorithm 1 by proving the monotonic convergence of each sub-objective functions. ...
Preprint
Multi-view unsupervised feature selection (MUFS) has been demonstrated as an effective technique to reduce the dimensionality of multi-view unlabeled data. The existing methods assume that all of views are complete. However, multi-view data are usually incomplete, i.e., a part of instances are presented on some views but not all views. Besides, learning the complete similarity graph, as an important promising technology in existing MUFS methods, cannot achieve due to the missing views. In this paper, we propose a complementary and consensus learning-based incomplete multi-view unsupervised feature selection method (C2^{2}IMUFS) to address the aforementioned issues. Concretely, C2^{2}IMUFS integrates feature selection into an extended weighted non-negative matrix factorization model equipped with adaptive learning of view-weights and a sparse 2,p\ell_{2,p}-norm, which can offer better adaptability and flexibility. By the sparse linear combinations of multiple similarity matrices derived from different views, a complementary learning-guided similarity matrix reconstruction model is presented to obtain the complete similarity graph in each view. Furthermore, C2^{2}IMUFS learns a consensus clustering indicator matrix across different views and embeds it into a spectral graph term to preserve the local geometric structure. Comprehensive experimental results on real-world datasets demonstrate the effectiveness of C2^{2}IMUFS compared with state-of-the-art methods.
... Reis et al. analyze sentiment in 69K headlines collected from The New York Times, BBC, Reuters, and Dailymail (Reis et al. 2015). Qian and Zhai collect news from CNN and Fox News to study unsupervised feature selection on text and image data from news (Qian and Zhai 2014). Saez-Trumper at al. explore different types of bias in news articles from the top 80 news websites during a two-week period (Saez-Trumper, Castillo, and Lalmas 2013). ...
Preprint
Full-text available
The complexity and diversity of today's media landscape provides many challenges for researchers studying news producers. These producers use many different strategies to get their message believed by readers through the writing styles they employ, by repetition across different media sources with or without attribution, as well as other mechanisms that are yet to be studied deeply. To better facilitate systematic studies in this area, we present a large political news data set, containing over 136K news articles, from 92 news sources, collected over 7 months of 2017. These news sources are carefully chosen to include well-established and mainstream sources, maliciously fake sources, satire sources, and hyper-partisan political blogs. In addition to each article we compute 130 content-based and social media engagement features drawn from a wide range of literature on political bias, persuasion, and misinformation. With the release of the data set, we also provide the source code for feature computation. In this paper, we discuss the first release of the data set and demonstrate 4 use cases of the data and features: news characterization, engagement characterization, news attribution and content copying, and discovering news narratives.
... The pseudo-label as a self-supervised signal can lead all views to learn more discriminative features, producing clearer clustering structures as shown in Figure 1. Therefore, self-supervised MVC pretrains sample data to obtain pseudo-labels and achieves good supervision and guidance for downstream clustering tasks through migration and fine-tuning [3,11,[28][29][30][31]. Additionally, the diverse learning methods and self-supervised signal representations bring opportunities and difficulties for future research. ...
Article
Full-text available
In recent years, multi‐view clustering (MVC) has had significant implications in the fields of cross‐modal representation learning and data‐driven decision‐making. Its main objective is to cluster samples into distinct groups by leveraging consistency and complementary information among multiple views. However, the field of computer vision has witnessed the evolution of contrastive learning, and self‐supervised learning has made substantial research progress. Consequently, self‐supervised learning is progressively becoming dominant in MVC methods. It involves designing proxy tasks to extract supervisory information from image and video data, thereby guiding the clustering process. Despite the rapid development of self‐supervised MVC, there is currently no comprehensive survey analysing and summarising the current state of research progress. Hence, the authors aim to explore the emergence of self‐supervised MVC by discussing the reasons and advantages behind it. Additionally, the internal connections and classifications of common datasets, data issues, representation learning methods, and self‐supervised learning methods are investigated. The authors not only introduce the mechanisms for each category of methods, but also provide illustrative examples of their applications. Finally, some open problems are identified for further investigation and development.
... In addition, there are some other methods based on feature selection or certain metric learning for MVC. A novel unsupervised feature selection method for MVC was developed in [133], where local learning was employed to learn pseudo-class labels on the raw features. Xu et al. performed multi-view data clustering and feature selection simultaneously for high-dimensional data [134]. ...
Article
Full-text available
Multi-view clustering (MVC) has attracted more and more attention in the recent few years by making full use of complementary and consensus information between multiple views to cluster objects into different partitions. Although there have been two existing works for MVC survey, neither of them jointly takes the recent popular deep learning-based methods into consideration. Therefore, in this paper, we conduct a comprehensive survey of MVC from the perspective of representation learning. It covers a quantity of multi-view clustering methods including the deep learning-based models, providing a novel taxonomy of the MVC algorithms. Furthermore, the representation learning-based MVC methods can be mainly divided into two categories, i.e., shallow representation learning-based MVC and deep representation learning-based MVC, where the deep learning-based models are capable of handling more complex data structure as well as showing better expression. In the shallow category, according to the means of representation learning, we further split it into two groups, i.e., multi-view graph clustering and multi-view subspace clustering. To be more comprehensive, basic research materials of MVC are provided for readers, containing introductions of the commonly used multi-view datasets with the download link and the open source code library. In the end, some open problems are pointed out for further investigation and development.
... According to the availability of the label information, feature selection can be categorized into unsupervised (He, Cai, and Niyogi 2005), semi-supervised (Benabdeslem and Hindawi 2014), and supervised (Nie et al. 2010) ones. Because of the diverse data structure, algorithms are also developed for multitask (Hernández-Lobato, Hernández-Lobato, and Ghahramani 2015), multi-label (Chang et al. 2014) and multi-view (Qian and Zhai 2014) feature selection. ...
Article
Unsupervised feature selection (UFS) aims to reduce the time complexity and storage burden, as well as improve the generalization performance. Most existing methods convert UFS to supervised learning problem by generating labels with specific techniques (e.g., spectral analysis, matrix factorization and linear predictor). Instead, we proposed a novel coupled analysis-synthesis dictionary learning method, which is free of generating labels. The representation coefficients are used to model the cluster structure and data distribution. Specifically, the synthesis dictionary is used to reconstruct samples, while the analysis dictionary analytically codes the samples and assigns probabilities to the samples. Afterwards, the analysis dictionary is used to select features that can well preserve the data distribution. The effective L2p-norm (0 < p <1) regularization is imposed on the analysis dictionary to get much sparse solution and is more effective in feature selection.We proposed an iterative reweighted least squares algorithm to solve the L2p-norm optimization problem and proved it can converge to a fixed point. Experiments on benchmark datasets validated the effectiveness of the proposed method
Article
Unsupervised feature selection has emerged as a significant and challenging issue in multi-view learning due to the need to process a large amount of unlabeled and high-dimensional multi-view data. Although numerous methods have been developed to this point, the majority of them are offline and have substantial computational and memory costs. Unlike previous work, we propose an Online Unsupervised Multi-View Feature Selection with Adaptive Neighbors (OUMVFSAN) method by exploring view-specific feature selection matrices and weights to handle the multi-view streaming data. Specifically, the proposed OUMVFSAN method naturally performs information fusion of different views guided through a consensus clustering indicator matrix to make feature selection matrices select more valuable features. Moreover, through processing multi-view streaming data by a buffering technique, OUMVFSAN reduces the computational and storage cost without sacrificing performance. We demonstrate the effectiveness and efficiency of the proposed OUMVFSAN with extensive experiments on eight multi-view datasets. It is worth noting that OUMVFSAN performs better than many state-of-the-art online and offline unsupervised multi-view feature selection methods.
Article
Multi-view unsupervised feature selection (MUFS) has been demonstrated as an effective technique to reduce the dimensionality of multi-view unlabeled data. The existing methods assume that all of views are complete. However, multi-view data are usually incomplete, i.e., a part of instances are presented on some views but not all views. Besides, learning the complete similarity graph, as an important promising technology in existing MUFS methods, cannot achieve due to the missing views. In this paper, we propose a complementary and consensus learning-based incomplete multi-view unsupervised feature selection method (C 2^{2} IMUFS) to address the aforementioned issues. Concretely, C 2^{2} IMUFS integrates feature selection into an extended weighted non-negative matrix factorization model equipped with adaptive learning of view-weights and a sparse 2,p\ell _{2,p} -norm, which can offer better adaptability and flexibility. By the sparse linear combinations of multiple similarity matrices derived from different views, a complementary learning-guided similarity matrix reconstruction model is presented to obtain the complete similarity graph in each view. Furthermore, C 2^{2} IMUFS learns a consensus clustering indicator matrix across different views and embeds it into a spectral graph term to preserve the local geometric structure. Comprehensive experimental results on real-world datasets demonstrate the effectiveness of C 2^{2} IMUFS compared with state-of-the-art methods.
Conference Paper
Full-text available
A new unsupervised feature selection method, i.e., Robust Unsupervised Feature Selection (RUFS), is proposed. Unlike traditional unsupervised feature selection methods, pseudo cluster labels are learned via local learning regularized robust nonnegative matrix factorization. During the label learning process, feature selection is performed simultaneously by robust joint l2,1 norms minimization. Since RUFS utilizes l2,1 norm minimization on processes of both label learning and feature learning, outliers and noise could be effectively handled and redundant or noisy features could be effectively reduced. Our method adopts the advantages of robust non-negative matrix factorization, local learning, and robust feature learning. In order to make RUFS be scalable, we design a (projected) limited-memory BFGS based iterative algorithm to efficiently solve the optimization problem of RUFS in terms of both memory consumption and computation complexity. Experimental results on different benchmark real world datasets show the promising performance of RUFS over the state-of-the-arts.
Article
Full-text available
The prevalent use of social media produces mountains of unlabeled, high-dimensional data. Feature selection has been shown effective in dealing with high-dimensional data for efficient data mining. Feature selection for unlabeled data remains a challenging task due to the absence of label information by which the feature relevance can be assessed. The unique characteristics of social media data further complicate the already challenging problem of unsupervised feature selection, (e.g., part of social media data is linked, which makes invalid the independent and identically distributed assumption), bringing about new challenges to traditional unsupervised feature selection algorithms. In this paper, we study the differences between social media data and traditional attribute-value data, investigate if the relations revealed in linked data can be used to help select relevant features, and propose a novel unsupervised feature selection framework, LUFS, for linked social media data. We perform experiments with real-world social media datasets to evaluate the effectiveness of the proposed framework and probe the working of its key components.
Conference Paper
Full-text available
To reveal and leverage the correlated and complemental information between different views, a great amount of multi-view learning algorithms have been proposed in recent years. However, unsupervised feature selection in multi-view learning is still a challenge due to lack of data labels that could be utilized to select the discriminative features. Moreover, most of the traditional feature selection methods are developed for the single-view data, and are not directly applicable to the multi-view data. Therefore, we propose an unsupervised learning method called Adaptive Unsupervised Multi-view Feature Selection (AUMFS) in this paper. AUMFS attempts to jointly utilize three kinds of vital information, i.e., data cluster structure, data similarity and the correlations between different views, contained in the original data together for feature selection. To achieve this goal, a robust sparse regression model with the l 2,1-norm penalty is introduced to predict data cluster labels, and at the same time, multiple view-dependent visual similar graphs are constructed to flexibly model the visual similarity in each view. Then, AUMFS integrates data cluster labels prediction and adaptive multi-view visual similar graph learning into a unified framework. To solve the objective function of AUMFS, a simple yet efficient iterative method is proposed. We apply AUMFS to three visual concept recognition applications (i.e., social image concept recognition, object recognition and video-based human action recognition) on four benchmark datasets. Experimental results show the proposed method significantly outperforms several state-of-the-art feature selection methods. More importantly, our method is not very sensitive to the parameters and the optimization method converges very fast.
Conference Paper
Full-text available
Nonnegative Matrix Factorization (NMF) has been widely used in machine learning and data mining. It aims to find two nonnegative matrices whose product can well approximate the nonnegative data matrix, which naturally lead to parts-based repre- sentation. In this paper, we present a local learning regularized nonnegative matrix factorization (LL- NMF) for clustering. It imposes an additional con- straint on NMF that the cluster label of each point can be predicted by the points in its neighbor- hood. This constraint encodes both the discrimi- native information and the geometric structure, and is good at clustering data on manifold. An itera- tive multiplicative updating algorithm is proposed to optimize the objective, and its convergence is guaranteed theoretically. Experiments on many benchmark data sets demonstrate that the proposed method outperforms NMF as well as many state of the art clustering methods.
Conference Paper
Full-text available
We consider clustering problems in which the available attributes can be split into two independent subsets, such that either subset suffices for learning. Example applications of this multi-view setting include clustering of Web pages which have an intrinsic view (the pages themselves) and an extrinsic view (e.g., anchor texts of inbound hyperlinks); multi-view learning has so far been studied in the context of classification. We develop and study partitioning and agglomerative, hierarchical multi-view clustering algorithms for text data. We find empirically that the multi-view versions of k-means and EM greatly improve on their single-view counterparts. By contrast, we obtain negative results for agglomerative hierarchical multi-view clustering. Our analysis explains this surprising phenomenon.
Article
In this paper, a new unsupervised learning algorithm, namely Nonnegative Discriminative Feature Selection (NDFS), is proposed. To exploit the discriminative information in unsupervised scenarios, we perform spectral clustering to learn the cluster labels of the input samples, during which the feature selection is performed simultaneously. The joint learning of the cluster labels and feature selection matrix enables NDFS to select the most discriminative features. To learn more accurate cluster labels, a nonnegative constraint is explicitly imposed to the class indicators. To reduce the redundant or even noisy features, ℓ 2,1-norm minimization constraint is added into the objective function, which guarantees the feature selection matrix sparse in rows. Our algorithm exploits the discriminative information and feature correlation simultaneously to select a better feature subset. A simple yet efficient iterative algorithm is designed to optimize the proposed objective function. Experimental results on different real world datasets demonstrate the encouraging performance of our algorithm over the state-of-the-arts. Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
This paper extends to two dimensions the frame criterion developed by Daubechies for one-dimensional wavelets, and it computes the frame bounds for the particular case of 2D Gabor wavelets. Completeness criteria for 2D Gabor image representations are important because of their increasing role in many computer vision applications and also in modeling biological vision, since recent neurophysiological evidence from the visual cortex of mammalian brains suggests that the filter response profiles of the main class of linearly-responding cortical neurons (called simple cells) are best modeled as a family of self-similar 2D Gabor wavelets. We therefore derive the conditions under which a set of continuous 2D Gabor wavelets will provide a complete representation of any image, and we also find self-similar wavelet parametrization which allow stable reconstruction by summation as though the wavelets formed an orthonormal basis. Approximating a “tight frame” generates redundancy which allows low-resolution neural responses to represent high-resolution images