ArticlePDF Available

Multi-View Representation Learning: A Survey from Shallow Methods to Deep Methods

Authors:

Abstract and Figures

Recently, multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas. This paper first reviews the root methods and theories on multi-view representation learning, especially on canonical correlation analysis (CCA) and its several extensions. And then we investigate the advancement of multi-view representation learning that ranges from shallow methods including multi-modal topic learning, multi-view sparse coding, and multi-view latent space Markov networks, to deep methods including multi-modal restricted Boltzmann machines, multi-modal autoencoders, and multi-modal recurrent neural networks. Further, we also provide an important perspective from manifold alignment for multi-view representation learning. Overall, this survey aims to provide an insightful overview of theoretical basis and current developments in the field of multi-view representation learning and to help researchers find the most appropriate tools for particular applications.
Content may be subject to copyright.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Multi-View Representation Learning: A Survey
from Shallow Methods to Deep Methods
Yingming Li, Ming Yang, Zhongfei (Mark) Zhang, Senior Member, IEEE
Abstract—Recently, multi-view representation learning has become a rapidly growing direction in machine learning and data mining
areas. This paper first reviews the root methods and theories on multi-view representation learning, especially on canonical correlation
analysis (CCA) and its several extensions. And then we investigate the advancement of multi-view representation learning that ranges
from shallow methods including multi-modal topic learning, multi-view sparse coding, and multi-view latent space Markov networks, to
deep methods including multi-modal restricted Boltzmann machines, multi-modal autoencoders, and multi-modal recurrent neural
networks. Further, we also provide an important perspective from manifold alignment for multi-view representation learning. Overall,
this survey aims to provide an insightful overview of theoretical basis and current developments in the field of multi-view representation
learning and to help researchers find the most appropriate tools for particular applications.
Index Terms—Multi-view representation learning, canonical correlation analysis, multi-modal deep learning.
F
1 INTRODUCTION
Multi-view data have become increasingly available in real-world
applications where examples are described by different feature
sets or different “views”, such as image+text, audio+video, and
webpage+click-through data. The different views usually contain
complementary information, and multi-view learning can exploit
this information to learn representation that is more expressive
than that of single-view learning method. Therefore, multi-view
representation learning has become a very promising topic with
wide applicability.
Canonical correlation analysis (CCA) [1] and its kernel ex-
tensions [2–4] are representative techniques in early studies of
multi-view representation learning. A variety of theories and
approaches are later introduced to investigate their theoretical
properties, explain their success, and extend them to improve the
generalization performance in particular tasks. While CCA and its
kernel versions show their abilities of effectively modeling the
relationship between two or more sets of variables, they have
limitations on capturing high level associations between multi-
view data. Inspired by the success of deep neural networks [5–
7], deep CCAs [8] have been proposed to this problem, with a
common strategy to learn a joint representation that is coupled
between multiple views at higher level after learning several layers
of view-specific features in the lower layers.
However, how to learn a good association between multi-view
data still remains an open problem. In 2016, a workshop on multi-
view representation learning is held in conjunction with the 33rd
international conference on machine learning to help promote a
better understanding of various approaches and the challenges in
specific applications. So far, there have been increasing research
activities in this direction and a large number of multi-view
representation learning algorithms have been presented based on
Y. Li, M. Yang, Z. Zhang are with College of Information Science &
Electronic Engineering, Zhejiang University, China.
E-mail: {yingming, cauchym, zhongfei}@zju.edu.cn
the fundamental theories of CCAs and progress of deep neural net-
works. For example, the advancement of multi-view representation
learning ranges from shallow methods including multi-modal topic
learning [9–11], multi-view sparse coding [12–14], and multi-view
latent space Markov networks [15, 16], to deep methods including
multi-modal deep Boltzmann machines [17], multi-modal deep
autoencoders [18–20], and multi-modal recurrent neural networks
[21–23]. Further, manifold alignment also provides an important
perspective for multi-view representation learning [24].
Therefore, we review the literature of multi-view representa-
tion learning from shallow methods to deep methods in accordance
with its progress. In particular, each part is surveyed by following
two broad, parallel lines of this learning mechanism: one rooted in
probabilistic graphical models and the other rooted in non-linear
embedding models including the kernel trick and neural networks.
In fact, the fundamental difference between the two paradigms
is whether the layered architecture of a learning model is to be
interpreted as a probabilistic graphical model or as a computation
network. The connection between these two paradigms becomes
more obvious when we consider deeper multi-view representation
learning models. The computational networks have become in-
creasingly important for big data learning since the exact inference
of probabilistic models usually becomes intractable with deep
structures.
The goal of this survey is to review the theoretical basis and
key advances in the area of multi-view representation learning and
to provide a global picture of this active direction. We expect this
survey to help researchers find the most appropriate approaches for
their particular applications and deliver perspectives of what can
be done in the future to promote the development of multi-view
representation learning.
The remainder of this paper is organized as follows. In Section
2, we introduce theories and advances on CCA and its exten-
sions including probabilistic CCA, sparse CCA, kernel CCA and
deep CCA. Section 3 surveys representative shallow multi-view
representation learning approaches according to the modeling
mechanisms involved. Then, in Section 4, we investigate the pop-
ular deep multi-view representation learning methods which have
arXiv:1610.01206v2 [cs.LG] 27 Nov 2016
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
Fig. 1. An illustrative example application of CCA in cross-modal retrieval (adapted from [25]). Left: Embedding of the text and image from their
source spaces to a CCA space, Semantic Space and a Semantic space learned using CCA representation. Right: examples of cross-modal retrieval
where both text and images are mapped to a common space. At the top is shown an example of retrieving text in response to an image query with
a common semantic space. At the bottom is shown an example of retrieving images in response to a text query with a common subspace using
CCA.
attracted much attention and been widely applied in cross-media
retrieval. Further, Section 5 surveys manifold alignment which
provides an important perspective for multi-view representation
learning. Finally, we provide a conclusion in Section 6.
2 CANONICAL CORRELATION ANALYSI S AND ITS
EXTENSIONS
In this sections we will review the root of multi-view representa-
tion learning techniques, canonical correlation analysis (CCA) and
its extensions that range from probabilistic graphical modeling to
non-linear deep embedding. In particular, we will investigate the
related work of probabilistic CCA, sparse CCA, kernel CCA and
deep CCA to illustrate the probabilistic, sparse, and non-linear
views of multi-view representation learning theories.
2.1 Canonical Correlation Analysis
Canonical Correlation Analysis [1] has become increasingly pop-
ular for its capability of effectively modeling the relationship
between two or more sets of variables. From the perspective
of multi-view representation learning, CCA computes a shared
embedding of both sets of variables through maximizing the
correlations among the variables between the two sets. More
specifically, CCA has been widely used in multi-view learning
tasks to generate low dimensional feature representations [25–
27]. Improved generalization performance has been witnessed in
areas including dimensionality reduction [28–30], clustering [31–
33], regression [34, 35], word embeddings [36–38], discriminant
learning [39–41], etc. For example, Figure 1 shows a fascinating
cross-modality application of CCA in multi-media retrieval.
Given a pair of datasets X= [x1, . . . , xn]and Y=
[y1, . . . , yn], CCA tends to find linear projections wxand wy,
which make the corresponding examples in the two datasets maxi-
mally correlated in the projected space. The correlation coefficient
between the two datasets in the projected space is given by
ρ=corr w>
xX, w>
yY=w>
xCxywy
q(w>
xCxxwx)w>
yCyy wy
(1)
where the covariance matrix Cxy is defined as
Cxy =1
n
n
X
i=1
(xiµx) (yiµy)>(2)
where µx=1
nPn
i=1 xiand µy=1
nPn
i=1 yiare the means of
the two views, respectively. The definition of Cxx and Cyy can be
obtained similarly.
Since the correlation ρis invariant to the scaling of wxand
wy, CCA can be posed equivalently as a constrained optimization
problem.
max
wx,wy
wT
xCxywy
s.t. wT
xCxxwx= 1, wT
yCyy wy= 1 (3)
By formulating the Lagrangian dual of Eq.(3), it can be shown
that the solution to Eq.(3) is equivalent to solving a pair of
generalized eigenvalue problems [3],
CxyC1
yy Cyx wx=λ2Cxxwx
CyxC1
xx Cxywy=λ2Cy y wy(4)
Besides the above definition of CCA, there are also other
different ways to define the canonical correlations of a pair of
matrices, and all these ways are shown to be equivalent [42].
In particular, Kettenring [43] show that CCA is equivalent to
a constrained least-square optimization problem. Further, Golub
and Zha [42] also provide a classical algorithm for computing
CCA by first QR decomposition of the data matrices which
whitens the data and then an SVD of the whitened covariance
matrix. However, with typically huge data matrices this procedure
becomes extremely slow. Avron et al. [30, 44] propose a fast
algorithm for CCA with a pair of tall-and-thin matrices using
subsampled randomized Walsh-Hadamard transform [45], which
only subsamples a small proportion of the training data points
to approximate the matrix product. Further, Lu and Foster [46]
consider sparse design matrices and introduce an efficient iterative
regression algorithm for large scale CCA.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
Another line of research for large scale CCA considers
stochastic optimization algorithm for CCA [47]. Ma et al. [48]
introduce an augmented approximate gradient scheme and further
extend it to a stochastic optimization regime. Recent work [49, 50]
attempts to transform the original problem of CCA into sequences
of least squares problems and solve these problems with acceler-
ated gradient descent (AGD). Further, Sun et al. [27] formulate
CCA as a least squares problem in multi-label classification.
While CCA has the capability of conducting multi-view fea-
ture learning and has been widely applied in different fields, it
still has some limitations in different applications. For example,
it ignores the nonlinearities of multi-view data. Consequently,
many algorithms based on CCA have been proposed to extend
the original CCA in real-world applications. In the following
sections, we review its several widely-used extensions including
probabilistic CCA, sparse CCA, kernel CCA and Deep CCA.
2.2 Probabilistic CCA
CCA can be interpreted as the maximum likelihood solution to
a probabilistic latent variable model for two Gaussian random
vectors. This formulation of CCA as a probabilistic model is
proposed by Bach and Jordan [51].
Probabilistic CCA can be formulated by first introducing an
explicit latent variable zcorresponding to the principal-component
subspace. And then a Gaussian prior distribution p(z)over the
latent variable, together with two Gaussian conditional distri-
butions p(x|z)and p(y|z)for the observed variables xand y
conditioned on the value of the latent variable. In particular, the
prior distribution over zis given by a zero-mean unit-covariance
Gaussian
p(z) = N(z|0, IM)(5)
Similarly, the conditional distributions of the observed variable x
and y, conditioned on the value of the latent variable z, is also
Gaussian and given as follows,
p(x|z) = N(x|Wxz+µx,Ψx)
p(y|z) = N(y|Wyz+µy,Ψy)(6)
where the mean of xand yis in general linear functions of z
governed by the dx×dzmatrix Wxand the dx-dimensional vector
µx, and by the dy×dzmatrix Wyand the dy-dimensional vector
µy, respectively. The crucial idea lies in that the latent variables z
is shared by the two data sets, while other variables or parameters
are independent.
Consequently, Bach and Jordan [51] provide a detailed con-
nection between CCA and probabilistic CCA based on the result
of the maximum likelihood estimation of Eq.(6) that Wx=
ΣxxUxP1/2Rand Wy= ΣyyUyP1/2R, where Uis composed
of the canonical directions, Pis a diagonal matrix containing
the canonical correlations, and Ris an arbitrary rotation matrix.
Further, the posterior expectations E(z|x)and E(z|y)lie in
the same subspace that the traditional CCA searches since the
obtained canonical weights are the same as the ones for CCA. EM
algorithm is applied to find the maximum likelihood solution and
is shown to converge to the global optimum. In addition, Browne
[52] also proves that the maximum likelihood solution of Eq.(6) is
identical to the solution of classical CCA.
However, since probabilistic CCA is based on a Gaussian
density model, outliers may lead to serious biases in the parameter
estimates when using a maximum likelihood estimation method.
To handle the influence of outliers, Archambeau et al. [53]
introduce robust canonical correlation analysis by replacing Gaus-
sian distributions with Student-tdistributions, which constructs
mixtures of robust CCA and can deal with missing data quite
easily. Fyfe and Leen [54] consider a Dirichlet process method for
performing a non-Bayesian probabilistic mixture CCA.
By treating the mixture case in length, Klami and Kaski
[55] present a full Bayesian treatment of probabilistic canonical
analyzer. Further, Wang [56] applies a hierarchical Bayesian
model to probabilistic CCA and learn this model by variational
approximation. By performing the Bayesian model algorithm for
probabilistic CCA, the number of canonical correlations can be
determined automatically and one would not encounter the issue
of selecting the number of canonical correlations, which has been
a well known downside for using CCA. Both [55] and [56] exploit
the inverse Wishart matrix distribution as a prior for the covari-
ance matrices Ψx(y)in Eq.(6) and use the automatic relevance
determination prior for the linear transformation matrices Wx(y).
Later Viinikanoja et al. [57] introduce a mixture of robust
canonical correlation analyzers and provide a variational Bayesian
inference for learning from noisy data. Such kinds of mixture
models can be considered as locally linear models that find several
correlation clusters and fit a separate CCA model for each cluster.
The clustering process is incorporated into the solution procedure
and is also closely related to the CCA models themselves.
Based on exponential family extensions of principal compo-
nent analysis [58, 59], Klami et al. [60] extend Bayesian CCA
to the exponential family by generalizing the Gaussian noise
assumption to noise with any distribution in the exponential
family. Using the natural parameter formulation of exponential
family distributions, certain common choices can be incorporated
as special cases, which leading to an efficient way of exploiting
the robustness of exponential family distribution in practical mod-
els [61, 62]. Recently, Mukuta and Harada [63] follow Wang’s
approach [56] and introduce a Bayesian extension of partial CCA
[64]. Kamada et al. [65] propose a probabilistic semi-CCA by
incorporating the mechanism of data missing.
Despite the apparent superiority of Bayesian CCA for model
selection, it is worth pointing out that the inference of the Bayesian
CCA is difficult for high-dimensional data. Most of the practical
applications of Bayesian CCA in the earlier efforts focus on
relatively low-dimensional data [55, 56]. To make Bayesian CCA
feasible for high-dimensional data, Huopaniemi et al. [66, 67] pro-
pose to use multi-way analysis setup by exploiting dimensionality
reduction techniques.
2.3 Sparse CCA
Recent years have witnessed a growing interest in learning sparse
representations of data. Correspondingly, the problem of sparse
CCA also has received much attention in the multi-view repre-
sentation learning. The quest for sparsity can be motivated from
several aspects. First is the ability to account for the predicted
results. The big picture usually relies on a small number of crucial
variables, with details to be allowed for variation. The second
motivation for sparsity is regularization and stability. Reasonable
regularization plays an important role in eliminating the influence
of noisy data and reducing the sensitivity of CCA to a small
number of observations. Further, sparse CCA can be formulated
as a subset selection scheme which reduces the dimensionality of
the vectors and makes possible a stable solution.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
The problem of sparse CCA can be considered as finding a pair
of linear combinations of wxand wywith prescribed cardinality
which maximize the correlation. In particular, sparse CCA can be
defined as the solution to
ρ= max
wx,wy
w>
xCxywy
pwxCxxwxwyCywy
s.t. ||wx||0sx,||wy||0sy.(7)
Most of the approaches to sparse CCA are based on the well
known LASSO trick [68] which is a shrinkage and selection
method for linear regression. By formulating CCA as two con-
strained simultaneous regression problems, Hardoon and Shawe-
Taylor [69] propose to approximate the non-convex constraints
with -norm. This is achieved by fixing each index of the
optimized vector to 1in turn and constraining the 1-norm of the
remaining coefficients. Similarly, Waaijenborg et al. [70] propose
to use the elastic net type regression.
In addition, Sun et al. [27] introduce a sparse CCA by
formulating CCA as a least squares problem in multi-label classi-
fication and directly computing it with the Least Angle Regression
algorithm (LARS) [71]. Further, this least squares formulation
facilitates the incorporation of the unlabeled data into the CCA
framework to capture the local geometry of the data. For example,
graph laplacian [72] can be used in this framework to tackle with
the unlabeled data.
In fact, the development of sparse CCA is intimately related
to the advance of sparse PCA [73–75]. The classical solutions to
generalized eigenvalue problem with sparse PCA can be easily
extended to that of sparse CCA [76, 77]. Torres et al. [76] derive
a sparse CCA algorithm by extending an approach for solving
sparse eigenvalue problems using D.C. programming. Based on
the sparse PCA algorithm in [74], Wiesel et al. [77] propose a
backward greedy approach to sparse CCA by bounding the corre-
lation at each stage. Witten et al. [78] propose to apply a penalized
matrix decomposition to the covariance matrix Cxy, which results
in a method for penalized sparse CCA. Consequently, structured
sparse CCA has been proposed by extending the penalized CCA
with structured sparsity inducing penalty [79].
Another line of research for sparse CCA is based on sparse
Bayesian learning [80–82]. In particular, they are built on the
probabilistic interpretation of CCA outlined by [51]. Fyfe and
Leen [80] investigate two methods for sparsifying probabilistic
CCA. Fujiwara et al. [83] present a variant of sparse Bayesian
CCA, using an element-wise automatic relevance determination
(ARD) prior configuration, where non-effective parameters are
automatically driven to zero. Rai and Daum´
e III [82] propose a
nonparametric, fully Bayesian framework that can automatically
select the number of correlation components and capture the
sparsity underlying the projections. This framework exploits the
Indian Buffet Process [84] to discover latent feature representation
of a set of observations and to control the overall complexity of
the model.
Recently, some authors have used probabilistic inter-battery
factor analysis (IBFA) [52, 85], which can be considered as an
extended CCA model which complements CCA by providing a
certain variation not captured by the correlation components, to
build more complex hierarchical sparse Bayesian CCA models.
By extending IBFA with group-wise sparsity, Klami et al. [86] in-
troduce a Bayesian treatment of the IBFA model, which describes
not only the correlations between data sets but also provides
components explaining the linear structure within each of the data
sets.
Built on multi-battery factor analysis (MBFA) by McDonald
[87] and Browne [88] as a generalization of IBFA, Klami et al.
[89] introduce a problem formulation of group factor analysis,
which extends CCA to more than two sets with structural sparsity,
resulting in that it is more flexible than the previous extensions
[86]. The solution to this framework can be formulated as varia-
tional inference of a latent variable model with a structural sparsity
prior.
2.4 Kernel CCA
Canonical Correlation Analysis is a linear multi-view represen-
tation learning algorithm, but for many scenarios of real-world
multi-view data revealing nonlinearities, it is impossible for a
linear embedding to capture all the properties of the multi-view
data [90]. Since kernerlization is a principled trick for introducing
non-linearity into linear methods, kernel CCA [91–93] provides
an alternative solution. As a non-linear extension of CCA, kernel
CCA has been successfully applied in many situations, including
independent component analysis [2], cross-media information
retrieval [3, 94, 95], computational biology [3, 96, 97], multi-
view clustering [32, 98], acoustic feature learning [99, 100], and
statistics [2, 101, 102].
The key idea of kernel CCA lies in embedding the data into
a higher dimensional feature space φx:X → H, where Hx
is the reproducing kernel Hilbert space (RKHS) associated with
the real numbers, kx:X × X Rand kx(xi, xj) =<
φx(xi), φx(xj)>.ky,Hy, and φycan be defined analogously.
By adapting the representer theorem [103] to the case of multi-
view data to state that the following minimization problem,
min
f1,...,fk
L((x1, y1, fx(x1), fy(y1)),...,(xn, yn, fx(xn), fy(yn)))
+ Ωx(||f||2
K,||f||2
K)(8)
where Lis an arbitrary loss function and is a strictly monoton-
ically increasing function, admits representation of the form
fx(x) = X
i
αikx(xi, x), fy(y) = X
i
βiky(yi, y)(9)
Correspondingly, we replace vectors wxand wyin our pre-
vious CCA formulation Eq.(1) with fx=Piαiφx(xi)and
fy=Piβiφy(yi), respectively, and replace the covariance
matrices accordingly. The kernel CCA objective can be written
as follows:
ρ=f>
xˆ
Cxyfy
qf>
xˆ
C>
xxfxf>
yˆ
Cyy fy
(10)
In particular, the kernel covariance matrix ˆ
Cxy is defined as
ˆ
Cxy =1
n
n
X
i=1 φx(xi)µφx(x)φy(yi)µφy(y)>,(11)
where µφx(x)=1
nPn
i=1 φx(xi)and µφy(y)=1
nPn
i=1 φy(yi)
are the means of the two views’ kernel mappings, respectively.
The form of ˆ
Cxx and ˆ
Cyy can be obtained analogously.
Let Kxdenote the kernel matrix such that Kx=H˜
KxH,
where [˜
Kx]ij =kx(xi, xj)and H=I1
n11>is a centering
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
matrix, 1Rnbeing a vector of all ones. And Kyis defined sim-
ilarly. Further, we substitute them into Eq.(10) and formulate the
objective of kernel CCA as the following optimization problem:
max
α,β
α>KxKyβ
qαTK2
xαβ>K2
yβ
(12)
As discussed in [3], the above optimization leads to degenerate
solutions when either Kxor Kyis invertible. Thus, we introduce
regularization terms and maximize the following regularized ex-
pression
max
α,β
α>KxKyβ
qαT(K2
x+xKx)αβ>K2
y+yKyβ
(13)
Since this new regularized objective function is not affected by
re-scaling of αor β, we assume that the optimization problem is
subject to
α>K2
xα+xα>Kxα= 1
β>K2
yβ+yβ>Kyβ= 1 (14)
Similar to the optimized case of CCA, by formulating the
Lagrangian dual of Eq.(13) with the constraints in Eq.(14), it can
be shown that the solution to Eq.(13) is also equivalent to solving
a pair of generalized eigenproblems [3],
(Kx+xI)1Ky(Ky+yI)1Kxα=λ2α
(Ky+yI)1Kx(Kx+xI)1Kyβ=λ2β(15)
Consequently, the statistical properties of KCCA have been
investigated from several aspects [2, 104]. Fukumizu et al. [105]
introduce a mathematical proof of the statistical convergence of
kernel CCA by providing rates for the regularization parameters.
Later Hardoon and Shawe-Taylor [106] provide a detailed theo-
retical analysis of KCCA and propose a finite sample statistical
analysis of KCCA by using a regression formulation. Cai and
Sun [107] also provide a convergence rate analysis of kernel CCA
under an approximation assumption. However, the problems of
choosing the regularization parameter in practice remain largely
unsolved.
In addition, KCCA has a closed-form solution via the eigen-
value system in Eq.(15), but this solution does not scale up to the
large size of the training set, due to the problem of time complexity
and memory cost. Thus, various approximation methods have been
proposed by constructing low-rank approximations of the kernel
matrices, including incomplete Cholesky decomposition [2], par-
tial Gram-Schmidt orthogonolisation [3], and block incremental
SVD [99, 108]. In addition, the Nystr¨
om method [109] is widely
used to speed up the kernel machines [110–113]. This approach
is achieved by carrying out an eigen-decomposition on a lower-
dimensional system, and then expanding the results back up to the
original dimensions.
With the development of image understanding, kernel CCA
has already been successfully used to associate images or image
regions with different kinds of captions, including individual
words, sets of tag, and even sentences [3, 94, 95, 114, 115].
Hardoon et al. [3] first apply KCCA to cross-modality retrieval
task, in which images are retrieved by a given multiple text query
and without using any label information around the retrieved
images. Consequently, KCCA is exploited by Socher and Li [94]
to learn a mapping between textual words and visual words so
that both modalities are connected by a shared, low dimensional
feature space. Further, Hodosh et al. [115] make use of KCCA
in a stringent task of associating images with natural language
sentences that describe what is depicted.
Moreover, kernel CCA also has attracted much attention in
multi-view clustering [32, 98]. Blaschko and Lampert [32] pro-
pose a correlational spectral clustering method, which exploits
KCCA to do unsupervised clustering of images and text in latent
meaning space. For webpage clustering task, Trivedi et al. [98]
use a regularized variant of the KCCA algorithm to learn a
lower dimensional subspace from heterogeneous data sources.
This approach leverages tag information as a complementary part
of webpage contents to extract highly discriminative features.
While KCCA can learn representation that is closely related
to the underlying generative process of the multi-view data, it
may suffer from the small sample effect when the data acquisition
in one or more modalities is expensive or otherwise limited. In
order to robustly learn the relevant directions maximizing the
correlation, Blaschko et al. [97] propose to modify the objective
of KCCA with semi-supervised Laplacian regularization [116] to
favor directions that lie along the data manifold [116].
2.5 Deep CCA
The CCA-like objectives can be naturally applied to neural
networks to capture high-level associations between data from
multiple views. In the early work, by assuming that different
parts of perceptual input have common causes in the external
world, Becker and Hinton [117] present a multilayer nonlinear
extension of canonical correlation by maximizing the normalized
covariation between the outputs from two neural network modules.
Further, Becker [118] explores the idea of maximizing the mutual
information between the outputs of different network modules to
extract higher order features from coherence inputs.
Later Lai and Fyfe [119, 120] investigate a neural network
implementation of CCA and maximize the correlation (rather than
canonical correlation) between the outputs of the networks for
each view. Hsieh [121] formulates a nonlinear canonical corre-
lation analysis (NLCCA) method using three feedforward neural
networks. The first network maximizes the correlation between the
canonical variates (the two output neurons), while the remaining
two networks map the canonical variates back to the original two
sets of variables.
Although multiple CCA-based neural network models have
been proposed for decades, the full deep neural network extension
of CCA, referred as deep CCA, has recently been developed
by Andrew et al. [8]. Inspired by the recent success of deep
neural networks [6, 122], Andrew et al. [8] introduce deep CCA
to learn deep nonlinear mappings between two views {X, Y }
which are maximally correlated. The deep CCA learns repre-
sentations of the two views by using multiple stacked layers
of nonlinear mappings. In particular, assume for simplicity that
the network has dintermediate layers, deep CCA first learns
deep representation from fx(x) = hWx,bx(x)with parameters
(Wx, bx)=(W1
x, b1
x, W 2
x, b2
x, . . . , W d
x, bd
x), where Wl
x(ij)de-
notes the parameters associated with the connection between unit
iin layer l, and unit jun layer l+ 1. Also, bl
x(j)denotes the bias
associated with unit jin layer l+ 1. Given a sample of the second
view, the representation fy(y)is computed in the same way, with
different parameters (Wy, by). The goal of deep CCA is to jointly
learn parameters for both views such that corr(fx(X), fy(Y)) is
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
Fig. 2. The framework of deep CCA (adapted from [8]), in which the
output layers of two deep networks are maximally correlated.
as high as possible. Let θxbe the vector of all the parameters
(Wx, bx)of the first view and similarly for θy, then
(θ
x, θ
y) = arg max
(θxy)corr(fx(X;θx), fy(Y;θy)).(16)
For training deep neural network models, parameters are typically
estimated with gradient-based optimization methods. Thus, the
parameters (θ
x, θ
y)are also estimated on the training data by
following the gradient of the correlation objective, with batch-
based algorithms like L-BFGS as in [8] or stochastic optimization
with mini-batches [123–125].
Deep CCA and its extensions have been widely applied in
learning representation tasks in which multiple views of data are
provided. Yan and Mikolajczyk [126] learn a joint latent space for
matching images and captions with a deep CCA framework, which
adopts a GPU implementation and could deal with overfitting. To
exploit multilingual context when learning word embeddings, Lu
et al. [124] learn deep non-linear embeddings of two languages
using the deep CCA.
Recently, a deep canonically correlated autoencoder (DCCAE)
[20] is proposed by combining the advantages of the deep CCA
and autoencoder-based approaches. In particular, DCCAE consists
of two autoencoders and optimizes the combination of canonical
correlation between the learned bottleneck representations and the
reconstruction errors of the autoencoders. This optimization offers
a trade-off between information captured in the embedding within
each view on one aspect, and the information in the relationship
across views on the other.
3 SH ALLOW METHODS ON MULTI-VIEW REPRE-
SENTATION LEARNING
In this section, we will first review the shallow multi-view repre-
sentation learning methods from probabilistic modeling perspec-
tive, and then survey the related methods from directly parameter-
ized representation learning perspective.
3.1 Probabilistic Models
From the probabilistic modeling perspective, the problem of multi-
view feature learning can be interpreted as an attempt to recover a
Fig. 3. The graphical model representation of the Corr-LDA model
(adapted from [11]).
compact set of latent random variables that describe a distribution
over the observed multi-view data. We can express p(x, y, z)as a
probabilistic model over the joint space of the latent variables z,
and observed two-view data x, y. Feature values are determined
by the posterior probability p(z|x, y). Parameters are usually
estimated by maximizing the regularized likelihood of the training
data.
3.1.1 Multi-Modal Latent Dirichlet Allocation
Latent Dirichlet allocation (LDA) [127] is a generative probabilis-
tic model for collections of a corpus. It proceeds beyond PLSA
through providing a generative model at words and document level
simultaneously. In particular, LDA is a three-level hierarchical
Bayesian network that models each document as a finite mixture
over an underlying set of topics.
As a generative model, LDA is extendable to multi-view data.
Barnard et al. [10] present a mixture of multi-modal LDA model
(MoM-LDA), which describes the following generative process
for the multi-modal data: each document (consisting of visual
and textual information) has a distribution for a fixed number of
topics (mixture components), and given a specific topic the visual
features and textual words are generated. Further, the probability
distribution of the topics is different for each image-caption pair,
which is achieved by imposing a Dirichlet prior for the distribution
of topics.
Consequently, Blei and Jordan [11] propose a correspondence
LDA (Corr-LDA) model, which not only allows simultaneous
dimensionality reduction in the representation of region descrip-
tions and words, but also models the conditional correspondence
between their respectively reduced representations. The graphical
model of Corr-LDA is depicted in Figure 3. This model can be
viewed in terms of a generative process that first generates the
region descriptions and subsequently generates the caption words.
In particular, let z={z1, z2, . . . , zN}be the latent variables
that generate the image, and let y={y1, y2, . . . , yM}be
discrete indexing variables that take values from 1to Nwith
equal probability. Each image and its corresponding caption are
represented as a pair (r,w). The first element r={r1, . . . , rN}
is a collection of Nfeature vectors associated with the regions
of the image. The second element w={w1, . . . , wM}is the
collection of Mwords of the caption. Given Nand M, a K-
factor Corr-LDA model assumes the following generative process
for an image/caption (r,w):
1) Sample θDir(θ|α).
2) For each image region rn,n∈ {1, . . . , N}:
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
a) Sample znMult(θ)
b) Sample rnp(r|zn, µ, σ)from a multivariate
Gaussian distribution conditioned on zn.
3) For each caption word wm,m∈ {1, . . . , M}:
a) Sample ymUnif(1, . . . , N )
b) Sample wmp(w|ym,z, β)from a multino-
mial distribution conditioned on the zymfactor.
Consequently, Corr-LDA specifies the following joint distribu-
tion on image regions, caption words, and latent variables:
p(r,w,θ, z,y) = p(θ|α) N
Y
n=1
p(zn|θ)p(rn|zn, µ, σ)!
· M
Y
m=1
p(ym|N)p(wm|ym,z, β)!
(17)
Note that exact probabilistic inference for Corr-LDA is intractable
and we employ variational inference methods to approximate the
posterior distribution over the latent variables given a particular
image/caption.
Both MoM-LDA and Corr-LDA achieve great success in
image annotation. However, there are still limitations with these
models. First, selecting the number of mixture components to be
used in a MoM-LDA or Corr-LDA model is difficult. In particular,
different choices of the number of mixture components correspond
to different MoM-LDA or Corr-LDA models. Second, MoM-
LDA and Corr-LDA models neglect the discriminative information
which may be helpful for multi-view learning. For example, image
captions provide evidence for the class label, and the class label
provides evidence for image captions. Third, both models discard
substantial visually auxiliary information existing in the images,
e.g., the position information of the visual features. For multi-
modal topic modeling, we are ultimately interested in defining a
latent space that is consistent in semantic terms, while discarding
as little useful information about the multi-modal data as possible.
Amount of work have been presented to solve the above
limitations. Based on hierarchical Dirichlet process (HDP) [128],
Yakhnenko and Honavar [129] introduce a multi-modal hierarchi-
cal Dirichlet process model (MoM-HDP), which is an extension of
MoM-LDA with an infinite number of mixture components. Thus,
MoM-HDP is capable of removing the need for a prior choice of
the number of mixture components or the computational expense
of model selection. Note that in practice, the Dirichlet process is
approximated by truncating it [130]. Qi et al. [131] also present a
correspondence hierarchical Dirichlet process model (Corr-HDP),
which is an extension of the Corr-LDA model using a hierarchical
Dirichlet process instead.
Following the supervised LDA algorithms [132], supervised
multi-modal LDA models are subsequently proposed to make
effective use of the discriminative information. Since supervised
topic models can be classified into two categories: downstream
models and upstream models [132], supervised multi-modal LDA
models can also be catergorized accordingly. For a downstream
supervised multimodal model, the supervised response variables
are generated from topic assignment variables. For instance,
Wang et al. [133] develop a multi-modal probabilistic model for
jointly modeling the image, its class label, and its annotations,
called multi-class supervised LDA with annotations, which treats
the class label as a global description of the image, and treats
annotation terms as local descriptions of parts of the image. Its
underlying assumptions naturally integrate the multi-modal data
with their discriminative information so that it takes advantage
of the merits of Corr-LDA and supervised LDA [134] simultane-
ously.
For an upstream supervised multi-modal model, the response
variables directly or indirectly generate latent topic variables
[135–137]. For instance, Cao et al. [136] propose a spatially
coherent latent topic model (Spatial-LTM), which represents an
image containing objects in two different modalities: appearance
features and salient image patches. Further, a supervised spatial-
LTM model is presented by incorporating label information into
distribution of topics. Nguyen et al. [138] propose a Multi-modal
Multi-instance Multi-Label LDA model (M3LDA), in which the
model consists of a visual-label part, a textual-label part, and a
label-topic part. The underlying idea is to make the topic decided
by the visual information to be consistent with the topic decided
by the textual information, leading to the correct label assignment.
3.1.2 Multi-View Sparse Coding
Multi-view sparse coding [12–14] relates a latent representation
(either a vector of random variables or a feature vector, depending
on the interpretation) to the multi-view data through a set of
linear mappings, which we refer to as the dictionaries. It has the
property of finding shared representation hwhich picks out the
most appropriate bases and zeros others, given a high degree of
correlation with the input. This property is owing to the explaining
away effect which aries naturally in directed graphical models
[139].
Given a pair of datasets {X, Y }, a non-probabilistic multi-
view sparse coding scheme can be seen as recovering the code or
feature vector associated with a new multi-view input via:
h= arg min
hkxWxhk2
2+kyWyhk2
2+λkhk1(18)
Learning the pair of dictionaries {Wx, Wy}can be accomplished
by optimizing the following training criterion with respect to Wx
and Wy:
JWx,Wy=X
ikxiWxh
ik2
2+kyiWyh
ik2
2(19)
where xiand yiare the two modal inputs and his the corre-
sponding sparse code determined by Eq.(18). In particular, Wx
and Wyare usually constrained to have unit-norm columns.
The above regularized form of multi-view sparse coding can
be generalized as a probabilistic model. In probabilistic multi-view
sparse coding, we assume the following generative distributions,
p(h) =
dh
Y
j
λ
2exp(λ|hj|)
n
i=1 :p(xi|h) = N(xi;Wxh+µxi, σ2
xiI)
p(yi|h) = N(yi;Wyh+µyi, σ2
yiI)(20)
In the case of two-view sparse coding, because we seek a
sparse multi-view representation (i.e., one with many features
set to zero), we are interested in recovering the MAP (maxi-
mum a posteriori) value of h: i.e., h= arg maxhp(h|x, y)
rather than its expected value E[h|x, y]. Under this interpreta-
tion, learning dictionaries Wxand Wyproceeds as maximiz-
ing the likelihood of the data given these MAP values of h:
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
arg maxWx,WyQip(xi|h)p(yi|h)subject to the norm con-
straint on Wxand Wy. Note that this parameter learning scheme,
subject to the MAP values of the latent h, is not a standard practice
in the probabilistic graphical model literature.
Typically, the likelihood of the multi-view data is maximized
directly. In the presence of latent variables, expectation maximiza-
tion is employed where the parameters are optimized with respect
to the marginal likelihood, i.e., summing or integrating the joint
log-likelihood over the all values of the latent variables under their
posterior, rather than considering only the single MAP value of
h. The theoretical properties of this form of parameter learning
are not yet well understood but seem to work in practice (e.g.,
Gaussian mixture models).
One might expect that multi-view sparse representation would
significantly leverage the performance especially when features
for different views are complementary to one another and indeed
it seems to be the case. There are numerous examples of its
successful applications as a multi-view feature learning scheme,
including human pose estimation [12], image classification [140],
web data mining [141], as well as cross-media retrieval [142, 143].
3.1.3 Multi-View Latent Space Markov Networks
Undirected graphical models, also called Markov random fields
that have many special cases, including the exponential fam-
ily Harmonium [144] and restricted Boltzmann machine [145].
Within the context of unsupervised multi-view feature learning,
Xing et al. [15] first introduce a particular form of multi-view
latent space Markov network model called multi-wing harmonium
model. This model can be viewed as an undirected counterpart of
the aforementioned directed aspect models such as multi-modal
LDA [11], with the advantages that inference is fast due to the
conditional independence of the hidden units and that topic mixing
can be achieved by document- and feature-specific combination of
aspects.
For simplicity, we begin with dual-wing harmonium model,
which consists of two modalities of input units X={xi}n
i=1,
Y={yj}n
j=1, and a set of hidden units H={hk}n
k=1. In
this dual-wing harmonium, each modality of input units and the
hidden units constructs a complete bipartite graph where units in
the same set have no connections but are fully connected to units
in the other set. In addition, there are no connections between
two input modalities. In particular, consider all the case where all
observed and hidden variables are from exponential family; we
have
p(xi) =exp{θT
iφ(xi)A(θi)}
p(yj) =exp{ηT
jψ(yj)B(ηj)}
p(hk) =exp{λT
kϕ(hk)C(λk)}(21)
where φ(·),ψ(·), and ϕ(·)are potentials over cliques formed by
individual nodes, θi,ηj, and λkare the associated weights of
potential functions, and A(·),B(·), and C(·)are log partition
functions.
Through coupling the random variables in the log-domain and
introducing other additional terms, we obtain the joint distribution
p(X, Y, H)as follows:
p(X, Y, H )expX
i
θT
iφ(xi) + X
j
ηT
jψ(yj) + X
k
λT
kϕ(hk)
+X
ik
φ(xi)TWikϕ(hk) + X
jk
ψ(yj)TUjk ϕ(hk)
(22)
where φ(xi)ϕ(hk),ψ(yj)ϕ(hk)are potentials over cliques con-
sisting of pairwise linked nodes, and Wik,Uj k are the associated
weights of potential functions. From the joint distribution, we can
derive the conditional distributions
p(xi|H)exp{˜
θT
iφ(xi)A(˜
θi)}
p(yj|H)exp{˜ηT
jψ(yj)Bηj)}
p(hk|X, Y )exp{˜
λT
kϕ(hk)C(˜
λk)}(23)
where the shifted parameters ˜
θi=θi+PkWikϕ(hk),˜ηj=ηj+
PkUjk ϕ(hk), and ˜
λk=λk+PiWikφ(xi) + PjUj k ψ(yj).
In training probabilistic models parameters are typically up-
dated in order to maximize the likelihood of the training data.
The updating rules can be obtained by taking derivative of the
log-likelihood of the sample defined in Eq.(22) with respect to the
model parameters. The multi-wing model can be directly obtained
by extending the dual-wing model when the multi-modal input
data are observed.
Further, Chen et al. [16] present a multi-view latent space
Markov network and its large-margin extension that satisfies a
weak conditional independence assumption that data from differ-
ent views and the response variables are conditionally indepen-
dent given a set of latent variables. In addition, Xie and Xing
[146] propose a multi-modal distance metric learning (MMDML)
framework based on the multi-wing harmonium model and metric
learning method by [147]. This MMDML provides a principled
way to embed data of arbitrary modalities into a single latent space
where distance supervision is leveraged.
3.2 Directly Learning A Parametric Embedding from
Multi-View Input to Representation
From the framework of multi-view probabilistic models discussed
in the previous section, we can see that the learned representation
is usually associated with shared latent variables, specifically with
their posterior distribution given the multi-view observed input.
Further, this posterior distribution often becomes complicated and
intractable if the designed models have hierarchical structures. It
has to resort to sampling or approximate inference techniques,
which need to endure the associated computational and approxi-
mation error cost. In addition, a posterior distribution over shared
latent variables is not yet a reasonable feature vector that can be
directly fed to the classifier. Therefore, if we intend to obtain stable
deterministic feature values, an alternative non-probabilistic multi-
view embedding learning paradigm is the directly parameterized
feature or representation functions. The common perspective be-
tween these methods is that they learn a direct encoding for multi-
view input. Consequently, in this section we review the related
work from this perspective.
3.2.1 Partial Least Squares
Partial Least Squares (PLS) [148–150] is a wide class of methods
for modeling relations between sets of observed variables. It has
been a popular tool for regression and classification as well as
dimensionality reduction, especially in the field of chemometrics
[151, 152]. The underlying assumption of all PLS methods is
that the observed data are generated by a process which is driven
by a small number of latent variables. In particular, PLS creates
orthogonal latent vectors by maximizing the covariance between
different sets of variables.
Given a pair of datasets X= [x1, . . . , xn]Rdx×nand
Y= [y1, . . . , yn]Rdy×n, a k-dimensional PLS solution can
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
be parameterized by a pair of matrices WxRdx×kand Wy
Rdy×k[47]. The PLS problem can now be expressed as:
max
Wx,Wy
tr W>
xCxyWy
s.t. W>
xWx=I , W >
yWy=I. (24)
It can be shown that the columns of the optimal Wxand Wy
correspond to the singular vectors of covariance matrix Cxy =
E[xy>]. Like the CCA objective, PLS is also an optimization of
an expectation subject to fixed constraints.
In essence, CCA finds the directions of maximum correlation
while PLS finds the directions of maximum covariance. Covari-
ance and correlation are two different statistical measures for de-
scribing how variables covary. It has been shown that there is close
connections between PLS and CCA in several aspects [149, 152].
Guo and Mu [153] investigate the CCA based methods, including
linear CCA, regularized CCA, and kernel CCA, and compare
to the PLS models in solving the joint estimation problem. In
particular, they provide a consistent ranking of the above methods
in estimating age, gender, and ethnicity.
Further, Li et al. [154] introduce a least square form of PLS,
called cross-modal factor analysis (CFA). CFA aims at finding
orthogonal transformation matrices Wxand Wythat minimize the
following expression:
||X>WxY>Wy||2
F
subject to: W>
xWx=I , W >
yWy=I(25)
where || · ||Fdenotes the Frobenius norm. It can be easily verified
that the above optimization problem in Eq.(25) has the same
solution as that of PLS. Several extensions of CFA are presented
by incorporating non-linearity and supervised information [155–
157].
3.2.2 Multi-View Discriminant Analysis
CCA and its kernel extensions are used to match sets of features
by maximizing within-set correlation and minimizing between-set
correlation, which are unsupervised scenarios. A number of multi-
view analysis approaches [158–160] have also been proposed by
considering the scenarios in which the data have multiple different
views, along with supervised information.
Given a pair of datasets X= [x1, . . . , xn]and Y=
[y1, . . . , yn]with the corresponding labels l= (l1, . . . , ln), the
two-view regularized Fisher discriminant analysis (FDA) [158]
chooses two sets of weights wxand wyto solve the following
optimization problem,
ρ=w>
xXll>Y>wy
q(w>
xXB X>wx+µ||wx||2)·w>
yY BY >wy+µ||wy||2
(26)
where wxand wyare the weight vectors for each view, and B
is a matrix incorporating the label information and the balance of
the datasets. Since the equation is not affected by rescaling of wx
or wy, the optimization is subjected to the following constraints
w>
xXBX >wx+µ||wx||2= 1
w>
yY BY >wy+µ||wy||2= 1 (27)
By introducing the Lagrange multipliers λxand λy, the
corresponding Lagrangian for this optimization is:
L=w>
xXll>Y>wyλx
2w>
xXBX >wx+µ||wx||21
λy
2w>
yY BY >wy+µ||wy||21,(28)
which can be solved by differentiation with respect to the weight
vectors wxand wy.
By substituting the two weight vectors with wx=and
wy=Y β, and a shared regularization parameter κfor both sets
of weights, we obtain
ρ=α>X>Xll>Y>Y β
p(α>X>XB X>Xα +κ||wx||2)
·1
p(β>Y>Y BY >Y β +κ||wy||2)(29)
Then the kernel form of it is:
ρ=α>Kxll>Kyβ
p(α>KxBKxα+κα>Kxα)·(β>KyBKyβ+κ||β>Kyβ)
(30)
As in the primal form, the equation is not affected by rescaling wx
or wy; the optimization is subjected to the following constraints
α>KxBKxα+κα>Kxα= 1
β>KyBKyβ+κβ>Kyβ= 1 (31)
where Kxand Kyare the corresponding kernels for the two views.
The corresponding Lagrangian for this optimization can be written
as
L=α>Kxll>Kyβλx
2α>KxBKxα+κα>Kxα1
λy
2β>KyBKyβ+κβ>Kyβ1(32)
By differentiation with respect to the weight vectors αand β, we
can solve the above problem. In addition, based on the approach
outlined in [161], both regularized two-view FDA and its kernel
extension can be casted as equivalent disciplined convex optimiza-
tion problems. Consequently, Diethe et al. [159] introduce Multi-
view Fisher Discriminant Analysis (MFDA) that learns classifiers
in multiple views, by minimizing the variance of the data along
the projection while maximizing the distance between the average
outputs for classes over all of the views.
However, MFDA can only be used for binary classification
problems. In [160], Kan et al. propose a Multi-view Discrimi-
nant Analysis (MvDA) method, which seeks for a discriminant
common space by maximizing the between-class and minimizing
the within-class variations, across all the views. Later based on
bilinear models [162] and general graph embedding framework
[163], Sharma et al. [164] introduce Generalized Multi-view
Analysis (GMA). As an instance of GMA, Generalized Multi-view
Linear Discriminant Analysis (GMLDA) finds a set of projection
directions in each view that tries to separate different contents’
class means and unify different views of the same class in the
common subspace.
3.2.3 Cross-Modal Hashing
CCA and its extensions haven been widely applied to conduct
cross-view similarity search [25, 38, 95]. A promising way to
speed up the cross-view similarity search is the hashing technique
which makes a tradeoff between accuracy and efficiency. Hence,
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
several multi-modal hashing methods have been proposed for fast
similarity search in multi-modal data [165–174]. The principle of
the multi-modal hashing methods is to map the high dimensional
multi-modal data into a common hash codes so that similar cross-
modal data objects have the same or similar hash codes.
Bronstein et al. [165] propose a hashing-based model, called
cross-modal similarity sensitive hashing (CMSSH), which ap-
proaches the cross-modality similarity learning problem by em-
bedding the multi-modal data into a common metric space. The
similarity is parameterized by the embedding itself. The goal of
cross-modality similarity learning is to construct the similarity
function between points from different spaces, XRd1and
YRd2. Assume that the unknown binary similarity function
is s:X×Y→ {±1}; the classical cross-modality similarity
learning aims at finding a binary similarity function ˆson X×Y
approximating s. Recent work attempts to solve the problem of
cross-modality similarity leaning as an multi-view representation
learning problem.
In particular, CMSSH proposes to construct two maps: ξ:
XHnand η:YHn, where Hndenotes the n-dimensional
Hamming space. Such mappings encode the multi-modal data
into two n-bit binary strings so that dHn(ξ(x), η(y)) is small for
s(x, y) = +1 and large for s(x, y) = 1with a high probability.
Consequently, this hamming embedding can be interpreted as
cross-modal similarity-sensitive hashing, under which positive
pairs have a high collision probability, while negative pairs are
unlikely to collide. Such hashing also acts as a way of multi-modal
dimensionality reduction when d1, d2n.
The n-dimensional Hamming embedding for Xcan be
thought of as a vector ξ(x)=(ξ1(x), . . . , ξn(x)) of binary
embeddings of the form
ξi(x) = 0if fi(x)0,
1if fi(x)>0,(33)
parameterized by a projection fi:XR. Similarly, ηiis a
binary map parameterized by projection gi:YR.
Following the greedy approach [175], the Hamming metric
can be constructed sequentially as a superposition of weak binary
classifiers on pairs of data points,
hi(x, y) = +1 if ξi(x) = ηi(y),
1otherwise ,
= (2ξi(x)1)(2ηi(y)1),(34)
Here, a simple strategy for the maps is affine projection, such as
fi(x) = p>
i+aiand gi(y) = q>
iy+bi. It can be extended to
complex projections easily.
Observing the resemblance with sequentially binary classifiers,
boosted cross-modality similarity learning algorithms are intro-
duced based on the standard AdaBoost procedure [176]. CMSSH
has shown its utility and efficiency in several multi-view learning
applications including cross-representation shape retrieval and
alignment of multi-modal medical images.
However, CMSSH only considers the inter-view correlation
but ignores the intra-view similarity [177]. Kumar and Udupa
[166] extend Spectral Hashing [178] from the single view setting
to the multi-view scenario and present cross view hashing (CVH),
which attempts to find hash functions that map similar objects to
similar codewords over all the views so that inter-view and intra-
view similarities are both preserved. Gong and Lazebink [167]
combine iterative quantization with CCA to exploit cross-modal
embeddings for learning similarity preserving binary codes. Con-
sequently, Zhen and Yang [168] present co-regularized hashing
(CRH) for multi-modal data based on a boosted co-regularization
framework. The hash functions for each bit of the hash codes are
learned by solving DC (difference of convex functions) programs,
while the learning for multiple bits is performed via a boosting
procedure. Later Zhu et al. [169] introduce linear cross-modal
hashing (LCMH), which is adopts a “two-stage” strategy to learn
cross-view hashing functions. The data within each modality are
first encoded into a low-rank representation using the idea of
anchor graph [179] and then hash functions for each modality
are learned to map each modality’s low-rank space into a shared
Hamming space. Song et al. [180] introduce an inter-media hash-
ing (IMH) model by jointly capturing inter-media and intra-media
consistency.
4 DE EP METHODS ON MULTI-VIEW REPRESENTA-
TI ON LEARNING
Inspired by the success of deep neural networks [5, 6, 122], a
variety of deep multi-view feature learning methods have been
proposed to capture the high-level correlation between multi-view
data. In this section, we continue to review the deep multi-view
representation models from probabilistic and directly embedding
perspectives.
4.1 Probabilistic Models
Restricted Boltzmann Machines (RBM) [181] is an undirected
graphical model that can learn the distribution of training data.
The model consists of stochastic visible units v∈ {0,1}dvand
stochastic hidden units h∈ {0,1}dh, which seeks to minimize
the following energy function E:{0,1}dv+dhR:
E(v,h;θ) =
dv
X
i=1
dh
X
j=1
viWij hj
dv
X
i=1
bivi
dh
X
j=1
ajhj(35)
where θ={a,b,W }are the model parameters. Consequently,
the joint distribution over the visible and hidden units is defined
by:
P(v,h;θ) = 1
Z(θ)exp (E(v,h;θ)) .(36)
When considering modeling visible real-valued or sparse count
data, this RBM can be easily extended to corresponding variants,
e.g., Gaussian RBM [144] and replicated softmax RBM [182].
A deep Boltzmann machine (DBM) is a generative network
of stochastic binary units. It consists of a set of visible units
v∈ {0,1}dv, and a sequence of layers of hidden units h(1)
{0,1}dh1,h(2) ∈ {0,1}dh2,...,h(L)∈ {0,1}dhL. Here only
connections between hidden units are allowed in adjacent layers.
Let us take a DBM with two hidden layers for example. By
ignoring bias terms, the energy of the joint configuration {v,h}
is defined as
E(v,h;θ) = v>W(1)h(1) h(1) W(2)h(2) (37)
where h={h(1),h(2)}represents the set of hidden units, and
θ={W(1), W (2) }are the model parameters that denote visible-
to-hidden and hidden-to-hidden symmetric interaction terms. Fur-
ther, this binary-to-binary DBM can also be easily extended to
modeling dense real-valued or sparse count data.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
Fig. 4. The graphical model of deep multi-modal RBM (adapted from
[17]), which models the joint distribution over image and text inputs.
By extending the setup of the DBM, Srivastava and Salakhut-
dinov [17] propose a deep multi-modal RBM to model the rela-
tionship between image and text. In particular, each data modality
is modeled using a separate two-layer DBM and then an additional
layer of binary hidden units on top of them is added to learn the
shared representation.
Let vmRdvmdenote an image input and vtRdvtdenote
a text input. By ignoring bias terms on the hidden units for clarity,
the distribution of vmin the image-specific two-layer DBM is
given as follows:
P(vm;θ) = X
h(1),h(2)
P(vm,h(1),h(2);θ)
=1
Z(θ)X
h(1),h(2)
exp
dvm
X
i=1
(vmi bi)2
2σ2
i
+
dvm
X
i=1
dh1
X
j=1
vmi
σi
W(1)
ij
+
dh1
X
j=1
dh2
X
l=1
h(1)
jW(2)
jl h(2)
l.(38)
Similarly, the text-specific two-layer DBM can also be defined by
combining a replicated softmax model with a binary RBM.
Consequently, the deep multi-modal DBM has been presented
by combining the image-specific and text specific two-layer DBM
with an additional layer of binary hidden units on top of them.
The particular graphical model is shown in Figure 4. The joint
distribution over the multi-modal input can be written as:
P(vm,vt;θ) = X
h(2)
m,h(2)
t,h(3)
Ph(2)
m,h(2)
t,h(3)X
h(1)
m
Pvm,
h(1)
m,h(2)
mX
h(1)
t
Pvt,h(1)
t,h(2)
t(39)
Like RBM, exact maximum likelihood learning in this model
is also intractable, while efficient approximate learning can be
implemented by using mean-field inference to estimate data-
dependent expectations, and an MCMC based stochastic approxi-
mation procedure to approximate the model’s expected sufficient
statistics [6].
Multi-modal DBM has been widely used for multi-view rep-
resentation learning [183–185]. Hu et al. [184] employ the multi-
modal DBM to learn joint representation for predicting answers
in cQA portal. Ge et al. [185] apply the multi-modal RBM to de-
termining information trustworthiness, in which the learned joint
representation denotes the consistent latent reasons that underline
Fig. 5. The bimodal deep autoencoder (adapted from [18]).
users’ ratings from multiple sources. Pang and Ngo [186] propose
to learn a joint density model for emotion prediction in user-
generated videos with a deep multi-modal Boltzmann machine.
This multi-modal DBM is exploited to model the joint distribution
over visual, auditory, and textual features. Here Gaussian RBM
is used to model the distributions over the visual and auditory
features, and replicated softmax topic model is applied for mining
the textual features.
Further, Sohn et al. [187] investigate an improved multi-modal
RBM framework via minimizing the variation of information
between data modalities through the shared latent representation.
Recently, Ehrlichet et al. [188] present a multi-task multi-modal
RBM (MTM-RBM) approach for facial attribute classification.
In particular, a multi-task RBM is proposed by extending the
formulation of discriminative RBM [189] to account for multiple
tasks. And then multi-task RBM is naturally extended to MTM-
RBM by combining a collection of unimodal MT-RBM, one
for each visible modality. Unlike the typical multi-modal RBM,
the learned joint feature representation of MTM-RBM enables
interaction between different tasks so that it is suited for multiple
attribute classification.
4.2 Directly Deep Parametric Embedding from Multi-
View Inputs to Representation
With the development of deep neural networks, many shallow
multi-view embedding methods have been extended to deep ones,
respectively. In this section, we review the deep multi-view feature
learning methods from the directly parametric embedding perspec-
tive.
4.2.1 Multi-Modal Deep Autoencoders
Although Multi-modal RBMs achieve great success in learning
a shared representation, there are still limitations with them, e.g.,
there is no explicit objective for the models to discover correlations
across the modalities such that some hidden units are tuned
only for one modality while others are tuned only for the other.
Multi-modal deep autoencoders [18, 190–192] gradually become
good alternatives for learning a shared representation between
modalities due to the sufficient flexibility of their objectives.
Inspired by denoising autoencoders [5], Ngiam et al. [18]
propose to extract shared representations via training a bimodal
deep autoencoder (Figure 5) using an augmented but noisy dataset
with additional examples that have only a single modality as input.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
Fig. 6. The correspondence deep autoencoder (adapted from [19]).
The key idea is to use greedy layer-wise training with an extension
to RBMs with sparsity [193] followed by fine-tuning.
Further, Feng et al. [19] propose a correspondence autoencoder
(Corr-AE) via constructing correlations between hidden repre-
sentations of two uni-modal deep autoencoders. The details of
the architecture of the basic Corr-AE is shown in Figure 6. As
illustrated in Figure 6, this Corr-AE architecture consists of two
subnetworks (each a basic deep autoencoder) that are connected
by a predefined similarity measure on a specific coder layer. Each
subnetwork in the Corr-AE serves for each modality.
Let f(x;Wf)and g(y;Wg)denote the mapping from the
inputs {X, Y }to the code layers, respectively. θ={Wf, Wg}
denotes the weight parameters in these two networks. Here the
similarity measure between the i-th pair of image feature xiand
the given text feature yiis defined as follows:
C(xi, yi;θ) = kf(xi;Wf)g(yi;Wg)k2
2(40)
where fand gare logistic activation functions.
Consequently, the loss function on any pair of inputs is then
defined as follows:
L(xi, yi;θ) =(1 α) (LI(xi, yi;θ) + LT(xi, yi;θ))
+αLC(xi, yi;θ)(41)
where
LI(xi, yi;θ) = kxiˆxik2
2
LT(xi, yi;θ) = kyiˆyik2
2
LC(xi, yi;θ) = C(xi, yi;θ)
LIand LTare the losses caused by data reconstruction errors
for the given inputs of the two subnetworks, specifically image
and text modality. LCis the correlation loss and αis a parameter
for trade-off between two groups of objectives. ˆxiand ˆyiare the
reconstruction data from xiand yirespectively.
Overall, optimizing the objective in Eq.(41) enables Corr-AE
to learn similar representations from bimodal features. Besides,
based on two other multi-modal autoencoders [18], Corr-AE is
extended to two other correspondence deep models, called Corr-
Cross-AE and Corr-Full-AE.
Consequently, Silberer and Lapata train stacked multi-modal
autoencoder with semi-supervised objective to learn grounded
meaning representations. Wang et al. [194] propose an effective
mapping mechanism based on stacked autoencoder for multi-
modal retrieval. In particular, the cross-modal mapping functions
are learned by optimizing an objective function which captures
both intra-modal and inter-modal semantic relationships of data
from heterogeneous sources.
Further, based on CCA and deep autoencoders, Wang et al.
[20] propose deep canonically correlated autoencoders that also
consist of two autoencoders and optimize the combination of
canonical correlations between the learned bottleneck representa-
tions and the reconstruction errors of the autoencoders. Intuitively,
this is the same principle as that of Corr-AE. The difference
between them relies on the different similarity measures.
Recently, Rastegar et al. [195] suggest to exploit the cross
weights between representations of modalities for gradually learn-
ing interactions of the modalities in a multi-modal deep autoen-
coder network. Theoretical analysis shows that considering these
interactions in deep network manner (from low to high level)
provides more intra-modality information. As opposed to the
existing deep multi-modal autoencoders, this approach attempts
to reconstruct the representation of each modality at a given level,
with the representation of the other modalities in the previous
layer.
4.2.2 Deep Cross-View Embedding Models
Deep cross-view embedding models have become increasingly
popular in the applications including cross-media retrieval [196,
197] and multi-modal distributional semantic learning [198, 199].
Frome et al. [196] propose a deep visual-semantic embedding
model (DeViSE), which connects two deep neural networks by
a cross-modal mapping. As shown in Figure 7, DeViSE is first
initialized with a pre-trained neural network language model [200]
and a pre-trained deep visual-semantic model [122]. Consequently,
a linear transformation is exploited to map the representation at
the top of the core visual model into the learned dense vector
representations by the neural language model.
Following the setup of loss function in [201], DeViSE employs
a combination of dot-product similarity and hinge rank loss so
that the model has the ability of producing a higher dot-product
similarity between the visual model output and the vector repre-
sentation of the correct label than between the visual output and
the other randomly chosen text terms. The per training example
hinge rank loss is defined as follows:
X
j6=label
max h0,margin ~
tlabelM ~v(image) + ~
tjM~v(image)i
(42)
where ~v(image)is a column vector denoting the output of the top
layer of the core visual network for the given image, Mis the
mapping matrix of the linear transformation layer, ~
tlabel is a row
vector denoting the learned embedding vector for the provided
text label, and ~
tjare the embeddings of the other text terms.
This DeViSE model is trained by asynchronous stochastic gradient
descent on a distributed computing platform [202].
Inspired by the success of DeViSE, Norouzi et al. [203]
propose a convex combination of semantic embedding model
(ConSE) for mapping images into continuous semantic embedding
spaces. Unlike DeViSE, this ConSE model keeps the softmax
layer of the convolutional net intact. Given a test image, ConSE
simply runs the convolutional classifier and considers the convex
combination of the semantic embedding vectors from the top
Tpredictions as its corresponding semantic embedding vector.
Further, Fang et al. [204] develop a deep multi-modal similarity
model that learns two neural networks to map images and text
fragments to a common vector representation.
With the development of multi-modal distributional semantic
models [205–207], deep cross-modal mapping is naturally ex-
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
Fig. 7. The DeViSE model (adapted from [196]), which is initialized
with parameters pre-trained at the lower layers of the visual object
categorization network and the skip-gram language model.
ploited to learn the improved multimodal distributed semantic
representation. Lazaridou et al. [199] introduce multimodal skip-
gram models to extend the skip-gram model of [200] by taking
visual information into account. In this extension, for a subset of
the target words, relevant visual evidence from natural images
is presented together with the corpus contexts. In particular,
the model is designed to encourage the propagation of visual
information to all words via a deep cross-modal ranking so that
it can improve image labeling and retrieval in the zero-shot setup,
where the test concepts are never seen during model training.
Further, Dong et al. [197] present Word2VisualVec, a deep neural
network architecture that learns to predict a deep visual encoding
of textual input, and thus enables cross-media retrieval in a visual
space.
In addition, Xu et al. [208] propose a unified framework that
jointly models video and the corresponding text sentence. In this
joint architecture, the goal is to learn a function f(V) : V T ,
where Vrepresents the low-level features extracted from video,
and Tis the high-level text description of the video. A joint model
Pis designed to connect these two levels of information. It con-
sists of three parts: compositional language model ML:TTf,
deep video model MV:VVf, and a joint embedding model
E(Vf, Tf), such that
P:MV(V)VfE(Vf, Tf)TfML(T)(43)
where Vfand Tfare the output of the deep video model and
the compositional language model, respectively. In this joint
embedding model, the distance of the outputs of the deep video
model and the compositional language model in the joint space is
minimized to make them alignment.
4.2.3 Deep Multi-Modal Hashing
Motivated by the recent advance of deep learning for multi-modal
data [17, 18, 196], deep neural networks have gradually been
exploited to learn hash functions. Kang et al. [209] introduce a
deep multi-view hashing method in which each layer of hidden
nodes consists of view-specific and shared hidden nodes to learn
individual and shared hidden spaces from multiple views of data.
Consequently, Zhuang et al. [210] propose a cross-media
hashing approach based on a correspondence multi-modal neu-
ral network, referred as Cross-Media Neural Network Hashing
(CMNNH). The network structure of CMNNH can be considered
as a combination of two modality-specific neural networks with
an additional correspondence layer as shown in Figure 8.
Fig. 8. The illustration of cross-media neural network hashing architec-
ture (adapted from [210]).
Denote the two neural networks corresponding to multi-modal
input {X, Y }as NNxand NNy, respectively. For each xX
(yYin a similar way), xis forwarded layer-by-layer through
NNxto generate the representation of each layer. The lth layer
takes hlas the input and uses a projection function to transform it
to hl+1 in the next layer:
hl+1 =fl(Wlhl)(44)
where hland hl+1 are the feature representations in the lth and
l+ 1th hidden layer, respectively. Wlis the projection matrix and
flis the activation function.
From the perspective of the hash function Hx, it takes x
as input, forwards xto the hash code layer, and outputs the k-
dimensional binary hash codes:
Hx(x) = sign(~x)(45)
where ~xRkis a k-dimensional, real-value vector of the hash
code layer, and it can be converted to a binary hash code by the
sign function. The hash function Hyis formulated by analogy.
However, the sign function is not differentiable, and thus is hard
to optimize directly. By following the setup of most of the existing
hashing methods [165, 178], the sign function can be removed at
the hash function learning stage and added at the testing stage.
Further, to preserve the inter-modal pair-wise correspondence
between NNxand NNy, a loss function is defined based on the
least square error of pairwise inter-modal correspondence:
L1(x, y) = 1
2||~x~y||2
F(46)
Besides, to preserve the intra-modal discriminative capability,
softmax regression function is employed as the loss function on
the output layer as follows:
L2(x, y, c) = KL(ˆ
tx, t) + KL(ˆ
ty, t)(47)
where tis the class label for xand y,ˆ
txand ˆ
tyare the estimated
values by NNxand NNy, respectively, and KL(·)is the KL-
divergence function. Further, the two loss functions for all the
data points in Xand Yare inegrated and the overall loss function
of CMNNH is minimized as follows:
J=
n
X
i=1
L1(xi, yi) + λ
n
X
i=1
L2(xi, yi, ti)(48)
where λis a hyper-parameter to balance the two losses. The train-
ing of CMNNH is conducted by the classical back-propagation
method.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
To solve the redundancy problem of multi-modal hashing rep-
resentation learning by deep models, Wang et al. [211] introduce
a deep multi-modal hashing model with orthogonal regularization
for mapping multi-modal data into a common hamming space. On
one hand, this model captures intra-modality and inter-modality
correlations to extract useful information from multi-modal data;
on the other hand, to address the redundancy problem, orthogonal
regularizers are imposed on the weighting matrices.
Recently, Jiang and Li [212] propose a deep cross-modal
hashing method (DCMH) by integrating feature learning and hash-
code learning into the same framework. Unlike most existing
methods that typically solve the discrete optimization problem
of hash-code learning via relaxing the original discrete learning
problem into a continuous learning problem, DCMH directly
learns the discrete hash codes without relaxation. In addition, Cao
et al. present a fully correlation autoencoder hashing method by
extending the correspondence autoencoder by [19].
4.2.4 Multi-Modal Recurrent Neural Network
A recurrent neural network (RNN) [213] is a neural network which
processes a variable-length sequence x= (x1, . . . , xT)through
hidden state representation h. At each time step t, the hidden state
htof the RNN is estimated by
ht=f(ht1, xt)(49)
where fis a non-linear activation function and is selected based
on the requirement of data modeling. For example, a simple case
may be a common element-wise logistic sigmoid function and a
complex case may be a long short-term memory (LSTM) unit
[214].
An RNN is known as its ability of learning a probability distri-
bution over a sequence by being trained to predict the next symbol
in a sequence. In this training, the prediction at each time step t
is decided by the conditional distribution p(xt|xt1, . . . , x1). For
example, a multinomial distribution can be learned as output with
a softmax activation function
p(xt,j = 1|xt1, . . . , x1) = exp(wjht)
PK
j0=1 exp(wj0ht)(50)
where j= 1, . . . , K denotes the possible symbol components and
wjare the corresponding rows of a weight matrix W. Further,
based the above probabilities, the probability of the sequence x
can be computed as
p(x) =
T
Y
t=1
p(xt|xt1, . . . , x1).(51)
With this learned distribution, it is straightforward to generate a
new sequence by iteratively generating a symbol at each time step.
Cho et al. [215] propose a RNN encoder-decoder model by
exploiting RNN to connect multi-modal sequence. As shown in
Figure 9, this neural network first encodes a variable-length source
sequence into a fixed-length vector representation and then de-
codes this fixed-length vector representation back into a variable-
length target sequence. In fact, it is a general method to learn the
conditional distribution over an output sequence conditioned on
another input sequence, e.g., p(y1, . . . , yT0|x1, . . . , xT), where
the input and output sequence lengths Tand T0
can be different.
In particular, the encoder of the proposed model is an RNN which
sequentially encodes each symbol of an input sequence xinto the
corresponding hidden state according to Eq.(49). After reading the
Fig. 9. The illustration of the RNN encoder-decoder (adapted from
[215]).
end of the input sequence, a summary hidden state of the whole
source sequence cis acquired. The decoder of the proposed model
is another RNN which is exploited to generate the target sequence
by predicting the next symbol ytwith the hidden state ht. Based
on the recurrent property, both ytand htare also conditioned
on yt1and on the summary cof the input sequence. Thus, the
hidden state of the decoder at time tis computed by,
ht=f(ht1, yt1,c)(52)
and the conditional distribution of the next symbol is
p(yt|yt1, yt2, . . . , y1,c) = g(ht, yt1,c)(53)
where gis an activation function and produces valid probabilities
with a softmax. The main idea of the RNN-based Encoder-
Decoder framework can be summarized by jointly training two
RNNs to maximize the conditional log-likelihood
max
θ
1
N
N
X
n=1
logpθ(yn|xn)(54)
where θis the set of the model parameters and each pair (xn,yn)
consists of an input sequence and an output sequence from the
training set. The model parameters can be estimated by a gradient-
based algorithm.
Further, Sutskever et al. [216] also present a general end-
to-end approach for multi-modal sequence to sequence learning
based on deep LSTM networks, which is very useful for learning
problems with long range temporal dependencies [214, 217]. The
goal of this method is also to estimate the conditional probabil-
ity p(y1, . . . , yT0|x1, . . . , xT). Similar to [215], the conditional
probability is computed by first obtaining the fixed dimensional
representation vof the input sequence (x1, . . . , xT)with the en-
coding LSTM-based networks, and then computing the probability
of y1, . . . , yT0with the decoding LSTM-based networks whose
initial hidden state is set to the representation vof x1, . . . , xT:
p(y1, . . . , yT0|x1, . . . , xT) =
T0
Y
t=1
p(yt|v, y1, . . . , yt1)(55)
where each p(yt|v, y1, . . . , yt1)distribution is represented with
a softmax over all the words in the vocabulary.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15
Besides, multi-modal RNNs have been widely applied in
image captioning [21, 22, 218], videos captioning [23, 219, 220],
and visual question answering [221]. Karpathy and Li [21] propose
a multi-modal recurrent neural network architecture to generate
new descriptions of image regions. Chen and Zitnick [222] explore
the bi-directional mapping between images and their sentence-
based descriptions with RNNs. Venugopalan et al. [220] introduce
an end-to-end sequence model to generate captions for videos.
By applying attention mechanism [223] to visual recognition
[224, 225], Xu et al. [226] introduce an attention based multi-
modal RNN model, which trains the multi-modal RNN in a
deterministic manner using the standard back propagation. In
particular, it incorporates a form of attention with two variants:
a ”hard” attention mechanism and a ”soft” attention mechanism.
The advantage of the proposed model lies in attending to salient
part of an image while generating its caption.
5 MU LTI -VIEW REPRESENTATION LEARNING AS
MANIFOLD ALIGNMENT
Another important perspective on multi-view representation learn-
ing is based on the manifold alignment. The key idea underlying
this approach is to embed inputs from different domains to a
new latent common space, simultaneously preserving topology
of each input domain. Its premise is the manifold hypothesis,
according to which high-dimensional real-world data are expected
to concentrate in the vicinity of a low-dimensional manifold,
embedded in high dimensional input space. In particular, this prior
is well suited for datasets such as images, sounds and texts, as
this information are not natural signals. Since most data sources
can be modeled by manifolds, manifold alignment can be used to
align the underlying structures across different data views. With
this perspective, the multi-view representation learning task can be
seen as finding the relationship among the structures of manifolds
from the different views of data.
One of the pioneering efforts in manifold alignment with
given coordinates is the semi-supervised alignment by Ham et al.
[24]. Given certain labeled samples by the ordinal index l, semi-
supervised alignment aims to find a map defined on the vertices
of the graph f:V7→ Rthat matches known target values
for the labeled vertices. This can be solved by directly finding
arg minf|fisi|(il)where sis the vector of target value.
Since only a small number of labeled examples are provided,
it is important to exploit manifold structure in the data when
constructing the mapping function. In particular, graph Laplacian
matrix Lprovides this structural information. The semi-supervised
alignment problem can be solved by minimizing the following
objective function:
C(f) = X
i
µ|fisi|2+fTLf (56)
where the first term is the loss function, and the second term
enforces smoothness along the manifold. The optimum mapping
function fis easily obtained by a linear transform.
This semi-supervised alignment is similar to manifold ranking
[227], which learns to rank on data manifolds. They both take
advantage of the manifold learning to do alignment between
different types of data. To exploit the high level geometry structure
embedded in the data samples, Mao et al. [228] extend the original
semi-supervised alignment and proposed a method called parallel
field alignment retrieval (PFAR), which investigates alignment
framework from the perspective of parallel vector fields.
In addition, Ham et al. [24] also introduce manifold alignment
with pairwise correspondence, which builds connections between
multiple data sets by aligning their underlying manifolds. In
particular, multi-view input data sets {X, Y }have subsets Xl
and Ylwhich are in pairwise alignment by the indices xiyi,
(il). Then we intend to determine how to match the remaining
examples using aligned manifold embeddings. Considering that
embedding coordinates for each data set should take similar values
for corresponding pairs, the multi-view manifold embedding can
be defined by generalizing the single graph embedding algorithm
as follows:
C(f, g) = µX
il
|figi|+fTLxf+gTLyg(57)
where fand gdenote real-valued functions defined on the respec-
tive graphs of Xand Y,Lxand Lyare the graph Laplacian
matrices of Xand Y. The first term penalizes the difference
between fand gon the corresponding vertices, and the second
and third terms impose smoothness constraints on fand gbased
on the respective graphs.
This graph algorithm is able to robustly align the underlying
manifold structure across multi-view data sets. Even with a small
number of paired samples provided, the algorithm is capable of
estimating a common low-dimensional embedding space which
can be used in cross-view retrieval task. The main concern about
this algorithm is the computational cost, which lies in finding the
spectral decomposition of a large matrix. Methods for calculating
eigenvectors of large sparse matrices can be employed to speed up
the computation of the embeddings.
6 CONCLUSION
Multi-view representation learning has attracted much attention
in machine learning and data mining areas. This paper first re-
views the root methods and theories on multi-view representation
learning, especially on canonical correlation analysis (CCA) and
its extensions. And then we investigate the advances of multi-
view representation learning that ranges from shallow methods
including multi-modal topic learning, multi-view sparse coding,
and multi-view latent space Markov networks, to deep methods in-
cluding multi-modal restricted Boltzmann machines, multi-modal
autoencoders, and multi-modal recurrent neural networks. Further,
we also provide an important perspective for multi-view repre-
sentation learning from manifold alignment. This survey aims to
provide an insightful picture of the theoretical basis and the current
development in the field of multi-view representation learning and
to help researchers find the most appropriate methodologies for
particular applications.
REFERENCES
[1] H. Hotelling, “Relations between two sets of variates,
Biometrika, vol. 28, no. 3/4, pp. 321–377, 1936.
[2] F. R. Bach and M. I. Jordan, “Kernel independent component
analysis,” Journal of Machine Learning Research, vol. 3, pp.
1–48, 2002.
[3] D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor,
“Canonical correlation analysis: An overview with application
to learning methods,” Neural Comput., vol. 16, no. 12, pp.
2639–2664, Dec. 2004.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16
[4] S. Sun, “A survey of multi-view machine learning,Neural
Computing and Applications, vol. 23, no. 7-8, pp. 2031–2038,
2013.
[5] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol,
“Extracting and composing robust features with denoising
autoencoders,” in ICML, 2008, pp. 1096–1103.
[6] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann ma-
chines.” in AISTATS, vol. 1, 2009, p. 3.
[7] Y. Bengio, A. Courville, and P. Vincent, “Representation learn-
ing: A review and new perspectives,” IEEE transactions on
pattern analysis and machine intelligence, vol. 35, no. 8, pp.
1798–1828, 2013.
[8] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, “Deep
canonical correlation analysis,” in ICML, 2013, pp. 1247–1255.
[9] D. A. Cohn and T. Hofmann, “The missing link - A probabilis-
tic model of document content and hypertext connectivity,” in
NIPS, 2000, pp. 430–436.
[10] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei,
and M. I. Jordan, “Matching words and pictures,” J. Mach.
Learn. Res., vol. 3, pp. 1107–1135, Mar. 2003.
[11] D. M. Blei and M. I. Jordan, “Modeling annotated data,” in
SIGIR, 2003, pp. 127–134.
[12] Y. Jia, M. Salzmann, and T. Darrell, “Factorized latent spaces
with structured sparsity,” in NIPS, 2010, pp. 982–990.
[13] T. Cao, V. Jojic, S. Modla, D. Powell, K. Czymmek, and
M. Niethammer, “Robust multimodal dictionary learning,” in
MICCAI, 2013, pp. 259–266.
[14] W. Liu, D. Tao, J. Cheng, and Y. Tang, “Multiview hessian
discriminative sparse coding for image annotation,Computer
Vision and Image Understanding, vol. 118, pp. 50–60, 2014.
[15] E. P. Xing, R. Yan, and A. G. Hauptmann, “Mining associated
text and images with dual-wing harmoniums,” in UAI, 2005,
pp. 633–641.
[16] N. Chen, J. Zhu, and E. P. Xing, “Predictive subspace learning
for multi-view data: a large margin approach,” in NIPS, 2010,
pp. 361–369.
[17] N. Srivastava and R. Salakhutdinov, “Multimodal learning with
deep boltzmann machines,” in NIPS, 2012, pp. 2231–2239.
[18] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
“Multimodal deep learning,” in ICML, 2011, pp. 689–696.
[19] F. Feng, X. Wang, and R. Li, “Cross-modal retrieval with
correspondence autoencoder,” in ACM Multimedia, 2014, pp.
7–16.
[20] W. Wang, R. Arora, K. Livescu, and J. A. Bilmes, “On deep
multi-view representation learning,” in ICML, 2015, pp. 1083–
1092.
[21] A. Karpathy and F. Li, “Deep visual-semantic alignments
for generating image descriptions,” CoRR, vol. abs/1412.2306,
2014.
[22] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille,
“Deep captioning with multimodal recurrent neural networks
(m-rnn),” arXiv preprint arXiv:1412.6632, 2014.
[23] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach,
S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recur-
rent convolutional networks for visual recognition and descrip-
tion,” in CVPR, 2015, pp. 2625–2634.
[24] J. Ham, D. Lee, and L. Saul, “Semisupervised alignment
of manifolds,” in 10th International Workshop on Artificial
Intelligence and Statistics, 2005, pp. 120–127.
[25] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R.
Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to
cross-modal multimedia retrieval,” in ACM Multimedia, 2010,
pp. 251–260.
[26] E. Kidron, Y. Y. Schechner, and M. Elad, “Pixels that sound.”
in CVPR (1). IEEE Computer Society, 2005, pp. 88–95.
[27] L. Sun, S. Ji, and J. Ye, “A least squares formulation for
canonical correlation analysis,” in ICML, 2008, pp. 1024–1031.
[28] D. P. Foster, R. Johnson, and T. Zhang, “Multi-view dimension-
ality reduction via canonical correlation analysis,” Tech. Rep.,
2008.
[29] L. Sun, B. Ceran, and J. Ye, “A scalable two-stage approach for
a class of dimensionality reduction techniques,” in KDD, 2010,
pp. 313–322.
[30] H. Avron, C. Boutsidis, S. Toledo, and A. Zouzias, “Efficient
dimensionality reduction for canonical correlation analysis,” in
ICML, 2013, pp. 347–355.
[31] X. Z. Fern, C. E. Brodley, and M. A. Friedl, “Correlation clus-
tering for learning mixtures of canonical correlation models,”
in SDM, 2005, pp. 439–448.
[32] M. B. Blaschko and C. H. Lampert, “Correlational spectral
clustering,” in CVPR, 2008.
[33] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan,
“Multi-view clustering via canonical correlation analysis,” in
ICML, 2009, pp. 129–136.
[34] S. M. Kakade and D. P. Foster, “Multi-view regression via
canonical correlation analysis,” in COLT, 2007, pp. 82–96.
[35] B. McWilliams, D. Balduzzi, and J. M. Buhmann, “Correlated
random features for fast semi-supervised learning,” in NIPS,
2013, pp. 440–448.
[36] P. S. Dhillon, D. P. Foster, and L. H. Ungar, “Multi-view
learning of word embeddings via CCA,” in NIPS, 2011, pp.
199–207.
[37] P. S. Dhillon, J. Rodu, D. P. Foster, and L. H. Ungar, “Using
CCA to improve CCA: A new spectral method for estimating
vector models of words,” in ICML, 2012.
[38] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, “A multi-view
embedding space for modeling internet images, tags, and their
semantics,” International Journal of Computer Vision, vol. 106,
no. 2, pp. 210–233, 2014.
[39] T. Kim, J. Kittler, and R. Cipolla, “Discriminative learning and
recognition of image set classes using canonical correlations,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 6, pp.
1005–1018, 2007.
[40] Y. Zhang and J. G. Schneider, “Multi-label output codes using
canonical correlation analysis.” in AISTATS, 2011, pp. 873–
882.
[41] Y. Su, Y. Fu, X. Gao, and Q. Tian, “Discriminant learning
through multiple principal angles for visual recognition,” IEEE
Trans. Image Processing, vol. 21, no. 3, pp. 1381–1390, 2012.
[42] G. H. Golub and H. Zha, “The canonical correlations of matrix
pairs and their numerical computation,” Stanford, CA, USA,
Tech. Rep., 1992.
[43] J. R. Kettenring, “Canonical analysis of several sets of vari-
ables,” Biometrika, vol. 58, no. 3, pp. 433–451, 1971.
[44] H. Avron, C. Boutsidis, S. Toledo, and A. Zouzias, “Efficient
dimensionality reduction for canonical correlation analysis,”
SIAM J. Scientific Computing, vol. 36, no. 5, 2014.
[45] J. A. Tropp, “Improved analysis of the subsampled randomized
hadamard transform,” CoRR, vol. abs/1011.1595, 2010.
[46] Y. Lu and D. P. Foster, “large scale canonical correlation
analysis with iterative least squares,” in NIPS, 2014, pp. 91–
99.
[47] R. Arora, A. Cotter, K. Livescu, and N. Srebro, “Stochastic
optimization for PCA and PLS,” in 50th Annual Allerton Con-
ference on Communication, Control, and Computing, Allerton
2012, Allerton Park & Retreat Center, Monticello, IL, USA,
October 1-5, 2012, 2012, pp. 861–868.
[48] Z. Ma, Y. Lu, and D. P. Foster, “Finding linear structure in large
datasets with scalable canonical correlation analysis,” in ICML,
2015, pp. 169–178.
[49] W. Wang, J. Wang, and N. Srebro, “Globally convergent
stochastic optimization for canonical correlation analysis,”
CoRR, vol. abs/1604.01870, 2016.
[50] R. Ge, C. Jin, S. M. Kakade, P. Netrapalli, and A. Sidford,
“Efficient algorithms for large-scale generalized eigenvector
computation and canonical correlation analysis,” CoRR, vol.
abs/1604.03930, 2016.
[51] F. R. Bach and M. I. Jordan, “A probabilistic interpretation
of canonical correlation analysis,” Department of Statistics,
University of California, Berkeley, Tech. Rep., 2005.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17
[52] M. W. Browne, “The maximum-likelihood solution in inter-
battery factor analysis,” British Journal of Mathematical and
Statistical Psychology, vol. 32, no. 1, pp. 75–86, 1979.
[53] C. Archambeau, N. Delannay, and M. Verleysen, “Robust
probabilistic projections,” in (ICML, 2006, pp. 33–40.
[54] C. Fyfe and G. Leen, “Stochastic processes for canonical
correlation analysis,” in ESANN, 2006, pp. 245–250.
[55] A. Klami and S. Kaski, “Local dependent components,” in
ICML, 2007, pp. 425–432.
[56] C. Wang, “Variational bayesian approach to canonical correla-
tion analysis,” IEEE Trans. Neural Networks, vol. 18, no. 3, pp.
905–910, 2007.
[57] J. Viinikanoja, A. Klami, and S. Kaski, “Variational bayesian
mixture of robust CCA models,” in ECML PKDD, 2010, pp.
370–385.
[58] M. Collins, S. Dasgupta, and R. E. Schapire, “A generalization
of principal components analysis to the exponential family,” in
NIPS, 2001, pp. 617–624.
[59] S. Mohamed, K. A. Heller, and Z. Ghahramani, “Bayesian
exponential family PCA,” in NIPS, 2008, pp. 1089–1096.
[60] A. Klami, S. Virtanen, and S. Kaski, “Bayesian exponential
family projections for coupled data sources,” in UAI, 2010, pp.
286–293.
[61] C. Archambeau, S. Guo, and O. Zoeter, “Sparse bayesian multi-
task learning,” in NIPS, 2011, pp. 1755–1763.
[62] B. Lakshminarayanan, G. Bouchard, and C. Archambeau, “Ro-
bust bayesian matrix factorisation,” in AISTATS, 2011, pp. 425–
433.
[63] Y. Mukuta and T. Harada, “Probabilistic partial canonical
correlation analysis,” in ICML, 2014, pp. 1449–1457.
[64] B. R. RAO, “Partial canonical correlations,” vol. 20, no. 2,
1969, pp. 211–219.
[65] C. Kamada, A. Kanezaki, and T. Harada, “Probabilistic semi-
canonical correlation analysis,” in ACM Multimedia, 2015, pp.
1131–1134.
[66] I. Huopaniemi, T. Suvitaival, J. Nikkil¨
a, M. Oresic, and
S. Kaski, “Two-way analysis of high-dimensional collinear
data,” Data Min. Knowl. Discov., vol. 19, no. 2, pp. 261–276,
2009.
[67] ——, “Multivariate multi-way analysis of multi-source data,”
Bioinformatics, vol. 26, no. 12, pp. 391–398, 2010.
[68] R. Tibshirani, “Regression shrinkage and selection via the
lasso,” Journal of the Royal Statistical Society (Series B),
vol. 58, pp. 267–288, 1996.
[69] D. R. Hardoon and J. Shawe-Taylor, “Sparse canonical corre-
lation analysis,” Department of Computer Science, University
College London, Tech. Rep., 2007.
[70] S. Waaijenborg, P. C. Verselewel de Witt Hamer, and A. H.
Zwinderman, “Quantifying the association between gene ex-
pressions and DNA-markers by penalized canonical correlation
analysis.” Statistical applications in genetics and molecular
biology, vol. 7, no. 1, 2008.
[71] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least
angle regression,Annals of Statistics, vol. 32, pp. 407–499,
2004.
[72] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regular-
ization: A geometric framework for learning from labeled and
unlabeled examples,J. Mach. Learn. Res., vol. 7, pp. 2399–
2434, Dec. 2006.
[73] B. K. Sriperumbudur, D. A. Torres, and G. R. G. Lanckriet,
“Sparse eigen methods by D.C. programming,” in ICML, 2007,
pp. 831–838.
[74] A. d’Aspremont, F. R. Bach, and L. E. Ghaoui, “Full regular-
ization path for sparse principal component analysis,” in ICML,
2007, pp. 177–184.
[75] A. d’Aspremont, L. E. Ghaoui, M. I. Jordan, and G. R. G.
Lanckriet, “A direct formulation for sparse PCA using semidef-
inite programming,” SIAM Review, vol. 49, no. 3, pp. 434–448,
2007.
[76] D. Torres, D. Turnbull, L. Barrington, B. Sriperumbudur, and
G. Lanckriet, “Finding musically meaningful words by sparse
cca,” NIPS WMBC ’07, December 2007.
[77] A. Wiesel, M. Kliger, and A. O. Hero, III, “A greedy approach
to sparse canonical correlation analysis,” ArXiv e-prints, 2008.
[78] D. M. Witten, T. Hastie, and R. Tibshirani, “A penalized
matrix decomposition, with applications to sparse principal
components and canonical correlation analysis,” Biostatistics,
2009.
[79] X. Chen, H. Liu, and J. G. Carbonell, “Structured sparse
canonical correlation analysis,” in AISTATS, 2012, pp. 199–
207.
[80] C. Fyfe and G. Leen, “Two methods for sparsifying proba-
bilistic canonical correlation analysis,” in ICONIP, 2006, pp.
361–370.
[81] C. Archambeau and F. R. Bach, “Sparse probabilistic projec-
tions,” in NIPS, 2008, pp. 73–80.
[82] P. Rai and H. D. III, “Multi-label prediction via sparse infinite
CCA,” in NIPS, 2009, pp. 1518–1526.
[83] Y. Fujiwara, Y. Miyawaki, and Y. Kamitani, “Estimating im-
age bases for visual image reconstruction from human brain
activity,” in NIPS, 2009, pp. 576–584.
[84] K. A. Heller and Z. Ghahramani, “A nonparametric bayesian
approach to modeling overlapping clusters,” in AISTATS, 2007,
pp. 187–194.
[85] L. R. Tucker, “An inter-battery method of factor analysis,,”
Psychometrika, vol. 23, no. 2, pp. 111–136, 1958.
[86] A. Klami, S. Virtanen, and S. Kaski, “Bayesian canonical
correlation analysis,” Journal of Machine Learning Research,
vol. 14, no. 1, pp. 965–1003, 2013.
[87] R. McDonald, “Three common factor models for groups of
variables,Psychometrika, vol. 37, no. 1, pp. 173–178, 1970.
[88] M. W. Browne, “Factor analysis of multiple batteries by
maximum likelihood,” British Journal of Mathematical and
Statistical Psychology, vol. 33, pp. 184–199, 1979.
[89] A. Klami, S. Virtanen, E. Lepp¨
aaho, and S. Kaski, “Group
factor analysis,” IEEE Trans. Neural Netw. Learning Syst.,
vol. 26, no. 9, pp. 2136–2147, 2015.
[90] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,”
arXiv preprint arXiv:1304.5634, 2013.
[91] P. L. Lai and C. Fyfe, “Kernel and nonlinear canonical correla-
tion analysis,” in IJCNN (4), 2000, p. 614.
[92] S. Akaho, “A kernel method for canonical correlation analysis,”
in International Meeting of the Psychometric Society, 2001.
[93] T. Melzer, M. Reiter, and H. Bischof, “Nonlinear feature
extraction using generalized canonical correlation analysis,” in
ICANN, 2001, pp. 353–360.
[94] R. Socher and F. Li, “Connecting modalities: Semi-supervised
segmentation and annotation of images using unaligned text
corpora,” in CVPR, 2010, pp. 966–973.
[95] S. J. Hwang and K. Grauman, “Learning the relative impor-
tance of objects from tagged images for retrieval and cross-
modal search.” International Journal of Computer Vision, vol.
100, no. 2, pp. 134–153, 2012.
[96] Y. Yamanishi, J. Vert, A. Nakaya, and M. Kanehisa, “Extraction
of correlated gene clusters from multiple genomic data by
generalized kernel canonical correlation analysis,” in ICISMB,
2003, pp. 323–330.
[97] M. B. Blaschko, J. A. Shelton, A. Bartels, C. H. Lampert,
and A. Gretton, “Semi-supervised kernel canonical correlation
analysis with application to human fmri,” Pattern Recognition
Letters, vol. 32, no. 11, pp. 1572–1583, 2011.
[98] A. Trivedi, P. Rai, S. L. DuVall, and H. Daum´
e, III, “Exploiting
tag and word correlations for improved webpage clustering,” in
Proceedings of the 2Nd International Workshop on Search and
Mining User-generated Contents, ser. SMUC ’10, 2010, pp.
3–12.
[99] R. Arora and K. Livescu, “Kernel CCA for multi-view learn-
ing of acoustic features using articulatory measurements,” in
MLSLP, 2012, pp. 34–37.
[100] ——, “Multi-view cca-based acoustic features for phonetic
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18
recognition across speakers and domains,” in ICASSP, 2013,
pp. 7135–7139.
[101] A. Gretton, R. Herbrich, and A. J. Smola, “The kernel mutual
information,” in ICASSP, 2003, pp. 880–884.
[102] A. Gretton, R. Herbrich, A. J. Smola, O. Bousquet, and
B. Sch¨
olkopf, “Kernel methods for measuring independence,
Journal of Machine Learning Research, vol. 6, pp. 2075–2129,
2005.
[103] B. Sch ¨
olkopf and A. J. Smola, Learning with Kernels: Support
Vector Machines, Regularization, Optimization, and Beyond.
The MIT Press, 2001.
[104] M. Kuss and T. Graepel, “The geometry of kernel canonical
correlation analysis,” Max Planck Institute for Biological Cy-
bernetics, T¨
ubingen, Germany, Tech. Rep. 108, may 2003.
[105] K. Fukumizu, F. R. Bach, and A. Gretton, “Statistical con-
sistency of kernel canonical correlation analysis,Journal of
Machine Learning Research, vol. 8, pp. 361–383, 2007.
[106] D. R. Hardoon and J. Shawe-Taylor, “Convergence analysis
of kernel canonical correlation analysis: theory and practice,”
Machine Learning, vol. 74, no. 1, pp. 23–38, 2009.
[107] J. Cai and H. Sun, “Convergence rate of kernel canonical
correlation analysis,” Science China Mathematics, vol. 54,
no. 10, pp. 2161–2170, 2011.
[108] M. Brand, “Incremental singular value decomposition of uncer-
tain data with missing values,” in ECCV, 2002, pp. 707–720.
[109] C. K. I. Williams and M. W. Seeger, “Using the nystr¨
om
method to speed up kernel machines,” in NIPS, 2000, pp. 682–
688.
[110] T. Yang, Y. Li, M. Mahdavi, R. Jin, and Z. Zhou, “Nystr¨
om
method vs random fourier features: A theoretical and empirical
comparison,” in NIPS, 2012, pp. 485–493.
[111] Q. V. Le, T. Sarl´
os, and A. J. Smola, “Fastfood - computing
hilbert space expansions in loglinear time,” in ICML, 2013, pp.
244–252.
[112] D. Lopez-Paz, S. Sra, A. J. Smola, Z. Ghahramani, and
B. Sch¨
olkopf, “Randomized nonlinear component analysis,” in
ICML, 2014, pp. 1359–1367.
[113] W. Wang and K. Livescu, “Large-scale approximate kernel
canonical correlation analysis,” CoRR, vol. abs/1511.04773,
2015.
[114] D. R. Hardoon, C. Saunders, S. Szedm ´
ak, and J. Shawe-Taylor,
“A correlation approach for automatic image annotation,” in
ADMA, 2006, pp. 681–692.
[115] M. Hodosh, P. Young, and J. Hockenmaier, “Framing image
description as a ranking task: Data, models and evaluation
metrics,” J. Artif. Intell. Res. (JAIR), vol. 47, pp. 853–899,
2013.
[116] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regular-
ization: A geometric framework for learning from labeled and
unlabeled examples,Journal of Machine Learning Research,
vol. 7, pp. 2399–2434, 2006.
[117] S. Becker and G. E. Hinton, “Self-organizing neural network
that discovers surfaces in random-dot stereograms,Nature,
vol. 355, no. 6356, pp. 161–163, 1992.
[118] S. Becker, “Mutual information maximization: Models of
cortical self-organization.Network : Computation in Neural
Systems, vol. 7, pp. 7–31, 1996.
[119] P. L. Lai and C. Fyfe, “Canonical correlation analysis using
artificial neural networks,” in ESANN, 1998, pp. 363–368.
[120] ——, “A neural implementation of canonical correlation anal-
ysis,” Neural Networks, vol. 12, no. 10, pp. 1391–1397, 1999.
[121] W. W. Hsieh, “Nonlinear canonical correlation analysis by
neural networks,” Neural Networks, vol. 13, no. 10, pp. 1095–
1105, 2000.
[122] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classification with deep convolutional neural networks,” in
NIPS, 2012, pp. 1097–1105.
[123] W. Wang, R. Arora, K. Livescu, and J. A. Bilmes, “Un-
supervised learning of acoustic features via deep canonical
correlation analysis,” in ICASSP, 2015, pp. 4590–4594.
[124] A. Lu, W. Wang, M. Bansal, K. Gimpel, and K. Livescu, “Deep
multilingual correlation for improved word embeddings,” in
HLT-NAACL, 2015, pp. 250–256.
[125] W. Wang, R. Arora, K. Livescu, and N. Srebro, “Stochastic op-
timization for deep CCA via nonlinear orthogonal iterations,”
in Allerton, 2015, pp. 688–695.
[126] F. Yan and K. Mikolajczyk, “Deep correlation for matching
images and text,” in CVPR, 2015, pp. 3441–3450.
[127] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet
allocation,” Journal of Machine Learning Research, vol. 3, pp.
993–1022, 2003.
[128] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierar-
chical dirichlet processes,” Journal of the American Statistical
Association, vol. 101, 2004.
[129] O. Yakhnenko and V. Honavar, “Multi-modal hierarchical
dirichlet process model for predicting image annotation and
image-object label correspondence,” in SDM, 2009, pp. 283–
293.
[130] H. Ishwaran and L. F. James, “Gibbs sampling methods for
stick-breaking priors,” Journal of the American Statistical As-
sociation, vol. 96, no. 453, pp. 161–173, 2001.
[131] Z. Qi, M. Yang, Z. M. Zhang, and Z. Zhang, “Mining partially
annotated images,” in KDD, 2011, pp. 1199–1207.
[132] J. Zhu, L. Li, F. Li, and E. P. Xing, “Large margin learning
of upstream scene understanding models,” in NIPS, 2010, pp.
2586–2594.
[133] C. Wang, D. M. Blei, and F. Li, “Simultaneous image classifi-
cation and annotation,” in CVPR, 2009, pp. 1903–1910.
[134] D. M. Blei and J. D. McAuliffe, “Supervised topic models,” in
NIPS, 2007, pp. 121–128.
[135] L. Li and F. Li, “What, where and who? classifying events by
scene and object recognition,” in ICCV, 2007, pp. 1–8.
[136] L. Cao and F. Li, “Spatially coherent latent topic model
for concurrent segmentation and classification of objects and
scenes,” in ICCV, 2007, pp. 1–8.
[137] L. Li, R. Socher, and F. Li, “Towards total scene understanding:
Classification, annotation and segmentation in an automatic
framework,” in CVPR, 2009, pp. 2036–2043.
[138] C. Nguyen, D. Zhan, and Z. Zhou, “Multi-modal image anno-
tation with multi-instance multi-label LDA,” in IJCAI, 2013.
[139] C. M. Bishop, Pattern recognition and machine learning.
Springer, 2006.
[140] Y. Han, F. Wu, D. Tao, J. Shao, Y. Zhuang, and J. Jiang, “Sparse
unsupervised dimensionality reduction for multiple view data,
IEEE Trans. Circuits Syst. Video Techn., vol. 22, no. 10, pp.
1485–1496, 2012.
[141] J. Yu, Y. Rui, and D. Tao, “Click prediction for web image
reranking using multimodal sparse coding,” IEEE Transactions
on Image Processing, vol. 23, no. 5, pp. 2019–2032, 2014.
[142] Y. Zhuang, Y. Wang, F. Wu, Y. Zhang, and W. Lu, “Supervised
coupled dictionary learning with group structures for multi-
modal retrieval,” in AAAI, 2013.
[143] F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, and Y. Zhuang,
“Sparse multi-modal hashing,” IEEE Trans. Multimedia,
vol. 16, no. 2, pp. 427–439, 2014.
[144] M. Welling, M. Rosen-Zvi, and G. E. Hinton, “Exponential
family harmoniums with an application to information re-
trieval,” in NIPS, 2004, pp. 1481–1488.
[145] G. E. Hinton, “Training products of experts by minimizing
contrastive divergence,” Neural Computation, vol. 14, no. 8,
pp. 1771–1800, 2002.
[146] P. Xie and E. P. Xing, “Multi-modal distance metric learning,
in IJCAI, 2013, pp. 1806–1812.
[147] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell, “Dis-
tance metric learning with application to clustering with side-
information,” in NIPS, 2002, pp. 505–512.
[148] H. Wold, “Soft modeling: the basic design and some exten-
sions,” Systems under indirect observation, vol. 2, pp. 589–591,
1982.
[149] R. Rosipal and N. Kr ¨
amer, Overview and Recent Advances in
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 19
Partial Least Squares, 2006, pp. 34–51.
[150] W. R. Schwartz, A. Kembhavi, D. Harwood, and L. S. Davis,
“Human detection using partial least squares analysis,” in
ICCV, 2009, pp. 24–31.
[151] S. Wold, M. Sjstrm, and L. Eriksson, “Pls-regression: a basic
tool of chemometrics,” Chemometrics and Intelligent Labora-
tory Systems, vol. 58, pp. 109–130, 2001.
[152] M. Barker and W. Rayens, “Partial least squares for discrimina-
tion,” Journal of Chemometrics, vol. 17, no. 3, pp. 166 – 173,
2003.
[153] G. Guo and G. Mu, “Joint estimation of age, gender and
ethnicity: CCA vs. PLS,” in FG, 2013, pp. 1–6.
[154] D. Li, N. Dimitrova, M. Li, and I. K. Sethi, “Multimedia
content processing through cross-modal association,” in ACM
Multimedia, 2003, pp. 604–611.
[155] Y. Wang, L. Guan, and A. N. Venetsanopoulos, “Kernel cross-
modal factor analysis for information fusion with application
to bimodal emotion recognition,” IEEE Transactions on Multi-
media, vol. 14, no. 3-1, pp. 597–607, 2012.
[156] J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. G.
Lanckriet, R. Levy, and N. Vasconcelos, “On the role of cor-
relation and abstraction in cross-modal multimedia retrieval,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 3, pp.
521–535, 2014.
[157] J. Wang, H. Wang, Y. Tu, K. Duan, Z. Zhan, and
S. Chekuri, “Supervised cross-modal factor analysis,” CoRR,
vol. abs/1502.05134, 2015.
[158] T. Diethe, D. R. Hardoon, and J. Shawe-taylor, “Multiview
fisher discriminant analysis,” in In NIPS Workshop on Learning
from Multiple Sources, 2008.
[159] T. Diethe, D. R. Hardoon, and J. Shawe-Taylor, “Constructing
nonlinear discriminants from multiple data views,” in ECML
PKDD, 2010, pp. 328–343.
[160] M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen, “Multi-view
discriminant analysis,” in ECCV, 2012, pp. 808–821.
[161] S. Mika, A. J. Smola, and B. Sch ¨
olkopf, “An improved training
algorithm for kernel fisher discriminants,” in AISTATS, 2001.
[162] J. B. Tenenbaum and W. T. Freeman, “Separating style and
content with bilinear models,” Neural Computation, vol. 12,
no. 6, pp. 1247–1283, 2000.
[163] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin,
“Graph embedding and extensions: A general framework for
dimensionality reduction,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 29, no. 1, pp. 40–51, 2007.
[164] A. Sharma, A. Kumar, H. D. III, and D. W. Jacobs, “Gener-
alized multiview analysis: A discriminative latent space,” in
CVPR, 2012, pp. 2160–2167.
[165] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios,
“Data fusion through cross-modality metric learning using
similarity-sensitive hashing,” in CVPR, 2010, pp. 3594–3601.
[166] S. Kumar and R. Udupa, “Learning hash functions for cross-
view similarity search,” in IJCAI, 2011, pp. 1360–1365.
[167] Y. Gong and S. Lazebnik, “Iterative quantization: A pro-
crustean approach to learning binary codes,” in CVPR, 2011,
pp. 817–824.
[168] Y. Zhen and D. Yeung, “Co-regularized hashing for multimodal
data,” in NIPS, 2012, pp. 1385–1393.
[169] X. Zhu, Z. Huang, H. T. Shen, and X. Zhao, “Linear cross-
modal hashing for efficient multimedia search,” in ACM Multi-
media Conference, MM ’13, Barcelona, Spain, October 21-25,
2013, 2013, pp. 143–152.
[170] D. Zhai, H. Chang, Y. Zhen, X. Liu, X. Chen, and W. Gao,
“Parametric local multimodal hashing for cross-view similarity
search,” in IJCAI, 2013, pp. 2754–2760.
[171] X. Liu, J. He, C. Deng, and B. Lang, “Collaborative hashing,”
in CVPR, 2014, pp. 2147–2154.
[172] D. Zhang and W. Li, “Large-scale supervised multimodal hash-
ing with semantic correlation maximization,” in AAAI, 2014,
pp. 2177–2183.
[173] Z. Lin, G. Ding, M. Hu, and J. Wang, “Semantics-preserving
hashing for cross-view retrieval,” in CVPR, 2015, pp. 3864–
3872.
[174] Y. Zhen, Y. Gao, D. Yeung, H. Zha, and X. Li, “Spectral
multimodal hashing and its application to multimedia retrieval,
IEEE Trans. Cybernetics, vol. 46, no. 1, pp. 27–38, 2016.
[175] G. Shakhnarovich, “Learning task-specific similarity,” Ph.D.
dissertation, Cambridge, MA, USA, 2005.
[176] Y. Freund and R. E. Schapire, “A decision-theoretic general-
ization of on-line learning and an application to boosting,” J.
Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997.
[177] Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang, “Dis-
criminative coupled dictionary hashing for fast cross-media
retrieval,” in SIGIR, 2014, pp. 395–404.
[178] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in
NIPS, 2008, pp. 1753–1760.
[179] W. Liu, J. Wang, S. Kumar, and S. Chang, “Hashing with
graphs,” in ICML, 2011, pp. 1–8.
[180] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen, “Inter-
media hashing for large-scale retrieval from heterogeneous data
sources,” in SIGMOD, 2013, pp. 785–796.
[181] D. E. Rumelhart, J. L. McClelland, and C. PDP Re-
search Group, Eds., Parallel Distributed Processing: Explo-
rations in the Microstructure of Cognition, Vol. 1: Foundations.
Cambridge, MA, USA: MIT Press, 1986.
[182] G. E. Hinton and R. R. Salakhutdinov, “Replicated softmax: an
undirected topic model,” in NIPS, 2009, pp. 1607–1614.
[183] J. Huang and B. Kingsbury, “Audio-visual deep learning for
noise robust speech recognition,” in ICASSP, 2013, pp. 7596–
7599.
[184] H. Hu, B. Liu, B. Wang, M. Liu, and X. Wang, “Multimodal
dbn for predicting high-quality answers in cqa portals.” in ACL
(2), 2013, pp. 843–847.
[185] L. Ge, J. Gao, X. Li, and A. Zhang, “Multi-source deep learning
for information trustworthiness estimation,” in KDD, 2013, pp.
766–774.
[186] L. Pang and C.-W. Ngo, “Mutlimodal learning with deep
boltzmann machine for emotion prediction in user generated
videos,” in ACM ICMR, 2015, pp. 619–622.
[187] K. Sohn, W. Shang, and H. Lee, “Improved multimodal deep
learning with variation of information,” in NIPS, 2014, pp.
2141–2149.
[188] M. Ehrlich, T. J. Shields, T. Almaev, and M. R. Amer, “Facial
attributes classification using multi-task representation learn-
ing,” in CVPR Workshops, 2016, pp. 47–55.
[189] H. Larochelle and Y. Bengio, “Classification using discrimina-
tive restricted boltzmann machines,” in ICML, 2008, pp. 536–
543.
[190] C. Hong, J. Yu, J. Wan, D. Tao, and M. Wang, “Multimodal
deep autoencoder for human pose recovery,” IEEE Transac-
tions on Image Processing, vol. 24, no. 12, pp. 5659–5670,
2015.
[191] B. S. Riggan, C. Reale, and N. M. Nasrabadi, “Coupled auto-
associative neural networks for heterogeneous face recogni-
tion,” IEEE Access, vol. 3, pp. 1620–1632, 2015.
[192] H. Amiri, P. Resnik, J. Boyd-Graber, and H. D. III, “Learning
text pair similarity with context-sensitive autoencoders,” 2016.
[193] H. Lee, C. Ekanadham, and A. Y. Ng, “Sparse deep belief net
model for visual area v2,” in NIPS, 2008, pp. 873–880.
[194] W. Wang, B. C. Ooi, X. Yang, D. Zhang, and Y. Zhuang, “Ef-
fective multi-modal retrieval based on stacked auto-encoders,”
Proceedings of the VLDB Endowment, vol. 7, no. 8, pp. 649–
660, 2014.
[195] S. Rastegar, M. Soleymani, H. R. Rabiee, and S. Mohsen Sho-
jaee, “Mdl-cw: A multimodal deep learning framework with
cross weights,” in CVPR, 2016, pp. 2601–2609.
[196] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean,
and T. Mikolov, “Devise: A deep visual-semantic embedding
model,” in NIPS, 2013, pp. 2121–2129.
[197] J. Dong, X. Li, and C. G. Snoek, “Word2visualvec: Cross-
media retrieval by visual feature prediction,arXiv preprint
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 20
arXiv:1604.06838, 2016.
[198] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Multimodal
neural language models.” in ICML, vol. 14, 2014, pp. 595–603.
[199] A. Lazaridou, N. T. Pham, and M. Baroni, “Combining lan-
guage and vision with a multimodal skip-gram model,” arXiv
preprint arXiv:1501.02598, 2015.
[200] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
estimation of word representations in vector space,arXiv
preprint arXiv:1301.3781, 2013.
[201] B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa,
Y. Qi, O. Chapelle, and K. Q. Weinberger, “Learning to rank
with (a lot of) word features,” Inf. Retr., vol. 13, no. 3, pp.
291–314, 2010.
[202] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao,
A. Senior, P. Tucker, K. Yang, Q. V. Le et al., “Large scale
distributed deep networks,” in NIPS, 2012, pp. 1223–1231.
[203] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens,
A. Frome, G. S. Corrado, and J. Dean, “Zero-shot learning by
convex combination of semantic embeddings,arXiv preprint
arXiv:1312.5650, 2013.
[204] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng,
P. Doll´
ar, J. Gao, X. He, M. Mitchell, J. C. Platt et al., “From
captions to visual concepts and back,” in CVPR, 2015, pp.
1473–1482.
[205] Y. Feng and M. Lapata, “Visual information in semantic repre-
sentation,” in HLT-NAACL, 2010, pp. 91–99.
[206] E. Bruni, N.-K. Tran, and M. Baroni, “Multimodal distribu-
tional semantics.” J. Artif. Intell. Res.(JAIR), vol. 49, no. 1-47,
2014.
[207] D. Kiela and L. Bottou, “Learning image embeddings using
convolutional neural networks for improved multi-modal se-
mantics.” in EMNLP, 2014, pp. 36–45.
[208] R. Xu, C. Xiong, W. Chen, and J. J. Corso, “Jointly modeling
deep video and compositional text to bridge vision and lan-
guage in a unified framework.” in AAAI, 2015, pp. 2346–2352.
[209] Y. Kang, S. Kim, and S. Choi, “Deep learning to hash with
multiple representations,” in ICDM, 2012, pp. 930–935.
[210] Y. Zhuang, Z. Yu, W. Wang, F. Wu, S. Tang, and J. Shao,
“Cross-media hashing with neural networks,” in ACM Multi-
media, 2014, pp. 901–904.
[211] D. Wang, P. Cui, M. Ou, and W. Zhu, “Deep multimodal
hashing with orthogonal regularization,” in IJCAI, 2015.
[212] Q.-Y. Jiang and W.-J. Li, “Deep cross-modal hashing,arXiv
preprint arXiv:1602.02255, 2016.
[213] T. Mikolov, M. Karafi´
at, L. Burget, J. Cernock`
y, and S. Khu-
danpur, “Recurrent neural network based language model.” in
Interspeech, vol. 2, 2010, p. 3.
[214] S. Hochreiter and J. Schmidhuber, “Long short-term memory,
Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
[215] K. Cho, B. van Merrienboer, C¸ . G¨
ulc¸ ehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase
representations using RNN encoder-decoder for statistical ma-
chine translation,” in EMNLP, 2014, pp. 1724–1734.
[216] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
learning with neural networks,” in NIPS, 2014, pp. 3104–3112.
[217] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term
dependencies with gradient descent is difficult,IEEE Transac-
tions on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
[218] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-
semantic embeddings with multimodal neural language mod-
els,” arXiv preprint arXiv:1411.2539, 2014.
[219] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney,
and K. Saenko, “Translating videos to natural language
using deep recurrent neural networks,” arXiv preprint
arXiv:1412.4729, 2014.
[220] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Dar-
rell, and K. Saenko, “Sequence to sequence-video to text,” in
ICCV, 2015, pp. 4534–4542.
[221] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra,
C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question
answering,” in ICCV, 2015, pp. 2425–2433.
[222] X. Chen and C. L. Zitnick, “Learning a recurrent visual
representation for image caption generation,” arXiv preprint
arXiv:1411.5654, 2014.
[223] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
lation by jointly learning to align and translate,” CoRR, vol.
abs/1409.0473, 2014.
[224] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recogni-
tion with visual attention,” CoRR, vol. abs/1412.7755, 2014.
[225] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent
models of visual attention,” CoRR, vol. abs/1406.6247, 2014.
[226] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhut-
dinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell:
Neural image caption generation with visual attention,” CoRR,
vol. abs/1502.03044, 2015.
[227] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Sch¨
olkopf,
“Ranking on data manifolds,” in NIPS, 2003, pp. 169–176.
[228] X. Mao, B. Lin, D. Cai, X. He, and J. Pei, “Parallel field
alignment for cross media retrieval,” in ACM Multimedia, 2013,
pp. 897–906.
PLACE
PHOTO
HERE
Yingming Li received the BS and MS degrees
in automation from University of Science and
Technology of China, Hefei, China, and the PhD
degree in information science and electronic en-
gineering from Zhejiang University, Hangzhou,
China. He is currently an assistant professor with
the College of Information Science and Elec-
tronic Engineering at Zhejiang University, China.
PLACE
PHOTO
HERE
Ming Yang received the BS, MS, and PhD de-
grees in information science and electronic en-
gineering from Zhejiang University, Hangzhou,
China. He had been a visiting scholar in com-
puter science at the State University of New York
(SUNY) at Binghamton.
PLACE
PHOTO
HERE
Zhongfei (Mark) Zhang received the BS degree
in electronics engineering, the MS degree in
information science, both from Zhejiang Univer-
sity, Hangzhou, China, and the PhD degree in
computer science from the University of Mas-
sachusetts at Amherst. He is a QiuShi Chaired
Professor at Zhejiang University, China, and di-
rects the Data Science and Engineering Re-
search Center at the university while he is on
leave from State University of New York (SUNY)
at Binghamton, USA, where he is a professor at
the Computer Science Department and directs the Multimedia Research
Laboratory in the Department. He has published more than 100 peer-
reviewed academic papers in leading international journals and confer-
ences and several invited papers and book chapters, has authored or
co-authored two monographs on multimedia data mining and relational
data clustering, respectively.