Content uploaded by Hsin-min Wang
Author content
All content in this area was uploaded by Hsin-min Wang on Sep 23, 2016
Content may be subject to copyright.
Minimization of Regression and Ranking Losses with Shallow Neural
Networks on Automatic Sincerity Evaluation
Hung-Shin Lee1,2, Yu Tsao3, Chi-Chun Lee4, Hsin-Min Wang2
Wei-Cheng Lin4, Wei-Chen Chen4, Shan-Wen Hsiao4, Shyh-Kang Jeng1
1Department of Electrical Engineering, National Taiwan University, Taiwan
2Institute of Information Science, Academia Sinica, Taiwan
3Research Center for Information Technology Innovation, Academia Sinica, Taiwan
4Department of Electrical Engineering, National Tsing Hua University, Taiwan
hslee@iis.sinica.edu.tw, yu.tsao@citi.sinica.edu.tw, cclee@ee.nthu.edu.tw
Abstract
To estimate the degree of sincerity conveyed by a speech utter-
ance and received by listeners, we propose an instance-based
learning framework with shallow neural networks. The frame-
work plays as not only a regressor that intends to fit the pre-
dicted value to the actual value but also a ranker that preserves
the relative target magnitude between each pair of utterances,
in an attempt to derive a higher Spearman’s rank correlation
coefficient. In addition to describing how to simultaneously
minimize regression and ranking losses, the issue of how ut-
terance pairs work in the training and evaluation phases is also
addressed by two kinds of realizations. The intuitive one is re-
lated to random sampling while the other seeks for represen-
tative utterances, named anchors, to form non-stochastic pairs.
Our system outperforms the baseline by more than 25% relative
improvement in the development set.
Index Terms: regression, ranking, degree of sincerity, shallow
neural networks, computational paralinguistics
1. Introduction
Given a speech utterance, the goal of automatic sincerity evalu-
ation is to tell us how sincere it could convey to human receivers
by means of a learning machine. Without doubt, sincerity itself
is a kind of very subjective affection with seemingly unmeasur-
able range due to individual differences. As a result, it is usually
much easier to acquire the quantitative degree of sincerity that
a listener feels than the mental state of sincerity that a speaker
has while speaking. The compromise left a broad space for the
progress of some applications, such as personal image consult-
ing and performing arts. For example, with a well-trained ma-
chine, a politician can understand how sincerity the voters will
feel after hearing the speech during a budget-insufficient cam-
paign; an actor/actress can revise his/her speaking style accord-
ing to the machinery feedback before an audition. In spite of a
relatively limited amount of research and experimentation done
in recent years regarding the latent factors of speaking sincerity
[1, 2], we can still leverage machine learning strategies to obtain
useful cues for sincerity evaluation if the coverage of linguistic
content and prosodic types is satisfactory in the collected data
and the respective labels are well annotated [3].
Evaluation metrics. In contrast to a similar task of sar-
casm recognition that deals with binary classification [4, 5], au-
tomatic sincerity evaluation is treated as a problem of regression
in this paper. That is, given a speech utterance, a predicted sin-
cerity value has to be generated and expected to be as close to
the actual value rated by annotators as possible. However, in-
stead of the well-known mean squared error (MSE), the Spear-
man’s rank correlation coefficient (ρ) [6], used in the research
on human perception modeling and psycholinguistics [7, 8], has
become a standard evaluation metric in paralinguistic computa-
tion in the Degree of Nativeness and Parkinsons Condition sub-
challenges in ComParE 2015 [9]. Spearman’s ρassesses how
well the relationship between predicted and actual values can
be described using a monotonic function by virtue of their own
relative ranks. To our knowledge, there are two reasons why
to adopt it. First, it is less sensitive to extreme values so that
the evaluation of a system can be insusceptible to occasionally
predicted outliers [10]. Second, a sincerity evaluation system
might be good in a pragmatic sense, not because it can precisely
tell us how annotators intend to rate, for these ratings are not as
sensible as something like meters and grams in the metric sys-
tem to ordinary people. Contrarily, a good system can give us
a series of ordinal rank of a set of utterances to represent their
relative degree of sincerity. Decision making by comparison is
always much easier for we humans.
Possible problems. Many popular regression methods, in-
cluding artificial neural networks (ANN) [11] and support vec-
tor regression (SVR) [12], are not designed ad hoc to achieve
higher Spearman’s ρ, but to minimize the residual sum of
squares between the ground truth and predicted responses di-
rectly or through margin maximization [13]. Notwithstand-
ing a model that gives perfect regression will also give per-
fect ranking, a model with near-perfect regression performance
may yield poor ranking performance. For example, a regres-
sor makes predictions of [1.1, 1.25, 1.24] with an MSE of only
0.002 but a ρof 0.5 with respect to the true values of [1.1, 1.2,
1.3]. For a well-trained ranker, however, even if its predicted
values of [4.1, 5.2, 6.3] result in a much worse MSE, it still
achieves a perfect ρ. Therefore, even in less extreme cases,
small regression errors can cause large ranking errors. Note
that it does not indicate that we should abandon criteria aiming
at minimizing regression errors and entirely pursue the maxi-
mization of Spearman’s ρ, although ideally we have to. Un-
fortunately, there seems little research on this goal until now.
Although some listwise approaches have been proposed in in-
formation retrieval for the purpose of optimizing other ranking
measures of the test samples, such as the normalized discounted
cumulative gain (NDCG) [14, 15, 16], they are not readily suit-
able for the evaluation scenario in this paper since each test sam-
ple has to be evaluated identically and independently.
Contributions. To solve the aforementioned problems, we
propose a framework with shallow neural networks using an
objective function for minimizing both regression and pairwise
ranking losses. In contrast to a similar idea presented in [17],
where the two losses are minimized alternatively for training
a linear model by introducing a tradeoff coefficient that helps
randomly fetch an input out of a single sample and a candidate
pair, the proposed training algorithm minimizes the two losses
simultaneously. In addition, this paper has two more contribu-
tions. First, two kinds of realizations are put forward to demon-
strate how utterance pairs are generated and work in the whole
learning and evaluating mechanism. Second, the use of artificial
neural networks with a proper amount of regularization circum-
vents the small sample size problem in our task while sustaining
the generalization ability [18, 19, 20].
The remainder of this paper is organized as follows. Sec-
tion 2 gives our objectives in regression and ranking. Section 3
presents two kinds of architectures based on shallow neural net-
works and shows how to derive the optimal model by minimiz-
ing the regression and ranking losses simultaneously. Section 4
reports the experiment results on the Sincerity sub-challenge of
ComParE 2016. Finally, Section 5 gives the conclusions.
2. Objectives
2.1. Regression and Ranking Losses
The goal of supervised regression is to learn a model Mrg that
can predict a real-valued target y∈Rfor a feature vector x
using a prediction function f(x)with little loss with respect to
a specified loss function. Given a set of labeled training data
D, the aggregate regression error based on the residual sum of
squares between target yand predicted f(x)is given by
Lrg (D) = 1
|D| X
(x,y)∈D
(f(x)−y)2.(1)
Therefore, a well-trained regression model Mrg that minimizes
Lrg (D)can be expected that the predicted degree of sincerity
is desirably close to the human annotated one.
In the same vein of RankSVM [21], given the difference
∆xab of two feature vectors xaand xb, the goal of a super-
vised pairwise ranking method is to learn a model Mrk that
can predict the difference ∆yab of target values yaand ybby
using a prediction function f(∆xab), where ∆yab =ya−yb
and ∆xab =xa−xb. Suppose a set of training pairs Pis
selected from D, the incurred rank-based loss is given by
Lrk (P) = 1
|P| X
((xa,ya),(xb,yb))∈P
(f(∆xab)−∆yab )2.(2)
Therefore, by minimizing Lrk (P), the prediction function, to
some extent, attempts to guarantee that if utterance asounds
more sincere than b, i.e., ya> yb, then awill have a larger
predicted value than b. Actually, (1) can be regarded as a special
case of (2) if we set utterance bto a trivial comparing reference
(xb, yb) = (0,0) for all utterances a.
2.2. The Combination of Regression and Ranking Losses
As mentioned in Section 1, although the final evaluation met-
ric in our task is totally rank-based, we still need to minimize
the regression loss to a degree in order to 1) reduce the number
of uncontrollable occurrences of outliers and to 2) complement
𝐱
∆𝐱
ℎ$
%𝑦
∆𝑦
ℎ'
%
ℎ$
(
ℎ'
(
𝐛$
%
𝐖$
%
𝐛$
(
𝐖$
(
𝐛'
%
𝐖'
%
𝐛'
(
𝐖'
(
𝐛+
𝐖+
ℎ,
-ℎ,
.
𝐛,
-
𝐖𝒔
-
𝐛,
.
𝐖𝒔
.
Figure 1: The architecture of spR2NNs.
𝐱
∆𝐱#$
ℎ&
$𝑦
ℎ#$
$
ℎ&
(
ℎ#$
(∆𝑦$
∆𝐱#) ℎ#)
$ℎ#)
(
⋮⋮⋮ ⋮
∆𝑦)
𝐛&
$
𝐖&
$
𝐛&
(
𝐖&
(
𝐛#$
$
𝐖#$
$
𝐛#)
$
𝐖#)
$
𝐛#$
(
𝐖#$
(
𝐛#)
(
𝐖#)
(
𝐛,
𝐖,
ℎ-
.ℎ-
/
𝐛-
/
𝐖𝒔
/
𝐛-
.
𝐖𝒔
.
Figure 2: The architecture of acR2NNs.
the pairwise ranking loss, which is actually a makeshift subop-
timal to the minimization of overall ranking error. Finally, by
assuming that Mrg and Mrk share the same model Mand
combining (1) and (2), the new goal of the training process is to
find an optimal Mby minimizing
J=αLrg (D) + (1 −α)Lrk (P) + λkMk2,(3)
where α∈[0,1] is a weight to adjust the importance of the
regression loss and the pairwise ranking loss, while λis a regu-
larization parameter that controls the complexity of the model.
3. Realizations
To realize the objective function (3) by means of artificial neu-
ral networks, we propose two structures, in light of multi-modal
and multi-task neural networks [22, 23, 24, 25, 26, 27, 28].
They mainly differ in 1) the way that the utterance pairs are
generated and used and 2) the way that a test sample is fed into
the machine while predicting.
3.1. The Sampling-based Neural Network
The architecture of the proposed sampling-based regression and
ranking neural networks (spR2NNs) is depicted in Figure 1.
Each input stream, i.e., feature vectors in Dor difference vec-
tors in P, is first independently modeled by separate neural
networks, built up of hidden layers {hg}and {hk}, respec-
tively, and their corresponding parameters {W,b}. After a
succession of encoder functions h1=ϕ(W1x+b1)and
hi=ϕ(Wihi−1+bi), where i > 1and ϕ(·)is the acti-
vation function, the last separate layers are jointly connected to
the strongly parameter-shared layers, denoted by {hs}, which
tend to mix the features learned from two kinds of data streams,
namely, xand ∆x. The final output layer, resulted from the last
layer of {hs}through the linear activation function, is consti-
tuted by only two nodes that represent the prediction results and
relate to the ground truth and its difference, yand ∆y.
Algorithm 1 The Training Procedure for spR2NNs
Input: The training data X= [x1,...,xN], the development
data U= [u1,...,uK], their respective label vectors y
and v, the initial model M(0) , the maximum number of
epochs T, and the tolerance τ.
Output: The model estimate ˆ
M.
1: for i= 1 to Tdo
2: Randomly permute Xwith respect of its columns and
return a new X0and the corresponding y0.
3: Calculate ∆X=X−X0and ∆y=y−y0.
4: Update parameters of each layer with (4).
5: if the development set Udoes not exist then
6: Use Algorithm (2) to calculate ρi, the Spearman’s ρ
of yand the predicted vector from X.
7: if imod 100 = 0 then
8: Calculate ¯ρiby averaging ρi−100,...,ρi.
9: if |¯ρi−¯ρi−100|/¯ρi−100 < τ then
10: Store the model M(i)as ˆ
Mand break the loop.
11: end if
12: end if
13: else
14: Use Algorithm (2) to calculate the Spearman’s ρfrom
the development set (U,v)and store the model that
results in the best ρso far as ˆ
M.
15: end if
16: end for
Algorithm 2 The Prediction Procedure for spR2NNs
Input: The feature matrix X, the model ˆ
M.
Output: The predicted vector y.
1: Set ∆Xto be a zero matrix with the same shape of X.
2: Feed Xand ∆Xinto the spR2NNs with ˆ
Mand return y
and ∆y. Note that ∆yis abandoned.
To train the spR2NNs, the set of candidate pairs Phas to
be prepared in advance. Algorithm 1 shows the training proce-
dure, where Pis formed by randomly permutating the training
set in each training epoch. Note that the tradeoff coefficient α
in (3) does not function as a specific ratio for randomly picking
up either of the input streams to individually optimize its corre-
sponding loss function, as done in [17]. Instead, it signifies the
relative importance between regression and ranking losses and
involves in the derivation of models. By implementing a back
propagation process from the top output layers down through
the whole spR2NNs to adjust all parameters, each parameter in
Mduring the t-th epoch is updated by gradient descent as:
W(t)←W(t−1) −η∂J
∂W(t−1) ,(4a)
b(t)←b(t−1) −η∂J
∂b(t−1) ,(4b)
where ηis the learning rate, and W∈ {Ws,Wg,Wk}and
b∈ {bs,bg,bk}in Figure 1. Since the combined training
data are not deterministic, Jas well as the Spearman’s ρde-
rived from the training set and the development set, respectively,
are fluctuant, but tend to steadily move over training epochs in
the long term. Therefore, steps 5 to 12 in Algorithm 1 show the
stopping criterion, based on simple moving average of Spear-
man’s ρ, to deal with the circumstance.
The simplest and reasonable way to feed test samples X
into the trained machine is to set ∆Xto be a matrix that is
filled with zeros, as shown in Algorithm 2.
Algorithm 3 The Training Procedure for acR2NNs
Input: The same as those in Algorithm 1.
Output: The model estimate ˆ
M.
1: Run the K-means algorithm on Xto yield nclusters and
their respective centers A= [a1,...,an].
2: Pick out nsamples, [xa1,...,xan], from Xthat are re-
spectively the closest to the cluster centers.
3: Form input and truth streams in Figure 2 by calculating
∆Xai =X−Xai and ∆yai =y−yai,i= 1,...,n.
4: Run the same steps 4 to 16 in Algorithm 1, except that the
prediction procedure has to be modified for acR2NNs.
3.2. The Anchor-based Neural Network
Without any sampling process implicated in the training phase,
we propose another architecture called the anchor-based regres-
sion and ranking neural networks (acR2NNs), as depicted in
Figure 2. The philosophy behind acR2NNs is to reinterpret the
pairwise ranking problem as a reference-comparing problem.
That is, it presumes that, given some fixed references, namely
anchors, if we can accurately predict the relative distances be-
tween the labels of each sample and the anchors, then the pre-
dicted labels of samples will be ranked nearly the same as the
ground truth. The training procedure of acR2NNs differs from
spR2NNs only in the use of sample pairs. As shown in Al-
gorithm 3, a clustering method, such as K-means, is first per-
formed to derive the most representative samples in the training
set, followed by calculating the feature differences to form other
input and output streams. Note that, similar to other instance-
based learning machines, these anchors have to be reserved with
the model parameters during either training or test phases.
4. Experiments and Results
In this section, we analyze the performance of the proposed two
instance-based methods on the speech material provided by the
Sincerity sub-challenge in ComParE 2016 [3].
4.1. Features and Datasets
Each audio file contains 6,373 features that were extracted by
the organizers with the OpenSMILE toolbox [29]. The training
and test sets are comprised of 655 and 256 utterances recorded
by 22 and 10 speakers, respectively. Since no separate develop-
ment set was provided, prototyping tests are done with Leave-
One-Speaker-Out cross-validation (LOSOCV) on the training
set, where the predicted values are disjointly composed of those
generated by 22 different models. In general, each learning pro-
cess through each training/test combination in LOSOCV should
be treated as an independent event so that the cross-validation
estimate of metrics, such as accuracies and precisions, can be
derived by average [30]. However, in the prototyping test, the
final Spearman’s ρis contributed by 22 batches of predicted
values which are highly correlated in the ranking sense. For in-
stance, the ranked values of batch amight be highly affected by
those of batches other than a. Moreover, we found that although
LOSOCV can help adjust hyper-parameters in neural networks,
such as the learning rate η, the regularization parameter λ, the
tradeoff coefficient αand the structure of hidden layers, it is dif-
ficult to help select the suitable initial weights/bias and the opti-
mal number of training epochs [31]. Therefore, we singled out
6 speakers from the training data as a development set, which
satisfies 1) the gender ratio is the same as the test set, 2) the
proportion of the sample size to the new training set is nearly
consistent with that of the test set to the original training set,
Figure 3: Spearman’s ρon the training and development sets
with respect to the training epoch in spR2NNs and acR2NNs.
and 3) the experiment results of the development set are similar
to the baseline results provided by the organizers [3].
4.2. Training on spR2NNs and acR2NNs
For the sake of convenience, the learning rate η, the regular-
ization parameter λ, the tolerance τ, and the maximum number
of training epochs are empirically fixed to 0.1, 1.0, 10−5, and
104, respectively. Initial weights are uniformly sampled by the
Glorot process that is fit for the rectified linear activation func-
tion (RELU) [32]. The adaptive gradient algorithm adopted to
update model parameters is AdaGrad, which scales the learning
rate by dividing with the square root of accumulated squared
gradients [33]. To avoid overfitting, we recorded the evaluation
results in each training epoch as shown in Figure 3, where the
layer structure of spR2NNs is [0, 1]@128, which means that the
number of separate layers is 0 while the number of parameter-
shared layers is 1 with 128 hidden nodes, and the layer structure
of acR2NNs is [0, 1]@256 with only 1 anchor sample. We can
see that 1) the optimization of Jon the training set indirectly
implies the optimization of Spearman’s ρ, 2) owing to the sam-
pling process, the curve of spR2NNs is much more fluctuant
than that of acR2NNs, and this might be the reason why the
training process of spR2NNs takes a longer time to converge.
4.3. Results
We compare the results on the development set with vari-
ous tradeoff coefficients in Figure 4. The layer structures of
spR2NNs and acR2NNs are the same as those in Figure 3 ex-
cept for the number of hidden nodes. We can see that 1) the
lower α, which gives a higher weight to the pairwise ranking
loss in (3), does not necessarily guarantee a higher Spearman’s
ρ, 2) acR2NNs outperforms spR2NNs, and all the results are
better than the baseline expressed by the black dashed line, and
3) spR2NNs with 256 hidden nodes does not perform well, per-
haps because the number of training epochs with higher model
complexity is inadequate to reach convergence.
Figure 5 shows the results on the development set with var-
ious settings of the layer structure and number of anchor sam-
ples. Note that the numbers of hidden notes and the tradeoff
coefficients used in spR2NNs and acR2NNs are based on the
best results in Figure 4. The reason why the acR2NNs with
larger numbers of anchor samples as well as the more complex
layer structures did not reveal their theoretical learning power
Figure 4: Spearman’s ρderived on the development set with
respect to various tradeoff coefficients αand numbers of hidden
nodes in spR2NNs and acR2NNs.
Figure 5: Spearman’s ρderived on the development set with
respect to various settings of the layer structure and numbers of
anchor samples in spR2NNs and acR2NNs.
might lie in the shortage of training samples.
Table 1 shows the final results on the test data provided by
the organizers. The performance was obtained from a model
trained on the original training set with the optimal parameters
determined in the development phase. Note that the stopping
criterion in Algrithm 1 (cf. steps 5-12) was adopted due to lack
of the development data while training spR2NNs and acR2NNs.
We can see that our proposed methods outperform the baseline
by relative improvements of 6.8 % and 27.1 % in the test and
development sets, respectively.
Table 1: Spearman’s ρderived on the development and test sets
with respect to the baseline and our proposed methods.
Method LOSOCV Devel. Test
Baseline (SVR, C= 10−4) .474 .602 .602
spR2NNs ([0, 1]@128) .499 .706 .629
acR2NNs ([0, 1]@256) .477 .710 .599
5. Conclusions
In this paper, we have proposed a framework based on simulta-
neous minimization of regression and ranking losses for the task
of automatic sincerity evaluation. The framework has been re-
alized by two kinds of neural network-based architectures. The
experiment results demonstrated the potential of the framework.
6. References
[1] J. Eriksson, “Self-expression, expressiveness, and sincerity,” Acta
Analytica, vol. 25, no. 1, pp. 71–79, 2010.
[2] E. S. Hinchman, “Assertion, sincerity, and knowledge,” Noˆ
us,
vol. 47, no. 4, pp. 613–646, 2013.
[3] B. Schuller, S. Steidl, A. Batliner, J. Hirschberg, J. K. Burgoon,
A. Baird, A. Elkins, Y. Zhang, E. Coutinho, and K. Evanini,
“The INTERSPEECH 2016 computational paralinguistics chal-
lenge: deception, sincerity & native language,” in Proc. Inter-
speech Conf., 2016.
[4] H. S. Cheang and M. D. Pell, “Recognizing sarcasm without lan-
guage: A cross-linguistic study of English and Cantonese,” Prag-
matics & Cognition, vol. 19, no. 2, pp. 203–223, 2011.
[5] J. Tepperman, D. R. Traum, and S. Narayanan, “”Yeah right”:
Sarcasm recognition for spoken dialogue systems,” in Proc. Inter-
speech Conf., 2006.
[6] R. V. Hogg, J. W. McKean, and A. T. Craig, Introduction to Math-
ematical Statistics, 7th ed. Pearson College Division, 2013.
[7] N. Itoh, G. Kurata, R. Tachibana, and M. Nishimura, “A metric for
evaluating speech recognizer output based on human-perception
model,” in Proc. Interspeech Conf., 2015, pp. 1285–1288.
[8] J. Gibson, N. Malandrakis, F. Romero, D. C. Atkins, and S. S.
Narayanan, “Predicting therapist empathy in motivational in-
terviews using language features inspired by psycholinguistic
norms,” in Proc. Interspeech Conf., 2015, pp. 1947–1951.
[9] B. Schuller, S. Steidl, A. Batliner, S. Hantke, F. H¨
onig, J. R.
Orozco-Arroyave, E. N¨
oth, Y. Zhang, and F. Weninger, “The IN-
TERSPEECH 2015 computational paralinguistics challenge: na-
tiveness, Parkinson’s & eating condition,” in Proc. Interspeech
Conf., 2015.
[10] M. M. Mukaka, “A guide to appropriate use of correlation co-
efficient in medical research,” Malawi Medical Journal, vol. 24,
no. 3, pp. 69–71, 2012.
[11] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification,
2nd ed. Wiley-Interscience, 2000.
[12] B. Sch ¨
olkopf and A. Smola, Learning with Kernels: Support Vec-
tor Machines, Regularization, Optimization, and Beyond. MIT
Press, 2001.
[13] K. Murphy, Machine Learning: A Probabilistic Perspective. MIT
Press, 2012.
[14] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank:
From pairwise approach to listwise approach,” in Proc. Int. Conf.
Mach. Learning (ICML), 2007, pp. 129–136.
[15] H. Valizadegan, R. Jin, R. Zhang, and J. Mao, “Learning to rank
by optimizing NDCG measure,” in Proc. Conf. Neural Inform.
Process. Syst. (NIPS), 2009, pp. 1883–1891.
[16] H. Pareek and P. Ravikumar, “A representation theory for ranking
functions,” in Proc. Conf. Neural Inform. Process. Syst. (NIPS),
2014, pp. 361–369.
[17] D. Sculley, “Combined regression and ranking,” in Proc. ACM Int.
Conf. Knowledge Discovery and Data Mining (SIGKDD), 2010,
pp. 979–988.
[18] P. L. Bartlett, “The sample complexity of pattern classification
with neural networks: The size of the weights is more important
than the size of the network,” IEEE Trans. Information Theory,
vol. 44, no. 2, pp. 525–536, 1998.
[19] Y. Hamamoto, S. Uchimura, T. Kanaoka, and S. Tomita, “Evalu-
ation of artificial neural network classifiers in small sample size
situations,” in Proc. Int. Joint Conf. Neural Networks (IJCNN),
1993, pp. 1731–1735.
[20] S. Ingrassia and I. Morlini, “Neural network modeling for small
datasets,” Technometrics, vol. 47, no. 3, pp. 297–311, 2005.
[21] T. Joachims, “Optimizing search engines using clickthrough
data,” in Proc. ACM Int. Conf. Knowledge Discovery and Data
Mining (SIGKDD), 2002, pp. 133–142.
[22] R. Collobert and J. Weston, “A unified architecture for natural lan-
guage processing: Deep neural networks with multitask learning,”
in Proc. Int. Conf. Mach. Learning (ICML), 2008, pp. 160–167.
[23] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
“Multimodal deep learning,” in Proc. Int. Conf. Mach. Learning
(ICML), 2011.
[24] D. Chen and B. Mak, “Multi-task learning of deep neural net-
works for low-resource speech recognition,” IEEE/ACM Trans.
Audio, Speech, Language Process., vol. 23, no. 7, pp. 1172–1183,
2015.
[25] X. Lu, F. Wu, X. Li, Y. Zhang, W. Lu, D. Wang, and Y. Zhuang,
“Learning multimodal neural network with ranking examples,” in
Proc. ACM Int. Conf. Multimedia (ACMMM), 2014, pp. 985–988.
[26] X. Shu, G.-J. Qi, J. Tang, and J. Wang, “Weakly-shared deep
transfer networks for heterogeneous-domain knowledge propaga-
tion,” in Proc. ACM Int. Conf. Multimedia (ACMMM), 2015, pp.
35–44.
[27] D. Tang, F. Wei, B. Qin, N. Yang, T. Liu, and M. Zhou, “Sentiment
embeddings with applications to sentiment analysis,” IEEE Trans.
Knowl. Data Eng., vol. 28, no. 2, pp. 496–509, 2015.
[28] R. Xia and Y. Liu, “A multi-task learning framework for emotion
recognition using 2D continuous space,” IEEE Trans. Affective
Comput., 2015.
[29] F. Eyben, M. W¨
ollmer, and B. W. Schuller, “OpenSMILE: the
munich versatile and fast open-source audio feature extractor,” in
Proc. ACM Int. Conf. Multimedia (ACMMM), 2010, pp. 1459–
1462.
[30] R. Kohavi, “A study of cross-validation and bootstrap for accu-
racy estimation and model selection,” in Proc. Int. Joint Conf.
Neural Networks (IJCNN), 1995, pp. 1137–1145.
[31] Y. Bengio, “Practical recommendations for gradient-based train-
ing of deep architectures,” in Neural Networks: Tricks of the
Trade, 2012, pp. 437–478.
[32] X. Glorot and Y. Bengio, “Understanding the difficulty of training
deep feedforward neural networks,” in Proc. Int. Conf. Artificial
Intelligence and Statistics (AISTATS), 2010, pp. 249–256.
[33] J. C. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient meth-
ods for online learning and stochastic optimization,” J. Mach.
Learning Research, vol. 12, pp. 2121–2159, 2011.