Available via license: CC BY 4.0
Content may be subject to copyright.
arXiv:2412.01171v1 [cs.LG] 2 Dec 2024
1
Cross-Task Inconsistency Based Active
Learning (CTIAL) for Emotion Recognition
Yifan Xu, Xue Jiang, and Dongrui Wu
Abstract—Emotion recognition is a critical component of affective computing. Training accurate machine learning models for emotion
recognition typically requires a large amount of labeled data. Due to the subtleness and complexity of emotions, multiple evaluators are
usually needed for each affective sample to obtain its ground-truth label, which is expensive. To save the labeling cost, this paper
proposes an inconsistency-based active learning approach for cross-task transfer between emotion classification and estimation.
Affective norms are utilized as prior knowledge to connect the label spaces of categorical and dimensional emotions. Then, the
prediction inconsistency on the two tasks for the unlabeled samples is used to guide sample selection in active learning for the target
task. Experiments on within-corpus and cross-corpus transfers demonstrated that cross-task inconsistency could be a very valuable
metric in active learning. To our knowledge, this is the first work that utilizes prior knowledge on affective norms and data in a different
task to facilitate active learning for a new task, even the two tasks are from different datasets.
Index Terms—Active learning, transfer learning, emotion classification, emotion estimation.
✦
1 INTRODUCTION
EMotion recognition is an important part of affective comput-
ing, which focuses on identifying and understanding human
emotions from facial expressions [1], body gestures [2], speech
[3], physiological signals [4], etc. It has potential applications in
healthcare and human-machine interactions, e.g., emotion health
surveillance [5] and emotion-based music recommendation [6].
Emotions can be represented categorically (discretely) or di-
mensionally (continuously). A typical example of the former is
Ekman’s six basic emotions [7]. Typical dimensional emotion
representations include the pleasure-arousal-dominance model [8]
(pleasure is often replaced by its opposite, valence), and the
circumplex model [9] that only considers valence and arousal.
Both categorical emotion classification (CEC) and dimensional
emotion estimation (DEE) are considered in this paper.
Accurate emotion recognition usually requires a large amount
of labeled training data. However, labeling affective data is expen-
sive because emotion is subtle and has large individual differences.
Each affective sample needs to be evaluated by multiple annotators
to obtain its ‘ground-truth’ label. Active learning (AL) [10] and
transfer learning (TL) [11] are promising solutions to alleviate the
labeling effort.
AL selects the most useful samples to query for their labels,
improving the model performance for a limited annotation budget.
The key is to define an appropriate measure of sample useful-
ness. For CEC from facial images, Muhammad and Alhamid
[12] adopted entropy as the uncertainty measure and selected
samples with high entropy to annotate. Zhang et al. [13] used the
distance to decision boundary as the uncertainty measure in AL for
speech emotion classification. They selected samples with medium
certainty to avoid noisy ones and allocated different numbers of
•Y. Xu, X. Jiang and D. Wu are with the Key Laboratory of the Ministry
of Education for Image Processing and Intelligent Control, School of
Artificial Intelligence and Automation, Huazhong University of Science
and Technology, Wuhan 430074, China. Email: yfxu@hust.edu.cn, xue-
jiang@hust.edu.cn, drwu09@gmail.com.
annotators for each sample adaptively. For AL in DEE, Wu et al.
[14] considered greedy sampling in both the feature space and the
label space. Their approach was later extended to multi-task DEE
[3]. Abdelwahab and Busso [15] also verified the effectiveness
of greedy sampling based AL approaches in valence and arousal
estimation using deep neural networks.
TL utilizes knowledge from relevant (source) tasks to facil-
itate the learning for a new (target) task [16], [17]. In task-
homogeneous TL, i.e., the source and target domains have the
same label space, the key is to reduce their distribution discrep-
ancies [18]. For example, Zhou and Chen [19] employed class-
wise adversarial domain adaptation to transfer from spontaneous
emotional speeches of children to acted corpus of adults. A more
challenging scenario is task-heterogenous TL, where the source
and target domains have different label spaces. Pre-training on
large datasets of relevant (but may not be exactly the same)
tasks and then fine-tuning on task-specific data was used in
facial emotion recognition [20]–[22]. Zhao et al. [23] adopted
age and gender prediction as source tasks, and supplemented the
extracted features to target tasks of speech emotion classification
and estimation. Park et al. [24] trained models on text corpora with
categorical emotion labels to estimate their dimensional emotions.
They ordered the discrete emotions along valence, arousal, and
dominance dimensions according to domain knowledge [25] and
reduced the earth mover’s distance to optimize the model.
This paper proposes cross-task inconsistency based active
transfer learning (CTIAL) between CEC and DEE, which has not
been explored before. We aim to reduce the labeling efforts in
the task-heterogeneous TL scenario where a homogenous source
dataset suitable for transfer is hard to obtain, but a heterogeneous
one is available. CTIAL enhances the efficiency of sample se-
lection in AL for the target task, by exploiting the source task
knowledge. Fig. 1 illustrates the difference between cross-task AL
and the traditional within-task AL.
To transfer knowledge between CEC and DEE tasks, we first
train models for CEC and DEE separately using the corresponding
labeled data and obtain their predictions on the unlabeled samples.
2
Fig. 1. Within-task AL for DEE, and CTIAL for cross-task transfer from CEC to DEE.
Affective norms, normative emotional ratings for words [26], are
utilized as domain knowledge to map the estimated categorical
emotion probabilities into the dimensional emotion space. A
cross-task inconsistency (CTI) is computed in this common label
space and used as an informativeness measure in AL. By further
integrating the CTI with other metrics, e.g., uncertainty in CEC
or diversity in DEE, we can identify the most useful unlabeled
samples to annotate for the target task and merge them into the
labeled dataset to update the model.
Our contributions are:
1) We propose CTI to measure the prediction inconsistency
between CEC and DEE tasks for cross-task AL.
2) We integrate CTI with other metrics in within-task AL to
further improve the sample query efficiency.
3) Within- and cross-corpus experiments demonstrated the
effectiveness of CTIAL in cross-task transfers.
To our knowledge, this is the first work that utilizes prior knowl-
edge on affective norms and data in a different task to facilitate
AL for a new task, even though the two tasks are from different
datasets.
The remainder of the paper is organized as follows: Section 2
introduces the proposed CTIAL approach. Section 3 describes
the datasets and the experimental setup. Sections 4 and 5 present
experiment results on cross-task transfers from DEE to CEC, and
from CEC to DEE, respectively. Section 6 draws conclusions.
2 METHODOLOGY
This section introduces the CTI measure and its application in
cross-task AL.
2.1 Problem Setting
Consider the transfer between CEC and DEE, i.e., one is the
source task, and the other is the target task. The source task has
a large amount of labeled samples, whereas the target task has
only a few. AL selects the most useful target samples from the
unlabeled data pool and queries for their labels.
Denote the dataset with categorical and dimensional emo-
tion annotations as DCat ={(xi,yCat
i)}NCat
i=1 and DDim =
{(xi,yDim
i)}NDim
i=1 , respectively, where xiin DCat and DDim are
from the same feature space, yCa t
i∈ R|E|is the one-hot encoding
label of the emotion category set E,yDim
i∈ R|D|is the dimen-
sional emotion label of the dimension set D. An example of the
emotion sets is E={angry, happy, sad, neutral}and D={valence,
arousal, dominance}. The unlabeled data pool P={xi}NP
i=1
has homogenous features with DCat and DDim . The target dataset
consists of Pand a few labeled samples.
2.2 Expert Knowledge
Affective norms, i.e., dimensional emotion ratings for words [25],
[26], can be utilized as domain knowledge to establish the con-
nection between the label spaces of categorical and dimensional
emotions [24], [27]. This paper uses the NRC Valence-Arousal-
Dominance Lexicon [25], where 20,000 English words were
manually annotated with valence, arousal and dominance scores
via crowd-sourcing. Specifically, the annotators were presented
with a four-word tuple each time, and asked to select the word with
the highest and lowest valence/arousal/dominance, respectively.
The best-worst scaling technique was then used to aggregate the
annotations: the score for a word is the proportion of times it was
chosen as the highest valence/arousal/dominance minus that as the
lowest valence/arousal/dominance. The scores for each emotion
dimension were then linearly mapped into the interval [0, 1].
With the help of affective norms, we can exploit datasets of
different emotion representations for TL, by converting categorical
emotion labels into dimensional ones. The affective norms of
different languages demonstrate relatively high correlations [28],
suggesting the feasibility for cross-linguistic transfer. Table 1
presents the dimensional emotion scores of some emotion cate-
gories in the NRC Lexicon.
TABLE 1
Valence, Arousal and Dominance scores of eight typical emotions in
the NRC Lexicon.
Valence Arousal Dominance
Angry 0.122 0.830 0.604
Happy 1.000 0.735 0.772
Sad 0.225 0.333 0.149
Disgusted 0.051 0.773 0.274
Fearful 0.083 0.482 0.278
Surprised 0.784 0.855 0.539
Frustrated 0.080 0.651 0.255
Neutral 0.469 0.184 0.357
3
Fig. 2. Flowchart for computing the CTI.
2.3 Cross-Task Inconsistency (CTI)
Fig. 2 shows the flowchart for computing the CTI, an informative-
ness measure of the prediction inconsistency between two tasks.
First, we construct the emotion classification and estimation
models, fCat and fDim, using DCat and DDim respectively. For an
unlabeled sample xi∈P, the prediction probabilities for the
categorical emotions are:
ˆ
yCat
i=fCat(xi),(1)
where each element ˆye
i∈ˆ
yCat
iis the probability of Emotion e∈
E.
The estimated dimensional emotion values are:
ˆ
yDim
i=fDim(xi),(2)
where ˆydim
i∈ˆ
yDim
iis the score of Dimension dim ∈D.
Utilizing prior knowledge of affective norms, we map the pre-
dicted categorical emotion probabilities to dimensional emotion
values:
˜
yDim
i=X
e∈E
ˆye
i·NRC[e],(3)
where NRC[e] denotes the scores of Emotion efrom the NRC
Lexicon.
The CTI is then computed as:
Ii=
ˆ
yDim
i−˜
yDim
i
2.(4)
A high CTI indicates that the two models have a high
disagreement, i.e., the corresponding unlabeled sample has high
uncertainty (informativeness). Those samples with high CTIs may
have inaccurate probability predictions on emotion categories
or imprecise estimations of emotion primitives. Labeling these
informative samples can expedite the model learning. Besides,
the affective norms only represent the most typical (or average)
emotion primitives of each category. In practice, a categorical
emotion’s dimensional primitives may have a large variance. Its
corresponding samples are also likely to have high CTIs, and
annotating them can increase the sample diversity.
Thus, the sample xqselected for labeling for the target task is:
q= arg max
xi∈P
Ii.(5)
CTI can be integrated with other useful within-task AL indi-
cators for better performance, as introduced next.
2.4 CTIAL for Cross-Task Transfer from DEE to CEC
Uncertainty is a frequently used informativeness metric for AL in
classification [10]. We consider two typical uncertainty measures
in CEC: information entropy and prediction confidence.
The information entropy, also called Shannon entropy [29], is:
Hi=−X
e∈E
ˆye
i·log ˆye
i.(6)
A large entropy indicates high uncertainty [30].
The prediction confidence directly reflects how certain the
classifier is about its prediction:
Confi= max
e∈Eˆye
i.(7)
A low confidence indicates high uncertainty [31].
Considering the uncertainty and CTI simultaneously, we select
the sample xqby:
q= arg max
xi∈P
Ii·Hi,(8)
or by:
q= arg max
xi∈P
Ii
Confi
,(9)
and query for its class.
The pseudo-code of CTIAL for cross-task transfer from DEE
to CEC is given in Algorithm 1.
2.5 CTIAL for Cross-Task Transfer from CEC to DEE
Wu et al. [3], [14] employed greedy sampling in both feature
and label spaces for sample selection and verified the importance
of diversity in AL for regression. Their multi-task improved
greedy sampling (MTiGS) [3] computes the distance between an
unlabeled sample xiand the labeled sample set DDim as:
di= min
xj∈DDim kxi−xjk2·Y
dim∈D
|ˆydim
i−ydim
j|.(10)
To weight the feature and label spaces equally, we slightly modify
MTiGS to:
d′
i= min
xj∈DDim kxi−xjk2· k ˆ
yDim
i−yDim
jk2.(11)
Considering CTI (informativeness) and MTiGS (diversity)
together, we select the sample xqby:
q= arg max
xi∈P
Ii·d′
i,(12)
and query for its dimensional emotion values.
The pseudo-code of CTIAL for DEE is shown in Algorithm 2.
4
Algorithm 1: CTIAL for cross-task transfer from DEE
to CEC.
Input: Source dataset with dimensional emotion values
DDim ={(xi,yDim
i)}NDim
i=1 ;
Target dataset, including data with categorical
emotion labels DCat ={(xi,yCat
i)}NCat
i=1 and
unlabeled data pool P={xi}NP
i=1;
K, number of samples to be queried;
Uncertainty measure, entropy or confidence.
Output: Emotion classification model fCat.
Train the model fCat on DCat and fDim on DDim ;
Estimate dimensional emotion values {ˆ
yDim
i}NP
i=1 of P
using (2);
for k= 1 : Kdo
Estimate emotion category probabilities {ˆ
yCat
i}NP
i=1 of
Pusing (1);
Map {ˆ
yCat
i}NP
i=1 into the dimensional emotion space
and obtain {˜
yDim
i}NP
i=1 using (3);
Calculate CTI {Ii}NP
i=1 using (4);
if the uncertainty measure is entropy then
Compute entropy {Hi}NP
i=1 using (6);
Select sample xqusing (8);
else
// the uncertainty measure is
confidence
Compute confidence {Confi}NP
i=1 using (7);
Select sample xqusing (9);
end
Query for the emotion category yCat
qof xq;
DCat ← DCat ∪(xq,yCat
q);
P←P\xq;NP←NP−1;
Update fCat on DCat;
end
2.6 Domain Adaptation in Cross-Corpus Transfer
CTIAL assumes the source model can make reliable predictions
for the target dataset. However, speech emotion recognition cor-
pora vary according to whether they are acted or spontaneous,
collected in lab or in the wild, and so on. These discrepancies may
cause a model trained on one corpus to perform poorly on another,
violating the underlying assumption in CTIAL.
To enable cross-corpus transfer, we adopt two classical domain
adaptation approaches, transfer component analysis (TCA) [32]
and balanced distribution adaptation (BDA) [33].
When the source task is CEC, BDA jointly aligns both the
marginal distributions and the class-conditional distributions with
a balance factor that adjusts the weights of the two corresponding
terms in the objective function. Specifically, the source model
assigns pseudo-labels for the unlabeled target dataset. The source
and target datasets’ average features and their corresponding class’
average features are aligned. The adapted features of the source
dataset are then used to update the source model. This process
iterates till convergence.
When the source task is DEE, TCA adapts the marginal
distributions of the source and target datasets by reducing the
distance between their average features. BDA is not used here,
since the class-conditional probabilities are difficult to compute in
regression problems.
Algorithm 2: CTIAL for cross-task transfer from CEC
to DEE.
Input: Source dataset with categorical emotion labels
DCat ={(xi,yCat
i)}NCat
i=1 ;
Target dataset, including data with dimensional
emotion values DDim ={(xi,yDim
i)}NDim
i=1 and
unlabeled data pool P={xi}NP
i=1;
K, number of samples to be queried.
Output: Emotion estimation model fDim .
Train the model fCat on DCat and fDim on DDim ;
Estimate category probabilities {ˆ
yCat
i}NP
i=1 of Pby (1);
Map {ˆ
yCat
i}NP
i=1 into the dimensional emotion space and
obtain {˜
yDim
i}NP
i=1 using (3);
for k= 1 : Kdo
Estimate dimensional emotion values {ˆ
yDim
i}NP
i=1 of P
using (2);
Calculate CTI {Ii}NP
i=1 using (4);
Compute the distance {d′
i}NP
i=1 between samples in P
and DDim using (11);
Select sample xqusing (12);
Query for the dimensional emotion values yDim
qof xq;
DDim ← DDim ∪(xq,yDim
q);
P←P\xq;NP←NP−1;
Update fDim on DDim;
end
After domain adaptation, the model trained on the source
dataset is applied to Pto obtain the predictions for the source
task. More details on the implementations will be introduced in
Section 3.1.
3 EXPERIMENTAL SETUP
This section describes the datasets and the experimental setup.
3.1 Datasets and Feature Extraction
Three public speech emotion datasets, IEMOCAP (Interactive
Emotional Dyadic Motion Capture Database) [34] of semi-
authentic emotions, MELD (Multimodal EmotionLines Dataset)
[35] of semi-authentic emotions, and VAM (Vera am Mittag; Vera
at Noon in English) [36] of authentic emotions, were used to verify
the proposed CTIAL. Specifically, we performed experiments of
within-corpus transfer on IEMOCAP, and cross-corpus transfer
from VAM and MELD to IEMOCAP.
IEMOCAP is a multi-modal dataset annotated with both cat-
egorical and dimensional emotion labels. Audio data from five
emotion categories (angry, happy1, sad, frustrated and neutral with
289, 947, 608, 971 and 1099 samples, respectively) in spontaneous
sessions were used in the within-corpus experiments and cross-
corpus transfer from DEE on VAM. In cross-corpus transfer from
CEC on MELD, only four overlapping classes (angry, happy,
sad and neutral) were used. The valence, arousal and dominance
annotations are in [1, 5], so we linearly rescaled the scores in the
NRC Lexicon from [0, 1] to [1, 5].
MELD is a multi-modal dataset collected from the TV series
Friends. Each utterance was labeled by five annotators from seven
1. The ‘happy’ class contains data of ‘happy’ and ‘excited’ in the original
annotation as in multiple previous works.
5
classes (anger, disgust, sadness, joy, surprise, fear, and neutral),
and majority vote was applied to generate the ground-truth labels.
We only used the four overlapping categories with IEMOCAP in
the training set (angry, happy, sad and neutral with 1109, 1743,
683 and 4709 samples, respectively).
VAM consists of 947 utterances collected from 47 guests
(11m/36f) in a German TV talk-show Vera am Mittag. Each
sentence was annotated by 6 or 17 evaluators for valence, arousal
and dominance values, and the weighted average was used to
obtain the ground-truth labels. We linearly rescaled their values
from [-1, 1] to [1, 5] to match the range of IEMOCAP.
The wav2vec 2.0 model [37], pre-trained on 960 hours of
unlabeled audio from the LibriSpeech dataset [38] and fine-tuned
for automatic speech recognition on the same audio with tran-
scripts, was used for feature extraction. For each audio segment,
we took the average output of 12 transformer encoder layers, and
averaged these features again along the time axis, obtaining a 768-
dimensional feature for each utterance.
In cross-task AL, we combined the source and target datasets
and used principal component analysis to reduce the feature di-
mensionality (maintaining 90% variance). In cross-corpus transfer,
an additional feature adaptation step was performed using TCA or
BDA to obtain more accurate source task predictions for samples
in P.
Before applying AL to P, we applied principal component
analysis to the original 768-dimensional features of the target
dataset in both cross-task AL and within-task AL baselines.
3.2 Experimental Setup
We used logistic regression (LR) and ridge regression (RR) as the
base model for CEC and DEE, respectively. The weight of the
regularization term in each model, i.e., 1/C in LR and αin RR,
was chosen from {1,5,10,50,1e2,5e2,1e3,5e3}by three-fold
cross-validation on the corresponding training data.
In cross-corpus transfer experiments, feature dimensionality in
TCA and BDA was set to 30 and 40, respectively. BDA used 10
iterations to estimate the class labels of the target dataset, align
the marginal and conditional distributions, and update the classi-
fier. The balance factor was selected from {0.1,0.2, ..., 0.9}to
minimize the sum of the maximum mean discrepancy metrics [39]
on all data and in each class.
3.3 Performance Evaluation
We aimed to classify all samples in the target dataset, some by
human annotators and the remaining by the trained classifier.
Therefore, we evaluated the classification performance on the en-
tire target dataset, including the manually labeled samples and P.
Specifically, we concatenated the ground-truth labels {yCat
i}NCat
i=1
of DCat and the predictions {ˆ
yCat
i}NP
i=1 of Pto compute the
performance metrics.
In within-corpus transfer, either from DEE to CEC or from
CEC to DEE on IEMOCAP, each time we used two sessions as
the source dataset and the rest three sessions as the target dataset,
resulting in 10 source-target dataset partitions. Experiments on
each dataset partition were repeated three times with different
initial labeled samples.
Cross-corpus experiments considered transfer from the source
DEE task on VAM to the target CEC task on IEMOCAP and
from the source CEC task on MELD to the target DEE task on
IEMOCAP. Experiments were repeated 10 times with different
initial labeled sample sets.
In each run of the experiment, the initial 20 labeled samples
were randomly selected. In the subsequent 200 iterations of sample
selection, one sample was chosen for annotation at a time.
Balanced classification accuracy (BCA), i.e., the average per-
class accuracies, was used as the performance measure in CEC
since IEMOCAP has significant class-imbalance. Root mean
squared error (RMSE) and correlation coefficient (CC) were used
as performance measures in DEE.
4 EXPERIMENTS ON CROSS-TASK TRANSFER
FROM DEE TO CEC
This section presents the results in cross-task transfer from DEE
to CEC. The DEE and CEC tasks could be from the same corpus,
or different ones.
4.1 Algorithms
We compared the performance of the following sample selection
strategies in cross-task transfer from DEE to CEC:
1) Random sampling (Rand), which randomly selects K
samples to annotate.
2) Entropy (Ent) [30], an uncertainty-based within-task AL
approach that selects samples with maximum entropy
computed by (6).
3) Least confidence (LC) [31], an uncertainty-based within-
task AL approach that selects samples with minimum
prediction confidence computed by (7).
4) Multi-task iGS on the source task (Source MTiGS),
which performs AL according to the informativeness
metric in the source DEE task. Specifically, we compute
the distance between samples xi∈Pand xj∈ DCat
by (10) and select samples with the maximum distance.
Without any dimensional emotion labels of DCat, we
replace the true label ydim
jwith ˆydim
jestimated from the
source model. This simple cross-task AL baseline directly
uses the knowledge from the source task.
5) CTIAL, which selects samples with the maximum CTI
by (5).
6) Ent-CTIAL, which integrates entropy with CTI as in-
troduced in Algorithm 1.
7) LC-CTIAL, which integrates the prediction confidence
with CTI as introduced in Algorithm 1.
4.2 Effectiveness of CTIAL
Fig. 3 shows the average BCAs in within- and cross-corpus
cross-task transfers from DEE to CEC. To examine if the perfor-
mance improvements of the integration of the uncertainty-based
approaches and CTIAL were statistically significant, Wilcoxon
signed-rank tests with Holm’s p-value adjustment [40] were per-
formed on the results in each AL iteration. The test results are
shown in Fig. 4.
Figs. 3 and 4 demonstrate that:
1) As the number of labeled samples increased, classifiers in
all approaches became more accurate, achieving higher
BCAs.
2) Between the two uncertainty-based AL approaches, LC
outperformed Ent. However, both may not outperform
6
0 50 100 150 200
K
0.35
0.40
0.45
0.50
0.55
BCA
Rand
Ent
LC
Source MTiGS
CTIAL
Ent-CTIAL
LC-CTIAL
(a)
0 50 100 150 200
K
0.35
0.40
0.45
0.50
0.55
BCA
Rand
Ent
LC
Source MTiGS
CTIAL
Ent-CTIAL
LC-CTIAL
(b)
Fig. 3. Average BCAs of different sample selection approaches in cross-
task transfer from DEE to CEC. (a) Within-corpus transfer from DEE on
IEMOCAP to CEC on IEMOCAP; and, (b) cross-corpus transfer from
DEE on VAM to CEC on IEMOCAP. Kis the number of samples to be
queried in addition to the initial labeled ones.
Rand, suggesting that only considering the uncertainty
may not be enough.
3) Source MTiGS that performed AL according to the
source DEE task achieved higher BCAs than Rand and
the two within-task AL approaches, because MTiGS
increased the feature diversity, which was also useful for
the target CEC task. Additionally, MTiGS increased the
diversity of the dimensional emotion labels, which in turn
increased the diversity of the categorical emotion labels.
4) CTIAL outperformed Rand, the two within-task AL
approaches (Ent and LC), and the cross-task AL baseline
Source MTiGS when Kwas small, demonstrating
that the proposed CTI measure properly exploited the
relationship between the CEC and DEE tasks and effec-
tively utilized knowledge from the source task. However,
in cross-corpus transfer, CTIAL and Rand had similar
performance when Kwas large. The reason may be that
domain shift limited the performance of the source DEE
models on the target dataset, which further resulted in in-
accurate CTI calculation and degraded AL performance.
5) Integrating CTI with uncertainty can further enhance the
classification accuracies by simultaneously considering
within- and cross-task informativeness: both LC-CTIAL
and Ent-CTIAL statistically significantly outperformed
the other baselines.
6) LC-CTIAL generally achieved the best performance
among all seven approaches.
50 100 150 200
K
CTIAL
Source MTiGS
LC
Ent
Rand
Ent-CTIAL LC-CTIAL
(a)
50 100 150 200
K
CTIAL
Source MTiGS
LC
Ent
Rand
Ent-CTIAL LC-CTIAL
(b)
Fig. 4. Statistical significance of the performance improvements of
Ent-CTIAL and LC-CTIAL over the other approaches. (a) Within-corpus
transfer from DEE on IEMOCAP to CEC on IEMOCAP; and, (b) cross-
corpus transfer from DEE on VAM to CEC on IEMOCAP. Kis the
number of samples to be queried in addition to the initial labeled ones.
The vertical axis denotes the approaches in comparison with Ent-CTIAL
or LC-CTIAL in Wilcoxon signed-rank tests. The red and green markers
were placed at where the adjusted p-values were smaller than 0.05.
4.3 Effectiveness of TCA
CTIAL assumes that the model trained on the source dataset is
reliable for the target dataset. We performed within-corpus and
cross-corpus experiments to verify this assumption.
Fig. 5 shows the average RMSEs and CCs in valence, arousal
and dominance estimation on IEMOCAP in within- and cross-
corpus transfers. In within-corpus transfer, the average RMSE
and CC were 0.6667 and 0.5659, respectively. Although models
trained on VAM were inferior to those on IEMOCAP, TCA still
achieved lower RMSEs and higher CCs than direct transfer.
25 30 35 40 45
d
0.66
0.7
0.74
0.78
0.82
RMSE
IEMOCAP
VAM w/o TCA
VAM w/ TCA
25 30 35 40 45
d
0.35
0.40
0.45
0.50
0.55
CC
IEMOCAP
VAM w/o TCA
VAM w/ TCA
Fig. 5. Average RMSEs and CCs in valence, arousal and dominance
estimation on IEMOCAP in within-corpus transfer (DEE on IEMOCAP
to DEE on IEMOCAP; red line), direct cross-corpus transfer (DEE on
VAM to DEE on IEMOCAP; blue line), and cross-corpus transfer using
TCA (DEE on VAM to DEE on IEMOCAP; black curve). dis the feature
dimensionality. The markers on the red and blue lines mean that the
feature dimensionality after principal component analysis was 46 and
40, respectively.
7
Generally, the assumption that the source model is reliable
was satisfied in both within-corpus transfer and more challenging
cross-corpus transfer, with the help of TCA.
5 EXPERIMENTS ON CROSS-TASK TRANSFER
FROM CEC TO DEE
This section presents the experiment results in cross-task transfer
from CEC to DEE, where the two tasks may be from the same or
different datasets.
5.1 Algorithms
The following approaches were compared in transferring from
CEC to DEE:
1) Direct mapping by NRC Lexicon (NRC Mapping),
where the dimensional emotion estimates of all the sam-
ples are obtained by (3). This non-AL approach only
utilizes the information of the source CEC task and
domain knowledge.
2) Random sampling (Rand), which randomly selects K
samples to annotate.
3) Multi-task iGS (MTiGS) [3], a diversity-based within-
task AL approach that selects samples with the furthest
distance to labeled data computed by (10).
4) Least confidence on the source task (Source LC),
which selects samples with the minimum source model
prediction confidence defined by (7). This cross-task AL
baseline depends solely on the source CEC model.
5) Cross-task iGS (CTiGS), a variant of MTiGS that further
considers the information in the source CEC task. Specif-
ically, we first obtain the predicted emotion categories
for the target data using the source model. The distance
calculation is only conducted between unlabeled and la-
beled samples predicted with the same emotion category.
For an emotion category with only unlabeled samples, we
calculate the distance between them and all the labeled
samples using (10), as in MTiGS. The subsequent sample
selection process is the same as MTiGS.
6) CTIAL, which selects samples with the maximum CTI
by (5).
7) MTiGS-CTIAL, which integrates MTiGS and CTI as
introduced in Algorithm 2.
5.2 Effectiveness of CTIAL
Fig. 6 shows the average performance in DEE on IEMOCAP using
different sample selection approaches. Wilcoxon signed-rank tests
with Holm’s p-value adjustment [40] were again used to examine
if the performance improvements of MTiGS-CTIAL over other
approaches were statistically significant in each AL iteration. The
test results are shown in Fig. 7.
Figs. 6 and 7 shows that:
1) The regression models in all approaches performed better
as the number of labeled samples increased, except the
non-AL baseline NRC Mapping.
2) NRC Mapping had poor performance, especially in
dominance estimation. Models trained on only a small
amount of labeled data outperformed NRC Mapping,
regardless of the sample selection approach. As stated
in Section 2.3, each emotion category was mapped to a
single tuple of dimensional emotion values according to
the affective norms. In practice, samples belonging to the
same emotion category may have diverse emotion prim-
itives. Direct mapping oversimplified the relationship
between categorical emotions and dimensional emotions.
3) MTiGS on average achieved lower RMSEs and higher
CCs than Rand. However, it performed similarly to
Rand for some emotion primitives.
4) Source LC performed much worse than Rand because
it only considered the classification information in the
source task but ignored the characteristics of the target
task. Samples with low classification confidence may not
be very useful to the target regression tasks.
5) CTiGS initially performed worse than MTiGS, but grad-
ually outperformed on some dimensions as the number
of labeled samples increased. The reason is that CTiGS
emphasizes the within-class sample diversity, whereas
MTiGS focuses on the global sample diversity. The latter
is more beneficial for the regression models to learn
global patterns, which is important when labeled samples
are rare. As Kincreases, enriching the within-class
sample diversity facilitates the global models to learn
local patterns.
6) In within-corpus transfer, CTIAL outperformed both
Rand and MTiGS initially, but gradually became inferior
to Rand. In cross-corpus transfer, CTIAL performed
better than Rand on arousal but worse on valence and
dominance. The possible reason is similar to that in
transfer from DEE to CEC: the source classification
models were not accurate enough, resulting in inaccurate
CTI calculation and unsatisfactory AL performance.
7) MTiGS-CTIAL achieved the best overall performance
in both within- and cross-corpus transfers, indicating that
considering both CTI and diversity in AL helped improve
the regression performance even when the CTI was not
very accurate.
5.3 Effectiveness of BDA
Fig. 8 shows the classification results in CEC on IEMOCAP using
different training data and projection dimensionality for BDA in
cross-corpus transfer. In within-corpus validation, we used two
sessions as the training data and the rest three sessions as the test
data. The average BCA in 10 training-test partitions for four-class
(E={angry, happy, sad, neutral}) emotion recognition reached
0.6477. In cross-corpus transfer, directly transferring the model
trained on MELD to IEMOCAP only achieved a BCA of 0.3904.
BDA boosted the BCA to 0.5378 when setting the projected
feature dimensionality to 45.
6 CONCLUSIONS
Human emotions can be described by both categorical and dimen-
sional representations. Usually, training accurate emotion classifi-
cation and estimation models requires large labeled datasets. How-
ever, manual labeling of affective samples is expensive, due to the
subtleness and complexity of emotions. This paper integrates AL
and TL to reduce the labeling effort. We proposed CTI to measure
the prediction inconsistency between the CEC and DEE tasks. To
further consider other useful AL metrics, CTI was integrated with
uncertainty in CEC and diversity in DEE for enhanced reliability.
Experiments on three speech emotion datasets demonstrated the
8
0 50 100 150 200
0.75
0.80
0.85
0.90
K
RMSE
Valence
0 50 100 150 200
0.3
0.4
0.5
0.6
K
CC
Valence
0 50 100 150 200
0.55
0.60
0.65
0.70
K
RMSE
Arousal
0 50 100 150 200
0.4
0.5
0.6
0.7
K
CC
Arousal
0.90
0 50 100 150 200
0.65
0.70
0.75
K
RMSE
Dominance
0 50 100 150 200
0.3
0.4
0.5
K
CC
Dominance
0 50 100 150 200
0.65
0.70
0.75
0.80
K
RMSE
Avg
0 50 100 150 200
0.3
0.4
0.5
0.6
K
CC
Avg
NRC
Mapping
Rand
MTiGS
Source LC
CTiGS
CTIAL
MTiGS-CTIAL
(a)
0 50 100 150 200
K
0.80
0.85
0.90
RMSE
Valence
0 50 100 150 200
K
0.4
0.5
CC
Valence
0 50 100 150 200
K
0.55
0.60
0.65
0.70
RMSE
Arousal
0 50 100 150 200
K
0.4
0.5
0.6
0.7
CC
Arousal
0 50 100 150 200
K
0.65
0.70
0.75
RMSE
Dominance
0 50 100 150 200
K
0.3
0.4
0.5
CC
Dominance
0 50 100 150 200
K
0.65
0.70
0.75
0.80
RMSE
Avg
0 50 100 150 200
K
0.4
0.5
0.6
CC
Avg
NRC
Mapping
Rand
MTiGS
Source LC
CTiGS
CTIAL
MTiGS-CTIAL
(b)
Fig. 6. Average RMSEs and CCs of different sample selection approaches in cross-task transfer from CEC to DEE. (a) Within-corpus transfer from
CEC on IEMOCAP to DEE on IEMOCAP; and, (b) cross-corpus transfer from CEC on MELD to DEE on IEMOCAP. Kis the number of samples to
be queried in addition to the initial labeled ones.
effectiveness of CTIAL in within- and cross-corpus transfers
between DEE and CEC tasks. To our knowledge, this is the first
work that utilizes prior knowledge on affective norms and data in a
different task to facilitate AL for a new task, regardless of whether
the two tasks are from the same dataset or not.
REFERENCE S
[1] M. Pantic and L. Rothkrantz, “Automatic analysis of facial expressions:
The state of the art,” IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. 22, no. 12, pp. 1424–1445, 2000.
[2] R. E. Kaliouby and P. Robinson, “Real-time inference of complex mental
states from facial expressions and head gestures,” in Proc. Int’l Conf. on
Computer Vision and Pattern Recognition, Washington DC, June 2004,
p. 154.
[3] D. Wu and J. Huang, “Affect estimation in 3D space using multi-task
active learning for regression,” IEEE Trans. on Affective Computing,
vol. 13, no. 41, pp. 16–27, 2022.
[4] D. Wu, B.-L. Lu, B. Hu, and Z. Zeng, “Affective brain-computer
interfaces (aBCIs): A tutorial,” Proc. of the IEEE, 2023, in press.
[5] Y. Jiang, W. Li, M. S. Hossain, M. Chen, A. Alelaiwi, and M. Al-
Hammadi, “A snapshot research and implementation of multimodal
information fusion for data-driven emotion recognition,” Information
Fusion, vol. 53, pp. 209–221, 2020.
[6] D. Ayata, Y. Yaslan, and M. E. Kamasak, “Emotion based music recom-
mendation system using wearable physiological sensors,” IEEE Trans.
on Consumer Electronics, vol. 64, no. 2, pp. 196–203, 2018.
[7] P. Ekman, W. V. Friesen, M. O’sullivan, A. Chan, I. Diacoyanni-Tarlatzis,
K. Heider, R. Krause, W. A. LeCompte, T. Pitcairn, P. E. Ricci-Bitti
et al., “Universals and cultural differences in the judgments of facial
expressions of emotion.” Journal of Personality and Social Psychology,
vol. 53, no. 4, p. 712, 1987.
[8] A. Mehrabian, Basic dimensions for a general psychological theory:
Implications for personality, social, environmental, and developmental
studies. Cambridge, MA: Oelgeschlager, Gunn & Hain, 1980.
[9] J. A. Russell, “A circumplex model of affect.” Journal of Personality and
Social Psychology, vol. 39, no. 6, p. 1161, 1980.
[10] B. Settles, “Active learning literature survey,” University of Wisconsin–
Madison, Computer Sciences Technical Report 1648, 2009.
[11] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. on
Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
[12] G. Muhammad and M. F. Alhamid, “User emotion recognition from a
larger pool of social network data using active learning,” Multimedia
Tools and Applications, vol. 76, no. 8, pp. 10 881–10 892, 2017.
[13] Y. Zhang, E. Coutinho, Z. Zhang, C. Quan, and B. Schuller, “Dynamic
active learning based on agreement and applied to emotion recognition in
spoken interactions,” in Proc. of the ACM on Int’l Conf. on Multimodal
Interaction, Seattle, Washington, WA, Nov. 2015, pp. 275–278.
[14] D. Wu, C.-T. Lin, and J. Huang, “Active learning for regression using
greedy sampling,” Information Sciences, vol. 474, pp. 90–105, 2019.
9
50 100 150 200
K
CTIAL
CTiGS
Source LC
MTiGS
Rand
(a)
50 100 150 200
K
CTIAL
CTiGS
Source LC
MTiGS
Rand
(b)
Fig. 7. Statistical significance of the performance improvements of
MTiGS-CTIAL over the other approaches. (a) Within-corpus transfer
from CEC on IEMOCAP to DEE on IEMOCAP; and, (b) cross-corpus
transfer from CEC on MELD to DEE on IEMOCAP. Kis the number
of samples to be queried in addition to the initial labeled ones. The
vertical axis denotes the approaches in comparison with MTiGS-CTIAL
in Wilcoxon signed-rank tests. The red markers were placed at where
the adjusted p-values were smaller than 0.05.
40 45 50 55
d
0.4
0.45
0.5
0.55
0.6
0.65
BCA
IEMOCAP
MELD w/o BDA
MELD w/ BDA
Fig. 8. BCAs in CEC on IEMOCAP in within-corpus transfer (CEC on
IEMOCAP to CEC on IEMOCAP; red line), direct cross-corpus transfer
(CEC on MELD to CEC on IEMOCAP; blue line), and cross-corpus
transfer using BDA (CEC on MELD to CEC on IEMOCAP; black curve).
ddenotes the feature dimensionality. The markers on the red and blue
lines mean that the feature dimensionality after principal component
analysis was 45 and 55, respectively.
[15] M. Abdelwahab and C. Busso, “Active learning for speech emotion
recognition using deep neural network,” in Proc. Int’l Conf. on Affective
Computing and Intelligent Interaction, Cambridge, UK, Sep. 2019, pp.
1–7.
[16] D. Wu, Y. Xu, and B.-L. Lu, “Transfer learning for EEG-based brain-
computer interfaces: A review of progress made since 2016,” IEEE Trans.
on Cognitive and Developmental Systems, vol. 14, no. 1, pp. 4–19, 2020.
[17] W. Zhang, L. Deng, L. Zhang, and D. Wu, “A survey on negative
transfer,” IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 2, pp.
305–329, 2023.
[18] W. Li, W. Huan, B. Hou, Y. Tian, Z. Zhang, and A. Song, “Can emotion
be transferred?—A review on transfer learning for EEG-based emotion
recognition,” IEEE Trans. on Cognitive and Developmental Systems,
vol. 14, no. 3, pp. 833–846, 2021.
[19] H. Zhou and K. Chen, “Transferable positive/negative speech emotion
recognition via class-wise adversarial domain adaptation,” in Proc. IEEE
Int’l Conf. on Acoustics, Speech and Signal Processing, Brighton, UK,
May 2019, pp. 3732–3736.
[20] H. Kaya, F. G¨urpınar, and A. A. Salah, “Video-based emotion recognition
in the wild using deep transfer learning and score fusion,” Image and
Vision Computing, vol. 65, pp. 66–75, 2017.
[21] T. Q. Ngo and S. Yoon, “Facial expression recognition on static images,”
in Proc. Future Data and Security Engineering, Nha Trang City, Viet-
nam, Nov. 2019, pp. 640–647.
[22] N. Sugianto and D. Tjondronegoro, “Cross-domain knowledge transfer
for incremental deep learning in facial expression recognition,” in Proc.
Int’l Conf. on Robot Intelligence Technology and Applications, Daejeon,
South Korea, Nov. 2019, pp. 205–209.
[23] H. Zhao, N. Ye, and R. Wang, “Speech emotion recognition based
on hierarchical attributes using feature nets,” Int’l Journal of Parallel,
Emergent and Distributed Systems, vol. 35, no. 3, pp. 354–364, 2020.
[24] S. Park, J. Kim, J. Jeon, H. Park, and A. Oh, “Toward dimensional emo-
tion detection from categorical emotion annotations,” arXiv:1911.02499,
2019.
[25] S. M. Mohammad, “Obtaining reliable human ratings of Valence,
Arousal, and Dominance for 20,000 English words,” in Proc. Annual
Conf. of the Association for Computational Linguistics, Melbourne,
Australia, Jul. 2018.
[26] M. M. Bradley and P. J. Lang, “Affective norms for English words
(ANEW): Instruction manual and affective ratings,” The Center for
Research in Psychophysiology, University of Florida, Tech. Rep., 1999.
[27] F. Zhou, S. Kong, C. C. Fowlkes, T. Chen, and B. Lei, “Fine-grained
facial expression analysis using dimensional emotion model,” Neurocom-
puting, vol. 392, pp. 38–49, 2020.
[28] A. B. Warriner, V. Kuperman, and M. Brysbaert, “Norms of valence,
arousal, and dominance for 13,915 English lemmas,” Behavior Research
Methods, vol. 45, pp. 1191–1207, 2013.
[29] C. E. Shannon, “A mathematical theory of communication,” The Bell
System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
[30] B. Settles and M. Craven, “An analysis of active learning strategies for
sequence labeling tasks,” in Proc. Conf. on Empirical Methods in Natural
Language Processing, Honolulu, HI, Oct. 2008, pp. 1070–1079.
[31] A. Culotta and A. McCallum, “Reducing labeling effort for structured
prediction tasks,” in Proc. AAAI Conf. on Artificial Intelligence, vol. 5,
Pittsburgh, PA, Jul. 2005, pp. 746–751.
[32] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via
transfer component analysis,” IEEE Trans. on Neural Networks, vol. 22,
no. 2, pp. 199–210, 2010.
[33] J. Wang, Y. Chen, S. Hao, W. Feng, and Z. Shen, “Balanced distribution
adaptation for transfer learning,” in Proc. IEEE Int’l Conf. on Data
Mining, New Orleans, LA, November 2017, pp. 1129–1134.
[34] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, J. N. Kim,
Samueland Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive
emotional dyadic motion capture database,” Language Resources and
Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
[35] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihal-
cea, “MELD: A multimodal multi-party dataset for emotion recognition
in conversations,” in Proc. 57th Annual Meeting of the Association for
Computational Linguistics, Florence, Italy, Jul. 2019, pp. 527–536.
[36] M. Grimm, K. Kroschel, and S. Narayanan, “The Vera am Mittag German
audio-visual emotional speech database,” in Proc. IEEE Int’l Conf. on
Multimedia and Expo, Hannover, Germany, Jun. 2008, pp. 865–868.
[37] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A
framework for self-supervised learning of speech representations,” in
Proc. Int’l Conf. on Neural Information Processing Systems, vol. 33,
Virtual Event, Dec. 2020, pp. 12 449–12 460.
[38] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An
ASR corpus based on public domain audio books,” in Proc. Int’l Conf.
on Acoustics, Speech and Signal Processing, South Brisbane, Australia,
Apr. 2015, pp. 5206–5210.
[39] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch ¨olkopf,
and A. J. Smola, “Integrating structured biological data by kernel max-
imum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. 49–57,
2006.
[40] S. Holm, “A simple sequentially rejective multiple test procedure,”
Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979.