PreprintPDF Available

Cross-Task Inconsistency Based Active Learning (CTIAL) for Emotion Recognition

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Emotion recognition is a critical component of affective computing. Training accurate machine learning models for emotion recognition typically requires a large amount of labeled data. Due to the subtleness and complexity of emotions, multiple evaluators are usually needed for each affective sample to obtain its ground-truth label, which is expensive. To save the labeling cost, this paper proposes an inconsistency-based active learning approach for cross-task transfer between emotion classification and estimation. Affective norms are utilized as prior knowledge to connect the label spaces of categorical and dimensional emotions. Then, the prediction inconsistency on the two tasks for the unlabeled samples is used to guide sample selection in active learning for the target task. Experiments on within-corpus and cross-corpus transfers demonstrated that cross-task inconsistency could be a very valuable metric in active learning. To our knowledge, this is the first work that utilizes prior knowledge on affective norms and data in a different task to facilitate active learning for a new task, even the two tasks are from different datasets.
Content may be subject to copyright.
arXiv:2412.01171v1 [cs.LG] 2 Dec 2024
1
Cross-Task Inconsistency Based Active
Learning (CTIAL) for Emotion Recognition
Yifan Xu, Xue Jiang, and Dongrui Wu
Abstract—Emotion recognition is a critical component of affective computing. Training accurate machine learning models for emotion
recognition typically requires a large amount of labeled data. Due to the subtleness and complexity of emotions, multiple evaluators are
usually needed for each affective sample to obtain its ground-truth label, which is expensive. To save the labeling cost, this paper
proposes an inconsistency-based active learning approach for cross-task transfer between emotion classification and estimation.
Affective norms are utilized as prior knowledge to connect the label spaces of categorical and dimensional emotions. Then, the
prediction inconsistency on the two tasks for the unlabeled samples is used to guide sample selection in active learning for the target
task. Experiments on within-corpus and cross-corpus transfers demonstrated that cross-task inconsistency could be a very valuable
metric in active learning. To our knowledge, this is the first work that utilizes prior knowledge on affective norms and data in a different
task to facilitate active learning for a new task, even the two tasks are from different datasets.
Index Terms—Active learning, transfer learning, emotion classification, emotion estimation.
1 INTRODUCTION
EMotion recognition is an important part of affective comput-
ing, which focuses on identifying and understanding human
emotions from facial expressions [1], body gestures [2], speech
[3], physiological signals [4], etc. It has potential applications in
healthcare and human-machine interactions, e.g., emotion health
surveillance [5] and emotion-based music recommendation [6].
Emotions can be represented categorically (discretely) or di-
mensionally (continuously). A typical example of the former is
Ekman’s six basic emotions [7]. Typical dimensional emotion
representations include the pleasure-arousal-dominance model [8]
(pleasure is often replaced by its opposite, valence), and the
circumplex model [9] that only considers valence and arousal.
Both categorical emotion classification (CEC) and dimensional
emotion estimation (DEE) are considered in this paper.
Accurate emotion recognition usually requires a large amount
of labeled training data. However, labeling affective data is expen-
sive because emotion is subtle and has large individual differences.
Each affective sample needs to be evaluated by multiple annotators
to obtain its ‘ground-truth’ label. Active learning (AL) [10] and
transfer learning (TL) [11] are promising solutions to alleviate the
labeling effort.
AL selects the most useful samples to query for their labels,
improving the model performance for a limited annotation budget.
The key is to define an appropriate measure of sample useful-
ness. For CEC from facial images, Muhammad and Alhamid
[12] adopted entropy as the uncertainty measure and selected
samples with high entropy to annotate. Zhang et al. [13] used the
distance to decision boundary as the uncertainty measure in AL for
speech emotion classification. They selected samples with medium
certainty to avoid noisy ones and allocated different numbers of
Y. Xu, X. Jiang and D. Wu are with the Key Laboratory of the Ministry
of Education for Image Processing and Intelligent Control, School of
Artificial Intelligence and Automation, Huazhong University of Science
and Technology, Wuhan 430074, China. Email: yfxu@hust.edu.cn, xue-
jiang@hust.edu.cn, drwu09@gmail.com.
annotators for each sample adaptively. For AL in DEE, Wu et al.
[14] considered greedy sampling in both the feature space and the
label space. Their approach was later extended to multi-task DEE
[3]. Abdelwahab and Busso [15] also verified the effectiveness
of greedy sampling based AL approaches in valence and arousal
estimation using deep neural networks.
TL utilizes knowledge from relevant (source) tasks to facil-
itate the learning for a new (target) task [16], [17]. In task-
homogeneous TL, i.e., the source and target domains have the
same label space, the key is to reduce their distribution discrep-
ancies [18]. For example, Zhou and Chen [19] employed class-
wise adversarial domain adaptation to transfer from spontaneous
emotional speeches of children to acted corpus of adults. A more
challenging scenario is task-heterogenous TL, where the source
and target domains have different label spaces. Pre-training on
large datasets of relevant (but may not be exactly the same)
tasks and then fine-tuning on task-specific data was used in
facial emotion recognition [20]–[22]. Zhao et al. [23] adopted
age and gender prediction as source tasks, and supplemented the
extracted features to target tasks of speech emotion classification
and estimation. Park et al. [24] trained models on text corpora with
categorical emotion labels to estimate their dimensional emotions.
They ordered the discrete emotions along valence, arousal, and
dominance dimensions according to domain knowledge [25] and
reduced the earth mover’s distance to optimize the model.
This paper proposes cross-task inconsistency based active
transfer learning (CTIAL) between CEC and DEE, which has not
been explored before. We aim to reduce the labeling efforts in
the task-heterogeneous TL scenario where a homogenous source
dataset suitable for transfer is hard to obtain, but a heterogeneous
one is available. CTIAL enhances the efficiency of sample se-
lection in AL for the target task, by exploiting the source task
knowledge. Fig. 1 illustrates the difference between cross-task AL
and the traditional within-task AL.
To transfer knowledge between CEC and DEE tasks, we first
train models for CEC and DEE separately using the corresponding
labeled data and obtain their predictions on the unlabeled samples.
2
Fig. 1. Within-task AL for DEE, and CTIAL for cross-task transfer from CEC to DEE.
Affective norms, normative emotional ratings for words [26], are
utilized as domain knowledge to map the estimated categorical
emotion probabilities into the dimensional emotion space. A
cross-task inconsistency (CTI) is computed in this common label
space and used as an informativeness measure in AL. By further
integrating the CTI with other metrics, e.g., uncertainty in CEC
or diversity in DEE, we can identify the most useful unlabeled
samples to annotate for the target task and merge them into the
labeled dataset to update the model.
Our contributions are:
1) We propose CTI to measure the prediction inconsistency
between CEC and DEE tasks for cross-task AL.
2) We integrate CTI with other metrics in within-task AL to
further improve the sample query efficiency.
3) Within- and cross-corpus experiments demonstrated the
effectiveness of CTIAL in cross-task transfers.
To our knowledge, this is the first work that utilizes prior knowl-
edge on affective norms and data in a different task to facilitate
AL for a new task, even though the two tasks are from different
datasets.
The remainder of the paper is organized as follows: Section 2
introduces the proposed CTIAL approach. Section 3 describes
the datasets and the experimental setup. Sections 4 and 5 present
experiment results on cross-task transfers from DEE to CEC, and
from CEC to DEE, respectively. Section 6 draws conclusions.
2 METHODOLOGY
This section introduces the CTI measure and its application in
cross-task AL.
2.1 Problem Setting
Consider the transfer between CEC and DEE, i.e., one is the
source task, and the other is the target task. The source task has
a large amount of labeled samples, whereas the target task has
only a few. AL selects the most useful target samples from the
unlabeled data pool and queries for their labels.
Denote the dataset with categorical and dimensional emo-
tion annotations as DCat ={(xi,yCat
i)}NCat
i=1 and DDim =
{(xi,yDim
i)}NDim
i=1 , respectively, where xiin DCat and DDim are
from the same feature space, yCa t
i R|E|is the one-hot encoding
label of the emotion category set E,yDim
i R|D|is the dimen-
sional emotion label of the dimension set D. An example of the
emotion sets is E={angry, happy, sad, neutral}and D={valence,
arousal, dominance}. The unlabeled data pool P={xi}NP
i=1
has homogenous features with DCat and DDim . The target dataset
consists of Pand a few labeled samples.
2.2 Expert Knowledge
Affective norms, i.e., dimensional emotion ratings for words [25],
[26], can be utilized as domain knowledge to establish the con-
nection between the label spaces of categorical and dimensional
emotions [24], [27]. This paper uses the NRC Valence-Arousal-
Dominance Lexicon [25], where 20,000 English words were
manually annotated with valence, arousal and dominance scores
via crowd-sourcing. Specifically, the annotators were presented
with a four-word tuple each time, and asked to select the word with
the highest and lowest valence/arousal/dominance, respectively.
The best-worst scaling technique was then used to aggregate the
annotations: the score for a word is the proportion of times it was
chosen as the highest valence/arousal/dominance minus that as the
lowest valence/arousal/dominance. The scores for each emotion
dimension were then linearly mapped into the interval [0, 1].
With the help of affective norms, we can exploit datasets of
different emotion representations for TL, by converting categorical
emotion labels into dimensional ones. The affective norms of
different languages demonstrate relatively high correlations [28],
suggesting the feasibility for cross-linguistic transfer. Table 1
presents the dimensional emotion scores of some emotion cate-
gories in the NRC Lexicon.
TABLE 1
Valence, Arousal and Dominance scores of eight typical emotions in
the NRC Lexicon.
Valence Arousal Dominance
Angry 0.122 0.830 0.604
Happy 1.000 0.735 0.772
Sad 0.225 0.333 0.149
Disgusted 0.051 0.773 0.274
Fearful 0.083 0.482 0.278
Surprised 0.784 0.855 0.539
Frustrated 0.080 0.651 0.255
Neutral 0.469 0.184 0.357
3
Fig. 2. Flowchart for computing the CTI.
2.3 Cross-Task Inconsistency (CTI)
Fig. 2 shows the flowchart for computing the CTI, an informative-
ness measure of the prediction inconsistency between two tasks.
First, we construct the emotion classification and estimation
models, fCat and fDim, using DCat and DDim respectively. For an
unlabeled sample xiP, the prediction probabilities for the
categorical emotions are:
ˆ
yCat
i=fCat(xi),(1)
where each element ˆye
iˆ
yCat
iis the probability of Emotion e
E.
The estimated dimensional emotion values are:
ˆ
yDim
i=fDim(xi),(2)
where ˆydim
iˆ
yDim
iis the score of Dimension dim D.
Utilizing prior knowledge of affective norms, we map the pre-
dicted categorical emotion probabilities to dimensional emotion
values:
˜
yDim
i=X
eE
ˆye
i·NRC[e],(3)
where NRC[e] denotes the scores of Emotion efrom the NRC
Lexicon.
The CTI is then computed as:
Ii=
ˆ
yDim
i˜
yDim
i
2.(4)
A high CTI indicates that the two models have a high
disagreement, i.e., the corresponding unlabeled sample has high
uncertainty (informativeness). Those samples with high CTIs may
have inaccurate probability predictions on emotion categories
or imprecise estimations of emotion primitives. Labeling these
informative samples can expedite the model learning. Besides,
the affective norms only represent the most typical (or average)
emotion primitives of each category. In practice, a categorical
emotion’s dimensional primitives may have a large variance. Its
corresponding samples are also likely to have high CTIs, and
annotating them can increase the sample diversity.
Thus, the sample xqselected for labeling for the target task is:
q= arg max
xiP
Ii.(5)
CTI can be integrated with other useful within-task AL indi-
cators for better performance, as introduced next.
2.4 CTIAL for Cross-Task Transfer from DEE to CEC
Uncertainty is a frequently used informativeness metric for AL in
classification [10]. We consider two typical uncertainty measures
in CEC: information entropy and prediction confidence.
The information entropy, also called Shannon entropy [29], is:
Hi=X
eE
ˆye
i·log ˆye
i.(6)
A large entropy indicates high uncertainty [30].
The prediction confidence directly reflects how certain the
classifier is about its prediction:
Confi= max
eEˆye
i.(7)
A low confidence indicates high uncertainty [31].
Considering the uncertainty and CTI simultaneously, we select
the sample xqby:
q= arg max
xiP
Ii·Hi,(8)
or by:
q= arg max
xiP
Ii
Confi
,(9)
and query for its class.
The pseudo-code of CTIAL for cross-task transfer from DEE
to CEC is given in Algorithm 1.
2.5 CTIAL for Cross-Task Transfer from CEC to DEE
Wu et al. [3], [14] employed greedy sampling in both feature
and label spaces for sample selection and verified the importance
of diversity in AL for regression. Their multi-task improved
greedy sampling (MTiGS) [3] computes the distance between an
unlabeled sample xiand the labeled sample set DDim as:
di= min
xj∈DDim kxixjk2·Y
dimD
|ˆydim
iydim
j|.(10)
To weight the feature and label spaces equally, we slightly modify
MTiGS to:
d
i= min
xj∈DDim kxixjk2· k ˆ
yDim
iyDim
jk2.(11)
Considering CTI (informativeness) and MTiGS (diversity)
together, we select the sample xqby:
q= arg max
xiP
Ii·d
i,(12)
and query for its dimensional emotion values.
The pseudo-code of CTIAL for DEE is shown in Algorithm 2.
4
Algorithm 1: CTIAL for cross-task transfer from DEE
to CEC.
Input: Source dataset with dimensional emotion values
DDim ={(xi,yDim
i)}NDim
i=1 ;
Target dataset, including data with categorical
emotion labels DCat ={(xi,yCat
i)}NCat
i=1 and
unlabeled data pool P={xi}NP
i=1;
K, number of samples to be queried;
Uncertainty measure, entropy or confidence.
Output: Emotion classification model fCat.
Train the model fCat on DCat and fDim on DDim ;
Estimate dimensional emotion values {ˆ
yDim
i}NP
i=1 of P
using (2);
for k= 1 : Kdo
Estimate emotion category probabilities {ˆ
yCat
i}NP
i=1 of
Pusing (1);
Map {ˆ
yCat
i}NP
i=1 into the dimensional emotion space
and obtain {˜
yDim
i}NP
i=1 using (3);
Calculate CTI {Ii}NP
i=1 using (4);
if the uncertainty measure is entropy then
Compute entropy {Hi}NP
i=1 using (6);
Select sample xqusing (8);
else
// the uncertainty measure is
confidence
Compute confidence {Confi}NP
i=1 using (7);
Select sample xqusing (9);
end
Query for the emotion category yCat
qof xq;
DCat DCat (xq,yCat
q);
PP\xq;NPNP1;
Update fCat on DCat;
end
2.6 Domain Adaptation in Cross-Corpus Transfer
CTIAL assumes the source model can make reliable predictions
for the target dataset. However, speech emotion recognition cor-
pora vary according to whether they are acted or spontaneous,
collected in lab or in the wild, and so on. These discrepancies may
cause a model trained on one corpus to perform poorly on another,
violating the underlying assumption in CTIAL.
To enable cross-corpus transfer, we adopt two classical domain
adaptation approaches, transfer component analysis (TCA) [32]
and balanced distribution adaptation (BDA) [33].
When the source task is CEC, BDA jointly aligns both the
marginal distributions and the class-conditional distributions with
a balance factor that adjusts the weights of the two corresponding
terms in the objective function. Specifically, the source model
assigns pseudo-labels for the unlabeled target dataset. The source
and target datasets’ average features and their corresponding class’
average features are aligned. The adapted features of the source
dataset are then used to update the source model. This process
iterates till convergence.
When the source task is DEE, TCA adapts the marginal
distributions of the source and target datasets by reducing the
distance between their average features. BDA is not used here,
since the class-conditional probabilities are difficult to compute in
regression problems.
Algorithm 2: CTIAL for cross-task transfer from CEC
to DEE.
Input: Source dataset with categorical emotion labels
DCat ={(xi,yCat
i)}NCat
i=1 ;
Target dataset, including data with dimensional
emotion values DDim ={(xi,yDim
i)}NDim
i=1 and
unlabeled data pool P={xi}NP
i=1;
K, number of samples to be queried.
Output: Emotion estimation model fDim .
Train the model fCat on DCat and fDim on DDim ;
Estimate category probabilities {ˆ
yCat
i}NP
i=1 of Pby (1);
Map {ˆ
yCat
i}NP
i=1 into the dimensional emotion space and
obtain {˜
yDim
i}NP
i=1 using (3);
for k= 1 : Kdo
Estimate dimensional emotion values {ˆ
yDim
i}NP
i=1 of P
using (2);
Calculate CTI {Ii}NP
i=1 using (4);
Compute the distance {d
i}NP
i=1 between samples in P
and DDim using (11);
Select sample xqusing (12);
Query for the dimensional emotion values yDim
qof xq;
DDim DDim (xq,yDim
q);
PP\xq;NPNP1;
Update fDim on DDim;
end
After domain adaptation, the model trained on the source
dataset is applied to Pto obtain the predictions for the source
task. More details on the implementations will be introduced in
Section 3.1.
3 EXPERIMENTAL SETUP
This section describes the datasets and the experimental setup.
3.1 Datasets and Feature Extraction
Three public speech emotion datasets, IEMOCAP (Interactive
Emotional Dyadic Motion Capture Database) [34] of semi-
authentic emotions, MELD (Multimodal EmotionLines Dataset)
[35] of semi-authentic emotions, and VAM (Vera am Mittag; Vera
at Noon in English) [36] of authentic emotions, were used to verify
the proposed CTIAL. Specifically, we performed experiments of
within-corpus transfer on IEMOCAP, and cross-corpus transfer
from VAM and MELD to IEMOCAP.
IEMOCAP is a multi-modal dataset annotated with both cat-
egorical and dimensional emotion labels. Audio data from five
emotion categories (angry, happy1, sad, frustrated and neutral with
289, 947, 608, 971 and 1099 samples, respectively) in spontaneous
sessions were used in the within-corpus experiments and cross-
corpus transfer from DEE on VAM. In cross-corpus transfer from
CEC on MELD, only four overlapping classes (angry, happy,
sad and neutral) were used. The valence, arousal and dominance
annotations are in [1, 5], so we linearly rescaled the scores in the
NRC Lexicon from [0, 1] to [1, 5].
MELD is a multi-modal dataset collected from the TV series
Friends. Each utterance was labeled by five annotators from seven
1. The ‘happy’ class contains data of ‘happy’ and ‘excited’ in the original
annotation as in multiple previous works.
5
classes (anger, disgust, sadness, joy, surprise, fear, and neutral),
and majority vote was applied to generate the ground-truth labels.
We only used the four overlapping categories with IEMOCAP in
the training set (angry, happy, sad and neutral with 1109, 1743,
683 and 4709 samples, respectively).
VAM consists of 947 utterances collected from 47 guests
(11m/36f) in a German TV talk-show Vera am Mittag. Each
sentence was annotated by 6 or 17 evaluators for valence, arousal
and dominance values, and the weighted average was used to
obtain the ground-truth labels. We linearly rescaled their values
from [-1, 1] to [1, 5] to match the range of IEMOCAP.
The wav2vec 2.0 model [37], pre-trained on 960 hours of
unlabeled audio from the LibriSpeech dataset [38] and fine-tuned
for automatic speech recognition on the same audio with tran-
scripts, was used for feature extraction. For each audio segment,
we took the average output of 12 transformer encoder layers, and
averaged these features again along the time axis, obtaining a 768-
dimensional feature for each utterance.
In cross-task AL, we combined the source and target datasets
and used principal component analysis to reduce the feature di-
mensionality (maintaining 90% variance). In cross-corpus transfer,
an additional feature adaptation step was performed using TCA or
BDA to obtain more accurate source task predictions for samples
in P.
Before applying AL to P, we applied principal component
analysis to the original 768-dimensional features of the target
dataset in both cross-task AL and within-task AL baselines.
3.2 Experimental Setup
We used logistic regression (LR) and ridge regression (RR) as the
base model for CEC and DEE, respectively. The weight of the
regularization term in each model, i.e., 1/C in LR and αin RR,
was chosen from {1,5,10,50,1e2,5e2,1e3,5e3}by three-fold
cross-validation on the corresponding training data.
In cross-corpus transfer experiments, feature dimensionality in
TCA and BDA was set to 30 and 40, respectively. BDA used 10
iterations to estimate the class labels of the target dataset, align
the marginal and conditional distributions, and update the classi-
fier. The balance factor was selected from {0.1,0.2, ..., 0.9}to
minimize the sum of the maximum mean discrepancy metrics [39]
on all data and in each class.
3.3 Performance Evaluation
We aimed to classify all samples in the target dataset, some by
human annotators and the remaining by the trained classifier.
Therefore, we evaluated the classification performance on the en-
tire target dataset, including the manually labeled samples and P.
Specifically, we concatenated the ground-truth labels {yCat
i}NCat
i=1
of DCat and the predictions {ˆ
yCat
i}NP
i=1 of Pto compute the
performance metrics.
In within-corpus transfer, either from DEE to CEC or from
CEC to DEE on IEMOCAP, each time we used two sessions as
the source dataset and the rest three sessions as the target dataset,
resulting in 10 source-target dataset partitions. Experiments on
each dataset partition were repeated three times with different
initial labeled samples.
Cross-corpus experiments considered transfer from the source
DEE task on VAM to the target CEC task on IEMOCAP and
from the source CEC task on MELD to the target DEE task on
IEMOCAP. Experiments were repeated 10 times with different
initial labeled sample sets.
In each run of the experiment, the initial 20 labeled samples
were randomly selected. In the subsequent 200 iterations of sample
selection, one sample was chosen for annotation at a time.
Balanced classification accuracy (BCA), i.e., the average per-
class accuracies, was used as the performance measure in CEC
since IEMOCAP has significant class-imbalance. Root mean
squared error (RMSE) and correlation coefficient (CC) were used
as performance measures in DEE.
4 EXPERIMENTS ON CROSS-TASK TRANSFER
FROM DEE TO CEC
This section presents the results in cross-task transfer from DEE
to CEC. The DEE and CEC tasks could be from the same corpus,
or different ones.
4.1 Algorithms
We compared the performance of the following sample selection
strategies in cross-task transfer from DEE to CEC:
1) Random sampling (Rand), which randomly selects K
samples to annotate.
2) Entropy (Ent) [30], an uncertainty-based within-task AL
approach that selects samples with maximum entropy
computed by (6).
3) Least confidence (LC) [31], an uncertainty-based within-
task AL approach that selects samples with minimum
prediction confidence computed by (7).
4) Multi-task iGS on the source task (Source MTiGS),
which performs AL according to the informativeness
metric in the source DEE task. Specifically, we compute
the distance between samples xiPand xj DCat
by (10) and select samples with the maximum distance.
Without any dimensional emotion labels of DCat, we
replace the true label ydim
jwith ˆydim
jestimated from the
source model. This simple cross-task AL baseline directly
uses the knowledge from the source task.
5) CTIAL, which selects samples with the maximum CTI
by (5).
6) Ent-CTIAL, which integrates entropy with CTI as in-
troduced in Algorithm 1.
7) LC-CTIAL, which integrates the prediction confidence
with CTI as introduced in Algorithm 1.
4.2 Effectiveness of CTIAL
Fig. 3 shows the average BCAs in within- and cross-corpus
cross-task transfers from DEE to CEC. To examine if the perfor-
mance improvements of the integration of the uncertainty-based
approaches and CTIAL were statistically significant, Wilcoxon
signed-rank tests with Holm’s p-value adjustment [40] were per-
formed on the results in each AL iteration. The test results are
shown in Fig. 4.
Figs. 3 and 4 demonstrate that:
1) As the number of labeled samples increased, classifiers in
all approaches became more accurate, achieving higher
BCAs.
2) Between the two uncertainty-based AL approaches, LC
outperformed Ent. However, both may not outperform
6
0 50 100 150 200
K
0.35
0.40
0.45
0.50
0.55
BCA
Rand
Ent
LC
Source MTiGS
CTIAL
Ent-CTIAL
LC-CTIAL
(a)
0 50 100 150 200
K
0.35
0.40
0.45
0.50
0.55
BCA
Rand
Ent
LC
Source MTiGS
CTIAL
Ent-CTIAL
LC-CTIAL
(b)
Fig. 3. Average BCAs of different sample selection approaches in cross-
task transfer from DEE to CEC. (a) Within-corpus transfer from DEE on
IEMOCAP to CEC on IEMOCAP; and, (b) cross-corpus transfer from
DEE on VAM to CEC on IEMOCAP. Kis the number of samples to be
queried in addition to the initial labeled ones.
Rand, suggesting that only considering the uncertainty
may not be enough.
3) Source MTiGS that performed AL according to the
source DEE task achieved higher BCAs than Rand and
the two within-task AL approaches, because MTiGS
increased the feature diversity, which was also useful for
the target CEC task. Additionally, MTiGS increased the
diversity of the dimensional emotion labels, which in turn
increased the diversity of the categorical emotion labels.
4) CTIAL outperformed Rand, the two within-task AL
approaches (Ent and LC), and the cross-task AL baseline
Source MTiGS when Kwas small, demonstrating
that the proposed CTI measure properly exploited the
relationship between the CEC and DEE tasks and effec-
tively utilized knowledge from the source task. However,
in cross-corpus transfer, CTIAL and Rand had similar
performance when Kwas large. The reason may be that
domain shift limited the performance of the source DEE
models on the target dataset, which further resulted in in-
accurate CTI calculation and degraded AL performance.
5) Integrating CTI with uncertainty can further enhance the
classification accuracies by simultaneously considering
within- and cross-task informativeness: both LC-CTIAL
and Ent-CTIAL statistically significantly outperformed
the other baselines.
6) LC-CTIAL generally achieved the best performance
among all seven approaches.
50 100 150 200
K
CTIAL
Source MTiGS
LC
Ent
Rand
Ent-CTIAL LC-CTIAL
(a)
50 100 150 200
K
CTIAL
Source MTiGS
LC
Ent
Rand
Ent-CTIAL LC-CTIAL
(b)
Fig. 4. Statistical significance of the performance improvements of
Ent-CTIAL and LC-CTIAL over the other approaches. (a) Within-corpus
transfer from DEE on IEMOCAP to CEC on IEMOCAP; and, (b) cross-
corpus transfer from DEE on VAM to CEC on IEMOCAP. Kis the
number of samples to be queried in addition to the initial labeled ones.
The vertical axis denotes the approaches in comparison with Ent-CTIAL
or LC-CTIAL in Wilcoxon signed-rank tests. The red and green markers
were placed at where the adjusted p-values were smaller than 0.05.
4.3 Effectiveness of TCA
CTIAL assumes that the model trained on the source dataset is
reliable for the target dataset. We performed within-corpus and
cross-corpus experiments to verify this assumption.
Fig. 5 shows the average RMSEs and CCs in valence, arousal
and dominance estimation on IEMOCAP in within- and cross-
corpus transfers. In within-corpus transfer, the average RMSE
and CC were 0.6667 and 0.5659, respectively. Although models
trained on VAM were inferior to those on IEMOCAP, TCA still
achieved lower RMSEs and higher CCs than direct transfer.
25 30 35 40 45
d
0.66
0.7
0.74
0.78
0.82
RMSE
IEMOCAP
VAM w/o TCA
VAM w/ TCA
25 30 35 40 45
d
0.35
0.40
0.45
0.50
0.55
CC
IEMOCAP
VAM w/o TCA
VAM w/ TCA
Fig. 5. Average RMSEs and CCs in valence, arousal and dominance
estimation on IEMOCAP in within-corpus transfer (DEE on IEMOCAP
to DEE on IEMOCAP; red line), direct cross-corpus transfer (DEE on
VAM to DEE on IEMOCAP; blue line), and cross-corpus transfer using
TCA (DEE on VAM to DEE on IEMOCAP; black curve). dis the feature
dimensionality. The markers on the red and blue lines mean that the
feature dimensionality after principal component analysis was 46 and
40, respectively.
7
Generally, the assumption that the source model is reliable
was satisfied in both within-corpus transfer and more challenging
cross-corpus transfer, with the help of TCA.
5 EXPERIMENTS ON CROSS-TASK TRANSFER
FROM CEC TO DEE
This section presents the experiment results in cross-task transfer
from CEC to DEE, where the two tasks may be from the same or
different datasets.
5.1 Algorithms
The following approaches were compared in transferring from
CEC to DEE:
1) Direct mapping by NRC Lexicon (NRC Mapping),
where the dimensional emotion estimates of all the sam-
ples are obtained by (3). This non-AL approach only
utilizes the information of the source CEC task and
domain knowledge.
2) Random sampling (Rand), which randomly selects K
samples to annotate.
3) Multi-task iGS (MTiGS) [3], a diversity-based within-
task AL approach that selects samples with the furthest
distance to labeled data computed by (10).
4) Least confidence on the source task (Source LC),
which selects samples with the minimum source model
prediction confidence defined by (7). This cross-task AL
baseline depends solely on the source CEC model.
5) Cross-task iGS (CTiGS), a variant of MTiGS that further
considers the information in the source CEC task. Specif-
ically, we first obtain the predicted emotion categories
for the target data using the source model. The distance
calculation is only conducted between unlabeled and la-
beled samples predicted with the same emotion category.
For an emotion category with only unlabeled samples, we
calculate the distance between them and all the labeled
samples using (10), as in MTiGS. The subsequent sample
selection process is the same as MTiGS.
6) CTIAL, which selects samples with the maximum CTI
by (5).
7) MTiGS-CTIAL, which integrates MTiGS and CTI as
introduced in Algorithm 2.
5.2 Effectiveness of CTIAL
Fig. 6 shows the average performance in DEE on IEMOCAP using
different sample selection approaches. Wilcoxon signed-rank tests
with Holm’s p-value adjustment [40] were again used to examine
if the performance improvements of MTiGS-CTIAL over other
approaches were statistically significant in each AL iteration. The
test results are shown in Fig. 7.
Figs. 6 and 7 shows that:
1) The regression models in all approaches performed better
as the number of labeled samples increased, except the
non-AL baseline NRC Mapping.
2) NRC Mapping had poor performance, especially in
dominance estimation. Models trained on only a small
amount of labeled data outperformed NRC Mapping,
regardless of the sample selection approach. As stated
in Section 2.3, each emotion category was mapped to a
single tuple of dimensional emotion values according to
the affective norms. In practice, samples belonging to the
same emotion category may have diverse emotion prim-
itives. Direct mapping oversimplified the relationship
between categorical emotions and dimensional emotions.
3) MTiGS on average achieved lower RMSEs and higher
CCs than Rand. However, it performed similarly to
Rand for some emotion primitives.
4) Source LC performed much worse than Rand because
it only considered the classification information in the
source task but ignored the characteristics of the target
task. Samples with low classification confidence may not
be very useful to the target regression tasks.
5) CTiGS initially performed worse than MTiGS, but grad-
ually outperformed on some dimensions as the number
of labeled samples increased. The reason is that CTiGS
emphasizes the within-class sample diversity, whereas
MTiGS focuses on the global sample diversity. The latter
is more beneficial for the regression models to learn
global patterns, which is important when labeled samples
are rare. As Kincreases, enriching the within-class
sample diversity facilitates the global models to learn
local patterns.
6) In within-corpus transfer, CTIAL outperformed both
Rand and MTiGS initially, but gradually became inferior
to Rand. In cross-corpus transfer, CTIAL performed
better than Rand on arousal but worse on valence and
dominance. The possible reason is similar to that in
transfer from DEE to CEC: the source classification
models were not accurate enough, resulting in inaccurate
CTI calculation and unsatisfactory AL performance.
7) MTiGS-CTIAL achieved the best overall performance
in both within- and cross-corpus transfers, indicating that
considering both CTI and diversity in AL helped improve
the regression performance even when the CTI was not
very accurate.
5.3 Effectiveness of BDA
Fig. 8 shows the classification results in CEC on IEMOCAP using
different training data and projection dimensionality for BDA in
cross-corpus transfer. In within-corpus validation, we used two
sessions as the training data and the rest three sessions as the test
data. The average BCA in 10 training-test partitions for four-class
(E={angry, happy, sad, neutral}) emotion recognition reached
0.6477. In cross-corpus transfer, directly transferring the model
trained on MELD to IEMOCAP only achieved a BCA of 0.3904.
BDA boosted the BCA to 0.5378 when setting the projected
feature dimensionality to 45.
6 CONCLUSIONS
Human emotions can be described by both categorical and dimen-
sional representations. Usually, training accurate emotion classifi-
cation and estimation models requires large labeled datasets. How-
ever, manual labeling of affective samples is expensive, due to the
subtleness and complexity of emotions. This paper integrates AL
and TL to reduce the labeling effort. We proposed CTI to measure
the prediction inconsistency between the CEC and DEE tasks. To
further consider other useful AL metrics, CTI was integrated with
uncertainty in CEC and diversity in DEE for enhanced reliability.
Experiments on three speech emotion datasets demonstrated the
8
0 50 100 150 200
0.75
0.80
0.85
0.90
K
RMSE
Valence
0 50 100 150 200
0.3
0.4
0.5
0.6
K
CC
Valence
0 50 100 150 200
0.55
0.60
0.65
0.70
K
RMSE
Arousal
0 50 100 150 200
0.4
0.5
0.6
0.7
K
CC
Arousal
0.90
0 50 100 150 200
0.65
0.70
0.75
K
RMSE
Dominance
0 50 100 150 200
0.3
0.4
0.5
K
CC
Dominance
0 50 100 150 200
0.65
0.70
0.75
0.80
K
RMSE
Avg
0 50 100 150 200
0.3
0.4
0.5
0.6
K
CC
Avg
NRC
Mapping
Rand
MTiGS
Source LC
CTiGS
CTIAL
MTiGS-CTIAL
(a)
0 50 100 150 200
K
0.80
0.85
0.90
RMSE
Valence
0 50 100 150 200
K
0.4
0.5
CC
Valence
0 50 100 150 200
K
0.55
0.60
0.65
0.70
RMSE
Arousal
0 50 100 150 200
K
0.4
0.5
0.6
0.7
CC
Arousal
0 50 100 150 200
K
0.65
0.70
0.75
RMSE
Dominance
0 50 100 150 200
K
0.3
0.4
0.5
CC
Dominance
0 50 100 150 200
K
0.65
0.70
0.75
0.80
RMSE
Avg
0 50 100 150 200
K
0.4
0.5
0.6
CC
Avg
NRC
Mapping
Rand
MTiGS
Source LC
CTiGS
CTIAL
MTiGS-CTIAL
(b)
Fig. 6. Average RMSEs and CCs of different sample selection approaches in cross-task transfer from CEC to DEE. (a) Within-corpus transfer from
CEC on IEMOCAP to DEE on IEMOCAP; and, (b) cross-corpus transfer from CEC on MELD to DEE on IEMOCAP. Kis the number of samples to
be queried in addition to the initial labeled ones.
effectiveness of CTIAL in within- and cross-corpus transfers
between DEE and CEC tasks. To our knowledge, this is the first
work that utilizes prior knowledge on affective norms and data in a
different task to facilitate AL for a new task, regardless of whether
the two tasks are from the same dataset or not.
REFERENCE S
[1] M. Pantic and L. Rothkrantz, “Automatic analysis of facial expressions:
The state of the art,” IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. 22, no. 12, pp. 1424–1445, 2000.
[2] R. E. Kaliouby and P. Robinson, Real-time inference of complex mental
states from facial expressions and head gestures,” in Proc. Int’l Conf. on
Computer Vision and Pattern Recognition, Washington DC, June 2004,
p. 154.
[3] D. Wu and J. Huang, “Affect estimation in 3D space using multi-task
active learning for regression,” IEEE Trans. on Affective Computing,
vol. 13, no. 41, pp. 16–27, 2022.
[4] D. Wu, B.-L. Lu, B. Hu, and Z. Zeng, “Affective brain-computer
interfaces (aBCIs): A tutorial,” Proc. of the IEEE, 2023, in press.
[5] Y. Jiang, W. Li, M. S. Hossain, M. Chen, A. Alelaiwi, and M. Al-
Hammadi, “A snapshot research and implementation of multimodal
information fusion for data-driven emotion recognition,” Information
Fusion, vol. 53, pp. 209–221, 2020.
[6] D. Ayata, Y. Yaslan, and M. E. Kamasak, “Emotion based music recom-
mendation system using wearable physiological sensors,” IEEE Trans.
on Consumer Electronics, vol. 64, no. 2, pp. 196–203, 2018.
[7] P. Ekman, W. V. Friesen, M. O’sullivan, A. Chan, I. Diacoyanni-Tarlatzis,
K. Heider, R. Krause, W. A. LeCompte, T. Pitcairn, P. E. Ricci-Bitti
et al., “Universals and cultural differences in the judgments of facial
expressions of emotion.” Journal of Personality and Social Psychology,
vol. 53, no. 4, p. 712, 1987.
[8] A. Mehrabian, Basic dimensions for a general psychological theory:
Implications for personality, social, environmental, and developmental
studies. Cambridge, MA: Oelgeschlager, Gunn & Hain, 1980.
[9] J. A. Russell, “A circumplex model of affect.” Journal of Personality and
Social Psychology, vol. 39, no. 6, p. 1161, 1980.
[10] B. Settles, “Active learning literature survey,” University of Wisconsin–
Madison, Computer Sciences Technical Report 1648, 2009.
[11] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. on
Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
[12] G. Muhammad and M. F. Alhamid, “User emotion recognition from a
larger pool of social network data using active learning,” Multimedia
Tools and Applications, vol. 76, no. 8, pp. 10 881–10 892, 2017.
[13] Y. Zhang, E. Coutinho, Z. Zhang, C. Quan, and B. Schuller, “Dynamic
active learning based on agreement and applied to emotion recognition in
spoken interactions,” in Proc. of the ACM on Int’l Conf. on Multimodal
Interaction, Seattle, Washington, WA, Nov. 2015, pp. 275–278.
[14] D. Wu, C.-T. Lin, and J. Huang, Active learning for regression using
greedy sampling,” Information Sciences, vol. 474, pp. 90–105, 2019.
9
50 100 150 200
K
CTIAL
CTiGS
Source LC
MTiGS
Rand
(a)
50 100 150 200
K
CTIAL
CTiGS
Source LC
MTiGS
Rand
(b)
Fig. 7. Statistical significance of the performance improvements of
MTiGS-CTIAL over the other approaches. (a) Within-corpus transfer
from CEC on IEMOCAP to DEE on IEMOCAP; and, (b) cross-corpus
transfer from CEC on MELD to DEE on IEMOCAP. Kis the number
of samples to be queried in addition to the initial labeled ones. The
vertical axis denotes the approaches in comparison with MTiGS-CTIAL
in Wilcoxon signed-rank tests. The red markers were placed at where
the adjusted p-values were smaller than 0.05.
40 45 50 55
d
0.4
0.45
0.5
0.55
0.6
0.65
BCA
IEMOCAP
MELD w/o BDA
MELD w/ BDA
Fig. 8. BCAs in CEC on IEMOCAP in within-corpus transfer (CEC on
IEMOCAP to CEC on IEMOCAP; red line), direct cross-corpus transfer
(CEC on MELD to CEC on IEMOCAP; blue line), and cross-corpus
transfer using BDA (CEC on MELD to CEC on IEMOCAP; black curve).
ddenotes the feature dimensionality. The markers on the red and blue
lines mean that the feature dimensionality after principal component
analysis was 45 and 55, respectively.
[15] M. Abdelwahab and C. Busso, “Active learning for speech emotion
recognition using deep neural network,” in Proc. Int’l Conf. on Affective
Computing and Intelligent Interaction, Cambridge, UK, Sep. 2019, pp.
1–7.
[16] D. Wu, Y. Xu, and B.-L. Lu, “Transfer learning for EEG-based brain-
computer interfaces: A review of progress made since 2016, IEEE Trans.
on Cognitive and Developmental Systems, vol. 14, no. 1, pp. 4–19, 2020.
[17] W. Zhang, L. Deng, L. Zhang, and D. Wu, “A survey on negative
transfer, IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 2, pp.
305–329, 2023.
[18] W. Li, W. Huan, B. Hou, Y. Tian, Z. Zhang, and A. Song, “Can emotion
be transferred?—A review on transfer learning for EEG-based emotion
recognition,” IEEE Trans. on Cognitive and Developmental Systems,
vol. 14, no. 3, pp. 833–846, 2021.
[19] H. Zhou and K. Chen, “Transferable positive/negative speech emotion
recognition via class-wise adversarial domain adaptation,” in Proc. IEEE
Int’l Conf. on Acoustics, Speech and Signal Processing, Brighton, UK,
May 2019, pp. 3732–3736.
[20] H. Kaya, F. urpınar, and A. A. Salah, “Video-based emotion recognition
in the wild using deep transfer learning and score fusion,” Image and
Vision Computing, vol. 65, pp. 66–75, 2017.
[21] T. Q. Ngo and S. Yoon, “Facial expression recognition on static images,”
in Proc. Future Data and Security Engineering, Nha Trang City, Viet-
nam, Nov. 2019, pp. 640–647.
[22] N. Sugianto and D. Tjondronegoro, “Cross-domain knowledge transfer
for incremental deep learning in facial expression recognition,” in Proc.
Int’l Conf. on Robot Intelligence Technology and Applications, Daejeon,
South Korea, Nov. 2019, pp. 205–209.
[23] H. Zhao, N. Ye, and R. Wang, “Speech emotion recognition based
on hierarchical attributes using feature nets,” Int’l Journal of Parallel,
Emergent and Distributed Systems, vol. 35, no. 3, pp. 354–364, 2020.
[24] S. Park, J. Kim, J. Jeon, H. Park, and A. Oh, “Toward dimensional emo-
tion detection from categorical emotion annotations,” arXiv:1911.02499,
2019.
[25] S. M. Mohammad, “Obtaining reliable human ratings of Valence,
Arousal, and Dominance for 20,000 English words,” in Proc. Annual
Conf. of the Association for Computational Linguistics, Melbourne,
Australia, Jul. 2018.
[26] M. M. Bradley and P. J. Lang, “Affective norms for English words
(ANEW): Instruction manual and affective ratings,” The Center for
Research in Psychophysiology, University of Florida, Tech. Rep., 1999.
[27] F. Zhou, S. Kong, C. C. Fowlkes, T. Chen, and B. Lei, “Fine-grained
facial expression analysis using dimensional emotion model,” Neurocom-
puting, vol. 392, pp. 38–49, 2020.
[28] A. B. Warriner, V. Kuperman, and M. Brysbaert, “Norms of valence,
arousal, and dominance for 13,915 English lemmas, Behavior Research
Methods, vol. 45, pp. 1191–1207, 2013.
[29] C. E. Shannon, “A mathematical theory of communication,” The Bell
System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
[30] B. Settles and M. Craven, “An analysis of active learning strategies for
sequence labeling tasks,” in Proc. Conf. on Empirical Methods in Natural
Language Processing, Honolulu, HI, Oct. 2008, pp. 1070–1079.
[31] A. Culotta and A. McCallum, “Reducing labeling effort for structured
prediction tasks,” in Proc. AAAI Conf. on Artificial Intelligence, vol. 5,
Pittsburgh, PA, Jul. 2005, pp. 746–751.
[32] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via
transfer component analysis,” IEEE Trans. on Neural Networks, vol. 22,
no. 2, pp. 199–210, 2010.
[33] J. Wang, Y. Chen, S. Hao, W. Feng, and Z. Shen, “Balanced distribution
adaptation for transfer learning, in Proc. IEEE Int’l Conf. on Data
Mining, New Orleans, LA, November 2017, pp. 1129–1134.
[34] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, J. N. Kim,
Samueland Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive
emotional dyadic motion capture database,” Language Resources and
Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
[35] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihal-
cea, “MELD: A multimodal multi-party dataset for emotion recognition
in conversations, in Proc. 57th Annual Meeting of the Association for
Computational Linguistics, Florence, Italy, Jul. 2019, pp. 527–536.
[36] M. Grimm, K. Kroschel, and S. Narayanan, “The Vera am Mittag German
audio-visual emotional speech database,” in Proc. IEEE Int’l Conf. on
Multimedia and Expo, Hannover, Germany, Jun. 2008, pp. 865–868.
[37] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A
framework for self-supervised learning of speech representations, in
Proc. Int’l Conf. on Neural Information Processing Systems, vol. 33,
Virtual Event, Dec. 2020, pp. 12 449–12 460.
[38] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An
ASR corpus based on public domain audio books,” in Proc. Int’l Conf.
on Acoustics, Speech and Signal Processing, South Brisbane, Australia,
Apr. 2015, pp. 5206–5210.
[39] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch ¨olkopf,
and A. J. Smola, “Integrating structured biological data by kernel max-
imum mean discrepancy, Bioinformatics, vol. 22, no. 14, pp. 49–57,
2006.
[40] S. Holm, “A simple sequentially rejective multiple test procedure,”
Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
For robotics and AI applications, automatic facial expression recognition can be used to measure user’s satisfaction on products and services that are provided through the human-computer interactions. Large-scale datasets are essentially required to construct a robust deep learning model, which leads to increased training computation cost and duration. This requirement is of particular issue when the training is supposed to be performed on an ongoing basis in devices with limited computation capacity, such as humanoid robots. Knowledge transfer has become a commonly used technique to adapt existing models and speed-up training process by supporting refinements on the existing parameters and weights for the target task. However, most state-of-the-art facial expression recognition models are still based on a single stage training (train at once), which would not be enough for achieving a satisfactory performance in real world scenarios. This paper proposes a knowledge transfer method to support learning using cross-domain datasets, from generic to specific domain. The experimental results demonstrate that shorter and incremental training using smaller-gap-domain from cross-domain datasets can achieve a comparable performance to training using a single large dataset from the target domain.
Article
A brain–computer interface (BCI) enables a user to communicate directly with a computer using only the central nervous system. An affective BCI (aBCI) monitors and/or regulates the emotional state of the brain, which could facilitate human cognition, communication, decision-making, and health. The last decade has witnessed rapid progress in aBCI research and applications, but there does not exist a comprehensive and up-to-date tutorial on aBCIs. This tutorial fills the gap. It introduces first the basic concepts of BCIs and then, in detail, the individual components in a closed-loop aBCI system, including signal acquisition, signal processing, feature extraction, emotion recognition, and brain stimulation. Next, it describes three representative applications of aBCIs, i.e., cognitive workload recognition, fatigue estimation, and depression diagnosis and treatment. Several challenges and opportunities in aBCI research and applications, including brain signal acquisition, emotion labeling, diversity and size of aBCI datasets, algorithm comparison, negative transfer in emotion recognition, and privacy protection and security of aBCIs, are also explained.
Article
Transfer learning (TL) utilizes data or knowledge from one or more source domains to facilitate learning in a target domain. It is particularly useful when the target domain has very few or no labeled data, due to annotation expense, privacy concerns, etc. Unfortunately, the effectiveness of TL is not always guaranteed. Negative transfer (NT), i.e., leveraging source domain data/knowledge undesirably reduces learning performance in the target domain, and has been a long-standing and challenging problem in TL. Various approaches have been proposed in the literature to address this issue. However, there does not exist a systematic survey. This paper fills this gap, by first introducing the definition of NT and its causes, and reviewing over fifty representative approaches for overcoming NT, which fall into three categories: domain similarity estimation, safe transfer, and NT mitigation. Many areas, including computer vision, bioinfor-matics, natural language processing, recommender systems, and robotics, that use NT mitigation strategies to facilitate positive transfers, are also reviewed. Finally, we give guidelines on NT task construction and baseline algorithms, benchmark existing TL and NT mitigation approaches on three NT-specific datasets, and point out challenges and future research directions. To ensure reproducibility, our code is publicized at https://github.com/chamwen/NT-Benchmark.
Article
The issue of electroencephalogram (EEG)-based emotion recognition has great academic and practical significance. Currently, there are numerous research trying to address this issue in the literature. Particularly, transfer learning has gradually become a new methodological trend for the issue in company with the popularity of deep learning. Motivated by capturing the research panorama, summarizing the technological essence, and forecasting the advancement tendency of transfer learning for EEG-based emotion recognition, this article contributes a review work. This work mainly includes five aspects: 1) introducing the issue of EEG-based emotion recognition and expounding the importance of transfer learning for it; 2) analyzing the transfer learning framework and comparing it with the traditional ones; 3) elucidating the issue difficulties and explaining the suitability and capability of transfer learning for this issue; 4) summarizing, categorizing, and exemplifying the typical transfer learning methods for this issue; and 5) clarifying the methodological merits, discussing the challenging problems, and predicting the prospective development of transfer learning for the issue. We expect these contributions can inspire innovation and reformation of the transfer learning methodology for EEG-based emotion recognition as well as other relevant topics in the not-so-far future.
Article
A brain-computer interface (BCI) enables a user to communicate with a computer directly using brain signals. The most common non-invasive BCI modality, electroencephalogram (EEG), is sensitive to noise/artifact and suffers between-subject/within-subject non-stationarity. Therefore, it is difficult to build a generic pattern recognition model in an EEG-based BCI system that is optimal for different subjects, during different sessions, for different devices and tasks. Usually, a calibration session is needed to collect some training data for a new subject, which is time-consuming and user unfriendly. Transfer learning (TL), which utilizes data or knowledge from similar or relevant subjects/sessions/devices/tasks to facilitate learning for a new subject/session/device/task, is frequently used to reduce the amount of calibration effort. This paper reviews journal publications on TL approaches in EEG-based BCIs in the last few years, i.e., since 2016. Six paradigms and applications – motor imagery, event-related potentials, steady-state visual evoked potentials, affective BCIs, regression problems, and adversarial attacks – are considered. For each paradigm/application, we group the TL approaches into cross-subject/session, cross-device, and cross-task settings and review them separately. Observations and conclusions are made at the end of the paper, which may point to future research directions.
Article
Automated facial expression analysis has a variety of applications in human-computer interaction. Traditional methods mainly analyze prototypical facial expressions of no more than eight discrete emotions as a classification task. However, in practice, spontaneous facial expressions in a naturalistic environment can represent not only a wide range of emotions, but also different intensities within an emotion family. In such situations, these methods are not reliable or adequate. In this paper, we propose to train deep convolutional neural networks (CNNs) to analyze facial expressions explainable in a dimensional emotion model. The proposed method accommodates not only a set of basic emotion expressions, but also a full range of other emotions and subtle emotion intensities that we both feel in ourselves and perceive in others in our daily life. Specifically, we first mapped facial expressions into dimensional measures so that we transformed facial expression analysis from a classification problem to a regression one. We then tested our CNN-based methods for facial expression regression and these methods demonstrated promising performance. Moreover, we improved our method by a bilinear pooling which encodes second-order statistics of features. We showed that such bilinear-CNN models significantly outperformed their respective baseline models.
Chapter
Facial expression recognition (FER) is currently one of the most attractive and also the most challenging topics in the computer vision and artificial fields. FER applications are ranging from medical treatment, virtual reality, to driver fatigue surveillance, and many other human-machine interaction systems. Benefit from the recent success of deep learning techniques, especially the invention of convolution neural networks (CNN), various end-to-end deep learning-based FER systems have been proposed in the past few years. However, overfitting caused by a lack of training data is still the big challenge that almost all deep FER systems have to put into a concern to achieve high-performance accuracy. In this paper, we are going to build a FER model to recognize eight commons emotions: neutral, happiness, sadness, surprise, fear, disgust, anger, and contempt on the AffectNet dataset. In order to mitigate the effect of small training data, which is prone to overfitting, we proposed a thoughtful transfer learning framework. Specifically, we fine-tuning ResNet-50 model, which is pre-trained on ImageNet dataset for object detection task, on the AffectNet dataset to recognize eight above mentioned face emotions. Experiment results demonstrate the effectiveness of our proposed FER model.