Content uploaded by Pakizar Shamoi
Author content
All content in this area was uploaded by Pakizar Shamoi on May 10, 2024
Content may be subject to copyright.
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 231 (2024) 771–778
1877-0509 © 2024 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the Conference Program Chairs
10.1016/j.procs.2023.12.139
10.1016/j.procs.2023.12.139 1877-0509
© 2024 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the Conference Program Chairs
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2018) 000–000
www.elsevier.com/locate/procedia
Soft Computing and Intelligent Systems: Theory and Applications
(SCISTA 2023)
November 7-9, 2023, Almaty, Kazakhstan
Fuzzy Approach for Audio-Video Emotion Recognition in
Computer Games for Children
Pavel Kozlova, Alisher Akrama, Pakizar Shamoia,∗
aSchool of Information Technology and Engineering, Kazakh-British Technical University, Almaty, Kazakhstan
Abstract
Computer games are widespread nowadays and enjoyed by people of all ages. But when it comes to kids, playing these games can
be more than just fun—it’s a way for them to develop important skills and build emotional intelligence. Facial expressions and
sounds that kids produce during gameplay reflect their feelings, thoughts, and moods. In this paper, we propose a novel framework
that integrates a fuzzy approach for the recognition of emotions through the analysis of audio and video data. Our focus lies within
the specific context of computer games tailored for children, aiming to enhance their overall user experience. We use the FER
dataset to detect facial emotions in video frames recorded from the screen during the game. For the audio emotion recognition of
sounds a kid produces during the game, we use CREMA-D, TESS, RAVDESS, and Savee datasets. Next, a fuzzy inference system
is used for the fusion of results. Besides this, our system can detect emotion stability and emotion diversity during gameplay, which,
together with prevailing emotion report, can serve as valuable information for parents worrying about the effect of certain games on
their kids. The proposed approach has shown promising results in the preliminary experiments we conducted, involving 3 different
video games, namely fighting, racing, and logic games, and providing emotion-tracking results for kids in each game. Our study
can contribute to the advancement of child-oriented game development, which is not only engaging but also accounts for children’s
cognitive and emotional states.
©2018 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
Keywords: fuzzy logic, video emotion recognition, audio emotion recognition, computer games, facial expression, user experience.
1. Introduction
Many parents are concerned about how much time their children spend playing computer games, and concerns
regarding the effects of video games on aspects like mental health and cognitive abilities have become consistent
topics in societal dialogues [12]. A majority of parents, specifically 64%, held the belief that video games were
∗Corresponding author. Tel.: +7-701-349-0001.
E-mail address: p.shamoi@kbtu.kz
1877-0509 ©2018 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2018) 000–000
www.elsevier.com/locate/procedia
Soft Computing and Intelligent Systems: Theory and Applications
(SCISTA 2023)
November 7-9, 2023, Almaty, Kazakhstan
Fuzzy Approach for Audio-Video Emotion Recognition in
Computer Games for Children
Pavel Kozlova, Alisher Akrama, Pakizar Shamoia,∗
aSchool of Information Technology and Engineering, Kazakh-British Technical University, Almaty, Kazakhstan
Abstract
Computer games are widespread nowadays and enjoyed by people of all ages. But when it comes to kids, playing these games can
be more than just fun—it’s a way for them to develop important skills and build emotional intelligence. Facial expressions and
sounds that kids produce during gameplay reflect their feelings, thoughts, and moods. In this paper, we propose a novel framework
that integrates a fuzzy approach for the recognition of emotions through the analysis of audio and video data. Our focus lies within
the specific context of computer games tailored for children, aiming to enhance their overall user experience. We use the FER
dataset to detect facial emotions in video frames recorded from the screen during the game. For the audio emotion recognition of
sounds a kid produces during the game, we use CREMA-D, TESS, RAVDESS, and Savee datasets. Next, a fuzzy inference system
is used for the fusion of results. Besides this, our system can detect emotion stability and emotion diversity during gameplay, which,
together with prevailing emotion report, can serve as valuable information for parents worrying about the effect of certain games on
their kids. The proposed approach has shown promising results in the preliminary experiments we conducted, involving 3 different
video games, namely fighting, racing, and logic games, and providing emotion-tracking results for kids in each game. Our study
can contribute to the advancement of child-oriented game development, which is not only engaging but also accounts for children’s
cognitive and emotional states.
©2018 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
Keywords: fuzzy logic, video emotion recognition, audio emotion recognition, computer games, facial expression, user experience.
1. Introduction
Many parents are concerned about how much time their children spend playing computer games, and concerns
regarding the effects of video games on aspects like mental health and cognitive abilities have become consistent
topics in societal dialogues [12]. A majority of parents, specifically 64%, held the belief that video games were
∗Corresponding author. Tel.: +7-701-349-0001.
E-mail address: p.shamoi@kbtu.kz
1877-0509 ©2018 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
772 Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778
2Author name /Procedia Computer Science 00 (2018) 000–000
Fig. 1: Basic human emotions. Fig. 2: Thayer’s arousal-valence emotion plane [24], [27]
.
responsible for fostering addiction. Furthermore, more than one out of every five parents were worried about video
games affecting their own child [5].
At the same time, most parents recognize the benefits of games and allow children to download applications [4]. It
has long been assumed that emotions have a strong influence on human behavior, actions, and mental abilities [14].
Many teachers are convinced that the correct handling of games can be useful for the development of a child as well.
The development of cognitive skills based on games depends on the quality of the content offered by the developers
of such games. The emotional state of the child during the game largely determines the interest in the computer game.
There are six classic human emotions: happiness, surprise, fear, disgust, anger, and sadness. However, according
to recent findings [17], basic emotion transmission is divided into four (rather than six) types. Specifically, in the
early phases, anger and disgust, as well as fear and surprise, are perceived identically. A wrinkled nose, for example,
expresses both anger and disgust, while lifted eyebrows communicate surprise and fear. Basic human emotions are
represented in Fig. 1. We use Thayer’s arousal-valence emotion plane [24] as our taxonomy and use seven emotions
(six basic and neutral) belonging to one of the four quadrants of the emotion plane (See Fig. 2). In adults, the expres-
sion of emotions is less natural and determined by upbringing and cultural code. In general, the language of emotions
is more universal, but there are some differences in facial expressions and gestures among different people.
The field of emotion recognition (ER) has gained growing interest in recent times. The complexity arising from
factors like different poses, illumination conditions, motion blurring, and more makes the identification of emotions
from audio-video sources a challenging task [29]. Moreover, a limited number of works investigate kids’ ER. One of
the recent studies explored the use of ER to improve online school learning [18].
Most emotion recognition algorithms are still limited to a single modality at the moment. However, in everyday
life, humans frequently conceal their true feelings, which leads to the dilemma that single-modal emotion recognition
accuracy is relatively poor [26]. The majority of works in this area employ CNN models for ER tasks. Some studies
use multiple deep models (CNN, RNN for images and SVM, LSTM for acoustic features) [8]. The authors of the other
study proposed a multi-modal residual perceptron network for multimodal ER [3]. An interesting approach has been
proposed for ER based on generating a representative set of frames from videos using the eigenspace domain and
Principal component analysis [9]. Another paper introduces Spatiotemporal attention-based multimodal deep neural
networks for dimensional ER [11].
Parents, wanting to occupy their child with something useful on a smartphone or computer, are not always sure
about the effects of such games. Current cognitive learning games very often forget about the relationship between
cognition and emotions, which is characterized by a ”hot executive function”. [7]. Game developers are now more
focused only on the end result, forgetting about the exciting gameplay and emotions that children experience in the
process. The interaction of children with educational games and applications should be easily accessible and engaging
while encouraging them to achieve and complete tasks.
In this article, we introduce the framework that integrates a fuzzy logic approach to precisely identify emotions by
analyzing both audio and video data. Our primary emphasis is on computer games designed for children, with the goal
Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778 773
Author name /Procedia Computer Science 00 (2018) 000–000 3
Fig. 3: Methodology representation.
Fig. 4: Some samples from FER with children emotions Fig. 5: Frequency of emotions in a dataset
of improving their overall gaming experience. By automatically monitoring the emotions of children as they navigate
through gameplay, developers can pinpoint pivotal moments within the gaming experience and can make games that
truly connect with kids. Our contributions include the proposed fuzzy fusion technique and exploring the audio-video
emotional stability and diversity besides emotions.
2. Methodology
The proposed approach is shown in Fig. 3. Our framework has two important stages - feature extraction and emotion
detection, fuzzy fusion of emotions. From incoming player video data we extract audio and video frames, perform
feature extraction, detect emotions, and pass them to a fuzzy inference system to perform the fusion.
2.1. FER 2013
We used the facial expression recognition 2013 (FER 2013) emotion dataset, [6], which was presented at the con-
ference [19]. This database contains 35887 black-and-white images of people’s faces with a resolution of 48x48 pixels.
All images were divided into 7 categories: 0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral.
An example of children’s emotions from the database is shown in Fig. 4.
The dataset contains images not only of fully open faces but also partially closed, for example with a hand, low-
contrast images, and images of people in glasses, etc. The dataset is divided into two parts, test and training. The test
set is needed to compare the recognition accuracy among other models. The training set is needed for the training and
optimization of models. The FER2013 dataset, divided into emotion categories, is shown in Fig. 5. The FER dataset is
normalized in such a way that its data resembles a normal distribution: zero expectation and unit variance. Output data
is registered in seven categories: ”angry”, ”disgust”, ”fear”, ”happy”, ”sad”, ”surprise”, and ”neutral”. Each emotion
is evaluated by the result on a scale from 0 to 1.
774 Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778
4Author name /Procedia Computer Science 00 (2018) 000–000
Table 1: Fuzzy rules used in the fuzzy inference system.
Rules Audio Emotion Intensity Video Emotion Intensity Overall Emotion Intensity
1 Low Low Little Bit
2 Low Medium Sometimes
3 Low High High
4 Medium Low Sometimes
5 Medium Medium High
6 Medium High Very High
7 High Low Sometimes
8 Medium Medium Very High
9 High High Extremely High
The project was built on a convolutional neural network-based model [23]. You can also retrain the model if
needed when calling and initializing the model. The accuracy of the CNN model on test data is 78%. The constructor
parameter is the Multitask Cascaded Neural Network (MTCNN) facial recognition technique. When the value is
”True”, the MTCNN model is used to detect faces, and when the value is set to ”False”, the function uses the OpenCV
Haar Cascade classifier by default.
2.2. Audio Emotion Recognition
Audio, or Speech Emotion Recognition (SER) involves identifying human emotions and audio signals. Voice pat-
terns frequently convey underlying emotions via variations in tone and pitch. In order to detect emotions in audio
extracted from player video, we used Speech Emotion Recognition model [1], which was trained using the well-
known datasets of audio clips annotated with emotions, namely, Crowd Sourced Emotional Multimodal Actors Dataset
(CREMA-D, contains 7,442 audio clips) [2], Toronto emotional speech set (TESS, 2800 audio files) [16], Ryerson
Audio-Visual Database of Emotional Speech and Song (RAVDESS, contains 1440 audio files) [13], Surrey Audio-
Visual Expressed Emotion (Savee, 480 audio files) [10]. The accuracy of the model on test data is 60.74 %.
2.3. PyAutoGUI
We use PyAutoGUI for automatic screening of the screen with the game and the child’s face every second. PyAu-
toGUI is a cross-platform library for automating actions on a computer using Python scripts. With the help of this
library, a screen with a logic game played by a child and a child’s face will be displayed.
2.4. Fuzzy Sets and Logic
There are some difficulties in recognizing emotions in faces, such as emotion ambiguity and a small number of
emotion classes in comparison to human emotions [25]. Fuzzy logic is a powerful tool for handling imprecision and
it was used in several studies to express emotions and their intensity [20]. A fuzzy set is a class of objects that has
a range of membership grades [28]. The main reason we used fuzzy sets and logic in our study is that they help us
to rate emotions in a human-consistent manner because they do not have clearly defined bounds. Despite being less
accurate, a language value is closer to human cognitive processes than a number [22].
We partition the spectrum of possible emotions corresponding to linguistic tags [21]. We have two input variables
for each emotion - Audio and Video Emotion Intensity. The output variable is simply the Overall Emotion Intensity in
percentage points (see Fig. 6). As can be seen from the Fig, we have ’Low’, ”Medium, and ’High’ fuzzy sets for both
input variables and ’A little bit’, ’Sometimes’, ’High’, ’Very High’, and ’Extremely High’ for the output variable.
To build fuzzy relationships between input and output variables we use fuzzy rules. In our fuzzy inference system
we have nine fuzzy rules as shown in Table 1. The detailed example is provided in section Application and Results.
Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778 775
4Author name /Procedia Computer Science 00 (2018) 000–000
Table 1: Fuzzy rules used in the fuzzy inference system.
Rules Audio Emotion Intensity Video Emotion Intensity Overall Emotion Intensity
1 Low Low Little Bit
2 Low Medium Sometimes
3 Low High High
4 Medium Low Sometimes
5 Medium Medium High
6 Medium High Very High
7 High Low Sometimes
8 Medium Medium Very High
9 High High Extremely High
The project was built on a convolutional neural network-based model [23]. You can also retrain the model if
needed when calling and initializing the model. The accuracy of the CNN model on test data is 78%. The constructor
parameter is the Multitask Cascaded Neural Network (MTCNN) facial recognition technique. When the value is
”True”, the MTCNN model is used to detect faces, and when the value is set to ”False”, the function uses the OpenCV
Haar Cascade classifier by default.
2.2. Audio Emotion Recognition
Audio, or Speech Emotion Recognition (SER) involves identifying human emotions and audio signals. Voice pat-
terns frequently convey underlying emotions via variations in tone and pitch. In order to detect emotions in audio
extracted from player video, we used Speech Emotion Recognition model [1], which was trained using the well-
known datasets of audio clips annotated with emotions, namely, Crowd Sourced Emotional Multimodal Actors Dataset
(CREMA-D, contains 7,442 audio clips) [2], Toronto emotional speech set (TESS, 2800 audio files) [16], Ryerson
Audio-Visual Database of Emotional Speech and Song (RAVDESS, contains 1440 audio files) [13], Surrey Audio-
Visual Expressed Emotion (Savee, 480 audio files) [10]. The accuracy of the model on test data is 60.74 %.
2.3. PyAutoGUI
We use PyAutoGUI for automatic screening of the screen with the game and the child’s face every second. PyAu-
toGUI is a cross-platform library for automating actions on a computer using Python scripts. With the help of this
library, a screen with a logic game played by a child and a child’s face will be displayed.
2.4. Fuzzy Sets and Logic
There are some difficulties in recognizing emotions in faces, such as emotion ambiguity and a small number of
emotion classes in comparison to human emotions [25]. Fuzzy logic is a powerful tool for handling imprecision and
it was used in several studies to express emotions and their intensity [20]. A fuzzy set is a class of objects that has
a range of membership grades [28]. The main reason we used fuzzy sets and logic in our study is that they help us
to rate emotions in a human-consistent manner because they do not have clearly defined bounds. Despite being less
accurate, a language value is closer to human cognitive processes than a number [22].
We partition the spectrum of possible emotions corresponding to linguistic tags [21]. We have two input variables
for each emotion - Audio and Video Emotion Intensity. The output variable is simply the Overall Emotion Intensity in
percentage points (see Fig. 6). As can be seen from the Fig, we have ’Low’, ”Medium, and ’High’ fuzzy sets for both
input variables and ’A little bit’, ’Sometimes’, ’High’, ’Very High’, and ’Extremely High’ for the output variable.
To build fuzzy relationships between input and output variables we use fuzzy rules. In our fuzzy inference system
we have nine fuzzy rules as shown in Table 1. The detailed example is provided in section Application and Results.
Author name /Procedia Computer Science 00 (2018) 000–000 5
Fig. 6: Input Fuzzy sets for Emotion Intensity (the same for audio and video) and Output Fuzzy Sets for Overall Emotion Intensity.
Fig. 7: Example of the application interface.
3. Application and Results
3.1. Prototype Application
Fig. 7illustrates the prototype application mockup. As it can be seen, it allows to track the average emotion,
prevailing emotions in audio and video, emotional stability and diversity. As a result, parents will be offered a report
about the feelings their child had while playing different computer games, emotional stability/instability and what
were the prevailing emotions associated with this condition. This report would be really helpful for parents. It would
let them see the effect of different games and how much their child can play. Together with a psychologist, it can help
to figure out the best way to manage gaming based on their child’s emotions.
3.2. Experimental Results
In this section, we present our preliminary experimental results. We conducted an experiment on a 7-year-old child
and looked at his emotions during three games: a fighting,racing, and logic game. The speech emotion analysis led to
the following results obtained on 10-sec audio extracted from the video and analyzed for each second:
776 Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778
6Author name /Procedia Computer Science 00 (2018) 000–000
Table 2: Emotions recognition results from video corresponding to a Fight game. 5 illustrative frames (F1, F2, F3, F4, F5) were chosen among 262
to show emotions change. These frames correspond to the ones presented in Fig. 9.
Emotion F1 ... F2 ... F3 ... F4 ... F5 Mean Median Variance SD
Happy 0.04 ... 0.01 ... 0.84 ... 0.54 ... 0.04 0.18 0.07 0.05 0.23
Angry 0.01 ... 0.02 ... 0.01 ... 0.01 ... 0.02 0.04 0.03 0.001 0.03
Disgust 0 ... 0 ... 0 ... 0 ... 0 0.0 0.0 0.0 0.0
Fear 0.03 ... 0.06 ... 0.01 ... 0.02 ... 0.02 0.06 0.05 0.003 0.05
Neutral 0.87 ... 0.65 ... 0.14 ... 0.38 ... 0.84 0.59 0.63 0.04 0.2
Sad 0.05 ... 0.26 ... 0 ... 0.02 ... 0.08 0.11 0.07 0.01 0.11
Surprise 0 ... 0 ... 0 ... 0.02 ... 0 0.01 0.01 0.001 0.03
(a) Fight Game (b) Racing Game (c) Logic Game (tic-tac-toe)
Fig. 8: Video Emotion Recognition Results
•Fight game - [’disgust’ ’sad’ ’sad’ ’sad’ ’disgust’ ’happy’ ’fearful’ ’sad’ ’sad’ ’disgust’]
•Racing game - [’sad’ ’sad’ ’sad’ ’fearful’ ’neutral’ ’neutral’ ’fearful’ ’happy’ ’sad’ ’disgust’]
•Logic Game - [’neutral’ ’neutral’ ’neutral’ ’neutral’ ’disgust’ ’angry’ ’neutral’ ’neutral’ ’sad’ ’neutral’]
Video emotion recognition results for a Fight game are shown in Fig. 9. Table 2illustrates the emotion detection
results for some video frames. To illustrate how emotions evolve, 5 exemplary frames (F1, F2, F3, F4, F5) were
selected from the set of 262 frames. These frames match those in Fig. 9. Fig. 8shows emotion tracking results for
each of the selected games. As we can see from Fig. 8,Fight game data exhibits emotional instability and more
emotional diversity than other games, including happy,neutral,sad, and fear emotions. We can also see that for a
Logic game, there is one leading neutral emotion. Emotional stability can be related to standard deviation (Table 2)
and emotional diversity - with the number of emotions with High or Medium intensity.
We can now simulate our fuzzy system by simply specifying the inputs and applying the defuzzification method.
For example, let us find out what would be the overall emotion intensity in the following scenario: the audio and
video intensity values for Happy Emotion are 12% and 85% respectively. Then, the output membership functions
are combined using the maximum operator (fuzzy aggregation). Next, to get a crisp answer, we need to perform
defuzzification, for that we use a centroid method. As a result of performing aggregation based on fuzzy rules, we get
47.55 % as the overall intensity for a Happy emotion. The visualized result is presented in Fig. 10.
4. Conclusion
Our study aimed to highlight the significance of audio-video ER in developing more engaging, safe, and effective
games for children. We proposed a fuzzy logic-based approach to aggregate the emotions detected from video frames
and sounds. For that, we used FER emotion library as the basis.
Our work can contribute to the advancement of emotionally aware computer games. Using ER software, game
developers can identify problems and work on their elimination and enhancement of user experience aiming for
Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778 777
Author name /Procedia Computer Science 00 (2018) 000–000 7
Fig. 9: Video emotion detection results for the Fight game and Thayer’s arousal-valence emotion planes for each of the 5 selected frames.
(a) Applying input 12% on Audio Intensity fuzzy set (b) Applying input 85% on Video Intensity fuzzy set (c) Aggregated Membership and Result, 47.55%
Fig. 10: Simulation Results.
games that connect with them on a deeper level. Parents can control the influence of certain games on their kids and
track the emotions associated with them.
The study has certain limitations. Adult’s face differs from a child’s, but we made training on all kinds of faces.
Moreover, interpreting emotions solely through audio and video signals might not capture the complete emotional
context of gameplay, because children of different ages might have varying emotional responses and cognitive abili-
ties. Despite its limitations, preliminary results demonstrate that it is a promising tool that has the potential to make
computer games more child-oriented based on emotional data.
The study has several open questions. In particular, we want to understand more about how emotions from sound
and video connect and contrast with each other. Researchers like [3], [29], [8], [11], [26] have delved into similar
areas.
As for future work, we plan to test the system in real settings to see how well it performs. Future experiments will
involve more participants of different ages and engaging with different game types. After testing, it will be possible
to conduct interviews with the participants to ask clarifying questions. According to recent findings, the expression of
fear and neutral emotions between adults and children is quite different between kids and adults [15]. So, we plan to
improve the ER framework by training the models on kids’ faces and sounds only.
778 Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778
8Author name /Procedia Computer Science 00 (2018) 000–000
References
[1] Burnwal, S., . Speech emotion recognition (kaggle). http://www.kaggle.com/code/shivamburnwal/
speech-emotion- recognition/notebook. Accessed: 2023-08-25.
[2] Cao, H., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R., 2014. Crema-d: Crowd-sourced emotional multimodal actors
dataset. IEEE Transactions on Affective Computing 5, 377–390. doi:10.1109/TAFFC.2014.2336244.
[3] Chang, X., Skarbek, W., 2021. Multi-modal residual perceptron network for audio–video emotion recognition. Sensors 21. URL: https:
//www.mdpi.com/1424-8220/21/16/5452, doi:10.3390/s21165452.
[4] Dore, R.A., Logan, J., Lin, T.J., Purtell, K.M., Justice, L.M., 2020. Associations between children’s media use and language and literacy skills.
Frontiers in Psychology 11. doi:10.3389/fpsyg.2020.01734.
[5] Frontier, . How parents perceive their children’s video game habits. URL: https://frontier.com/resources/
e-is- for-everyone- video-game- study.
[6] Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., Zhou, Y.,
Ramaiah, C., Feng, F., Li, R., Wang, X., Athanasakis, D., Shawe-Taylor, J., Milakov, M., Park, J., Ionescu, R., Popescu, M., Grozea, C.,
Bergstra, J., Xie, J., Romaszko, L., Xu, B., Chuang, Z., Bengio, Y., 2013. Challenges in representation learning: A report on three machine
learning contests. arXiv:1307.0414.
[7] Gray, S.I., Robertson, J., Manches, A., Rajendran, G., 2019. Brainquest: The use of motivational design theories to create a cognitive training
game supporting hot executive function. Int-l Journal of Human-Computer Studies 127, 124–149. doi:10.1016/J.IJHCS.2018.08.004.
[8] Guo, X., Polan´
ıa, L.F., Barner, K.E., 2020. Audio-video emotion recognition in the wild using deep hybrid networks. arXiv:2002.09023.
[9] Hajarolasvadi, N., Demirel, H., 2020. Deep facial emotion recognition in video using eigenframes. Image Processing, IET doi:10.1049/
iet-ipr.2019.1566.
[10] Haq, S., Jackson, P., 2009. Speaker-dependent audio-visual emotion recognition, in: Proc. Int. Conf. on Auditory-Visual Speech Processing
(AVSP’08), Norwich, UK.
[11] Lee, J., Kim, S., Kim, S., Sohn, K., 2018. Audio-visual attention networks for emotion recognition, in: Proceedings of the 2018 Workshop
on Audio-Visual Scene Understanding for Immersive Multimedia, Ass-n for Comp. Mach-y, New York, NY, USA. p. 27–32. doi:10.1145/
3264869.3264873.
[12] Lieberoth, A., Fiskaali, A., 2021. Can worried parents predict effects of video games on their children? a case-control study of cognitive
abilities, addiction indicators and wellbeing. Front. in Psychology 11. doi:10.3389/fpsyg.2020.586699.
[13] Livingstone, S.R., Russo, F.A., 2019. Ravdess emotional speech audio. URL: https://www.kaggle.com/dsv/256618, doi:10.34740/
KAGGLE/DSV/256618.
[14] Metcalfe, J., Mischel, W., 1999. A hot/cool-system analysis of delay of gratification: Dynamics of willpower. Psychological Review 106,
3–19. doi:10.1037/0033-295X.106.1.3.
[15] Park, H., Shin, Y., Song, K., Yun, C., Jang, D., 2022. Facial emotion recognition analysis based on age-biased data. Applied Sciences 12.
URL: https://www.mdpi.com/2076-3417/12/16/7992, doi:10.3390/app12167992.
[16] Pichora-Fuller, M.K., Dupuis, K., 2020. Toronto emotional speech set (TESS). doi:10.5683/SP2/E8H2MF.
[17] Rachael E. Jack, Oliver G.B. Garrod, P.G.S., 2014 (accessed January 02, 2014). Dynamic facial expressions of emotion transmit an evolving
hierarchy of signals over time .
[18] Rathod, M., Dalvi, C., Kaur, K., Patil, S., Gite, S., Kamat, P., Kotecha, K., Abraham, A., Gabralla, L., 2022. Kids’ emotion recognition using
various deep-learning models with explainable ai. Sensors 22, 8066. doi:10.3390/s22208066.
[19] Sambare, M., . Fer13. URL: https://www.kaggle.com/datasets/msambare/fer2013.
[20] Shamoi, E., Turdybay, A., Shamoi, P., Akhmetov, I., Jaxylykova, A., Pak, A., 2022. Sentiment analysis of vegan related tweets using mutual
information for feature selection. PeerJ Computer Science 8, e1149. doi:10.7717/peerj-cs.1149.
[21] Shamoi, P., Inoue, A., , Kawanaka, H., 2016. Fhsi: Toward more human-consistent color representation. Journal of Advanced Computational
Intelligence and Intelligent Informatics 20. doi:10.20965/jaciii.2016.p0393.
[22] Shamoi, P., Inoue, A., 2012. Computing with words for direct marketing support system, in: Midwest Artificial Intelligence and Cognitive
Science Conference. URL: http://ceur-ws.org/Vol- 841/submission_36.pdf.
[23] Shenk, J., CG, A., Arriaga, O., Owlwasrowk, 2021. justinshenk/fer: Zenodo. doi:10.5281/zenodo.5362356.
[24] Thayer, R.E., 2000. Mood regulation and general arousal systems. Psychological Inquiry 11, 202–204. URL: http://www.jstor.org/
stable/1449805.
[25] Ualibekova, A., Shamoi, P., 2022. Music emotion recognition using k-nearest neighbors algorithm, in: 2022 International Conference on Smart
Information Systems and Technologies (SIST), pp. 1–6. doi:10.1109/SIST54437.2022.9945814.
[26] Wu, X., Tian, M., Zhai, L., 2022. Icanet: A method of short video emotion recognition driven by multimodal data. arXiv:2208.11346.
[27] Yang, y.h., Su, Y.F., Lin, Y.C., Chen, H., 2007. Music emotion recognition: The role of individuality. Proceedings of the ACM International
Multimedia Conference and Exhibition , 13–22doi:10.1145/1290128.1290132.
[28] Zadeh, L.A., 1965. Fuzzy sets. Information and Control 8, 338–353. doi:10.1016/S0019-9958(65)90241- X.
[29] Zhou, H., Meng, D., Zhang, Y., Peng, X., Du, J., Wang, K., Qiao, Y., 2019. Exploring emotion features and fusion strategies for audio-video
emotion recognition, in: 2019 Int-l Conf. on Multimodal Interaction, ACM. doi:10.1145/3340555.3355713.