Fuzzy Approach for Audio-Video Emotion Recognition in Computer Games for Children



Procedia Computer Science 231 (2024) 771–778
Soft Computing and Intelligent Systems: Theory and Applications
(SCISTA 2023)
November 7-9, 2023, Almaty, Kazakhstan
Fuzzy Approach for Audio-Video Emotion Recognition in
Computer Games for Children
Pavel Kozlova, Alisher Akrama, Pakizar Shamoia,
aSchool of Information Technology and Engineering, Kazakh-British Technical University, Almaty, Kazakhstan
Computer games are widespread nowadays and enjoyed by people of all ages. But when it comes to kids, playing these games can
be more than just fun—it’s a way for them to develop important skills and build emotional intelligence. Facial expressions and
sounds that kids produce during gameplay reflect their feelings, thoughts, and moods. In this paper, we propose a novel framework
that integrates a fuzzy approach for the recognition of emotions through the analysis of audio and video data. Our focus lies within
the specific context of computer games tailored for children, aiming to enhance their overall user experience. We use the FER
dataset to detect facial emotions in video frames recorded from the screen during the game. For the audio emotion recognition of
sounds a kid produces during the game, we use CREMA-D, TESS, RAVDESS, and Savee datasets. Next, a fuzzy inference system
is used for the fusion of results. Besides this, our system can detect emotion stability and emotion diversity during gameplay, which,
together with prevailing emotion report, can serve as valuable information for parents worrying about the eect of certain games on
their kids. The proposed approach has shown promising results in the preliminary experiments we conducted, involving 3 dierent
video games, namely fighting, racing, and logic games, and providing emotion-tracking results for kids in each game. Our study
can contribute to the advancement of child-oriented game development, which is not only engaging but also accounts for children’s
cognitive and emotional states.
Keywords: fuzzy logic, video emotion recognition, audio emotion recognition, computer games, facial expression, user experience.
1. Introduction
Many parents are concerned about how much time their children spend playing computer games, and concerns
regarding the eects of video games on aspects like mental health and cognitive abilities have become consistent
topics in societal dialogues [12]. A majority of parents, specifically 64%, held the belief that video games were
1. Introduction
Many parents are concerned about how much time their children spend playing computer games, and concerns
regarding the eects of video games on aspects like mental health and cognitive abilities have become consistent
topics in societal dialogues [12]. A majority of parents, specifically 64%, held the belief that video games were
Corresponding author. Tel.: +7-701-349-0001.
E-mail address:
Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778
Fig. 1: Basic human emotions. Fig. 2: Thayer’s arousal-valence emotion plane [24], [27]
responsible for fostering addiction. Furthermore, more than one out of every ve parents were worried about video
games aecting their own child [5].
At the same time, most parents recognize the benefits of games and allow children to download applications [4]. It
has long been assumed that emotions have a strong influence on human behavior, actions, and mental abilities [14].
Many teachers are convinced that the correct handling of games can be useful for the development of a child as well.
The development of cognitive skills based on games depends on the quality of the content oered by the developers
of such games. The emotional state of the child during the game largely determines the interest in the computer game.
There are six classic human emotions: happiness, surprise, fear, disgust, anger, and sadness. However, according
to recent findings [17], basic emotion transmission is divided into four (rather than six) types. Specifically, in the
early phases, anger and disgust, as well as fear and surprise, are perceived identically. A wrinkled nose, for example,
expresses both anger and disgust, while lifted eyebrows communicate surprise and fear. Basic human emotions are
represented in Fig. 1. We use Thayer’s arousal-valence emotion plane [24] as our taxonomy and use seven emotions
(six basic and neutral) belonging to one of the four quadrants of the emotion plane (See Fig. 2). In adults, the expres-
sion of emotions is less natural and determined by upbringing and cultural code. In general, the language of emotions
is more universal, but there are some dierences in facial expressions and gestures among dierent people.
The field of emotion recognition (ER) has gained growing interest in recent times. The complexity arising from
factors like dierent poses, illumination conditions, motion blurring, and more makes the identification of emotions
from audio-video sources a challenging task [29]. Moreover, a limited number of works investigate kids’ ER. One of
the recent studies explored the use of ER to improve online school learning [18].
Most emotion recognition algorithms are still limited to a single modality at the moment. However, in everyday
life, humans frequently conceal their true feelings, which leads to the dilemma that single-modal emotion recognition
accuracy is relatively poor [26]. The majority of works in this area employ CNN models for ER tasks. Some studies
use multiple deep models (CNN, RNN for images and SVM, LSTM for acoustic features) [8]. The authors of the other
study proposed a multi-modal residual perceptron network for multimodal ER [3]. An interesting approach has been
proposed for ER based on generating a representative set of frames from videos using the eigenspace domain and
Principal component analysis [9]. Another paper introduces Spatiotemporal attention-based multimodal deep neural
networks for dimensional ER [11].
Parents, wanting to occupy their child with something useful on a smartphone or computer, are not always sure
about the eects of such games. Current cognitive learning games very often forget about the relationship between
cognition and emotions, which is characterized by a ”hot executive function”. [7]. Game developers are now more
focused only on the end result, forgetting about the exciting gameplay and emotions that children experience in the
process. The interaction of children with educational games and applications should be easily accessible and engaging
while encouraging them to achieve and complete tasks.
In this article, we introduce the framework that integrates a fuzzy logic approach to precisely identify emotions by
analyzing both audio and video data. Our primary emphasis is on computer games designed for children, with the goal
Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778
Fig. 3: Methodology representation.
Fig. 4: Some samples from FER with children emotions Fig. 5: Frequency of emotions in a dataset
of improving their overall gaming experience. By automatically monitoring the emotions of children as they navigate
through gameplay, developers can pinpoint pivotal moments within the gaming experience and can make games that
truly connect with kids. Our contributions include the proposed fuzzy fusion technique and exploring the audio-video
emotional stability and diversity besides emotions.
2. Methodology
The proposed approach is shown in Fig. 3. Our framework has two important stages - feature extraction and emotion
detection, fuzzy fusion of emotions. From incoming player video data we extract audio and video frames, perform
feature extraction, detect emotions, and pass them to a fuzzy inference system to perform the fusion.
2.1. FER 2013
We used the facial expression recognition 2013 (FER 2013) emotion dataset, [6], which was presented at the con-
ference [19]. This database contains 35887 black-and-white images of people’s faces with a resolution of 48x48 pixels.
All images were divided into 7 categories: 0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral.
An example of children’s emotions from the database is shown in Fig. 4.
The dataset contains images not only of fully open faces but also partially closed, for example with a hand, low-
contrast images, and images of people in glasses, etc. The dataset is divided into two parts, test and training. The test
set is needed to compare the recognition accuracy among other models. The training set is needed for the training and
optimization of models. The FER2013 dataset, divided into emotion categories, is shown in Fig. 5. The FER dataset is
normalized in such a way that its data resembles a normal distribution: zero expectation and unit variance. Output data
is registered in seven categories: ”angry”, ”disgust”, ”fear”, ”happy”, ”sad”, ”surprise”, and ”neutral”. Each emotion
is evaluated by the result on a scale from 0 to 1.
Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778
Fig. 6: Input Fuzzy sets for Emotion Intensity (the same for audio and video) and Output Fuzzy Sets for Overall Emotion Intensity.
Fig. 7: Example of the application interface.
3. Application and Results
3.1. Prototype Application
Fig. 7illustrates the prototype application mockup. As it can be seen, it allows to track the average emotion,
prevailing emotions in audio and video, emotional stability and diversity. As a result, parents will be oered a report
about the feelings their child had while playing dierent computer games, emotional stability/instability and what
were the prevailing emotions associated with this condition. This report would be really helpful for parents. It would
let them see the eect of dierent games and how much their child can play. Together with a psychologist, it can help
to figure out the best way to manage gaming based on their child’s emotions.
3.2. Experimental Results
In this section, we present our preliminary experimental results. We conducted an experiment on a 7-year-old child
and looked at his emotions during three games: a fighting,racing, and logic game. The speech emotion analysis led to
the following results obtained on 10-sec audio extracted from the video and analyzed for each second:
Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778
Table 2: Emotions recognition results from video corresponding to a Fight game. 5 illustrative frames (F1, F2, F3, F4, F5) were chosen among 262
to show emotions change. These frames correspond to the ones presented in Fig. 9.
Emotion F1 ... F2 ... F3 ... F4 ... F5 Mean Median Variance SD
Happy 0.04 ... 0.01 ... 0.84 ... 0.54 ... 0.04 0.18 0.07 0.05 0.23
Angry 0.01 ... 0.02 ... 0.01 ... 0.01 ... 0.02 0.04 0.03 0.001 0.03
Disgust 0 ... 0 ... 0 ... 0 ... 0 0.0 0.0 0.0 0.0
Fear 0.03 ... 0.06 ... 0.01 ... 0.02 ... 0.02 0.06 0.05 0.003 0.05
Neutral 0.87 ... 0.65 ... 0.14 ... 0.38 ... 0.84 0.59 0.63 0.04 0.2
Sad 0.05 ... 0.26 ... 0 ... 0.02 ... 0.08 0.11 0.07 0.01 0.11
Surprise 0 ... 0 ... 0 ... 0.02 ... 0 0.01 0.01 0.001 0.03
(a) Fight Game (b) Racing Game (c) Logic Game (tic-tac-toe)
Fig. 8: Video Emotion Recognition Results
Fight game - [’disgust’ sad’ sad’ sad’ disgust’ ’happy’ ’fearful’ sad’ sad’ disgust’]
Racing game - [’sad’ sad’ sad’ ’fearful’ ’neutral’ ’neutral’ ’fearful’ ’happy’ sad’ disgust’]
Logic Game - [’neutral’ ’neutral’ ’neutral’ ’neutral’ disgust’ ’angry’ ’neutral’ ’neutral’ sad’ ’neutral’]
Video emotion recognition results for a Fight game are shown in Fig. 9. Table 2illustrates the emotion detection
results for some video frames. To illustrate how emotions evolve, 5 exemplary frames (F1, F2, F3, F4, F5) were
selected from the set of 262 frames. These frames match those in Fig. 9. Fig. 8shows emotion tracking results for
each of the selected games. As we can see from Fig. 8,Fight game data exhibits emotional instability and more
emotional diversity than other games, including happy,neutral,sad, and fear emotions. We can also see that for a
Logic game, there is one leading neutral emotion. Emotional stability can be related to standard deviation (Table 2)
and emotional diversity - with the number of emotions with High or Medium intensity.
We can now simulate our fuzzy system by simply specifying the inputs and applying the defuzzification method.
For example, let us find out what would be the overall emotion intensity in the following scenario: the audio and
video intensity values for Happy Emotion are 12% and 85% respectively. Then, the output membership functions
are combined using the maximum operator (fuzzy aggregation). Next, to get a crisp answer, we need to perform
defuzzification, for that we use a centroid method. As a result of performing aggregation based on fuzzy rules, we get
47.55 % as the overall intensity for a Happy emotion. The visualized result is presented in Fig. 10.
4. Conclusion
Our study aimed to highlight the significance of audio-video ER in developing more engaging, safe, and eective
games for children. We proposed a fuzzy logic-based approach to aggregate the emotions detected from video frames
and sounds. For that, we used FER emotion library as the basis.
Our work can contribute to the advancement of emotionally aware computer games. Using ER software, game
developers can identify problems and work on their elimination and enhancement of user experience aiming for
Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778
Fig. 9: Video emotion detection results for the Fight game and Thayer’s arousal-valence emotion planes for each of the 5 selected frames.
(a) Applying input 12% on Audio Intensity fuzzy set (b) Applying input 85% on Video Intensity fuzzy set (c) Aggregated Membership and Result, 47.55%
Fig. 10: Simulation Results.
games that connect with them on a deeper level. Parents can control the influence of certain games on their kids and
track the emotions associated with them.
The study has certain limitations. Adult’s face diers from a child’s, but we made training on all kinds of faces.
Moreover, interpreting emotions solely through audio and video signals might not capture the complete emotional
context of gameplay, because children of dierent ages might have varying emotional responses and cognitive abili-
ties. Despite its limitations, preliminary results demonstrate that it is a promising tool that has the potential to make
computer games more child-oriented based on emotional data.
The study has several open questions. In particular, we want to understand more about how emotions from sound
and video connect and contrast with each other. Researchers like [3], [29], [8], [11], [26] have delved into similar
As for future work, we plan to test the system in real settings to see how well it performs. Future experiments will
involve more participants of dierent ages and engaging with dierent game types. After testing, it will be possible
to conduct interviews with the participants to ask clarifying questions. According to recent findings, the expression of
fear and neutral emotions between adults and children is quite dierent between kids and adults [15]. So, we plan to
improve the ER framework by training the models on kids’ faces and sounds only.
Pavel Kozlov et al. / Procedia Computer Science 231 (2024) 771–778
[1] Burnwal, S., . Speech emotion recognition (kaggle).
speech-emotion- recognition/notebook. Accessed: 2023-08-25.
[2] Cao, H., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R., 2014. Crema-d: Crowd-sourced emotional multimodal actors
dataset. IEEE Transactions on Aective Computing 5, 377–390. doi:10.1109/TAFFC.2014.2336244.
[3] Chang, X., Skarbek, W., 2021. Multi-modal residual perceptron network for audio–video emotion recognition. Sensors 21. URL: https:
//, doi:10.3390/s21165452.
[4] Dore, R.A., Logan, J., Lin, T.J., Purtell, K.M., Justice, L.M., 2020. Associations between children’s media use and language and literacy skills.
Frontiers in Psychology 11. doi:10.3389/fpsyg.2020.01734.
[5] Frontier, . How parents perceive their children’s video game habits. URL:
e-is- for-everyone- video-game- study.
[6] Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., Zhou, Y.,
Ramaiah, C., Feng, F., Li, R., Wang, X., Athanasakis, D., Shawe-Taylor, J., Milakov, M., Park, J., Ionescu, R., Popescu, M., Grozea, C.,
Bergstra, J., Xie, J., Romaszko, L., Xu, B., Chuang, Z., Bengio, Y., 2013. Challenges in representation learning: A report on three machine
learning contests. arXiv:1307.0414.
[7] Gray, S.I., Robertson, J., Manches, A., Rajendran, G., 2019. Brainquest: The use of motivational design theories to create a cognitive training
game supporting hot executive function. Int-l Journal of Human-Computer Studies 127, 124–149. doi:10.1016/J.IJHCS.2018.08.004.
[8] Guo, X., Polan´
ıa, L.F., Barner, K.E., 2020. Audio-video emotion recognition in the wild using deep hybrid networks. arXiv:2002.09023.
[9] Hajarolasvadi, N., Demirel, H., 2020. Deep facial emotion recognition in video using eigenframes. Image Processing, IET doi:10.1049/
[10] Haq, S., Jackson, P., 2009. Speaker-dependent audio-visual emotion recognition, in: Proc. Int. Conf. on Auditory-Visual Speech Processing
(AVSP’08), Norwich, UK.
[11] Lee, J., Kim, S., Kim, S., Sohn, K., 2018. Audio-visual attention networks for emotion recognition, in: Proceedings of the 2018 Workshop
on Audio-Visual Scene Understanding for Immersive Multimedia, Ass-n for Comp. Mach-y, New York, NY, USA. p. 27–32. doi:10.1145/
[12] Lieberoth, A., Fiskaali, A., 2021. Can worried parents predict eects of video games on their children? a case-control study of cognitive
abilities, addiction indicators and wellbeing. Front. in Psychology 11. doi:10.3389/fpsyg.2020.586699.
[13] Livingstone, S.R., Russo, F.A., 2019. Ravdess emotional speech audio. URL:, doi:10.34740/
[14] Metcalfe, J., Mischel, W., 1999. A hot/cool-system analysis of delay of gratification: Dynamics of willpower. Psychological Review 106,
3–19. doi:10.1037/0033-295X.106.1.3.
[15] Park, H., Shin, Y., Song, K., Yun, C., Jang, D., 2022. Facial emotion recognition analysis based on age-biased data. Applied Sciences 12.
URL:, doi:10.3390/app12167992.
[16] Pichora-Fuller, M.K., Dupuis, K., 2020. Toronto emotional speech set (TESS). doi:10.5683/SP2/E8H2MF.
[17] Rachael E. Jack, Oliver G.B. Garrod, P.G.S., 2014 (accessed January 02, 2014). Dynamic facial expressions of emotion transmit an evolving
hierarchy of signals over time .
[18] Rathod, M., Dalvi, C., Kaur, K., Patil, S., Gite, S., Kamat, P., Kotecha, K., Abraham, A., Gabralla, L., 2022. Kids’ emotion recognition using
various deep-learning models with explainable ai. Sensors 22, 8066. doi:10.3390/s22208066.
[19] Sambare, M., . Fer13. URL:
[20] Shamoi, E., Turdybay, A., Shamoi, P., Akhmetov, I., Jaxylykova, A., Pak, A., 2022. Sentiment analysis of vegan related tweets using mutual
information for feature selection. PeerJ Computer Science 8, e1149. doi:10.7717/peerj-cs.1149.
[21] Shamoi, P., Inoue, A., , Kawanaka, H., 2016. Fhsi: Toward more human-consistent color representation. Journal of Advanced Computational
Intelligence and Intelligent Informatics 20. doi:10.20965/jaciii.2016.p0393.
[22] Shamoi, P., Inoue, A., 2012. Computing with words for direct marketing support system, in: Midwest Artificial Intelligence and Cognitive
Science Conference. URL: 841/submission_36.pdf.
[23] Shenk, J., CG, A., Arriaga, O., Owlwasrowk, 2021. justinshenk/fer: Zenodo. doi:10.5281/zenodo.5362356.
[24] Thayer, R.E., 2000. Mood regulation and general arousal systems. Psychological Inquiry 11, 202–204. URL:
[25] Ualibekova, A., Shamoi, P., 2022. Music emotion recognition using k-nearest neighbors algorithm, in: 2022 International Conference on Smart
Information Systems and Technologies (SIST), pp. 1–6. doi:10.1109/SIST54437.2022.9945814.
[26] Wu, X., Tian, M., Zhai, L., 2022. Icanet: A method of short video emotion recognition driven by multimodal data. arXiv:2208.11346.
[27] Yang, y.h., Su, Y.F., Lin, Y.C., Chen, H., 2007. Music emotion recognition: The role of individuality. Proceedings of the ACM International
Multimedia Conference and Exhibition , 13–22doi:10.1145/1290128.1290132.
[28] Zadeh, L.A., 1965. Fuzzy sets. Information and Control 8, 338–353. doi:10.1016/S0019-9958(65)90241- X.
[29] Zhou, H., Meng, D., Zhang, Y., Peng, X., Du, J., Wang, K., Qiao, Y., 2019. Exploring emotion features and fusion strategies for audio-video
emotion recognition, in: 2019 Int-l Conf. on Multimodal Interaction, ACM. doi:10.1145/3340555.3355713.
Speech Emotion Recognition, abbreviated as SER, the act of trying to identify a person's feelings and relationships. Affected situations from speech. This is because the truth often reflects the basic feelings of tone and tone of voice. Emotional awareness is a fast-growing field of research in recent years. Unlike humans, machines do not have the power to comprehend and express emotions. But human communication with the computer can be improved by using automatic sensory recognition, accordingly reducing the need for human intervention. In this project, basic emotions such as peace, happiness, fear, disgust, etc. are analyzed signs of emotional expression. We use machine learning techniques such as Multilayer perceptron Classifier (MLP Classifier) which is used to separate information provided by groups to be divided equally. Coefficients of Mel-frequency cepstrum (MFCC), chroma and mel features are extracted from speech signals and used to train MLP differentiation. By accomplishing this purpose, we use python libraries such as Librosa, sklearn, pyaudio, numpy and audio file to analyze speech patterns and see the feeling. Keywords: Speech emotion recognition, mel cepstral coefficient, neural artificial network, multilayer perceptrons, mlp classifier, python.