December 2024
·
5 Reads
Voice is very complicated and dynamic when compared to other physical characteristics of the human, as it can be used to speak in a variety of different languages with varying accents and in various states of emotion. Our ears receive signals, which our brain analyses and determines how we feel and how the other person feels. Despite being unimportant to us, it becomes a difficult chore that any computing device can replicate. There are numerous uses for automatic gender, mood, and speaker identification systems in the fields of social media, robotics, etc. After encoding the human speech signals into spectrogram images, our proposed deep learning model employs a Vision Transformer for both emotion and gender classification tasks. This model has been implemented for the first time in the domain of biomedical signal processing. The proposed Vision Transformer model is tested on the widely used RAVDESS, TESS, and URDU emotion detection datasets. We are able to detect human emotions on the RAVDESS dataset with 73.94% accuracy and gender on the same dataset with 98.76% accuracy, which is a tremendous accomplishment. On the TESS and URDU emotion detection datasets, however, we are able to classify emotions with 99.92% and 94.02% accuracies as well, respectively.