ArticlePDF Available

Unveiling personality traits through Bangla speech using Morlet wavelet transformation and BiG

Authors:
Natural Language Processing Journal 9 (2024) 100113
Contents lists available at ScienceDirect
Natural Language Processing Journal
journal homepage: www.elsevier.com/locate/nlp
Unveiling personality traits through Bangla speech using Morlet wavelet
transformation and BiG
Md. Sajeebul Islam Sk. , Md. Golam Rabiul Alam
Department of Computer Science and Engineering, BRAC University, Merul Badda, Dhaka 1212, Bangladesh
ARTICLE INFO
Keywords:
Bangla speech
Personality classification
MoMF
MEWLP
DistilRo
BiG
ABSTRACT
Speech serves as a potent medium for expressing a wide array of psychologically significant attributes.
While earlier research on deducing personality traits from user-generated speech predominantly focused on
other languages, there is a noticeable absence of prior studies and datasets for automatically assessing user
personalities from Bangla speech. In this paper, our objective is to bridge the research gap by generating speech
samples, each imbued with distinct personality profiles. These personality impressions are subsequently linked
to OCEAN (Openness, Conscientiousness, Extroversion, Agreeableness, and Neuroticism) personality traits. To
gauge accuracy, human evaluators, unaware of the speaker’s identity, assess these five personality factors.
The dataset is predominantly composed of around 90% content sourced from online Bangla newspapers,
with the remaining 10% originating from renowned Bangla novels. We perform feature level fusion by
combining MFCCs with LPC features to set MELP and MEWLP features. We introduce MoMF feature extraction
method by transforming Morlet wavelet and fusing MFCCs feature. We develop two soft voting ensemble
models, DistilRo (based on DistilBERT and RoBERTa) and BiG (based on Bi-LSTM and GRU), for personality
classification in speech-to-text and speech modalities, respectively. The DistilRo model has gained F-1 score
89% in speech-to-text and the BiG model has gained F-1 score 90% in speech modality.
1. Introduction
Personality is like the fingerprint of our inner selves. An indi-
vidual is a unique combination of traits, behaviors, and characteris-
tics (Matthews et al.,2003). Some of us lighten up in social gath-
erings, while others find solace in quieter moments. Our interests,
the way we react to challenges, our sense of humor they are all
threads that weave together into the beautiful tapestry of who we
are (Pervin,2003). Nature gives us a head start with certain traits, but
life experiences add their own splash of color to the canvas (Burger,
2014). Personality remains a dynamic and evolving essence, a key to
understanding ourselves and connecting with others on a profound
level.
Speech serves as a potent medium for the expression of a multitude
of psychologically significant phenomena (Ryumina et al.,2024). For
instance, within a matter of a few hundred milliseconds of encountering
speech, humans have the remarkable ability to consistently deduce
an extensive array of details about the speaker (Pisanski and Bryant,
2019). Beyond mere directed dialogs and basic command-and-control
interfaces, the realm of voice-based human–machine interaction is
broadening. Machines must now possess the capability to compre-
hend inputs and generate responses within a distinct context, and
Corresponding author.
E-mail addresses: sajeebhasan54@gmail.com (M.S.I. Sk.), rabiul.alam@bracu.ac.bd (M.G.R. Alam).
it is influenced by numerous factors, with voice quality playing a
prominent role, necessitating a more intricate level of interpretation
and output generation (Polzehl et al.,2010). The modern landscape
is reminiscent of the vivid characters that inhabit the pages of famous
novels, each showcasing a unique facet of human nature. The increasing
number of online platforms, including news portals, social media, and
blogs (Rudra et al.,2020) make it easier for people to raise their voices
on a multitude of subjects. Various systems have been proposed to
characterize an individual’s personality, and a substantial number of
these systems revolve around the framework of the Big Five personality
traits (Goldberg,1993). This model endeavors to depict a person’s
personality by utilizing five distinct factors, somewhat akin to a vector.
Previously, many methodologies have utilized a variety of lexicons,
linguistic elements, psycholinguistic factors, and emotional attributes
within a supervised learning framework to ascertain a user’s person-
ality from textual and spoken interactions. These approaches have
employed a spectrum of learning models, ranging from conventional
SVM (Polzehl et al.,2010), KNN, BPT, TF-IGM (Kamalesh and Bharathi,
2022), naive Bayes, and so on, to the contemporary deep learning
strategies like CNN, MLP (Rudra et al.,2020), Bi-LSTM (Zhou et al.,
2022), GRU (Dey and Salem,2017) and so on. In this study, we
https://doi.org/10.1016/j.nlp.2024.100113
Received 27 March 2024; Received in revised form 6 October 2024; Accepted 11 October 2024
2949-7191/©2024 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
harness the power of state-of-the-art language models to address the
intricacies of personality. Specifically, we leverage the capabilities of
RoBERTa (Robustly Optimized BERT Approach) (Liu et al.,2019) and
DistilBERT (Sanh et al.,2020), two prominent transformer-based mod-
els that have demonstrated exceptional prowess in natural language
understanding tasks. RoBERTa excels in capturing contextual informa-
tion, enabling nuanced comprehension of linguistic nuances (Adoma
et al.,2020a). On the other hand, DistilBERT strikes a balance between
computational efficiency and performance, making it an attractive
choice for tasks with resource constraints (Arslan et al.,2021). Malik
and others (Malik et al.,2023) used RoBERTa for capturing semantics
and contextual information from YouTube comments in both English
and Russian languages. Furthermore, Julianda and others (Julianda
et al.,2023), employed DistilBERT to classify personality traits within
the realm of social media. Ensemble methods work by creating a
team of classifiers that come together to make predictions on new
data (Dietterich,2000). The majority of these techniques have primar-
ily been developed for languages such as English, German (Polzehl
et al.,2010), and others. To the best of our understanding, there exists a
solitary prior effort and dataset in Bangla that deal with the intricacies
of detecting personality from social media Bangla text (Rudra et al.,
2020). However, to the best of our knowledge, no previous dataset
existed for classifying personality traits from Bangla speech. As a result,
it becomes imperative to bridge this gap in research and establish
resources that can pave the way for future investigations in this domain.
In our work, we hire a speaker for generating speeches and our
raters lack familiarity with the speaker, their observations are solely
derived from the speech itself. Consequently, we conduct an initial ex-
periment to ascertain the viability of this approach. In this experiment,
human evaluators gauge a speaker’s personality by listening to diverse
speech samples. These samples are generated by a speaker who has
been instructed to simulate various personalities. The research presents
a set of contributions:
We are classifying personality traits based on the spoken Bangla
language. To the best of our knowledge, there has been no
prior work specifically focusing on personality classification using
Bangla speech. By developing a system adapted to Bangla, we
provide a foundational framework that can be expanded upon
in future research, thereby enriching the global understanding of
speech-based personality classification.
The lack of existing datasets for classifying personalities from
Bangla speech posed a significant barrier to research in this area.
By creating and validating a new dataset, we not only enable
our own study but also provide a valuable resource for other
researchers. This dataset can serve as a benchmark for future
studies, promoting further advancements in the field.
We have introduced Morlet-based Mel-frequency Cepstral Coeffi-
cients (MoMF) feature extraction method by transforming Morlet
wavelet and fusing MFCCs feature. By leveraging the Morlet
wavelet, MoMF captures both temporal and spectral character-
istics of speech, which are crucial for accurately distinguishing
between different personality traits. This method has the potential
to improve the performance of speech-based classification systems
across various applications.
We develop two soft voting ensemble models for personality clas-
sification in speech-to-text and speech modalities. DistilRo is great
at understanding personality from text (speech-to-text modality),
gain F-1 score 89% and BiG does even better at understanding
personality just by listening (speech modality), gain F-1 score
90%. These models demonstrate the effectiveness of ensemble
techniques in improving classification accuracy and provide a
robust framework for future research in this domain.
In selecting our methodology, we aim to address the unique chal-
lenges and opportunities presented by the Bangla language and its
phonetic characteristics. The Morlet Wavelet Transformation is chosen
for its efficacy in capturing both time and frequency information from
speech signals, allowing us to effectively analyze the intricate details
of Bangla speech. This approach provides a robust framework for
extracting features that are sensitive to the nuances of speech, which
are critical for accurate personality classification.
The Morlet-based Mel-frequency Cepstral Coefficients (MoMF) fea-
ture extraction method leverages the Morlet wavelet to capture both
temporal and spectral characteristics of the speech signal. This dual
capability is particularly important for Bangla speech, which features
a rich array of phonetic nuances that can be pivotal in distinguishing
between different personality traits. By combining the Morlet wavelet
with MFCCs, MoMF enhances the representation of speech features,
leading to more effective personality classification.
Moreover, our choice of BiG (Bi-LSTM and GRU) for the speech
modality and DistilRo (DistilBERT and RoBERTa) for the speech-to-
text modality leverages state-of-the-art deep learning architectures that
have demonstrated superior performance in natural language process-
ing tasks. The combination of these models allows us to harness the
strengths of both recurrent neural networks and transformer models,
providing a comprehensive approach to capturing the contextual and
sequential dependencies in speech data.
Ensemble methods, such as the soft voting models we employ,
have been shown to improve predictive performance by combining
the strengths of multiple classifiers (Verhoeven et al.,2013). In our
case, the DistilRo model excels in understanding personality from
text (speech-to-text modality), achieving an F-1 score of 89%, while
the BiG model performs even better at understanding personality di-
rectly from speech (speech modality), achieving an F-1 score of 90%.
By integrating these advanced techniques, our methodology not only
addresses the gap in existing research on Bangla speech but also es-
tablishes a new benchmark for personality classification using speech.
This dual-modality approach ensures that our models are well-equipped
to handle the complexities of speech-based personality assessment,
thereby enhancing the accuracy and reliability of our findings.
Bridging the gap in Bangla speech-based personality classification
holds significant importance for both personality psychology and nat-
ural language processing (NLP). This research expands the scope of
personality psychology by incorporating Bangla, a language spoken by
millions but underrepresented in academic studies. Including Bangla
speech data allows for a broader and more culturally diverse under-
standing of personality traits, which can lead to more comprehensive
and universally applicable personality theories. From an NLP per-
spective, creating robust models for Bangla speech advances speech
processing technologies. The methodologies and models, such as MoMF
feature extraction and ensemble techniques, can be adapted for other
low-resource languages, promoting the development of inclusive NLP
tools.
The subsequent sections of the paper are structured in the following
manner: Section 2explains the personality traits and prior works. Sec-
tion 3describes methodology of personality classification from Bangla
speeches. Section 4analyzes the result of these model, and Section 5
concludes the paper and reports on limitations.
2. Literature review
The majority of prior research approaches the task of automatic
personality detection by employing the Big Five model, which char-
acterizes human personality using five dimensions. Based on the ar-
ticle (Polzehl et al.,2010) and other, we provide an overview of the
general characteristics of these personality traits:
Agreeableness (A): High scores in agreeableness often signify em-
pathy and trustworthiness (Ryumina et al.,2024) and lower in agree-
ableness display traits such as egocentrism, competitiveness, and a
predisposition towards skepticism and distrust (Polzehl,2015).
Conscientiousness (C): High conscientiousness is recognized for
precision, attentiveness, reliability, and effective planning skills and
2
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
low conscientiousness display carelessness, thoughtlessness, and a ten-
dency to act imprudently (Matthews et al.,2003).
Extroversion (E): High extroversion are often characterized by
outgoing, sociable nature, marked by a propensity for sociability, en-
thusiasm, independent roles and exhibit a vibrant, energetic demeanor
and introverted tendencies are frequently observed as reserved, deep in
thought and conservative outlooks (Polzehl et al.,2010).
Neuroticism (N): High in neuroticism exhibit emotional instability,
susceptible to feelings of shock, overwhelmed by emotions and may
lack self-confidence (Matthews et al.,2003) and low neuroticism are
generally characterized as composed and emotionally stable, excel
under pressure and remain unflustered (Polzehl,2015).
Openness (O): High openness reflect an individual’s receptivity
to new ideas, willingness to embrace novel experiences, visionary
and curious, adventurous experimentation (Polzehl et al.,2010) and
low scores tend to conservatism, favoring conventional wisdom over
avant-garde thinking (Polzehl,2015).
Our raters construct an individual’s personality profile by respond-
ing to 10 statements from the NEO-FFI questionnaire (Polzehl,2015),
utilizing a five point Likert scale that ranges from ‘strongly disagree’ to
‘strongly agree,’ which corresponds to numeric values between 1 to 5.
Each of the five personality factors can yield a score within the range
of 0 to 50. The combination of these scores results in a comprehensive
personality profile for the individual.
Early studies on figuring out people’s personality traits mainly
looked at what users wrote in essays. They checked different features
in supervised learning system (Pennebaker and King,1999;Das and
Das,2017). Pennebaker and King (Pennebaker and King,1999) did a
project where they got essays from volunteers in a controlled setting.
After that, the authors of the essays were asked to describe their
personalities using the Big Five model. Then, Tim Phozel (Polzehl,
2015) refers events ‘‘speech base personality’’ in English and German
Languages, uses ten personality traits, extracted these features by using
praat, applies linear regression, support vector machines, and artificial
neural networks and achieve 60% accuracy. The limitation of the
proposed models is their inability to extract deep-level discriminative
features needed for accurate classification of speech-based personality
traits. The First Impressions V2 Corpus from CVPR 2017 (Ponce-López
et al.,2016) introduced a significant dataset for studying first impres-
sions through multimodal cues, including speech characteristics. This
dataset is the publicly available resource that includes labels for the
Big Five Personality traits. Recently, the Multimodal Personality Traits
Assessment (MuPTA) Corpus from INTERSPEECH 2023 (Ryumina et al.,
2023) provided a comprehensive dataset aimed at assessing personality
traits through multiple modalities, including speech. This corpus en-
abled researchers to explore the interplay between speech patterns and
other modalities to enhance the accuracy and depth of personality trait
classification models.
The authors (Rudra et al.,2020) refer as text base personality
classification based on Facebook and YouTube comments in Bangla
Language with utilize TF-IDF base feature extraction and deep learning
models and achieve 37% f1 measure scores on the dataset and distinct
personality profile in five basic dimensions. One limitation of this study
is that only Statistical Machine Learning methods and Deep Learning
approaches (MLP, FastText, C-LSTM) were proposed, which are unable
to capture the semantic relationships between words effectively. Now a
days, Twitter has become popular because it is great for getting quick
and up-to-date information. Researchers have come up with different
ways to figure out people’s personality from short texts like tweets.
The authors (Hans et al.,2021) refer to ‘‘text-based personality’’ and in-
troduces a novel prediction method using a multi-modal deep learning
architecture, incorporating multiple pre-trained language models such
as BERT, RoBERTa, and XLNet. This approach achieves an F1 score of
91% on Facebook datasets and 88% on Twitter datasets.
Further, Feature level fusion is a comprehensive understanding of
the underlying data, promoting synergy and improved representation
for robust analysis and decision-making (Gunatilaka and Baertlein,
2001). For instance, in the study outlined (Polzehl,2015), the author
employs the Mel-frequency cepstral coefficients (MFCCs) and linear
predictive coding (LPC) feature extraction techniques to evaluate per-
sonality traits from speech data. The authors (Efat et al.,2023) em-
ployed ensemble methods for the classification of context and emotion
in political speech, achieving accuracies of 73% and 53%, respectively.
Utilizing an ensemble approach, authors integrated insights from per-
sonality recognition in text (Verhoeven et al.,2013;Fu et al.,2021a),
and speech (Alam and Riccardi,2013;Ramezani et al.,2022), demon-
strating the effectiveness of this combined strategy in achieving high
performance across their tasks. The progress in this area for Bangla
language was not possible for the lack of dataset.
However, in Bangla language, we have not found any previous re-
search on classifying personality traits based on speech. This discovery
inspires us to create a standard dataset and see how well different
systems can do in solving this problem.
3. Methodology
Our research has five phases to classify personality from Bangla
speech. We begin with data collection (Phase 1). In Phase 2, we
annotate this data to understand personality traits. Phase 3 involves
reliability checks to ensure the accuracy of our annotations. Phase 4
involves data preprocessing and feature extraction. Lastly, in Phase
5, we develop two soft voting ensemble models called DistilRo and
BiG. The following sections elaborate on each phase, detailing the
procedures and techniques employed to ensure the robustness and
accuracy of our study. In Fig. 1, we show an overview of our work.
3.1. Data collection
The data is carefully collected to ensure high quality and rel-
evance for classifying personality traits from Bangla speech. Draw-
ing inspiration from prior investigations conducted on various na-
tive languages, including English (Greer and Mensing,2013)(Ober,
2007), German (Baker et al.,2002), French (Cotterell et al.,2014),
Spanish (Valenzuela et al.,2017), Mandarin (Black,2009), and Hindi
(Mochahary,2019) have predominantly relied upon data collection
from online newspapers and renowned novels. Based on these obser-
vations, we conduct a data collection process that involves gathering
single-sentence text from various popular online Bangla newspapers
and renowned Bangla novels. We choose short text because it requires
less processing time, contains concise and focused information.
To obtain relevant content (Rudra et al.,2020), we employ an
empirical approach, using specific keywords aligned with the Big Five
personality traits (Islam et al.,2019). These keywords are meticulously
chosen to ensure that the collected texts are indicative of the respective
traits. For instance, we select Bangla keywords such as ,
, and for the trait of Extroversion. This keyword-based
extraction method enables us to gather a wide array of texts that
effectively represent the different personality traits.
To maintain high quality, we hire 6 graduate students from Bangla
Department, who possess a deep understanding of the Big Five per-
sonality traits, to review and verify the collected text. Their expertise
is instrumental in ensuring that the texts accurately represent the
targeted personality traits (Openness, Conscientiousness, Extroversion,
Agreeableness, and Neuroticism) that described in Section 2. This
multi-layered verification process helps us to filter out any irrelevant
or low-quality sentences and maintain the correct semantic information
in our data.
Furthermore, we manually filter the texts to remove any content
that does not adequately represent the intended personality traits. This
manually selecting text is essential to make sure each text accurately
reflects the specific trait it is supposed to represent. Finally, we obtain
that 90% of the text scripts are taken from various online Bangla
3
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 1. Toplevel Overview of the Proposed System.
Table 1
Summary of data.
Traits Sample data Website
Agreeableness https://www.prothomalo.com/
Conscientiousness https://www.bbc.com/bengali/
Extroversion https://www.newsbangla24.com/
Neuroticism https://www.ebanglalibrary.com/
Openness https://www.kazinazrulislam.org/
newspapers and the remaining 10% come from well-known Bangla
novels. Examples of the collected texts for each personality trait, along
with their sources, are provided in Table 1to illustrate the diversity
and relevance of the data.
To prepare a realistic speech corpus for our experiments in the
speech-to-text modality, we enlist a professional speaker. The speaker
is given the task of immersing himself in the NEO-FFI descriptions
(Goldberg,1993), which represent 5 personality profiles. The speaker
records a ‘natural’ version of a predefined text, speaking in his ordi-
nary manner without acting. We instruct the speaker to perform this
imitation at least 7 times for each text. All the audio recordings are
conducted in a room that prevents external noise. Then we carefully
examine all the audio recordings and select the best 5 audio samples
for each text. The complete audio database comprises a total of 1750
recorded files.
In the speech modality, we utilize 71% of the text from our pre-
defined material. We provide specific instructions to our speaker to
enact 10 distinct personality variations, resulting in the desired speech
recordings. These instructions are crafted following the guidelines out-
lined in Section 2. The speaker’s performance aims at portraying ex-
treme ends of the personality factors. To illustrate, for the trait of
agreeableness, we direct the speaker to imitate both a highly agreeable
person and a significantly less agreeable person. Since the speaker is
professional, there is a higher likelihood of obtaining authentic and
natural speech samples. We ensure that the speaker carried out this
imitation process a minimum of 6 times for each text. We follow same
procedure for recording audio like before. Afterward, we carefully re-
view the audio and select the best 4 recordings for each text. As a result
of this effort, our audio database contains a total of 1000 recorded files.
Similar approach to Polzehl (2015), we provide a summary of all the
recorded data and conditions in Table 2. We believe that sharing this
dataset will benefit other researchers in the field. The dataset can be
accessed at the following reference: (Sk.,2024)
4
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Table 2
Present an overview of the recordings.
Speech-to-text modality Speech modality
Predefine text Yes Yes
Domain Newspapers and novels Newspapers and novels
Speaker-dependency Yes Yes
Acted/non-acted Non-acted Acted
Linguistic diversity ‘‘Short Text’’ ‘‘Short Text’’
Dataset size 5h 3h
Number of speaker 1 1
Audio capturing quality 44.1 KHz, mono 44.1 KHz, mono
3.2. Data annotation
In the speech-to-text modality, we enlist the assistance of 5 graduate
students who are knowledgeable about the Big Five personality traits
and are not known about the speaker. Each of them is allocated 350
randomly selected recorded audio files along with NEO-FFI question-
naires. We select 10 questionnaires from NEO-FFI for each personality
trait. Thus, for the five distinct personality traits, we have a total of
50 questionnaires. These students carefully listen to each audio file
multiple times and completed the questionnaires. Each question in the
questionnaire offers five response options, ranging from ‘‘strongly dis-
agree’’ to ‘‘strongly agree’’, corresponding to numeric values between
1 to 5. Subsequently, we calculate numerical values for each audio file
based on the NEO-FFI questionnaire responses.
To ensure consistency among the annotators, we implement several
steps. Firstly, all annotators are provided with a detailed annotation
manual that included clear guidelines on how to assess each personality
trait using the NEO-FFI questionnaires. Additionally, we conduct an
initial training session for all annotators. During this session, annotators
practice on a subset of audio files and complete the questionnaires in-
dependently. Their responses are then discussed collectively to identify
any discrepancies and to standardize their understanding of the rating
process.
To further ensure inter-rater reliability, we calculate the inter-rater
reliability coefficient, Cohen’s Kappa, for a randomly selected subset
of annotated audio files. Cohen’s Kappa values ranged from 0.75 to
0.85 across different personality traits, indicating substantial agreement
among the annotators (Landis and Koch,1977). Our dataset’s average
Cohen’s Kappa value across all traits was 0.794.
The audio file that generates the highest value for a particular
trait is selected as representative of that trait. To illustrate the anno-
tation process, we provide an example concerning the Conscientious-
ness trait. Fig. 2shows the responses to NEO-FFI questions related to
Conscientiousness for a specific audio file.
In the speech modality, we engage 5 graduate students who are also
well-versed in the Big Five personality traits. Similar to the speech-
to-text phase, each of them is provided with 200 randomly recorded
audio files along with NEO-FFI questionnaires and they are unknown
to one another. These students followed the same procedure, listening
to the audio files and completing the questionnaires to determine the
personality traits represented in each audio recording. Following the
annotation phase, our findings are presented in both Table 3and
Table 4. All human raters involved in the annotation process pro-
vide informed consent. They are briefed on the nature of the study,
their roles, and data usage. Privacy and confidentiality are maintained
by anonymizing personal identifiers and storing data securely with
restricted access.
3.3. Reliability analysis
We assess the reliability of the personality trait measures used in our
research. The reliability analysis is crucial in ensuring that the measure-
ment items consistently capture the underlying personality constructs.
Table 3
Speech-to-text modality.
Label Number of data
Agreeableness 350
Extroversion 350
Openness 350
Neuroticism 350
Conscientiousness 350
Table 4
Speech modality.
Label Number of data
HighAgree 100
LowAgree 100
HighExtrover 100
LowExtrover 100
HighOpen 100
LowOpen 100
HighNeurotic 100
LowNeurotic 100
HighConscientious 100
LowConscientious 100
Table 5
Speech-to-text modality.
Traits Cronbach’s alpha
Agreeableness 0.83
Extroversion 0.86
Openness 0.83
Neuroticism 0.85
Conscientiousness 0.81
Cronbach’s alpha (Tavakol and Dennick,2011), a well-established mea-
sure of internal consistency, is employed to evaluate the reliability of
each personality trait scale utilized in our research. A high Cronbach’s
alpha value, typically ranging from 0.8 and higher, indicates strong
internal consistency. This suggests that the items within the scale are
closely related and collectively contribute to a reliable measurement of
the intended variable (Bonett and Wright,2015).
During the speech-to-text modality, we take the numeric values
that have been previously calculated for each audio file in the data
annotation phase. These numeric values represent various aspects of
personality traits. We use these numbers to compute Cronbach’s alpha
for each personality trait. In Table 5, we provide an overview of the
Cronbach’s alpha values obtained from our data.
In the speech modality, we use the NEO-FFI questionnaire to assess
personality through speech, we assign ratings using a five-point Likert
scale (Polzehl,2015). Since we have 5 assessors for doing this assess-
ment, and the total score for each personality trait could range from 0
to 50.
To provide a clear sense of how ratings on this scale correspond
to the various personality traits, we create histograms for each of the
Big 5 personality traits and a Gaussian distribution curve in Fig. 3.
This figure simplifies a direct comparison between the ratings for high
traits (depicted in purple) and low traits (depicted in green). If the
difference between high and low traits are not noticeable, the Gaussian
curves mostly overlap. However, we observe that, for certain traits
like extroversion, there is only a small area of overlap, indicating a
clear distinction between high and low trait ratings. In contrast, for
other traits such as openness and neuroticism, there are more overlap
in two, suggesting that distinguishing between high and low trait
ratings is less straightforward in this dataset. The reason can be that
openness and neuroticism are inherently correlated personality traits.
Individuals high in openness may experience negative emotions, which
can be associated with neuroticism. Another reason can be that the
speaker does not achieve the intended results. Other speakers might do
better. We need to test these hypotheses with different speakers through
5
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 2. Annotation of Conscientiousness Trait Based on a Single Audio File. The questions evaluated different aspects of Conscientiousness. The responses were rated on a scale
from 1 to 5 and used to determine the numerical value 44 that represented the level of Conscientiousness in the audio file.
repeated experiments in future. It is also possible that these factors
cannot be accurately judged just by listening to speech samples. If that
is the case, we expect the ratings and results to be inconsistent. So, some
pairs of high and low personality traits are more easily distinguishable
based on the data, while others exhibit more similarity.
3.4. Data preprocessing and feature extraction
We perform data preprocessing in two distinct modalities to prepare
our audio dataset for analysis.
In the first modality (Speech-to-Text), we focus on converting audio
data into text format, which is essential for subsequent text-based
analysis. The labels for the audio files are determined based on the
directory structure of the dataset. To convert audio to text, we employ
the SpeechRecognition library. This library allows us to transcribe
spoken words and convert them into a textual representation, making
the data accessible for text-based processing. For tokenization, we use
the built-in DistilBERT tokenizer and the RoBERTa tokenizer.
In the second modality (Speech Processing), We used audio files that
were in WAV format. Similar to speech-to-text, the labels for the audio
files are determined based on the directory structure of the dataset. In
our approach, we employ four feature extraction techniques: MFCCs,
MELP, MEWLP, and MoMF.
MFCCs: For feature extraction, we use ‘‘Mel-frequency cepstral co-
efficients’’ (MFCCs) that capture a snapshot of the audio’s acoustic
characteristics and translate them into numerical data. MFCCs trans-
form the speech signal into a frequency domain using Fourier Trans-
form (Trabelsi and Ayed,2012). It divides the audio signal into small
time frames, and for each frame, it calculates the energy. Mel scale is
then applied to approximate the perception of pitch and the intensity
of sound is converted to a logarithmic scale. Then, the data is processed
using DCT to convert the information into a set of coefficients, and
these coefficients capture the unique features of the sound (Alim and
Rashid,2018). From the article (Alim and Rashid,2018), we can define
the MFCCs formula:
𝑚(𝑓) = 2595 𝑙 𝑜𝑔10(1 +𝑓
700 )(1)
𝑎𝑚=
𝑙
𝑚=1
(𝑙 𝑜𝑔 𝑜𝑙)𝑐 𝑜𝑠[𝑚(𝑙1
2)𝜋
𝑙](2)
where f is frequency(Hz), l is number of mel ceptrum coefficients,
𝑜𝑙is filterbank output, and 𝑎𝑚is the coefficient. MFCCs extract rele-
vant features from audio signals with reduced dimensionality of the
data, less sensitivity to background noise, and inherent variations in
pitch (Kamarulafizam et al.,2007).
3.4.1. Mel-frequency cepstral coefficients with linear predictive coding
(MELP)
One drawback of MFCCs is their relatively poor performance in cap-
turing temporal dynamics and short-duration events in audio signals.
This drawback can be fixed by Linear Predictive Coding (LPC) (Labied
and Belangour,2021) that explicit temporal characteristics of the signal
by estimating the parameters of a linear predictive model. So, we
perform a feature-level fusion of MFCCs and LPC features to set MELP to
capture unique acoustic features in speech signals. LPC tries to calculate
a set of coefficients that describe the filter and tries to predict the next
sound sample based on previous samples (Labied and Belangour,2021).
From Alim and Rashid (2018), we can define the LPC formula:
𝑏𝑛=𝑙 𝑜𝑔[1 𝑝𝑛
1 +𝑝𝑛
](3)
Now from Eq. (2) and Eq. (3), we get:
𝑀 𝐸 𝐿𝑃 =𝑀 𝐹 𝐶 𝐶 𝑠 𝐿𝑃 𝐶(4)
𝑀 𝐸 𝐿𝑃 =𝑎𝑚 𝑏𝑛(5)
Combining MFCCs and LPC in the MELP feature extraction provides
a thorough representation of speech signals. MFCCs capture spectral
features and details about intensity (Trabelsi and Ayed,2012), and LPC
focuses on the characteristics of the vocal tract (Labied and Belangour,
2021) that offer precise temporal information.
3.4.2. Mel-frequency cepstral coefficients with wiener linear predictive cod-
ing (MEWLP)
One drawback of MELP is that it struggles to capture both the fine
spectral details and precise temporal dynamics simultaneously. This
limitation can be fixed by Wiener filtering (Chen et al.,2006) that
6
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 3. The graph shows the five traits such that Agreeable (Trait A), Conscientious (Trait C), Extrovert (Trait E), Neurotic (Trait N), Open (Trait O) probability distribution of
acted speech. On the left, green bars represent ratings for speech acts to low personality trait scores. On the right, there are purple bars, which represent ratings for speech acts
to high personality trait scores. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
provide more robust estimation in dynamic environments. For MEWLP
feature extraction, we use the Wiener filter and then compute the LPC
coefficients of the denoised signal. Furthermore, we concatenate WLPC
coefficients of the denoised signal with MFCCs to obtain feature vector.
It is defined as:
original audio signal =𝑦[𝑛]
𝐿𝑃 𝐶(𝑦, 𝑜𝑟𝑑 𝑒𝑟) =𝑎𝑘;𝑘= 1,2,, 𝑜𝑟𝑑 𝑒𝑟 = 40 (6)
In eqaution (6),𝑎𝑘is LPC coefficients.
𝑛𝑠[𝑛] =𝑦[𝑛] + (𝑛𝑙 𝑟𝑛[𝑛]) (7)
In Eq. (7), we generate a noisy version of the original signal
by adding some random noise. ns, 𝑛𝑙, and 𝑟𝑛 represent 𝑛𝑜𝑖𝑠𝑦_𝑠𝑖𝑔 𝑛𝑎𝑙,
𝑛𝑜𝑖𝑠𝑒_𝑙 𝑒𝑣𝑒𝑙, and 𝑟𝑎𝑛𝑑 𝑜𝑚_𝑛𝑜𝑖𝑠𝑒 respectively.
𝑊𝑘=|𝑎𝑘|
|𝑎𝑘|+𝑣𝑎𝑟(𝑦𝑛𝑠)(8)
7
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
In Eq. (8), the Wiener filter coefficients are computed where 𝑊𝑘is
the Wiener filter coefficient for the 𝐾𝑡ℎ LPC coefficient.
𝑑 𝑒𝑛𝑜𝑖𝑠𝑒𝑑 _𝑠𝑖𝑔 𝑛𝑎𝑙[𝑛] =𝑦[𝑛] 𝑊𝑛;𝑛= 0,1,, 𝑙 𝑒𝑛(𝑦) 1(9)
In Eq. (9), the Wiener filter coefficients are applied to each sample
of the original signal.
𝑊 𝐿𝑃 𝐶(𝑑 𝑒𝑛𝑜𝑖𝑠𝑒𝑑_𝑠𝑖𝑔 𝑛𝑎𝑙 , 𝑜𝑟𝑑 𝑒𝑟) =𝑎
𝑘(10)
In Eq. (10), the WLPC coefficients are computed of the denoised
signal where 𝑎
𝑘are the WLPC coefficients after denoising.
𝑀 𝐹 𝐶 𝐶(𝑦, 𝑛_𝑚𝑓 𝑐 𝑐) =1
𝑇
𝑇
𝑡=1
𝑀 𝐹 𝐶 𝐶𝑡(11)
In Eq. (11), the Mel-Frequency Cepstral Coefficients (MFCC) are
extracted where 𝑀 𝐹 𝐶 𝐶𝑡is the vector of MFCC coefficients at time 𝑡
and 𝑇is the number of frames.
𝑐 𝑓= [𝑀 𝐹 𝐶 𝐶1,, 𝑀 𝐹 𝐶 𝐶𝑛_𝑚𝑓 𝑐 𝑐, 𝑎
1,, 𝑎
𝑜𝑟𝑑 𝑒𝑟 ](12)
In Eq. (12), the MFCC and WLPC coefficients concatenate to obtain
the combined feature vector where 𝑐 𝑓represents 𝑐 𝑜𝑚𝑏𝑖𝑛𝑒𝑑_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠.
By subjective, MEWLP enhances the robustness of the MELP represen-
tation, resulting in improved performance in noisy environments and
enhanced signal-to-noise ratio.
3.4.3. Morlet-based mel-frequency cepstral coefficients (MoMF)
We introduce a novel feature extraction technique named MoMF
for analyzing both time-domain and frequency-domain characteristics
simultaneously. Unlike traditional Fourier-based methods that provide
limited temporal resolution at high frequencies, the Morlet wavelet en-
hances time-frequency localization, making it well-suited for capturing
transient events and spectral variations in signals. For MoMF feature
extraction, we use Morlet low pass filter that effectively suppresses
the higher-frequency components of the Morlet wavelet (Lin and Qu,
2000), resulting in a filter that retains the low-frequency information in
the analysis of time-frequency representations of signals. Since all audio
samples in our dataset feature male voices, this approach effectively
isolates the distinctive lower-frequency components inherent in male
vocalizations that enhance the analysis of key characteristics such as
pitch and fundamental frequency. From Cohen (2019), we can define
the Morlet wavelet formula:
𝑀 𝑜𝑟𝑙 𝑒𝑡(𝑡) =𝑐 𝑜𝑠(2𝜋 𝑓𝑐𝑡).𝑒
𝑡2
2𝐵2(13)
In Eq. (13),𝑓𝑐,𝐵,𝑡represent 𝑐 𝑒𝑛𝑡𝑒𝑟_𝑓 𝑟𝑒𝑞 𝑢𝑒𝑛𝑐 𝑦,𝑏𝑎𝑛𝑑 𝑤𝑖𝑑 𝑡ℎ, and 𝑡𝑖𝑚𝑒
respectively. As we want to capture low-frequency information so we
need convolution operation. The convolution operation:
𝑓 𝑠[𝑛] =
𝑘=−∞
𝑠𝑖𝑔 𝑛𝑎𝑙[𝑘].𝑀 𝑜𝑟𝑙 𝑒𝑡(𝑛𝑘)(14)
In Eq. (14),𝑓 𝑠,𝑘,𝑛represent 𝑓 𝑖𝑙 𝑡𝑒𝑟𝑒𝑑_𝑠𝑖𝑔 𝑛𝑎𝑙,𝑖𝑛𝑝𝑢𝑡_𝑠𝑖𝑔 𝑛𝑎𝑙, and
𝑘𝑒𝑟𝑛𝑒𝑙 respectively. The convolution operation is used here to imple-
ment Morlet low-pass filter.
𝑥[𝑗] =|𝐹 𝐹 𝑇(𝑓 𝑠[𝑛])|(15)
In Eq. (15), we use the Fast Fourier Transform (FFT) to compute the
magnitude spectrum (𝑥[𝑗]) of the filtered signal.
𝑚𝑠[𝑖] =
𝑗
𝑚𝑓 [𝑖, 𝑗]𝑥[𝑗](16)
In Eq. (16), Mel filterbank is applied using matrix multiplication
where i= 0 to num_filters-1, j= 0 to len(magnitude_spectrum), ms and
𝑚𝑓 represent 𝑚𝑒𝑙_𝑠𝑝𝑒𝑐 𝑡𝑟𝑢𝑚 and 𝑚𝑒𝑙_𝑓 𝑖𝑙 𝑡𝑒𝑟𝑠 respectively. 𝑚𝑒𝑙_𝑠𝑝𝑒𝑐 𝑡𝑟𝑢𝑚
captures the energy distribution across different frequency bands, em-
phasizing regions that are more perceptually significant. Then the
log compression is applied to mimic the human auditory system’s
sensitivity to differences in loudness:
𝑙 𝑐 𝑒[𝑖] =𝑙 𝑜𝑔(𝜖+𝑚𝑠[𝑖]) (17)
In Eq. (17), a logarithmic compression is applied to the Mel spec-
trum where 𝑙 𝑐 𝑒represents 𝑙 𝑜𝑔_𝑐 𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑒𝑑 _𝑒𝑛𝑒𝑟𝑔 𝑖𝑒𝑠 and 𝜖is a small
constant to avoid taking logarithm zero.
𝑓 𝑣[𝑖] =
𝑁−1
𝑖=0
𝑙 𝑐 𝑒[𝑖]𝑐 𝑜𝑠(𝜋
𝑁𝑗(𝑖+1
2)) (18)
In Eq. (18), DCT (Discrete Cosine Transform) is applied to the
𝑙 𝑜𝑔_𝑐 𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑒𝑑 _𝑒𝑛𝑒𝑟𝑔 𝑖𝑒𝑠 to obtain the feature vector where 𝑓 𝑣rep-
resent 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒_𝑣𝑒𝑐 𝑡𝑜𝑟,𝑁represent the number of coefficients. Now
concatenate 𝑓 𝑣with MFCCs to get MoMF.
From Eq. (18) and Eq. (2), we get
𝑀 𝑜𝑀 𝐹=𝑓 𝑣 𝑀 𝐹 𝐶 𝐶 𝑠(19)
MFCC is a column vector of size (𝑀 1) and Morlet is a column
vector of size (𝑁 1). After concatenation, we get single vector shape
((𝑀+𝑁) 1) for each signal.
3.5. DistilRo and BiG: Soft voting ensemble models for personality classifi-
cation in speech, and speech-to-text modalities
For the classification of Speech, and Speech-to-Text modalities, we
use different techniques. Ensemble learning models (Zhou,2012) lever-
age the decisions made by various baseline models to enhance overall
performance. For ensemble classifiers to work well, it is important that
the base classifiers are somewhat different from each other (Da Silva
et al.,2014). Brown and his team suggested three ways to make them
diverse: (1) starting from different points in their thinking, (2) having
access to different information, like using different training data, and
(3) using different methods or strategies to make decisions (Brown
et al.,2005). In our research, we focus on the third way. We use
four different base classifiers and combine them in different ways to
make two ensemble classifiers for classifying personality traits. We
chose soft voting (Zhou,2012) as our combination method, which
is a popular method. It works by averaging the probabilities given
by each base classifier for each possible decision. Then, it picks the
decision with the highest average probability as the final result. It
is important to note that ensemble classifiers do not always perform
better than individual base classifiers (Fu et al.,2021b). So, we perform
two different ensemble classifiers to see if they could do better when
classifying personality traits.
For classifying personalities in speech-to-text modality, we perform
DistilRo. DistilRo is a soft voting classifier that brings together the
advantages of two baseline models: DistilBERT (Sanh et al.,2020)
and RoBERTa (Liu et al.,2019). These models collaborate in a soft
voting setup to analyze the personality traits, taking all the semantic
aspects present in the text. In the speech multi-class classification, we
perform BiG. BiG is another soft voting classifier, this time utilizing
Bi-LSTM (Zhou et al.,2022) and GRU (Dey and Salem,2017) baseline
models. These components work together to make accurate predictions
while also considering the semantic subtle present in the spoken lan-
guage. Now, we explain each model and then we explain the parameter
tuning that was used to develop them.
DistilBERT and RoBERTa: DistilBERT (Sanh et al.,2020) is a shorter
version of BERT (Devlin et al.,2019) model. It uses the same archi-
tecture of BERT, which is a transformer model (Ameer et al.,2023).
Based on the model (Sanh et al.,2020), multi-head self-attention of
8
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 4. An overview of DistilBERT work flow.
the transformer allows to focus on different parts of the input sentence
and learn the relationship between them. It is defined as:
𝑔(𝑄, 𝐾 , 𝑉) =𝑐(1,, 𝑡)𝑤0(20)
𝑙=𝑎(𝑄𝑤𝑄
𝑙, 𝐾 𝑤𝐾
𝑙, 𝑉 𝑤𝑉
𝑙), 𝑎(𝑄, 𝐾 , 𝑉) =𝑠(𝑄𝐾𝑇
𝑑𝑘
)𝑉(21)
In the Eqs. (20) and (21) multihead, concat, attention, and softmax
are considered as g, c, a, and s respectively.
The Feed-forward neural network allows the model to apply non-
linear transformations to the input and learn complex features. It is
defined as:
𝑓(𝑎) =𝑟(𝑎𝑤1+𝑑1)𝑤2+𝑑2(22)
In Eq. (22), ReLU function is considering as r.
The Model takes input as a sequence of tokens, which are words or
subwords, and converts them into vectors using word embeddings and
position embeddings. Word embeddings capture the meaning of each
token, and position embeddings capture the order of each token in the
sequence (Adoma et al.,2020b). Then the model sum word embeddings
and position embeddings for producing a final hidden state for each
token (Sanh et al.,2020). The equation for the output of DistilBERT is:
ℎ𝑖𝑑 𝑑 𝑒𝑛𝑙=𝑡𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟(𝑒+𝑝)(23)
𝑡𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟(𝑎) =𝑓(𝑔(𝑎, 𝑎, 𝑎)) (24)
In Eq. (23), e is the embedding matrix, p is the position embedding
matrix, and transformer is the Transformer encoder with l layers.
Final hidden state can be used for different tasks, in our case it is a
classification task.
RoBERTa (Liu et al.,2019) uses the same architecture as BERT,
which is based on the Transformer model with more layers, and pa-
rameters that make it larger, and powerful (Sharma et al.,2021). The
working procedure of DistilBERT and RoBERTa are same. In Fig. 4, we
show an overview of DistilBERT model where we use one openness text
from our dataset.
3.5.1. DistilRo
Our DistilRo model is a soft voting classifier, bringing together two
powerful baseline models: DistilBERT and RoBERTa. In this classifier,
Table 6
Baseline Models parameters of DistilRo.
Baseline Models Parameters
DistilBERT train_batch_size = 16
eval_batch_size = 64
epochs = 11
RoBERTa train_batch_size = 8
eval_batch_size = 32
gradient_accumulation = 4
epochs = 20
each model gives its prediction along with a probability. Then, the
DistilRo classifier combines these probabilities from both models for
each prediction and selects the one with the highest total probability
as the final prediction.
𝑧 =𝑎𝑟𝑔 𝑚𝑎𝑥𝑘
𝑝
𝑙=1
𝑤𝑝𝑠𝑘𝑝 (25)
In Eq. (25),𝑤𝑝signifies the weight attributed to the 𝑙th classifier,
while 𝑠denotes the predict score.
We have fine-tuned each baseline model to optimize the perfor-
mance of the soft voting classifier. In Table 6, we provide some specific
parameters that have employed in these baseline models.
We have provided a visual representation of its structure in Fig. 1.
This unique combination of DistilBERT and RoBERTa, working together
as a soft voting classifier, is designed to capture and utilize seman-
tic information, ensuring accurate and reliable results in personality
classification.
Bi-LSTM: Bi-LSTM (Zhou et al.,2022) is one kind of LSTM (Graves
et al.,2013) that can learn long-term dependencies from sequential
data in both forward and backward directions. It combines both LSTM
and bidirectional processing for sequence learning (Mughees et al.,
2021). Based on Graves et al. (2013), for the forward direction, it can
be defined as:
𝑝𝑘=𝜎(𝑤𝑎𝑝𝑎𝑘+𝑤𝑙 𝑝𝑙𝑘−1 +𝑑𝑝)(26)
𝑞𝑘=𝜎(𝑤𝑎𝑞𝑎𝑘+𝑤𝑙 𝑞𝑙𝑘−1 +𝑑𝑞)(27)
𝑟𝑘=𝜎(𝑤𝑎𝑟𝑎𝑘+𝑤𝑙 𝑟𝑙𝑘−1 +𝑑𝑟)(28)
9
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
𝑠𝑘=𝑡𝑎𝑛ℎ(𝑤𝑎𝑠𝑎𝑘+𝑤𝑙 𝑠𝑙𝑘−1 +𝑑𝑠)(29)
𝑠𝑘=𝑞𝑘 𝑠𝑘−1 +𝑝𝑘 𝑠𝑘(30)
𝑙𝑘=𝑟𝑘 𝑡𝑎𝑛ℎ(𝑠𝑘)(31)
where 𝜎is sigmoid function, t anh is tan hyperbolic function, is point-
wise multiplication, 𝑎𝑘is input vector at time step 𝑘.𝑝𝑘,𝑞𝑘,𝑟𝑘, and 𝑠𝑘
are the input, forget, output, and cell gate, respectively. 𝑠𝑘is cell state,
and 𝑙𝑘is hidden state at time step 𝑘.𝑤and 𝑑are the weight matrices
and biases for each gate.
For backward direction:
𝑝𝑘=𝜎(
𝑤𝑎𝑝
𝑎𝑘+
𝑤𝑙 𝑝
𝑙𝑘−1 +
𝑑𝑝)(32)
𝑞𝑘=𝜎(
𝑤𝑎𝑞
𝑎𝑘+
𝑤𝑙 𝑞
𝑙𝑘−1 +
𝑑𝑞)(33)
𝑟𝑘=𝜎(
𝑤𝑎𝑟
𝑎𝑘+
𝑤𝑙 𝑟
𝑙𝑘−1 +
𝑑𝑟)(34)

𝑠𝑘=𝑡𝑎𝑛ℎ(
𝑤𝑎𝑠
𝑎𝑘+
𝑤𝑙 𝑠
𝑙𝑘−1 +
𝑑𝑠)(35)
𝑠𝑘=
𝑞𝑘
𝑠𝑘−1 +
𝑝𝑘
𝑠𝑘(36)
𝑙𝑘=
𝑟𝑘 𝑡𝑎𝑛ℎ(
𝑠𝑘)(37)
For all backward equation, the notation is similar to the forward
direction, but with an overline to indicate the backward direction.
Final hidden state concatenate forward and backward direction that is
defined as:
𝑙
𝑘= [𝑙𝑘
𝑙𝑘](38)
where denotes the concatenation operation. Final hidden state can
be used for different tasks, in our case it is a classification task.
GRU: GRU (Dey and Salem,2017) is a simplified version of LSTM to
process a sequence of tokens and consists of two gates i.e reset gate and
update gate. These two gates decide how much information to keep or
discard from the previous and current states (Yu and Markov,2017).
Based on Dey and Salem (2017), the equations can be defined as:
𝑚𝑘=𝜎(𝑤𝑎𝑚𝑎𝑘+𝑤𝑙 𝑚𝑙𝑘−1 +𝑑𝑚)(39)
𝑛𝑘=𝜎(𝑤𝑎𝑛𝑎𝑘+𝑤𝑙 𝑛𝑙𝑘−1 +𝑑𝑛)(40)
𝑙𝑘=𝑡𝑎𝑛ℎ(𝑤𝑎𝑙𝑎𝑘+𝑤𝑙 𝑙𝑙𝑘−1 +𝑑𝑙)(41)
𝑙𝑘= (1 𝑛𝑘) 𝑙𝑘−1 +𝑛𝑘
𝑙𝑘(42)
where 𝜎is sigmoid function, 𝑡𝑎𝑛ℎ is tan hyperbolic function, is point-
wise multiplication, 𝑎𝑘is input at time step 𝑘,𝑚𝑘, and 𝑛𝑘are reset
and update gates respectively.
𝑙𝑘is candidate hidden state, 𝑙𝑘is hidden
state. 𝑤, and 𝑑are the weight matrices and bias for each gate.
3.5.2. BiG
Our BiG model is a soft voting classifier, bringing together the two
baseline models: Bi-LSTM and GRU. In this classifier, each model makes
a guess and gives it a probability. Then, the BiG classifier combines
these guesses from both models for each prediction and picks the
one with the highest overall probability as the final guess. We have
fine-tuned each baseline model to optimize the performance of the
soft voting classifier. In Table 7, and Table 8, we provide specific
parameters that we have employed in these baseline models based
on MFCCs, MoMF, MELP, and MEWLP feature extractions technique
respectively.
We have provided a visual representation of its structure in Fig. 1.
This combination of Bi-LSTM and GRU, working together as a soft
voting classifier, is designed to capture and utilize semantic subtle
present in spoken language, ensuring accurate and reliable results in
personality classification.
Table 7
Baseline Models parameters of BiG using MFCCs & MoMF.
Baseline Model Parameters
Bi-LSTM regularizer = 0.01
dropout = 0.2
loss = categorical_crossentropy
ephocs = 92 (MFCCs)
ephocs = 95 (MoMF)
GRU batch_size = 64
optimizer = adam
ephocs = 100 (MFCCs)
ephocs = 220 (MoMF)
Table 8
Baseline Models parameters of BiG using MELP & MEWLP.
Baseline Model Parameters
Bi-LSTM input shape = (81,1)
dropout = 0.2
loss = categorical_crossentropy
ephocs = 114 (MELP)
ephocs = 142 (MEWLP)
GRU batch_size = 64
input shape = (81,1)
optimizer = adam
ephocs = 200 (MELP)
ephocs = 186 (MEWLP)
4. Results and discussion
In this section, we commence our discussion by addressing the crit-
ical matter of parameter selection for feature extraction. Subsequently,
we are describing the validation results of our experiments and discuss
about the models performance on our dataset of Bangla speech. The
dataset is partitioned, with 80% reserved for training data and the
remaining 20% allocated for validation data. For training our baseline
models, we use Google Colab Pro platform (Bisong et al.,2019). To
gauge how well our models are doing, we use confusion metrics (Deng
et al.,2016), precision, recall, and F-1 score (Goutte and Gaussier,
2005). These help us to understand how good our models for classifying
personality traits.
4.1. Parameter selection for feature extraction
In the context of MELP, each feature vector is composed of the
concatenation of 40 default MFCCs features and 40 LPC features. One-
hot encoding is employed for label representation across all feature
extraction methods. For MEWLP, 40 LPC features are specified for each
audio file, while the 𝑛𝑜𝑖𝑠𝑒_𝑙 𝑒𝑣𝑒𝑙 is fixed at 0.01. In the case of MFCCs,
the number of frames, denoted as 𝑇, is set to 40. Concerning MoMF,
parameters are configured as follows: 𝑓𝑐= 1000,𝐵= 5,𝑡= 30
in Eq. (13). A total of 30 mel filters (𝑚𝑒𝑙_𝑓 𝑖𝑙 𝑡𝑒𝑟𝑠 = 30) are utilized
in Eq. (16). To prevent issues associated with the logarithm of zero, 𝜖
is set to 1𝑒−5 . Furthermore, Eq. (18) involves a specification of 𝑁= 30,
representing the number of coefficients.
4.2. Personality traits classification using DistilRo
DistilRo uses two models called DistilBERT and RoBERTa as its
foundation. These models are designed to work together to improve the
performance of the DistilRo model. Additionally, to prevent overfitting
when dealing with smaller datasets, the models use an l2 regularizer
mechanisms. DistilBERT and RoBERTa as baseline models make the Dis-
tilRo perform better in classifying the personality traits. In Table 9and
Table 10, we present the baseline performance of these models when it
comes to categorizing individual personality traits. Table 9presents the
precision, recall, and F1-score for different personality traits classified
10
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Table 9
RoBERTa classification results.
Traits Precision Recall F1-score
Agreeableness 0.83 0.88 0.85
Conscientiousness 0.73 0.77 0.75
Openness 0.98 0.84 0.91
Extroversion 0.93 0.8 0.86
Neroticism 0.73 0.88 0.79
Macro Average 0.84 0.83 0.84
Table 10
DistilBERT classification results.
Traits Precision Recall F1-score
Agreeableness 1.00 0.82 0.9
Conscientiousness 0.76 0.77 0.77
Extroversion 0.85 0.9 0.88
Neroticism 0.81 0.82 0.82
Openness 0.7 0.83 0.76
Macro Average 0.82 0.83 0.83
Table 11
DistilRo classification results.
Traits Precision Recall F1-score
Agreeableness 0.95 0.9 0.92
Conscientiousness 0.9 0.85 0.87
Openness 0.89 0.92 0.9
Neroticism 1.00 0.8 0.89
Extroversion 0.71 0.97 0.82
Macro Average 0.89 0.88 0.89
by the RoBERTa model. Openness demonstrates the highest precision
(0.98) and a strong F1-score (0.91), indicating the model’s high accu-
racy in correctly identifying this trait, though it slightly underperforms
in recall (0.84), suggesting some instances of Openness are missed.
Extroversion also shows high precision (0.93) and a robust F1-score
(0.86), but a lower recall (0.80), highlighting a tendency to miss some
Extroversion instances. Conversely, Conscientiousness and Neuroticism
exhibit balanced yet moderate performance metrics, with precision and
recall around 0.73 to 0.88, indicating the model’s relatively balanced
but less accurate predictions for these traits. The Macro Average scores
(0.84 precision, 0.83 recall, 0.84 F1-score) reflect a generally strong
overall performance of the model across all traits, with slight variances
in handling specific traits. In Table 10, DistilBERT model achieves
perfect precision (1.00) for Agreeableness but a lower recall (0.82),
resulting in a high F1-score (0.90), indicating excellent accuracy in
identifying this trait but missing some instances. Extroversion exhibits
high performance across all metrics with a precision of 0.85, recall of
0.90, and F1-score of 0.88, signifying robust identification and coverage
of this trait. Neuroticism shows balanced and solid performance with
all metrics around 0.81 to 0.82, reflecting consistent classification.
Conscientiousness has moderate precision (0.76) and recall (0.77),
leading to a stable F1-score (0.77), indicating moderate accuracy and
coverage. Openness presents the lowest precision (0.70) but better
recall (0.83), with an F1-score of 0.76, suggesting the model often
misclassifies this trait but detects most instances. The Macro Average
values (0.82 precision, 0.83 recall, 0.83 F1-score) indicate an overall
balanced and effective performance of the model across all traits, with
some variation in handling specific traits.
In Figs. 5and 6, which display the confusion matrices of our
baseline models. Table 11 provides detailed classification results for
the ensemble model DistilRo, including its precision, recall, and F1-
score for each personality trait. Agreeableness achieves a high precision
of 0.95 and a recall of 0.90, resulting in a strong F1-score of 0.92,
indicating that the model accurately identifies this trait and captures
most instances. Conscientiousness also shows high performance with
Table 12
Bi-LSTM classification results based on MFCCs.
Traits Precision Recall F1-score
HighAgree 0.73 0.84 0.76
HighConscientious 0.92 0.90 0.91
HighExtrover 0.97 0.94 0.96
HighNeurotic 0.71 0.81 0.75
HighOpen 0.76 0.68 0.70
LowAgree 0.86 0.68 0.77
LowConscientious 0.83 0.95 0.88
LowExtrover 0.97 0.94 0.95
LowNeurotic 0.74 0.85 0.79
LowOpen 0.49 0.53 0.50
Macro Average 0.80 0.79 0.79
a precision of 0.90, recall of 0.85, and an F1-score of 0.87, reflecting
balanced and reliable classification. Openness exhibits a precision of
0.89 and recall of 0.92, leading to a robust F1-score of 0.90, suggest-
ing the model effectively identifies and covers this trait. Neuroticism
demonstrates perfect precision (1.00) but a lower recall (0.80), yield-
ing an F1-score of 0.89, indicating the model’s ability to precisely
identify Neuroticism but missing some instances. Extroversion has a
lower precision of 0.71 but a high recall of 0.97, with an F1-score
of 0.82, implying that while the model captures nearly all instances
of Extroversion, it frequently misclassifies other traits as Extroversion.
The Macro Average values (0.89 precision, 0.88 recall, 0.89 F1-score)
highlight the model’s overall balanced and high performance across
all traits, with slight variations in precision and recall for specific
traits. The reason for missing high performance in some areas is the
presence of feature overlap between certain personality traits, making
it hard for the model to tell them apart accurately. Additionally, the
limited amount of training data further hampers the model’s ability to
generalize well to new instances. In Fig. 7, which display the confusion
matrices of DistilRo model. Table 9,Table 10, and Table 11 provide
the performance of baseline models and the ensemble model. We can
see that RoBERTa performs well for Extroversion and DistilRo performs
well for all other traits. Overall model performance, DistilBERT achieve
83% F-1 score, RoBERTa achieve 84% F-1 score and DistilRo achieve
89% F-1 score in the speech-to-text modality.
4.3. Personality traits classification using BiG
BiG uses two models called Bi-LSTM and GRU as its foundation.
These models are designed to work together to improve the perfor-
mance of the BiG model. Additionally, to prevent overfitting when
dealing with smaller datasets, we use l2 regularizer and early stopping
mechanisms. Bi-LSTM and GRU as baseline models make the BiG model
perform better in classifying the personality traits.
MFCCs base findings: In Table 12 and Table 13, we present the
baseline performance of these models when it comes to categorizing
individual personality traits. In Table 12, we see that HC, HE, LC,
and LE traits are highly identified by Bi-LSTM based on MFCCs and
promising at HA, LA, HN, LN and HO traits but struggle with LO trait.
Furthermore, In Table 13, we see that HC, HE, and LE traits are highly
identified by GRU based on MFCCs and promising at HA, LC, and LN
traits but struggle with HN, LA, and LO traits.
In Figs. 8and 9, which display the confusion matrices of our base-
line models. Table 14 provides a detailed classification results for BiG,
including its precision, recall, and F1-score for each personality traits
where HighAgree, LowAgree, HighExtrover, LowExtrover, HighOpen,
LowOpen, HighNeurotic, LowNeurotic, HighConscientious, LowConsci-
entious are considering as HA, LA, HE, LE, HO, LO, HN, LN, HC, LC
respectively. The model BiG performs best in classifying HC, HE, LE,
while still achieving good results for HA, HN, LN, LC, LA and struggle
with HO and LO. In Fig. 10, which display the confusion matrices of BiG
model. Bi-LSTM performs well for HO, LA, LC traits and BiG performs
11
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 5. Confusion matrix of RoBERTa. RoBERTa effectively captures the Agreeable and Open personality traits. However, a discernible inconsistency of 17% in data alignment
with Agreeable is observed within the Conscientious trait. Additionally, a 14% data mismatch is identified between Neurotic and Open traits.
Fig. 6. Confusion matrix of DistilBERT. DistilBERT effectively captures the Extrover personality traits. However, 13% data mismatch is identified between Conscientious and
Extrover traits. Additionally, a 14% data mismatch is identified between Neurotic and Open traits.
well for all other traits. Overall model performance, Bi-LSTM achieve
79% F-1 score, GRU achieve 73% F-1 score and BiG achieve 81% F-1
score in the speech modality.
MoMF base findings: In Table 15 and Table 16, we present the
baseline performance of these models when it comes to categorizing
individual personality traits. In Table 15, we see that HA, HC, HE, HN,
and LC traits are highly identified by Bi-LSTM based on MoMF and
promising at HO, LA, LE, LN, and LO traits. Furthermore, In Table 16,
we see that HA, HC, HE, LE traits are highly identified by GRU based
on MoMF and promising at LC, LN, LO traits but struggle with HN, HO,
and LA.
In Figs. 11 and 12, which display the confusion matrices of our
baseline models. Table 17 provides a detailed classification report for
BiG, including its precision, recall, and F1-score for each personality
traits. The model BiG performs best in classifying HA, HC, HE, LE, and
LN, while still achieving good results for HN, LA, LC, LO but struggle
with HO. In Fig. 13, which display the confusion matrices of BiG model.
Bi-LSTM performs well for HN, HO, and LC traits. GRU performs well
for HC, and LO traits and BiG performs well for all other traits. Overall
model performance, Bi-LSTM achieve 80% F-1 score, GRU achieve 79%
F-1 score and BiG achieve 84% F-1 score in the speech modality.
MELP base findings: In Table 18 and Table 19, we present the
baseline performance of these models when it comes to categorizing
individual personality traits. In Table 18, we see that HA, HC, HE, HN,
LA, LC, and LE traits are highly identified by Bi-LSTM based on MELP
and promising at HO, LN, and Lo traits. Furthermore, In Table 19, we
see that HC, HE, LC, and LE traits are highly identified by GRU based
12
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 7. Confusion matrix of DistilRo. DistilRo highly captures the Agreeable, Conscientious, Neurotic, and Open personality traits. It encounters challenges in accurately representing
the Extrover trait, particularly in establishing distinctions with Neurotic and Open personality traits.
Fig. 8. Confusion matrix of Bi-LSTM based on MFCCs. Bi-LSTM effectively capture all the personality traits.
Table 13
GRU classification results based on MFCCs.
Traits Precision Recall F1-score
HighAgree 0.59 0.93 0.72
HighConscientious 0.97 0.90 0.93
HighExtrover 0.92 0.98 0.95
HighNeurotic 0.55 0.84 0.67
HighOpen 0.55 0.31 0.40
LowAgree 0.97 0.36 0.53
LowConscientious 0.72 0.90 0.80
LowExtrover 0.88 0.98 0.93
LowNeurotic 0.72 0.74 0.73
LowOpen 0.67 0.39 0.49
Macro Average 0.75 0.73 0.73
Table 14
BiG classification results based on MFCCs.
Traits Precision Recall F1-score
HighAgree 0.77 1.00 0.87
HighConscientious 0.94 1.00 0.97
HighExtrover 1.00 1.00 1.00
HighNeurotic 0.74 0.87 0.80
HighOpen 0.73 0.50 0.59
LowAgree 0.94 0.62 0.75
LowConscientious 0.68 0.94 0.79
LowExtrover 0.97 0.97 0.97
LowNeurotic 0.87 0.87 0.87
LowOpen 0.56 0.50 0.53
Macro Average 0.82 0.83 0.81
13
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 9. Confusion matrix of GRU based on MFCCs. GRU effectively capture all the personality traits except LA, HO and LO. However, most data of LA are mismatch with LC, LN,
and HA traits. Additionally, 52 and 40 data of HO and LO traits are considering as HN and HA traits respectively.
Fig. 10. Confusion matrix of BiG based on MFCCs. BiG effectively capture all the personality traits except HO and LO.
Table 15
Bi-LSTM classification results based on MoMF.
Traits Precision Recall F1-score
HighAgree 0.89 0.91 0.90
HighConscientious 0.74 0.97 0.84
HighExtrover 0.96 0.76 0.85
HighNeurotic 0.86 0.96 0.91
HighOpen 0.67 0.75 0.71
LowAgree 0.89 0.59 0.71
LowConscientious 0.74 0.89 0.81
LowExtrover 0.73 0.87 0.79
LowNeurotic 0.77 0.67 0.72
LowOpen 0.86 0.62 0.72
Macro Average 0.81 0.80 0.80
Table 16
GRU classification results based on MoMF.
Traits Precision Recall F1-score
HighAgree 0.95 0.86 0.90
HighConscientious 0.95 0.95 0.95
HighExtrover 0.99 0.97 0.98
HighNeurotic 0.66 0.67 0.67
HighOpen 0.66 0.54 0.59
LowAgree 0.89 0.56 0.69
LowConscientious 0.62 0.86 0.72
LowExtrover 0.90 0.94 0.92
LowNeurotic 0.79 0.69 0.74
LowOpen 0.66 0.89 0.76
Macro Average 0.80 0.79 0.79
14
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 11. Confusion matrix of Bi-LSTM based on MoMF. Bi-LSTM effectively capture all the personality traits but a discernible inconsistency of LA is observed alignment with LC.
Fig. 12. Confusion matrix of GRU based on MoMF. GRU effectively capture all the personality traits except LA and HO. However, most data of LA are identifying as LC. Additionally,
for HO trait data, model do not separate HO and HN traits adequately.
Table 17
BiG classification results based on MoMF.
Traits Precision Recall F1-score
HighAgree 0.95 1.00 0.98
HighConscientious 0.89 1.00 0.94
HighExtrover 1.00 1.00 1.00
HighNeurotic 0.69 0.96 0.80
HighOpen 0.71 0.45 0.56
LowAgree 1.00 0.62 0.77
LowConscientious 0.64 1.00 0.78
LowExtrover 0.91 1.00 0.95
LowNeurotic 0.93 0.87 0.90
LowOpen 0.86 0.60 0.71
Macro Average 0.86 0.85 0.84
on MELP and promising at HA, HN, LA, and LN traits but struggle with
HO, and LO traits.
In Figs. 14 and 15, which display the confusion matrices of our
baseline models. Table 20 provides a detailed classification results for
BiG, including its precision, recall, and F1-score for each personality
traits. The model BiG performs best in classifying HA, HC, HE, HN and
LE, while still achieving good results for LA, HO, LC, LN and LO. In
Fig. 16, which display the confusion matrices of BiG model. Bi-LSTM
performs well for HN traits and GRU performs well for HC, LE, LA,
LC, LE traits and BiG performs well for all other traits. Overall model
performance, Bi-LSTM achieve 85% F-1 score, GRU achieve 82% F-1
score and BiG achieve 88% F-1 score in the speech modality.
MEWLP base findings: In Table 21 and Table 22, we present the
baseline performance of these models when it comes to categorizing
individual personality traits. In Table 21, we see that HA, HC, HE, HN,
15
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 13. Confusion matrix of BiG based on MoMF. BiG effectively capture all the personality traits but model struggle with HO trait data.
Fig. 14. Confusion matrix of Bi-LSTM based on MELP. Bi-LSTM effectively capture all the personality traits.
Table 18
Bi-LSTM classification results based on MELP.
Traits Precision Recall F1-score
HighAgree 0.89 0.90 0.89
HighConscientious 0.91 0.90 0.91
HighExtrover 0.94 0.93 0.92
HighNeurotic 0.91 0.96 0.94
HighOpen 0.70 0.73 0.71
LowAgree 0.94 0.81 0.87
LowConscientious 0.86 0.91 0.88
LowExtrover 0.89 0.90 0.89
LowNeurotic 0.79 0.81 0.80
LowOpen 0.90 0.63 0.73
Macro Average 0.87 0.85 0.85
Table 19
GRU classification results based on MELP.
Traits Precision Recall F1-score
HighAgree 0.65 0.97 0.78
HighConscientious 0.97 0.91 0.94
HighExtrover 0.97 0.98 0.98
HighNeurotic 0.72 0.84 0.78
HighOpen 0.68 0.63 0.65
LowAgree 0.96 0.81 0.88
LowConscientious 0.91 0.96 0.94
LowExtrover 0.90 0.99 0.94
LowNeurotic 0.82 0.79 0.81
LowOpen 0.79 0.41 0.54
Macro Average 0.84 0.83 0.82
16
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 15. Confusion matrix of GRU based on MELP. GRU effectively capture all the personality traits except LO trait. Additionally, most of the data of LO trait are considering as
HA trait.
Fig. 16. Confusion matrix of BiG based on MELP. BiG correctly capture all the personality traits but struggle with LA trait. However, to identify LA trait it sometime consider LC
trait.
LA, LC, and LE traits are highly identified by Bi-LSTM based on MEWLP
and promising at HO, LN, and LO traits. Furthermore, In Table 22,
we see that HC, and HE traits are highly identified by GRU based on
MEWLP and promising at LE, and HN traits but struggle with HA, HO,
LA, LC, LN, and LO traits.
In Figs. 17 and 18, which display the confusion matrices of our
baseline models. Table 23 provides a detailed classification results for
BiG, including its precision, recall, and F1-score for each personality
traits. The model BiG performs best in classifying HA, HC, HE, HN, LE,
and LN traits while still achieving good results for HO, LA, LC, and LO
traits. In Fig. 19, which display the confusion matrices of BiG model.
Bi-LSTM performs well for HA, LA, LC, and LO traits and BiG performs
well for all other traits. Overall model performance, Bi-LSTM achieve
87% F-1 score, GRU achieve 57% F-1 score and BiG achieve 90% F-1
score in the speech modality.
Our analysis of the classification results reveals several significant
patterns in the performance metrics, particularly precision, recall, and
F1-score.
For the HighExtrover trait in the Bi-LSTM model, we achieve perfect
scores (P=R=F1=1.00). This indicates an ideal classification scenario
where all instances of HighExtrover are correctly identified without any
false positives or false negatives, demonstrating the model’s robustness
and accuracy for this specific trait.
The DistilBERT model shows perfect precision (P=1.00) for Agree-
ableness but has a lower recall (R=0.82), resulting in an F1-score of
0.90. This suggests that while the model is highly accurate in iden-
tifying Agreeableness instances, it tends to miss some true positives.
The model’s conservative approach minimizes false positives but at
the cost of a reduced recall. For Extroversion, the model exhibits high
recall (R=0.90) but lower precision (P=0.85), leading to an F1-score of
17
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 17. Confusion matrix of Bi-LSTM based on MEWLP. Bi-LSTM effectively capture all the personality traits.
Fig. 18. Confusion matrix of GRU based on MEWLP. GRU cannot capture all the personality traits. However HA, HO, LA, LC, and LO are mismatch with LN, HN, LC, LE, and LN
traits respectively.
Table 20
BiG classification results based on MELP.
Traits Precision Recall F1-score
HighAgree 1.00 1.00 1.00
HighConscientious 0.94 0.89 0.91
HighExtrover 0.96 1.00 0.98
HighNeurotic 0.84 0.94 0.89
HighOpen 0.94 0.71 0.81
LowAgree 1.00 0.64 0.78
LowConscientious 0.68 1.00 0.81
LowExtrover 0.88 1.00 0.93
LowNeurotic 0.89 0.86 0.88
LowOpen 0.78 0.88 0.82
Macro Average 0.89 0.89 0.88
Table 21
Bi-LSTM classification results based on MEWLP.
Traits Precision Recall F1-score
HighAgree 0.93 0.96 0.94
HighConscientious 0.90 0.94 0.92
HighExtrover 1.00 0.93 0.95
HighNeurotic 0.88 0.91 0.89
HighOpen 0.75 0.80 0.77
LowAgree 0.93 0.81 0.86
LowConscientious 0.80 0.96 0.88
LowExtrover 0.91 0.86 0.87
LowNeurotic 0.77 0.95 0.85
LowOpen 0.90 0.70 0.79
Macro Average 0.88 0.88 0.87
18
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Fig. 19. Confusion matrix of BiG based on MEWLP. BiG effectively capture all the personality traits except LO.
Table 22
GRU classification results based on MEWLP.
Traits Precision Recall F1-score
HighAgree 0.53 0.20 0.29
HighConscientious 0.96 0.97 0.97
HighExtrover 0.97 0.98 0.98
HighNeurotic 0.48 0.83 0.61
HighOpen 0.31 0.16 0.21
LowAgree 0.77 0.37 0.50
LowConscientious 0.57 0.47 0.51
LowExtrover 0.64 0.96 0.77
LowNeurotic 0.41 0.90 0.56
LowOpen 0.50 0.17 0.25
Macro Average 0.61 0.60 0.57
Table 23
BiG classification results based on MEWLP.
Traits Precision Recall F1-score
HighAgree 0.87 1.00 0.93
HighConscientious 1.00 1.00 1.00
HighExtrover 1.00 1.00 1.00
HighNeurotic 0.88 0.96 0.92
HighOpen 0.81 0.77 0.79
LowAgree 0.94 0.71 0.81
LowConscientious 0.68 0.94 0.79
LowExtrover 0.97 1.00 0.98
LowNeurotic 0.94 1.00 0.97
LowOpen 0.93 0.65 0.76
Macro Average 0.91 0.90 0.90
0.88. This indicates the model’s ability to capture most Extroversion
instances, though it also includes some false positives. The model’s
inclusive strategy increases recall but slightly compromises precision.
The classification for Neuroticism shows balanced metrics with
precision and recall both around 0.82, and an F1-score of 0.82. This
balance indicates consistent performance without significant trade-offs
between precision and recall.
The macro average scores across different models and traits re-
flect generally strong performance, with variations highlighting areas
for further refinement. By balancing precision and recall, the models
can optimize F1-scores, ensuring both accuracy and completeness in
personality trait classification.
To evaluate our models, we test both the DistilRo and BiG models
using a separate set of 50 data samples and achieve an F-1 score of
40%. The low F-1 score is partly due to the small size of the test data,
which is not enough to fully demonstrate the models’ capabilities.
5. Conclusion
In summary, this research has successfully addressed the gap in
personality classification using Bangla speech by developing a robust
dataset and advanced classification models. Our comprehensive dataset
of Bangla speech samples annotated with Big Five personality traits
serves as a foundational resource for future research in this domain.
The dataset comprises 90% content from online Bangla newspapers and
10% from renowned Bangla novels, providing a diverse range of speech
samples.
We introduce the Morlet-based Mel-frequency Cepstral Coefficients
(MoMF) feature extraction method, which captures both temporal and
spectral characteristics of speech. This dual capability enhances the
representation of speech features, leading to more effective personality
classification. The effectiveness of this feature extraction method is
demonstrated in the classification performance of our models.
Our primary research focus centers around the classification of
personality traits using Bangla speech, conducted in two distinct modal-
ities: speech-to-text and speech analysis. We create our own dataset,
comprising 1750 speeches for the speech-to-text modality and 1000
acted speeches for the speech modality. In the former, we categorize
the data into five distinct personality trait classes, while the latter fea-
tured ten classes. We record an experimental dataset by a professional
speaker. Our assessors are unfamiliar with the speaker and are only
given the speech to annotate.
Our two soft voting ensemble models, DistilRo and BiG, show re-
markable performance in personality classification. The DistilRo model
achieves an F-1 score of 89% in the speech-to-text modality, while the
BiG model achieves an F-1 score of 90% in the speech modality. These
results highlight the potential of ensemble techniques in improving
classification accuracy. Additionally, BiG achieves an F-1 score of 81%
based on MFCCs, 84% based on MoMF, 88% based on MELP, and 90%
based on MEWLP in the speech modality. Total training time energy
consumed is 50.40 kWh with carbon emissions of 30.24 Kg.
We observe that in the DistilRo model, DistilBERT struggles some-
what in classifying conscientiousness and openness. Similarly, in the
19
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
BiG model, the GRU faces confusion when distinguishing between high
and low agreeableness (HA,LA) and high and low openness (HO, LO).
In most cases, machine consider openness trait data as neuroticism
trait data. These difficulties stem from the limited amount of data
available for each personality trait and we also notice that in data
reliability, openness trait Gaussian curves mostly overlap that means
distinguishing between high and low trait are less straightforward.
However, the models perform well in discriminating between high and
low extroversion in both phases.
To ensure the consistency of our annotations, we conduct inter-
rater reliability checks using Cohen’s Kappa coefficient. The average
Cohen’s Kappa value of 0.794 indicates substantial agreement among
annotators, confirming the reliability of our dataset. This step is crucial
in validating the credibility of our data and the robustness of our
findings.
For future research, we intend to focus on factor reduction, particu-
larly exploring the correlations between NEO-FFI factors and prosodic
and acoustic signal-based features. The current database is recorded
from one professional speaker under lab conditions. In future exper-
iments, we plan to include at least 30 professional speakers and use
a wider variety of speech materials. Additionally, we see potential in
investigating the identification of emotions and exploring the relation-
ship between emotions and personality traits within speech, offering
promising avenues for further research in this field.
The methodologies and models developed in this study can be
adapted for other low-resource languages, promoting the development
of inclusive natural language processing tools. The implications of our
work extend to practical applications in areas such as human–computer
interaction, psychological assessment, and social media analysis. Future
research can build upon our dataset and models to explore further
improvements in personality classification accuracy and robustness.
These findings provide a strong foundation for advancing the field
of speech-based personality classification, particularly for underrepre-
sented languages like Bangla. The dataset and models developed in this
study contribute not only to the academic understanding of personality
traits but also have practical applications in various domains.
CRediT authorship contribution statement
Md. Sajeebul Islam Sk.: Writing original draft, Visualization, Val-
idation, Software, Resources, Methodology, Formal analysis, Data cura-
tion, Conceptualization. Md. Golam Rabiul Alam: Writing review &
editing, Supervision, Investigation, Conceptualization.
Declaration of competing interest
The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
Acknowledgments
We thank all participants and human raters who contribute to the
creation and annotation of the dataset used in this study.
References
Adoma, A.F., Henry, N.-M., Chen, W., 2020a. Comparative analyses of bert, roberta,
distilbert, and xlnet for text-based emotion recognition. In: 2020 17th International
Computer Conference on Wavelet Active Media Technology and Information
Processing. ICCWAMTIP, IEEE, pp. 117–121.
Adoma, A.F., Henry, N.-M., Chen, W., 2020b. Comparative analyses of bert, roberta,
distilbert, and xlnet for text-based emotion recognition. In: 2020 17th International
Computer Conference on Wavelet Active Media Technology and Information
Processing. ICCWAMTIP, IEEE, pp. 117–121.
Alam, F., Riccardi, G., 2013. Comparative study of speaker personality traits recognition
in conversational and broadcast news speech. In: INTERSPEECH. pp. 2851–2855.
Alim, S.A., Rashid, N.K.A., 2018. Some Commonly Used Speech Feature Extraction
Algorithms. IntechOpen London, UK.
Ameer, I., Bölücü, N., Siddiqui, M.H.F., Can, B., Sidorov, G., Gelbukh, A., 2023. Multi-
label emotion classification in texts using transfer learning. Expert Syst. Appl. 213,
118534. http://dx.doi.org/10.1016/j.eswa.2022.118534.
Arslan, Y., Allix, K., Veiber, L., Lothritz, C., Bissyandé, T.F., Klein, J., Goujon, A., 2021.
A comparison of pre-trained language models for multi-class text classification in
the financial domain. In: Companion Proceedings of the Web Conference 2021. pp.
260–268.
Baker, P., Hardie, A., McEnery, T., Cunningham, H., Gaizauskas, R.J., 2002. EMILLE,
a 67-million word corpus of indic languages: Data collection, mark-up and
harmonisation. In: LREC.
Bisong, E., et al., 2019. Building Machine Learning and Deep Learning Models on
Google Cloud Platform. Springer.
Black, R.W., 2009. Online fan fiction, global identities, and imagination. Res. Teach.
Engl. 397–425.
Bonett, D.G., Wright, T.A., 2015. Cronbach’s alpha reliability: Interval estimation,
hypothesis testing, and sample size planning. J. Organ. Behav. 36 (1), 3–15.
Brown, G., Wyatt, J., Harris, R., Yao, X., 2005. Diversity creation methods: a survey
and categorisation. Inf. Fusion 6 (1), 5–20.
Burger, J.M., 2014. Personality. Cengage Learning.
Chen, J., Benesty, J., Huang, Y., Doclo, S., 2006. New insights into the noise reduction
Wiener filter. IEEE Trans. Audio Speech Lang. Process. 14 (4), 1218–1234.
Cohen, M.X., 2019. A better way to define and describe morlet wavelets for
time-frequency analysis. NeuroImage 199, 81–86.
Cotterell, R., Renduchintala, A., Saphra, N., Callison-Burch, C., 2014. An algerian
arabic-french code-switched corpus. In: Workshop on Free/Open-Source Arabic
Corpora and Corpora Processing Tools Workshop Programme. p. 34.
Da Silva, N.F., Hruschka, E.R., Hruschka, Jr., E.R., 2014. Tweet sentiment analysis with
classifier ensembles. Decis. Support Syst. 66, 170–179.
Das, K.G., Das, D., 2017. Developing lexicon and classifier for personality identification
in texts. In: Proceedings of the 14th International Conference on Natural Language
Processing. ICON-2017, pp. 362–372.
Deng, X., Liu, Q., Deng, Y., Mahadevan, S., 2016. An improved method to construct
basic probability assignment based on the confusion matrix for classification
problem. Inform. Sci. 340, 250–261.
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of deep
bidirectional transformers for language understanding. arXiv:1810.04805.
Dey, R., Salem, F.M., 2017. Gate-variants of gated recurrent unit (GRU) neural
networks. In: 2017 IEEE 60th International Midwest Symposium on Circuits and
Systems. MWSCAS, IEEE, pp. 1597–1600.
Dietterich, T.G., 2000. Ensemble methods in machine learning. In: International
Workshop on Multiple Classifier Systems. Springer, pp. 1–15.
Efat, A.A., Atiq, A., Abeed, A.S., Momin, A., Alam, M.G.R., 2023. Empoliticon: NLP
and ML based approach for context and emotion classification of political speeches
from transcripts. IEEE Access.
Fu, L., Liang, P., Li, X., Yang, C., 2021a. A machine learning based ensemble method
for automatic multiclass classification of decisions. In: Evaluation and Assessment
in Software Engineering. pp. 40–49.
Fu, L., Liang, P., Li, X., Yang, C., 2021b. A machine learning based ensemble method
for automatic multiclass classification of decisions. In: Proceedings of the 25th
International Conference on Evaluation and Assessment in Software Engineering.
pp. 40–49.
Goldberg, L.R., 1993. The structure of phenotypic personality traits. Am. Psychol. 48
(1), 26–34.
Goutte, C., Gaussier, E., 2005. A probabilistic interpretation of precision, recall and
F-score, with implication for evaluation. In: European Conference on Information
Retrieval. Springer, pp. 345–359.
Graves, A., Mohamed, A.-r., Hinton, G., 2013. Speech recognition with deep recurrent
neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing. Ieee, pp. 6645–6649.
Greer, J.D., Mensing, D., 2013. The evolution of online newspapers: A longitudinal
content analysis, 1997–2003. In: Internet Newspapers. Routledge, pp. 13–32.
Gunatilaka, A.H., Baertlein, B.A., 2001. Feature-level and decision-level fusion of
noncoincidently sampled sensors for land mine detection. IEEE Trans. Pattern Anal.
Mach. Intell. 23 (6), 577–589.
Hans, C., Suhartono, D., Andry, C., Zamli, K., 2021. Text based personality prediction
from multiple social media data sources using pre-trained language model and
model averaging. J. Big Data 8 (68).
Islam, N., et al., 2019. The big five model of personality in Bangladesh: examining the
ten-item personality inventory. Psihologija 52 (4), 395–412.
Julianda, A.R., Maharani, W., et al., 2023. Personality detection on reddit us-
ing distilbert. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) 7 (5),
1140–1146.
Kamalesh, M.D., Bharathi, B., 2022. Personality prediction model for social media using
machine learning technique. Comput. Electr. Eng. 100, 107852.
Kamarulafizam, I., Salleh, S.-H., Najeb, J., Ariff, A., Chowdhury, A., 2007. Heart
sound analysis using MFCC and time frequency distribution. In: World Congress
on Medical Physics and Biomedical Engineering 2006: August 27–September 1,
2006 COEX Seoul, Korea ‘‘Imaging the Future Medicine’’. Springer, pp. 946–949.
20
M.S.I. Sk. and M.G.R. Alam Natural Language Processing Journal 9 (2024) 100113
Labied, M., Belangour, A., 2021. Automatic speech recognition features extraction
techniques: a multi-criteria comparison. Int. J. Adv. Comput. Sci. Appl. 12 (8).
Landis, J.R., Koch, G.G., 1977. An application of hierarchical kappa-type statistics in the
assessment of majority agreement among multiple observers. Biometrics 363–374.
Lin, J., Qu, L., 2000. Feature extraction based on morlet wavelet and its application
for mechanical fault diagnosis. J. Sound Vib. 234 (1), 135–148.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692.
Malik, M.S.I., Nazarova, A., Jamjoom, M.M., Ignatov, D.I., 2023. Multilingual hope
speech detection: A robust framework using transfer learning of fine-tuning
RoBERTa model. J. King Saud Univ.-Comput. Inf. Sci. 35 (8), 101736.
Matthews, G., Deary, I.J., Whiteman, M.C., 2003. Personality Traits. Cambridge
University Press.
Mochahary, H., 2019. Translation literature in bodo language. Transl. Literat..
Mughees, N., Mohsin, S.A., Mughees, A., Mughees, A., 2021. Deep sequence to sequence
bi-LSTM neural networks for day-ahead peak load forecasting. Expert Syst. Appl.
175, 114844. http://dx.doi.org/10.1016/j.eswa.2021.114844.
Ober, R., 2007. Kapati time: Storytelling as a data collection method in indigenous
research. Mystery Train.
Pennebaker, J.W., King, L.A., 1999. Linguistic styles: language use as an individual
difference. J. Pers. Soc. Psychol. 77 (6), 1296.
Pervin, L.A., 2003. The Science of Personality. Oxford University Press.
Pisanski, K., Bryant, G.A., 2019. The evolution of voice perception. Oxf. Handb. Voice
Stud. 269–300.
Polzehl, T., 2015. Personality in speech. Assess. Autom. Classif..
Polzehl, T., Möller, S., Metze, F., 2010. Automatically assessing personality from speech.
In: 2010 IEEE Fourth International Conference on Semantic Computing. IEEE, pp.
134–140.
Ponce-López, V., Chen, B., Oliu, M., Corneanu, C., Clapés, A., Guyon, I., Baró, X.,
Escalante, H.J., Escalera, S., 2016. Chalearn lap 2016: First round challenge on
first impressions-dataset and results. In: Computer Vision–ECCV 2016 Workshops:
Amsterdam, the Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part III
14. Springer, pp. 400–418.
Ramezani, M., Feizi-Derakhshi, M.-R., Balafar, M.-A., Asgari-Chenaghlu, M., Feizi-
Derakhshi, A.-R., Nikzad-Khasmakhi, N., Ranjbar-Khadivi, M., Jahanbakhsh-
Nagadeh, Z., Zafarani-Moattar, E., Akan, T., 2022. Automatic personality prediction:
an enhanced method using ensemble modeling. Neural Comput. Appl. 34 (21),
18369–18389.
Rudra, U., Chy, A.N., Seddiqui, M.H., 2020. Personality traits detection in bangla:
A benchmark dataset with comparative performance analysis of state-of-the-art
methods. In: 2020 23rd International Conference on Computer and Information
Technology. ICCIT, IEEE, pp. 1–6.
Ryumina, E., Markitantov, M., Ryumin, D., Karpov, A., 2024. OCEAN-AI framework
with EmoFormer cross-hemiface attention approach for personality traits assess-
ment. Expert Syst. Appl. 239, 122441. http://dx.doi.org/10.1016/j.eswa.2023.
122441.
Ryumina, E., Ryumin, D., Markitantov, M., Kaya, H., Karpov, A., et al., 2023. Mul-
timodal personality traits assessment (MuPTA) corpus: the impact of spontaneous
and read speech. In: Proceedings of ISCA International Conference INTERSPEECH.
pp. 4049–4053.
Sanh, V., Debut, L., Chaumond, J., Wolf, T., 2020. Distilbert, a distilled version of
BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.
Sharma, M., Kandasamy, I., Kandasamy, V., 2021. Deep learning for predicting
neutralities in offensive language identification dataset. Expert Syst. Appl. 185,
115458. http://dx.doi.org/10.1016/j.eswa.2021.115458.
Sk., M., 2024. Bangla speech personality traits data. http://dx.doi.org/10.17632/
fb6dm3yb6m.1.
Tavakol, M., Dennick, R., 2011. Making sense of cronbach’s alpha. Int. J. Med. Educ.
2, 53.
Trabelsi, I., Ayed, D.B., 2012. On the use of different feature extraction methods for
linear and non linear kernels. In: 2012 6th International Conference on Sciences
of Electronics, Technologies of Information and Telecommunications. SETIT, IEEE,
pp. 797–802.
Valenzuela, P., Domínguez-Cuesta, M.J., García, M.A.M., Jiménez-Sánchez, M., 2017.
A spatio-temporal landslide inventory for the NW of Spain: Bapa database.
Geomorphology 293, 11–23.
Verhoeven, B., Daelemans, W., De Smedt, T., 2013. Ensemble methods for personality
recognition. In: Proceedings of the International AAAI Conference on Web and
Social Media, vol. 7, (2), pp. 35–38.
Yu, J., Markov, K., 2017. Deep learning based personality recognition from facebook
status updates. In: 2017 IEEE 8th International Conference on Awareness Science
and Technology. ICAST, IEEE, pp. 383–387.
Zhou, Z.-H., 2012. Ensemble Methods: Foundations and Algorithms. CRC Press.
Zhou, L., Zhang, Z., Zhao, L., Yang, P., 2022. Attention-based BiLSTM models for
personality recognition from user-generated content. Inform. Sci. 596, 460–471.
21
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Personality is a unique set of motivations, feelings, and behaviors humans possess. Personality detection on social media is a research topic commonly conducted in computer science. Personality models often used for personality detection research are the Big Five Indicator (BFI) and Myers-Briggs Type Indicator (MBTI) models. Unlike the BFI, which classifies personalities based on an individual’s traits, the MBTI model classifies personalities based on the type of the individual. So, MBTI performs better in several scenarios than the Big Five model. Many studies use machine learning to detect personality on social media, such as Logistic Regression, Naïve Bayes, and Support Vector Machine. With the recent popularity of Deep Learning, we can use language models such as DistilBERT to classify personality on social media. Because of DistilBERT’s ability to process large sentences and the ability for parallelization thanks to the transformer architecture. Therefore, the proposed research will detect MBTI personality on Reddit using DistilBERT. The evaluation shows that removing stopwords on the data preprocessing stage can reduce the model’s performance, and with class imbalance handling, DistilBERT performs worse than without class imbalance handling. Also, as a comparison, DistilBERT outperforms other machine learning classifiers such as Naïve Bayes, SVM, and Logistic Regression in accuracy, precision, recall, and f1-score.
Article
Full-text available
Can language use reflect personality style? Studies examined the reliability, factor structure, and validity of written language using a word-based, computerized text analysis program. Daily diaries from 15 substance abuse inpatients, daily writing assignments from 35 students, and journal abstracts from 40 social psychologists demonstrated good internal consistency for over 36 language dimensions. Analyses of the best 15 language dimensions from essays by 838 students yielded 4 factors that replicated across written samples from another 381 students. Finally, linguistic profiles from writing samples were compared with Thematic Apperception Test coding, self-reports, and behavioral measures from 79 students and with self-reports of a 5-factor measure and health markers from more than 1,200 students. Despite modest effect sizes, the data suggest that linguistic style is an independent and meaningful way of exploring personality.
Article
Full-text available
Political speeches have played one of the most influential roles in shaping the world. Speeches of the written variety have been etched into history. These sorts of speeches have a great effect on the general people and their actions in the coming few days. Moreover, if left unchecked, political personnel or parties may cause major problems. In many cases, there may be a warning sign that the government needs to change its policies and also listen to the people. Understanding the emotion and context of a political speech is important, as they can be early indicators or warning signs for impending international crises, alignments, wars and future conflicts. In our research, we have focused on the presidents/prime ministers of China, Russia, the United Kingdom and the United States which are the permanent members of the United Nations Security Council and classified the speeches given by them based on the context and emotion of the speeches. The speeches were categorized into optimism, neutral, joy or upset in terms of emotion and five context categories, which are international affairs, nationalism, development, extremism and others. Here, optimism is a secondary emotion, whereas joy and upset are primary emotions. Apart from classifying the speeches based on context and emotion, one of the major works of our research is that we are introducing a dataset of political speeches that contains 2010 speeches labelled with emotion and context of the speech. The speeches we have worked on are large in word count. We propose EMPOLITICON-Context, a soft voting classifier ensemble learning model for context classification and EMPOLITICON-Emotion, a soft voting classifier ensemble learning model for emotion classification of political speeches. The proposed EMPOLITICON-Context model has achieved 73.13% accuracy in terms of context classification and the EMPOLITICON-Emotion model has achieved 53.07% accuracy in classifying the emotion of the political speeches.
Article
Full-text available
Social media is a widespread platform that provides a massive amount of user-generated content that can be mined to reveal the emotions of social media users. This has many potential benefits, such as getting a sense of people’s pulse on various events or news. Emotion classification from social media posts is challenging, especially when it comes to detecting multiple emotions from a short piece of text, as in multi-label classification problem. Most of the previous work on emotion detection has focused on deep neural networks such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM) Networks. However, none of them has utilized multiple attention mechanisms and Recurrent Neural Networks (i.e., specialized attention networks for each emotion) nor utilized the recently introduced Transformer Networks such as XLNet, DistilBERT, and RoBERTa for the task of classifying emotions with multiple labels. The proposed multiple attention mechanism reveals the contribution of each word on each emotion, which has not been investigated before. In this study, we investigate both the use of LSTMs and the fine-tuning of Transformer Networks through Transfer Learning along with a single-attention network and a multiple-attention network for multi-label emotion classification. The experimental results show that our novel transfer learning models using pre-trained transformers with and without multiple attention mechanisms were able to outperform the current state-of-the-art accuracy (58.8% - Baziotis et al., 2018) in the SemEval-2018 Task-1C dataset. Our best-performing RoBERTa-MA (RoBERTa-Multi-attention) model outperformed the state-of-the-art and achieved 62.4% accuracy (3.6% gain over the state-of-the-art) on the challenging SemEval-2018 E-c: Detecting Emotions (multi-label classification) dataset for English. Moreover, the XLNet-MA (XLNet-Multi-attention) model outperformed the other proposed models by achieving 45.6% accuracy on the Ren-CECps dataset for Chinese.
Article
Full-text available
Human personality is significantly represented by those words which he/she uses in his/her speech or writing. As a consequence of spreading the information infrastructures (specifically the Internet and social media), human communications have reformed notably from face to face communication. Generally, Automatic Personality Prediction (or Perception) (APP) is the automated forecasting of the personality on different types of human generated/exchanged contents (like text, speech, image, video, etc.). The major objective of this study is to enhance the accuracy of APP from the text. To this end, we suggest five new APP methods including term frequency vector-based, ontology-based, enriched ontology-based, latent semantic analysis (LSA)-based, and deep learning-based (BiLSTM) methods. These methods as the base ones, contribute to each other to enhance the APP accuracy through ensemble modeling (stacking) based on a hierarchical attention network (HAN) as the meta-model. The results show that ensemble modeling enhances the accuracy of APP.
Article
Predicting human behavior and personality from the social media applications like Facebook, Twitter and Instagram is achieving tremendous attention among researchers. Statistical information about the human thoughts expressed via status on social media is essential assets for research in predicting various human behaviour and personality. The current work mainly focuses on guessing user personality based on big five personality traits. An intelligent Sentence analysis model is built to extract personality features. In this article, a new Binary-Partitioning Transformer (BPT) with Term Frequency & Inverse Gravity Moment (TF-IGM) is proposed that identifies relationships among feature sets and traits from datasets. The proposed work outperforms the all feature extraction average baseline set on multiple social datasets. A maximum F1-score of 0.762 and accuracy of 78.34% on the Facebook dataset; 0.783 and 79.67%; on the Twitter dataset, 0.821; 86.84% on Instagram dataset is achieved.
Article
Emojis have been widely used in social media as a new way to express various emotions and personalities. However, most previous research only focused on limited features from textual information while neglecting rich emoji information in user-generated content. This study presents two novel attention-based Bi-LSTM architectures to incorporate emoji and textual information at different semantic levels, and investigate how the emoji information contributes to the performance of personality recognition tasks. Specifically, we first extract emoji information from online user-generated content, and concatenate word embedding and emoji embedding based on word and sentence perspectives. We then obtain the document representations of all users from the word and sentence levels during the training process and feed them into the attention-based Bi-LSTM architecture to predict the Big Five personality traits. Experimental results show that the proposed methods achieve state-of-the-art performance over the baseline models on the real dataset, demonstrating the usefulness and contribution of emoji information in personality recognition tasks. The findings could help researchers and practitioners better understand the rich semantics of emoji information and provide a new way to introduce emoji information into personality recognition tasks.