Conference PaperPDF Available

Generative Transformer Chatbots for Mental Health Support: A Study on Depression and Anxiety

Authors:
Generative Transformer Chatbots for Mental Health Support: A
Study on Depression and Anxiety
Jordan J. Bird
Department of Computer Science, Nottingham Trent
University
Nottingham, United Kingdom
jordan.bird@ntu.ac.uk
Ahmad Lot
Department of Computer Science, Nottingham Trent
University
Nottingham, United Kingdom
ahmad.lot@ntu.ac.uk
ABSTRACT
Mental health is a critical issue worldwide and eective treatments
are available. However, incidence of social stigma prevents many
from seeking the support they need. Given the rapid developments
in the eld of large-language models, this study explores the po-
tential of chatbots to support people experiencing depression and
anxiety. The focus of this research is on the engineering aspect
of building chatbots, and through topology optimisation nd an
eective hyperparameter set that can predict tokens with 88.65%
accuracy and with a performance of 96.49% and 97.88% regarding
the correct token appearing in the top 5 and 10 predictions. Exam-
ples of how optimised chatbots can eectively answer questions
surrounding mental health are provided, generalising information
from veried online sources. The results of this study demonstrate
the potential of chatbots to provide accessible and anonymous sup-
port to individuals who may otherwise be deterred by the stigma
associated with seeking help for mental health issues. However,
the limitations and challenges of using chatbots for mental health
support must also be acknowledged, and future work is suggested
to fully understand the potential and limitations of chatbots and to
ensure that they are developed and deployed ethically and respon-
sibly.
CCS CONCEPTS
Information systems
Information retrieval; Theory of
computation
Design and analysis of algorithms;Human-
centered computing Interactive systems and tools.
KEYWORDS
Chatbots, Natural Language Processing, Transformers, Mental Health
ACM Reference Format:
Jordan J. Bird and Ahmad Lot. 2023. Generative Transformer Chatbots for
Mental Health Support: A Study on Depression and Anxiety. In Proceedings
of the 16th International Conference on PErvasive Technologies Related to
Assistive Environments (PETRA ’23), July 5–7, 2023, Corfu, Greece. ACM, New
York, NY, USA, 6 pages. https://doi.org/10.1145/3594806.3596520
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
PETRA ’23, July 5–7, 2023, Corfu, Greece
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0069-9/23/07. . . $15.00
https://doi.org/10.1145/3594806.3596520
1 INTRODUCTION
Mental health is a critical issue that aects millions of people around
the world. According to the World Health Organisation (WHO),
an estimated 5% of all adults suer from depression [WHO, 2021].
The WHO also note that, although eective treatment is available,
75% of those categorised as low- and middle-income do not receive
treatment. Indeed, awareness and acceptance of poor mental health
have steadily improved [Frank and Glied, 2006, Jones and Wessely,
2005], but there is still a signicant stigma about the need for
professional help [Sickel et al
.
, 2014]. Mental health stigma can
act as a barrier for people experiencing depression, anxiety, or
other mental health challenges, preventing them from accessing
the support they need. The prevalence of mental health stigma
has led many people to view online alternatives favourably over
physical human interaction [Hanley and Wyatt, 2021].
This knowledge leads to the concept of the online chatbot. In
recent years, advances in Natural Language Processing (NLP) have
led to the development of chatbots as a tool for promoting mental
well-being. Chatbots are computer programs that can simulate a
natural conversation, providing support through textual input and
output. Given their accessibility and anonymity, they have the
potential to help alleviate the stigma associated with seeking help
for mental health issues [Abd-Alrazaq et al., 2019].
This paper focuses on the engineering aspect of chatbots for men-
tal health support, with a specic focus on answering questions
about depression and anxiety. The study will explore hyperparam-
eter space to build chatbots based on attention mechanisms and
transformers, which are large language models. These models have
shown great success in various natural language processing tasks
and have the potential to provide eective and engaging support
to individuals experiencing mental health challenges. Furthermore,
the paper will present examples of interactions with optimised
chatbots to demonstrate their eectiveness and usability. The main
goal of this work is to contribute to ongoing research in the eld of
mental health and technology by exploring the potential of chatbots
to provide accessible and eective support for people experiencing
depression and anxiety.
The remaining parts of this paper are organised as follows; back-
ground and related work is presented in Section 2 followed by the
proposed method in Section 3. The results and observations are
presented in Section 4. Section 5 presents the conclusion and future
work.
2 BACKGROUND AND RELATED WORK
Chatbots are Human-Computer Interaction (HCI) models that allow
users to converse with machines through natural language [Bansal
PETRA ’23, July 5–7, 2023, Corfu, Greece Bird and Lotfi
and Khan, 2018]. Most often in the modern literature, chatbots
make use of articial intelligence and machine learning to process
an input and produce a response in the form of text [Suhaili et al
.
,
2021] and have grown rapidly more prominent in research since
the year 2015.
A recent scoping review of chatbots in mental health revealed
several pieces of interesting information within the eld [Abd-
Alrazaq et al
.
, 2021]. Namely, the majority of chatbots focus on
support for depression and autism, and controlled the conversation
for therapy, training, and screening. The approach in this work
is that of question-answering; that is, the goal of the model is to
generalise online resources to provide answers that people may
have about the included categories.
Bhagchandani and Nayak proposed the combination of two nat-
ural language processing models for a mental health chatbot frame-
work [Bhagchandani and Nayak, 2022]. In this study, the authors
rst perform text classication using sentiment analysis to discern
whether the user should be directed to a chatbot for a generic chat
or another for therapy-based conversation. A similar approach was
proposed in CareBot [Crasto et al
.
, 2021], where conversational data
was used along with the PHQ-9 and WHO-5 screening question-
naires to train a chatbot using a multimodal approach. The study
recorded lower perplexity values for transformers compared to re-
current methods, but experimental observations revealed that 63%
of the participants preferred the response generated by the Trans-
former over 22% for Long Short Term Memory (LSTM) networks
and 15% for the Recurrent Neural Network (RNN).
In 2021, Deshpande and Warren proposed an additional mod-
ule for a mental health chatbot which could detect users at risk
of self-harm [Deshpande and Warren, 2021]; In their study, text
classication experiments noted that the Bidirectional Encoder Rep-
resentations from Transformers (BERT) could achieve 97% accuracy
in recognising the risk within scraped Reddit data that were not
part of the training dataset. BERT representations were also applied
in a recent work, which found that it was a promising approach
compared to classical approaches for the detection of mental health
status from Reddit posts [Jiang et al
.
, 2020]. Alongside the use of at-
tention, several other methods have also been proposed to improve
chatbots. These include data augmentation by paraphrasing [Bird
et al
.
, 2021, Joglekar, 2022], transfer learning [Prakash et al
.
, 2020,
Syed et al
.
, 2021], reinforcement learning [Cuayáhuitl et al
.
, 2019,
Liu et al
.
, 2020], and ensemble learning [Almansor et al
.
, 2021, Bali
et al., 2019].
Transformers are a new type of neural network that have re-
cently seen a rapid rise in popularity, achieving state of the art
performance in natural language processing, image captioning, im-
age synthesis, classication, and audio processing [Lin et al
.
, 2022].
Most relevant to this study are the studies exploring how trans-
former models achieve the current best performance metrics for the
synthesis of text and answering of questions [Devlin and Chang,
2018, Lukovnikov et al
.
, 2019, Radford et al
.
, 2019, Shao et al
.
, 2019].
According to the original paper [Vaswani et al
.
, 2017], the atten-
tion values are calculated as the scaled dot product; Weights are
calculated for each token within the input text as follows:
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 (𝑄, 𝐾 , 𝑉 )=𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 𝑄𝐾𝑇
𝑑𝑘!𝑉(1)
where
𝑄
is the query token, an embedded representation of a word
within a sequence.
𝐾
represents keys, vectors of the sequence of
tokens presented to the model, and
𝑉
are values that are calcu-
lated when querying keys. In this study,
𝑄
,
𝐾
, and
𝑉
are from the
same data source, and therefore the operation is described as self-
attention. Each block also contains several attention heads, and thus
the approach that this study implements is known as multi-headed
self-attention (
𝑀𝐻
). This is simply calculated via the concatentation
of 𝑖attention heads as follows:
𝑀𝐻 (𝑄 , 𝐾, 𝑉 )=𝐶𝑜𝑛𝑐𝑎𝑡 𝑒𝑛𝑎𝑡 𝑒 (1, .. .,ℎ)𝑊𝑂(2)
The application of multi-headed attention has shown a signi-
cant improvement in ability compared to the conventional approach
It is suggested that a shallower, wider model is more stable during
the training process.
Fig. 1 shows a diagram of how the model uses embeddings as
input and output, with a tokeniser used to transform both strings
into encodings and vice versa.
3 METHOD
Within this section, the proposed methodology will be discussed,
followed by work on optimisation of chatbots to answer mental
health questions. The general approach of this work can be observed
in Fig. 2; this section details each step of this process.
Initially, data from various sources were collected to form a
large dataset. No single modern dataset is viable for large neural
language models given their data requirements for eective gener-
alisation [Sezgin et al
.
, 2022]. Due to this, data from CounselChat
1
,
the Brain & Behaviour Research Foundation
2
, the NHS
34
, Wellness
in Mind
5
and White Swan Foundation
6
were selected. Questions
and answers are extracted, and questions are manually generated
dependent on the information available, e.g. for the NHS denition
of depression, questions such as “what is depression?" are imputed.
For preprocessing, all texts were converted to lowercase, and
punctuation was removed in order to reduce the learning of irrele-
vant tokens. For example, the tokens “Hello", “hello", “Hello!", and
“hello?" would all be treated as separate learnable tokens prior to
this step. Then the vocabulary was limited to the most common
30,000 tokens to remove uncommon occurences that cannot be
generalised. Following these steps, queries and answers are then de-
noted in the dataset with markup tags <Q> ... </Q> and <A> ... </A>,
which are useful for several purposes: (i) to condition the model
on separate types of text, (ii) to present the model with queries,
1Available online: https://counselchat.com [Last Accessed: 09/05/2023]
2
Available online: https://www.bbrfoundation.org/faq/frequently-asked-questions-
about-depression [Last Accessed: 09/05/2023]
3
Available online: https://www.nhs.uk/mental-health/conditions/clinical-depression
[Last Accessed: 09/05/2023]
4
Available online: https://www.nhs.uk/mental-health/conditions/generalised-anxiety-
disorder/overview [Last Accessed: 09/05/2023]
5
Available online: https://www.wellnessinmind.org/frequently-asked-questions/ [Last
Accessed: 09/05/2023]
6
Available online: https://www.whiteswanfoundation.org/mental-health-matters/
understanding-mental- health/mental-illness-faqs [Last Accessed: 09/05/2023]
Generative Transformer Chatbots for Mental Health Support PETRA ’23, July 5–7, 2023, Corfu, Greece
what
Tokeniser
Token to
encoding
is
GAD?
t1
t2
t3
Transformer
t4
t5
t6
t7
Tokeniser
Encoding
to token
t4
t5
t6
is
generalised
anxiety
disorder
t4
t5
Figure 1: Diagram showing the use of a tokeniser to transform the text. Inputs are encoded and used for inference, encodings
are output which are then transformed back into readable strings.
CounselChat
BBR Foundation
NHS
Wellness in Mind
White Swan Foundation
Preprocessing Query Answer
Tagging
Manual query
impute
<Q> query </Q>
<A> answer </A>
Tokenisation
Expert Knowledge
Hyperparameter
Optimisation
Transformer
Training
Chatbot Model
Figure 2: General diagram for the data preprocessing and training process for an optimised conversational chatbot model.
and (iii) to aid in the logic of ending the prediction loop when an
answer has been generated.
With regards to the preprocessed data, a batch search of model
hyperparameters were implemented for the generative transformer
model. Starting from a random weight distribution, topologies of
{
2
,
4
,
8
,
16
}
attention heads was engineered and attached to one
layer of {64,128,256,512}rectied linear units. Shallow networks
are produced due to the data requirements of deeper models; al-
though alarge dataset was collected, it is relatively close to the
minimum requirements of a model following this learning method.
In future, given more data, deeper networks could be explored. Mod-
els are trained and compared based on the validation metrics of
accuracy and loss, with consideration also given to top-
𝑘
accuracy
where
𝑘=
5and
𝑘=
10. Top-
𝑘
metrics are important for deeper
comparison of similarly-performing models, since it is a further
measure of how incorrect a wrong prediction is. For example, two
models selecting the correct token half of the time will both score
50% accuracy, but one model’s second choice may more often be
correct, suggesting that it is on a better track to generalise the data
compared to the other.
To conclude the methodology shown in Fig. 2, a general diagram
for the process of interfacing with the chatbot and inferring a
response from the input query is shown in Fig. 3.
Table 1: Loss values for the transformer topology tuning
experiments.
Dense
Neurons
Attention Heads
2 4 8 16
64 0.64 0.56 0.47 0.91
128 0.65 0.58 0.47 1.16
256 0.65 0.59 0.48 1.37
512 0.64 0.59 1.42 1.72
4 RESULTS AND OBSERVATIONS
In this section, the observed metrics during the topology engi-
neering for the transformer-based chatbots are presented before
exploring some examples of its usage after training.
Table 1 and Table 2 show the loss and accuracy metrics for
the 16 individual experiments, respectively. Two equally scoring
models outperformed all others, which were eight attention heads
succeeded by either 64 or 128 rectied linear units. Both of these
models could predict the next token 88.65% of the time. Further to
loss and accuracy metrics, Tables 3 and 4 show the top-
𝑘
accuracy
for
𝑘=
5and
𝑘=
10, respectively. Beyond the initial results, these
tables show us that the option of using 128 neurons in the layer
prior to token prediction gives a slightly higher result. These were
96.49% (against 96.41%) and 97.88% (against 97.82%). The 8-headed,
PETRA ’23, July 5–7, 2023, Corfu, Greece Bird and Lotfi
Chatbot
Interface
Query
Preprocessing Query Tagging
Query query <Q> query </Q>
Chatbot Model
<A> answer </A> Response
Figure 3: Diagram of the inference process for the trained chatbot model interface.
Table 2: Accuracy values for the transformer topology tuning
experiments.
Dense
Neurons
Attention Heads
2 4 8 16
64 84.13 86.23 88.65 79.5
128 83.88 85.95 88.65 74.3
256 83.81 85.46 88.02 69.98
512 84.01 85.44 68.9 63.86
Table 3: Top-5 accuracy values for the transformer topology
tuning experiments.
Dense
Neurons
Attention Heads
2 4 8 16
64 94.86 95.73 96.41 90.89
128 94.84 95.53 96.49 87.15
256 94.78 95.43 96.34 84.26
512 94.85 95.31 87.7 78.79
Table 4: Top-10 accuracy values for the transformer topology
tuning experiments.
Dense
Neurons
Attention Heads
2 4 8 16
64 96.92 97.64 97.82 93.92
128 96.94 97.36 97.88 90.92
256 96.83 97.29 97.77 88.78
512 96.95 97.23 88.48 84.12
128-neuron model is therefore selected as the best candidate for
further exploration.
To provide examples of interaction with the trained model with
<Q> queries provided by a human and <A> answers output by
the model, the following is an excerpt from a conversation about
depression7:
<Q> what is depression </Q>
7
Note: all grammar has been removed from the training data to enable generalisation
of tokens with identical meaning
<A> clinical depression is a serious condition that neg-
atively aects how a person thinks feels and behaves
</A>
<Q> what are the symptoms of depression </Q>
<A> the symptoms of depression can be complex and
vary widely between people if youre depressed you
may feel sad hopeless and lose interest in things you
used to enjoy </A>
<Q> how is depression diagnosed </Q>
<A> the rst step to being diagnosed is to visit a doctor
for a medical evaluation certain medications and some
medical conditions such as thyroid disorder can cause
similar symptoms as depression </A>
<Q> are certain people more likely to get depressed
than others </Q>
<A> you may be more vulnerable to depression if you
have certain personality traits such as low selfesteem
or being overly selfcritical this may be because of the
genes youve inherited from your parents your early
life experiences or both </A>
<Q> can you tell me about depression after childbirth
</Q>
<A> sometimes new mothers fathers or partners de-
velop depression after they have a baby this is known
as postnatal depression and its treated in a similar way
to other types of depression with talking therapies
and antidepressant medicines </A>
In addition, the following is an excerpt from a conversation about
anxiety:
<Q> is anxiety normal </Q>
<A> everyone has feelings of anxiety at some point
in their life for example you may feel worried and
anxious about sitting an exam or having a medical
test or job interview during times like these feeling
anxious can be perfectly normal </A>
<Q> if i feel too anxious should i see a doctor </Q>
<A> although feelings of anxiety at certain times are
completely normal you should see a gp if anxiety is
aecting your daily life or causing you distress </A>
<Q> what are the symptoms of anxiety </Q>
<A> gad can cause both psychological mental and
physical symptoms these vary from person to person
but can include feeling restless or worried having
trouble concentrating or sleeping dizziness or heart
palpitations </A>
<Q> what does gad mean </Q>
Generative Transformer Chatbots for Mental Health Support PETRA ’23, July 5–7, 2023, Corfu, Greece
<A> gad stands for general anxiety disorder </A>
<Q> is there any way i can feel less anxious </Q>
<A> there are many things that can do yourself to
help reduce your anxiety such as going on a selfhelp
course exercising regularly stopping smoking looking
after your physical health </A>
As can be observed from the aforementioned conversations,
interaction with the most optimal model leads to examples where
queries can be eectively answered and advice given following
training from the veried sources. Terms such as GAD (General
Anxiety Disorder) are more likely to appear in the outputs since
they were abbreviated more often than not within the training data;
in this case, it was possible to ask the chatbot to clarify this term.
Reducing the number of unique tokens via removing grammar
aided in training with a dataset of this given size, but results in
none being output. In future, more natural conversation would
be enabled through either learning from a grammatically-correct
dataset, or correcting the chatbot output prior to the response being
printed to an interface.
5 CONCLUSION AND FUTURE WORK
In this work, the engineering and applications of transformer-based
chatbots are explored to answer questions with a focus on mental
health support. Specically, the focus is on queries surrounding
depression and anxiety from respected and veried sources. To
conclude this work, chatbots have the potential to play a signi-
cant role in supporting people suering from mental health stigma.
The use of attention mechanism techniques to build chatbots from
transformers, which are large language models, seem to lead to
the creation of engaging conversational systems. The results of
this study demonstrate the potential of chatbots to provide easily
accessible and anonymous support to people who may otherwise
be discouraged from seeking help due to stigma. However, with
these ndings considered, it is also important to acknowledge the
limitations and challenges of using chatbots for mental health sup-
port. More research from medical and psychological backgrounds
is needed to fully understand the limitations of chatbots and ensure
that they are developed and deployed ethically and responsibly.
Alongside future work regarding ethics, there are also limitations
to this study that should be explored. Firstly, data availability is a
concern; although we collected a large dataset for this study, this la-
borious process led to only the minimal amount of data to train such
models. In the future, more data could be collected and experiments
could be reimplemented to further generalisation. Additionally,
methods such as transfer learning and data augmentation could
be explored as alternatives to alleviate this limitation. To engineer
the topologies, we performed a batch search; this could be further
improved through metaheuristic hyperparameter optimisation to
automate this process. Although this would likely lead to a better
model, it would require far more computational resources and time.
In addition to future experiments, examples such as the chatbot
outputting “GAD" (instead of General Anxiety Disorder) show how
the model can be aected when the majority of terms are abbrevi-
ated within the training data. In the future, the application may be
more informative if abbreviations are replaced with denitions as
an added data preprocessing step.
Finally, in conclusion, this study highlights the importance of
continuing research and development in the eld of mental health
technology. By exploring the potential of chatbots to provide sup-
port to individuals experiencing depression and anxiety, we can
work toward creating innovative and eective solutions to promote
mental well-being.
REFERENCES
Alaa A Abd-Alrazaq, Mohannad Alajlani, Ali Abdallah Alalwan, Bridgette M Bewick,
Peter Gardner, and Mowafa Househ. 2019. An overview of the features of chatbots
in mental health: A scoping review. International Journal of Medical Informatics
132 (2019), 103978.
Alaa A Abd-Alrazaq, Mohannad Alajlani, Nashva Ali, Kerstin Denecke, Bridgette M
Bewick, and Mowafa Househ. 2021. Perceptions and opinions of patients about
mental health chatbots: scoping review. Journal of medical Internet research 23, 1
(2021), e17828.
Ebtesam Hussain Almansor, Farookh Khadeer Hussain, and Omar Khadeer Hussain.
2021. Supervised ensemble sentiment-based framework to measure chatbot quality
of services. Computing 103 (2021), 491–507.
Manish Bali, Samahit Mohanty, Subarna Chatterjee, Manash Sarma, and Rajesh Pu-
ravankara. 2019. Diabot: a predictive medical chatbot using ensemble learning.
International Journal of Recent Technology and Engineering 8, 2 (2019), 6334–6340.
Himanshu Bansal and Rizwan Khan. 2018. A review paper on human computer
interaction. Int. J. Adv. Res. Comput. Sci. Softw. Eng 8, 4 (2018), 53.
Anushka Bhagchandani and Aryan Nayak. 2022. Deep Learning Based Chatbot Frame-
work for Mental Health Therapy. In Advances in Data and Information Sciences:
Proceedings of ICDIS 2021. Springer, 271–281.
Jordan J Bird, Anikó Ekárt, and Diego R Faria. 2021. Chatbot Interaction with Articial
Intelligence: human data augmentation with T5 and language transformer ensemble
for text classication. Journal of Ambient Intelligence and Humanized Computing
(2021), 1–16.
Reuben Crasto, Lance Dias, Dominic Miranda, and Deepali Kayande. 2021. Care-
Bot: A Mental Health ChatBot. In 2021 2nd International Conference for Emerging
Technology (INCET). IEEE, 1–5.
Heriberto Cuayáhuitl, Donghyeon Lee, Seonghan Ryu, Yongjin Cho, Sungja Choi,
Satish Indurthi, Seunghak Yu, Hyungtak Choi, Inchul Hwang, and Jihie Kim. 2019.
Ensemble-based deep reinforcement learning for chatbots. Neurocomputing 366
(2019), 118–130.
Saahil Deshpande and Jim Warren. 2021. Self-Harm Detection for Mental Health
Chatbots.. In MIE. 48–52.
Jacob Devlin and Ming-Wei Chang. 2018. Open Sourcing BERT: State-of-the-Art
Pre-training for Natural Language Processing. Google AI Blog. Weblog.[Online]
Available from: https://ai. googleblog. com/2018/11/open-sourcing-bertstate-of-art-pre.
html [Accessed 4 December 2019] (2018).
Richard G Frank and Sherry A Glied. 2006. Better but not well: Mental health policy in
the United States since 1950. JHU Press.
Terry Hanley and Claire Wyatt. 2021. A systematic review of higher education stu-
dents’ experiences of engaging with online therapy. Counselling and Psychotherapy
Research 21, 3 (2021), 522–534.
Zheng Ping Jiang, Sarah Ita Levitan, Jonathan Zomick, and Julia Hirschberg. 2020.
Detection of mental health from reddit via deep contextualized representations. In
Proceedings of the 11th International Workshop on Health Text Mining and Information
Analysis. 147–156.
Chaitanya Joglekar. 2022. WOzBot: A Wizard of Oz Based Method for Chatbot Response
Improvement. Master’s thesis. Trinity College Dublin.
Edgar Jones and Simon Wessely. 2005. Shell shock to PTSD: Military psychiatry from
1900 to the Gulf War. Psychology Press.
Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of
transformers. AI Open (2022).
Jianfeng Liu, Feiyang Pan, and Ling Luo. 2020. Gochat: Goal-oriented chatbots with
hierarchical reinforcement learning. In Proceedings of the 43rd International ACM
SIGIR Conference on Research and Development in Information Retrieval. 1793–1796.
Denis Lukovnikov, Asja Fischer, and Jens Lehmann. 2019. Pretrained transformers for
simple question answering over knowledge graphs. In International Semantic Web
Conference. Springer, 470–486.
Kolla Bhanu Prakash, Y Nagapawan, N Lakshmi Kalyani, and V Pradeep Kumar. 2020.
Chatterbot implementation using transfer learning and LSTM encoder-decoder
architecture. International Journal 8, 5 (2020).
Alec Radford, Je Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
2019. Language Models are Unsupervised Multitask Learners. (2019).
Emre Sezgin, Joseph Sirrianni, and Simon L Linwood. 2022. Operationalizing and
implementing pretrained, large articial intelligence linguistic models in the US
health care system: outlook of generative pretrained transformer 3 (GPT-3) as a
service model. JMIR medical informatics 10, 2 (2022), e32875.
PETRA ’23, July 5–7, 2023, Corfu, Greece Bird and Lotfi
Taihua Shao, Yupu Guo, Honghui Chen, and Zepeng Hao. 2019. Transformer-based
neural network for answer selection in question answering. IEEE Access 7 (2019),
26146–26156.
Amy E Sickel, Jason D Seacat, and Nina A Nabors. 2014. Mental health stigma update:
A review of consequences. Advances in Mental Health 12, 3 (2014), 202–215.
Sinarwati Mohamad Suhaili, Naomie Salim, and Mohamad Nazim Jambli. 2021. Ser vice
chatbots: A systematic review. Expert Systems with Applications 184 (2021), 115461.
Zeeshan Haque Syed, Asma Trabelsi, Emmanuel Helbert, Vincent Bailleau, and Chris-
tian Muths. 2021. Question answering chatbot for troubleshooting queries based
on transfer learning. Procedia Computer Science 192 (2021), 941–950.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In
Advances in neural information processing systems. 5998–6008.
WHO. 2021. Depression. https://www.who.int/news-room/fact-sheets/detail/
depression
... Those domains that stand out in the literature relate to health and education. In the area of health, LLM application areas include public health [44], health care [52], [53], mental health [54], dentistry [55], biomedical research [56], surgery [57], in vitro fertilization [53], pediatrics [58], and clinical practice [42]. In education, studies covered academia, in general [11], [59], [60], while some focused on higher education and research [1], engineering education [61], standardized admissions test [43], creative writing [62], language learning [63], and peer feedback [64]. ...
... Other concerns that have been raised by literature on the role of AI research and development side of using LLMs on the political economy includes: environmental footprints, and future social effects [50] towards ethical and responsible development and deployment [54], such as the commodification of LLMs as evidenced by paid API access and usage, and the continuous divide between those with and without access to computing resources [50]. Although the AI divide becomes wider, LLM applications are also perceived as enablers towards inclusivity of particular sectors of society such as an e-mail writing assistant for adults with dyslexia [86]. ...
Article
Full-text available
This paper conducted a systematic review of Scopus-indexed publications on large language models (LLMs) and natural language processing (NLP) extracted in October 2023 to address the dearth of literature on their opportunities and challenges. Through bibliometric analysis, from the 1,600 relevant documents, the study explored research productivity, revealing both opportunities and challenges spanning research and real-world applications in education, medicine, and health care, citations, and keyword co-occurrence networks. Results highlighted distribution patterns and dominant players like Google LLC and Stanford University. Opportunities such as technological development in generative artificial intelligence (AI), were contrasted with challenges such as biases and ethical concerns. The intellectual structure analysis revealed prominent application areas in health and education and also emphasized issues such as AI divide and human-AI partnership. Improvement on the technology performance of LLM and NLP remains to be a challenge. Recommendations include further exploration of open research problems and bibliometric studies using other research databases given the research bias towards Scopus-indexed English publications.
... Regarding cancer myths and misconceptions, 97% of expert reviews deemed answers from ChatGPT to be accurate [56]. In addition, Bird and Lotfi optimized a chatbot that could answer mental health-related questions with an accuracy of 89% [57]. Overall, LLMs, particularly ChatGPT, demonstrate an impressive performance in public education in health. ...
... Of greater concern is data availability. Healthcare institutions have shared no identifiable health information with widely accessible LLMs like ChatGPT due to privacy concerns and legal compliances [7] and it is arduous to collect new data for LLM training [57]. ChatGPT, for example, was not trained on patients' clinical data [4]. ...
Preprint
Full-text available
Background: The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators. Objective: This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare. Methods: We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns. Results: Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research. Conclusions: Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.
... Além disso, os pacientes podem sentir-se mais confortáveis compartilhando informações negativas ou sensíveis com a IA em vez de um humano, o que pode ser benéfico no processo terapêutico.Outrossim, a utilização de uma interface entre pacientes deprimidos e profissionais de saúde mental por meio de um chatbot é capaz de simular uma conversa natural e oferecer suporte por meio de entrada e saída de texto. Essa abordagem pode contribuir para a redução do estigma associado à busca de ajuda para questões de saúde mental, proporcionando uma forma mais acessível e envolvente de suporte.(15) De maneira similar, o Terabot, sistema de diálogo que emprega LLMs para facilitar a interação entre pacientes deprimidos e profissionais de saúde mental interage com os pacientes, auxiliando-os a acalmar suas emoções acirradas e incentivando a prática de exercícios relaxantes. ...
Article
Full-text available
Objetivo: Este estudo revisa o uso de Modelos de Linguagem de Grande Escala (LLMs) na área da saúde mental, focando especificamente no tratamento da depressão. Método: Foram analisados 18 artigos de um total inicial de 121, explorando como os LLMs auxiliam na tomada de decisões clínicas e na interação entre profissionais de saúde mental e pacientes deprimidos. Resultados: Os resultados principais mostram que os LLMs podem aumentar a precisão na detecção de sintomas e melhorar as intervenções terapêuticas por meio de interfaces conversacionais avançadas. Conclusão: O resumo aponta para lacunas na pesquisa existente e ressalta a contribuição do estudo para uma melhor compreensão da aplicabilidade dos LLMs em contextos clínicos.
... The findings of that study suggest 77% accuracy in recognising depression from natural language using Hierarchical Attention Networks and Long-sequence Transformer models. In [2], a self-attention transformer architecture was shown to be capable of predicting tokens in a sequence with 88.65% top-1 and 96.49% top-5 accuracy, given a dataset of mental health support questions and answers. Regarding the use of AI in interventions, Fitzpatrick et al. [9] note that conversational agents appear feasible, engaging and effective when applied to the delivery of cognitive behavioural therapy (CBT). ...
Conference Paper
Full-text available
Technological intervention to support care areas that some people may not have access to is of paramount importance to promote sustainable development of good health and wellbeing. This study aims to explore the linguistic similarities and differences between human professionals and Generative Artificial Intelligence (AI) conversational agents in therapeutic dialogues. Initially, the MISTRAL-7B Large Language Model (LLM) is instructed to generate responses to patient queries to form a synthetic equivalent to a publicly available psychology dataset. A large set of linguistic features (e.g., text metrics, lexical diversity and richness, readability scores, sentiment, emotions, and named entities) is extracted and studied from both the expert and synthetically-generated text. The results suggest a significantly richer vocabulary in humans than the LLM approach. Similarly, the use of sentiment was significantly different between the two, suggesting a difference in the supportive or objective language used and that synthetic linguistic expressions of emotion may differ from those expressed by an intelligent being. However, no statistical significance was observed between human professionals and AI in the use of function words, pronouns and several named entities; possibly reflecting an increased proficiency of LLMs in modelling some language patterns, even in a specialised context (i.e., therapy). However, current findings do not support the similarity in sentimental nuance and emotional expression, which limits the effectiveness of contemporary LLMs as standalone agents. Further development is needed towards clinically validated algorithms.
... Despite these advancements, the application of generative models in the behavioral health domain remains underexplored. Recent studies illustrate the application of large language models (LLMs) in augmenting empathy in online peer support platforms 23 , supporting behavioral health services 24 , building GAI integrated chatbot to support patients 25,26 , and addressing behavioral health information seeking activities 27 . Of the many proposed uses of GAI in behavioral health 28 , image generation has yet to be explored 29 . ...
Article
Full-text available
There have been considerable advancements in artificial intelligence (AI), specifically with generative AI (GAI) models. GAI is a class of algorithms designed to create new data, such as text, images, and audio, that resembles the data on which they have been trained. These models have been recently investigated in medicine, yet the opportunity and utility of GAI in behavioral health are relatively underexplored. In this commentary, we explore the potential uses of GAI in the field of behavioral health, specifically focusing on image generation. We propose the application of GAI for creating personalized and contextually relevant therapeutic interventions and emphasize the need to integrate human feedback into the AI-assisted therapeutics and decision-making process. We report the use of GAI with a case study of behavioral therapy on emotional recognition and management with a three-step process. We illustrate image generation-specific GAI to recognize, express, and manage emotions, featuring personalized content and interactive experiences. Furthermore, we highlighted limitations, challenges, and considerations, including the elements of human emotions, the need for human-AI collaboration, transparency and accountability, potential bias, security, privacy and ethical issues, and operational considerations. Our commentary serves as a guide for practitioners and developers to envision the future of behavioral therapies and consider the benefits and limitations of GAI in improving behavioral health practices and patient outcomes.
... With a deliberate integration, the models wound up providing utility as an assembly line of sorts-each singularly focused yet part of an integrated machine-aiming for precision and efficiency in depression detection. Bird and Lotfi, (2023) [5], investigate for people's mental health issuesanxiety or depression to-be-treatedand pondered the efficiency of high-level model chatbots. Chatbots were specifically constructed. ...
Research
Full-text available
Exchanging conversations for increasing mindfulness and combatting loneliness helps reduce and prevent mental health problems. Anxiety, depression, and stress among other mental health disorders are growing in today's world that is driven by speed and changes, severely affecting the lives and well-being of people across all ages and diversities. The present research aims to design a virtual companion which resolves problems with an inclusive approach, is aimed at looking at a problem through different perspectives to get a holistic understanding, using the best natural language processing models. This is completed by utilizing modern text generation models like DialoGPT and T5 Transformers. The model works by using Prompt engineering, Retrieval Augmented Generation (RAG), and fine-tuning techniques on an augmented dataset, text-to-speech engines such as Speech T5 along with Meta's Massively Multilingual Speech (MMS), that can produce naturalistic computer speech in multiple languages with speed that carries the nuances of human communication. The goal is to offer an empathetic, friendly solution that meets the mental health needs of people with varied diversities. Using the unique blend of AI and concern, this research tries to develop an innovative tool for empathic care and support people to be resilient in the face of life's trials, by generating real-world based solutions and responses, including humor as per the user's receptivity to build rapport and indulge the user in engaging conversations and journalling.
Article
Full-text available
Background The launch of ChatGPT (OpenAI) in November 2022 attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including health care. Numerous studies have since been conducted regarding how to use state-of-the-art LLMs in health-related scenarios. Objective This review aims to summarize applications of and concerns regarding conversational LLMs in health care and provide an agenda for future research in this field. Methods We used PubMed, ACM, and the IEEE digital libraries as primary sources for this review. We followed the guidance of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to screen and select peer-reviewed research articles that (1) were related to health care applications and conversational LLMs and (2) were published before September 1, 2023, the date when we started paper collection. We investigated these papers and classified them according to their applications and concerns. Results Our search initially identified 820 papers according to targeted keywords, out of which 65 (7.9%) papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT (60/65, 92% of papers), followed by Bard (Google LLC; 1/65, 2% of papers), LLaMA (Meta; 1/65, 2% of papers), and other LLMs (6/65, 9% papers). These papers were classified into four categories of applications: (1) summarization, (2) medical knowledge inquiry, (3) prediction (eg, diagnosis, treatment recommendation, and drug synergy), and (4) administration (eg, documentation and information collection), and four categories of concerns: (1) reliability (eg, training data quality, accuracy, interpretability, and consistency in responses), (2) bias, (3) privacy, and (4) public acceptability. There were 49 (75%) papers using LLMs for either summarization or medical knowledge inquiry, or both, and there are 58 (89%) papers expressing concerns about either reliability or bias, or both. We found that conversational LLMs exhibited promising results in summarization and providing general medical knowledge to patients with a relatively high accuracy. However, conversational LLMs such as ChatGPT are not always able to provide reliable answers to complex health-related tasks (eg, diagnosis) that require specialized domain expertise. While bias or privacy issues are often noted as concerns, no experiments in our reviewed papers thoughtfully examined how conversational LLMs lead to these issues in health care research. Conclusions Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications bring bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in health care.
Article
Full-text available
Generative pretrained transformer models have been popular recently due to their enhanced capabilities and performance. In contrast to many existing artificial intelligence models, generative pretrained transformer models can perform with very limited training data. Generative pretrained transformer 3 (GPT-3) is one of the latest releases in this pipeline, demonstrating human-like logical and intellectual responses to prompts. Some examples include writing essays, answering complex questions, matching pronouns to their nouns, and conducting sentiment analyses. However, questions remain with regard to its implementation in health care, specifically in terms of operationalization and its use in clinical practice and research. In this viewpoint paper, we briefly introduce GPT-3 and its capabilities and outline considerations for its implementation and operationalization in clinical practice through a use case. The implementation considerations include (1) processing needs and information systems infrastructure, (2) operating costs, (3) model biases, and (4) evaluation metrics. In addition, we outline the following three major operational factors that drive the adoption of GPT-3 in the US health care system: (1) ensuring Health Insurance Portability and Accountability Act compliance, (2) building trust with health care providers, and (3) establishing broader access to the GPT-3 tools. This viewpoint can inform health care practitioners, developers, clinicians, and decision makers toward understanding the use of the powerful artificial intelligence tools integrated into hospital systems and health care.
Article
Full-text available
In this work we present the Chatbot Interaction with Artificial Intelligence (CI-AI) framework as an approach to the training of a transformer based chatbot-like architecture for task classification with a focus on natural human interaction with a machine as opposed to interfaces, code, or formal commands. The intelligent system augments human-sourced data via artificial paraphrasing in order to generate a large set of training data for further classical, attention, and language transformation-based learning approaches for Natural Language Processing (NLP). Human beings are asked to paraphrase commands and questions for task identification for further execution of algorithms as skills. The commands and questions are split into training and validation sets. A total of 483 responses were recorded. Secondly, the training set is paraphrased by the T5 model in order to augment it with further data. Seven state-of-the-art transformer-based text classification algorithms (BERT, DistilBERT, RoBERTa, DistilRoBERTa, XLM, XLM-RoBERTa, and XLNet) are benchmarked for both sets after fine-tuning on the training data for two epochs. We find that all models are improved when training data is augmented by the T5 model, with an average increase of classification accuracy by 4.01%. The best result was the RoBERTa model trained on T5 augmented data which achieved 98.96% classification accuracy. Finally, we found that an ensemble of the five best-performing transformer models via Logistic Regression of output label predictions led to an accuracy of 99.59% on the dataset of human responses. A highly-performing model allows the intelligent system to interpret human commands at the social-interaction level through a chatbot-like interface (e.g. “Robot, can we have a conversation?”) and allows for better accessibility to AI by non-technical users.
Chapter
Full-text available
Chatbots potentially address deficits in availability of the traditional health workforce and could help to stem concerning rates of youth mental health issues including high suicide rates. While chatbots have shown some positive results in helping people cope with mental health issues, there are yet deep concerns regarding such chatbots in terms of their ability to identify emergency situations and act accordingly. Risk of suicide/self-harm is one such concern which we have addressed in this project. A chatbot decides its response based on the text input from the user and must correctly recognize the significance of a given input. We have designed a self-harm classifier which could use the user’s response to the chatbot and predict whether the response indicates intent for self-harm. With the difficulty to access confidential counselling data, we looked for alternate data sources and found Twitter and Reddit to provide data similar to what we would expect to get from a chatbot user. We trained a sentiment analysis classifier on Twitter data and a self-harm classifier on the Reddit data. We combined the results of the two models to improve the model performance. We got the best results from a LSTM-RNN classifier using BERT encoding. The best model accuracy achieved was 92.13%. We tested the model on new data from Reddit and got an impressive result with an accuracy of 97%. Such a model is promising for future embedding in mental health chatbots to improve their safety through accurate detection of self-harm talk by users.
Article
Full-text available
Aim The prevalence of mental health difficulties and the demand for psychological support for students in higher education (HE) appear to be increasing. Online therapy is a widely accessible resource that could provide effective support; however, little is known about such provision. The aim of this study was therefore to answer the research question ‘What factors serve to influence higher education students' levels of engagement with online therapy?’ Method A systematic review of qualitative scholarly and peer‐reviewed literature was conducted across 10 databases. Six papers met the inclusion criteria, were assessed for quality and were analysed using thematic synthesis. Findings Factors that serve to motivate HE students to engage with online therapy included the perception that it might enhance the quality of the therapeutic relationship, that it would facilitate more autonomy in the work, and that it might enable them to be anonymous and avoid face‐to‐face contact. In contrast, demotivating factors were primarily practical in nature. Fitting therapeutic work into their busy lives, technological challenges and persisting mental health stigma proved important factors. Conclusion This review synthesises the reasons why HE students might engage with or withdraw from online therapy. It highlights that students appear to view online therapy positively, but they can be inhibited by both personal and practical issues. Therapeutic services therefore need to ensure that information about the work they offer online is clear and transparent and that the platforms they work on are secure and stable. Finally, the need for further research, to keep abreast of technological developments, is recommended.
Conference Paper
Full-text available
We address the problem of automatic detection of psychiatric disorders from the linguistic content of social media posts. We build a large scale dataset of Reddit posts from users with eight disorders and a control user group. We extract and analyze linguistic characteristics of posts and identify differences between diagnostic groups. We build strong classification models based on deep contextualized word representations and show that they out-perform previously applied statistical models with simple linguistic features by large margins. We compare user-level and post-level classification performance, as well as an ensembled multiclass model.
Article
Full-text available
Developing an intelligent chatbot has evolved in the last few years to become a trending topic in the area of computer science. However, a chatbot often fails to understand the user’s intent, which can lead to the generation of inappropriate responses that cause dialogue breakdown and user dissatisfaction. Detecting the dialogue breakdown is essential to improve the performance of the chatbot and increase user satisfaction. Recent approaches have focused on modeling conversation breakdown using serveral approaches, including supervised and unsupervised approaches. Unsupervised approach relay heavy datasets, which make it challenging to apply it to the breakdown task. Another challenge facing predicting breakdown in conversation is the bias of human annotation for the dataset and the handling process for the breakdown. To tackle this challenge, we have developed a supervised ensemble automated approach that measures Chatbot Quality of Service (CQoS) based on dialogue breakdown. The proposed approach is able to label the datasets based on sentiment considering the context of the conversion to predict the breakdown. In this paper we aim to detect the affect of sentiment change of each speaker in a conversation. Furthermore, we use the supervised ensemble model to measure the CQoS based on breakdown. Then we handle this problem by using a hand-over mechanism that transfers the user to a live agent. Based on this idea, we perform several experiments across several datasets and state-of-the-art models, and we find that using sentiment as a trigger for breakdown outperforms human annotation. Overall, we infer that knowledge acquired from the supervised ensemble model can indeed help to measure CQoS based on detecting the breakdown in conversation.
Chapter
With the rising popularity of chatbots, the research on their underlying technology has expanded to provide increased support to the users. One such sphere has been mental health support. As we train chatbots to better understand human emotions, we can also employ them to assist users in dealing with their emotions and improving their mental well-being. This paper presents a novel approach toward building a chatbot framework that can converse with the users and also provide therapeutic advice based on assessment of the user’s mood. The framework employs sentiment analysis for analyzing the user behavior which classifies the use of our chatbot architecture. Depending on the classification, the framework present two trained chatbot model based on self-attention mechanism to engage user in generic or therapy based conversations. Hence, the framework is designed with an emphasis on using natural language processing and machine learning techniques to ameliorate the onset of mental health disorders.
Article
Open-Domain Question Answering (ODQA) is a technique for finding an answer to a given query from a large set of documents. In this paper, we present an experimentation study to compare ODQA candidate solutions in the context of troubleshooting documents. We mainly focus on a well known open-source framework which is called Haystack. This framework comprises two key components which are the Retriever and the Reader. The Haystack Framework comes with several Retriever-Reader combinations and the choice of the best one is still unanswered till now. In this paper, we conduct an experimentation study to compare different Retriever-Reader combinations. Our aim is to come up with the best combination of components in regard to the speed and the processing power within the context of troubleshooting queries.
Article
Chatbots or Conversational agents are the next significant technological leap in the field of conversational services, that is, enabling a device to communicate with a user upon receiving user requests in natural language. The device uses artificial intelligence and machine learning to respond to the user with automated responses. While this is a relatively new area of study, the application of this concept has increased substantially over the last few years. The technology is no longer limited to merely emulating human conversation but is also being increasingly used to answer questions, either in academic environments or in commercial uses, such as situations requiring assistants to seek reasons for customer dissatisfaction or recommending products and services. The primary purpose of this literature review is to identify and study the existing literature on cutting-edge technology in developing chatbots in terms of research trends, their components and techniques, datasets and domains used, as well as evaluation metrics most used between 2011 and 2020. Using the standard SLR guidelines designed by Kitchenham, this work adopts a systematic literature review approach and utilizes five prestigious scientific databases for identifying, extracting, and analyzing all relevant publications during the search. The related publications were filtered based on inclusion/exclusion criteria and quality assessment to obtain the final review paper. The results of the review indicate that the exploitation of deep learning and reinforcement learning architecture is the most used technique to understand users’ requests and to generate appropriate responses. Besides, we also found that the Twitter dataset (open domain) is the most popular dataset used for evaluation, followed by Airline Travel Information Systems (ATIS) (close domain) and Ubuntu Dialog Corpora (technical support) datasets. The SLR review also indicates that the open domain provided by the Twitter dataset, airline and technical support are the most common domains for chatbots. Moreover, the metrics utilized most often for evaluating chatbot performance (in descending order of popularity) were found to be accuracy, F1-Score, BLEU (Bilingual Evaluation Understudy), recall, human-evaluation, and precision.