Conference PaperPDF Available

Challenges in Chatbot Development: A Study of Stack Overflow Posts

Authors:

Abstract and Figures

Chatbots are becoming increasingly popular due to their benefits in saving costs, time, and effort. This is due to the fact that they allow users to communicate and control different services easily through natural language. Chatbot development requires special expertise (e.g., machine learning and conversation design) that differ from the development of traditional software systems. At the same time, the challenges that chatbot developers face remain mostly unknown since most of the existing studies focus on proposing chatbots to perform particular tasks rather than their development. Therefore, in this paper, we examine the Q&A website, Stack Overflow, to provide insights on the topics that chatbot developers are interested and the challenges they face. In particular, we leverage topic modeling to understand the topics that are being discussed by chatbot developers on Stack Overflow. Then, we examine the popularity and difficulty of those topics. Our results show that most of the chatbot developers are using Stack Overflow to ask about implementation guidelines. We determine 12 topics that developers discuss (e.g., Model Training) that fall into five main categories. Most of the posts belong to chatbot development, integration , and the natural language understanding (NLU) model categories. On the other hand, we find that developers consider the posts of building and integrating chatbots topics more helpful compared to other topics. Specifically, developers face challenges in the training of the chatbot's model. We believe that our study guides future research to propose techniques and tools to help the community at its early stages to overcome the most popular and difficult topics that practitioners face when developing chatbots.
Content may be subject to copyright.
Challenges in Chatbot Development:
A Study of Stack Overflow Posts
Ahmad Abdellatif*, Diego Costa*, Khaled Badran*, Rabe Abdalkareem**, Emad Shihab*
*Data-driven Analysis of Software (DAS) Lab
Concordia University, Montreal, Canada
{a_bdella,d_damasc,k_badran,eshihab}@encs.concordia.ca
**Software Analysis and Intelligence Lab (SAIL)
Queen’s University, Kingston, Canada
abdrabe@gmail.com
ABSTRACT
Chatbots are becoming increasingly popular due to their benets in
saving costs, time, and eort. This is due to the fact that they allow
users to communicate and control dierent services easily through
natural language. Chatbot development requires special expertise
(e.g., machine learning and conversation design) that dier from the
development of traditional software systems. At the same time, the
challenges that chatbot developers face remain mostly unknown
since most of the existing studies focus on proposing chatbots to
perform particular tasks rather than their development.
Therefore, in this paper, we examine the Q&A website, Stack
Overow, to provide insights on the topics that chatbot develop-
ers are interested and the challenges they face. In particular, we
leverage topic modeling to understand the topics that are being
discussed by chatbot developers on Stack Overow. Then, we exam-
ine the popularity and diculty of those topics. Our results show
that most of the chatbot developers are using Stack Overow to
ask about implementation guidelines. We determine 12 topics that
developers discuss (e.g., Model Training) that fall into ve main
categories. Most of the posts belong to chatbot development, in-
tegration, and the natural language understanding (NLU) model
categories. On the other hand, we nd that developers consider
the posts of building and integrating chatbots topics more helpful
compared to other topics. Specically, developers face challenges
in the training of the chatbot’s model. We believe that our study
guides future research to propose techniques and tools to help the
community at its early stages to overcome the most popular and
dicult topics that practitioners face when developing chatbots.
ACM Reference Format:
Ahmad Abdellatif*, Diego Costa*, Khaled Badran*, Rabe Abdalkareem**,
Emad Shihab*. 2020. Challenges in Chatbot Development: A Study of Stack
Overow Posts. In 17th International Conference on Mining Software Reposi-
tories (MSR ’20), October 5–6, 2020, Seoul, Republic of Korea. ACM, New York,
NY, USA, 12 pages. https://doi.org/10.1145/3379597.3387472
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
MSR ’20, October 5–6, 2020, Seoul, Republic of Korea
©2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7517-7/20/05. . . $15.00
https://doi.org/10.1145/3379597.3387472
1 INTRODUCTION
More than 50 years after Weinzebaum introduced the rst com-
puter program to have a conversation with humans [
68
], chatbots
have become the main conduit between humans and services [
58
].
Potentialized by the recent advances in articial intelligence and
natural language processing [
33
], chatbots are the primary interface
in a variety of services, from smart homes [
10
,
65
] and personal
assistants [
8
,
29
], to health care [
18
] and E-commerce[
59
]. Given
how chatbots reduce the operational costs of services, the usage of
chatbots will only increase - experts predict that 85% of users’ inter-
actions with services will be done through chatbots by 2021 [41].
Due to their importance and popularity, developing and main-
taining chatbots is becoming more important. In addition, the de-
velopment of chatbots requires expertise in specialized areas, such
as machine-learning and natural language processing, which, dis-
tinguishes it from traditional software development [
19
]. While
recently introduced chatbot frameworks (e.g., Microsoft Bot Frame-
work [
39
]) have reduced the barrier to entry of creating chatbots,
e.g., by providing the components for user interaction and natu-
ral language understanding platforms, little is known about the
specic challenges that chatbot developers face when developing
chatbots. Understanding such challenges is of paramount impor-
tance, helping the research community provide more eective tools
for chatbot development, improving their quality, and ultimately
increasing their adoption and usefulness among users.
In this paper, we provide the rst attempt at understanding the
challenges of chatbot development by investigating what chatbot
developers are asking about on Stack Overow. We study Stack
Overow since it is the most prominent code-centric Q&A website
and used constantly by the development community to commu-
nicate their challenges and issues, provide solutions and foment
discussions about all aspects in software development [
2
,
52
]. Our
investigation dives into the chatbot-related posts on Stack Over-
ow to pinpoint the major topics surrounding the discussions on
chatbot development. We use well-known topic modeling tech-
niques to group the posts into cohesive topics and apply a series of
quantitative analyses, both through metrics and manual analysis.
Specically, our work investigates the following research questions:
RQ1: What topics are chatbot developers asking about?
We nd that chatbot developers ask about 12 main topics
that can be grouped into 5 main categories. The categories
are related to chabot integration, development, natural lan-
guage understanding (NLU), user interaction, and User Input.
MSR ’20, October 5–6, 2020, Seoul, Republic of Korea Abdellatif, Costa, Badran, Abdalkareem, and Shihab
The most popular questions include those related to chatbot
creation, integration, and user interface.
RQ2: What types of questions are chatbot developers
asking?
Chatbot developers use Stack Overow primarily
as a source of guidance for specic implementation routines,
working examples, and troubleshooting. This shows a need
for better documentation that provides real-scenarios and
more information about the NLU models used by chatbots.
RQ3: Which topics are the most dicult to answer?
The most dicult topics are related to training the chatbot
NLU models. On the other hand, posts related to traditional
software development, e.g., chatbot development framework,
are more frequently answered, albeit, we did not nd any
statistically signicant correlation between the popularity
and diculty of the chatbot topics in our study.
In addition to the identied chatbot topics in Stack Overow, we
discuss the evolution of the chatbot topics on Stack Overow and
nd that the chatbot-related discussions have increased substan-
tially since 2016. The activity of some categories are linked to the
releases of chatbot platforms. Also, we compare the chatbot topics
to other mature SE elds (e.g., mobile and security) in terms of pop-
ularity and diculty. Our results show that the chatbot community
needs more eort to reach the maturity level of similar SE elds.
Our ndings show that platform owners need to improve their
current documentation and integration with popular third-parties.
Moreover, we believe that our study guides future research to focus
on the most popular and challenging chatbot topics.
Paper Organization.
The rest of paper is organized as follows.
Section 2 describes our methodology. Section 3 reports our empirical
study results. Section 4 discusses our results and the implications
of our ndings. Section 5 presents the related work to our study.
Section 6 discusses the threats to validity, and Section 7 concludes
the paper.
2 METHODOLOGY
The main goal of our study is to examine what chatbot developers
are asking about. To achieve this goal, we resort to analyze the de-
velopers’ discussions on Stack Overow as it provides a rich dataset
and have been used by similar investigations in other domains, such
as concurrency [
6
], cryptography APIs [
42
], and deep learning [
31
].
While providing structured data with questions, answers and their
respective metadata (e.g., accepted answers), Stack Overow does
not contain any ne-grained topic information related to chatbots.
Hence, we rst need to identify the posts from Stack Overow that
are related to chatbots, group them according to their dominant
topic, and then conduct our analysis. As Figure 1 shows, we perform
the selection of chatbot related posts in a methodology of ve-steps,
which will be detailed further in this section.
Step 1: Download & extract Stack Overow dump.
We down-
load the entire Stack Overow dump (last updated 4 September
2019) [
23
], containing user questions, answers, and the metadata
of the posts (e.g., view count, creation date) for the period between
August 2008 and September 2019. The initial dataset contains ap-
proximately 46 million questions and 75 million answer posts.
Step 2: Identify chatbot tags.
Stack Overow holds posts on a
myriad of dierent software development topics (e.g., Java, secu-
rity, and blockchain). Posts are typically tagged by their authors
with commonly used tags (e.g. chatbot, web) to improve the posts’
visibility and chances of being answered [
13
]. To identify the most
relevant chatbot-related tags, we follow the approach used by prior
work [
11
,
54
], and create a tag set using the following procedure.
First, we retrieve all posts with the ‘chatbot’ tag, yielding a set of
2,116 posts. We refrain from adding any other tags in this inital
step to reduce the chances of introducing noise, as this will be used
to identify other chatbot-related tags. Second, we extract all the
tags that co-exist with the ‘chatbot’ tag from the chatbot-tagged
posts. Next, we use two heuristic metrics used in prior work to
obtain a bigger set of chatbot-related tags [
54
,
67
]. The rst metric
is the tag relevance threshold (TRT), a measure of how related a
specic tag is to the chatbot-tagged posts. This measure calculates
the ratio of the chatbot-related posts (posts that include the ‘chat-
bot’ tag) for a specic tag compared to the total number of posts
for that tag. Specically, the TRT is measured using the equation
T RTtaд=N o .o f c hatb ot pos ts f o r the ta д
T ot al no .of pos ts f o r t he ta д
. For example, ‘rasa’ is a
tag with a TRT of 21.2%, which means that 21.2% of the posts tagged
with ‘rasa’ are also tagged with ‘chatbot’. By using the TRT we are
able to eliminate the irrelevant tags from our set.
However, some tags that have a small number of posts (e.g., the
‘botlibre’ tag has only 3 posts) can have a high TRT of (33.3%) be-
cause a single one of their posts is chatbot-related, and this may
introduce insignicant tags. Therefore, we use a second metric,
the tag signicance threshold (TST), which is a measure of how
prominent a specic tag is in the chatbot-tagged posts [
54
,
67
].
This metric is measured by using the total number of the chatbot
posts for that tag and the total number of the chatbot posts for
the most popular tag ( ‘chatbot’ tag with 2,116 posts.) as follows
TSTtaд=N o .of c h at bo t pos ts f or t he t aд
N o.o f chatbot p osts f or t he mos t popul ar t aд
. For exam-
ple, the ‘rasa’ tag has a TST of 0.3% which means that the total
number of the posts that are tagged with ‘rasa’ and ‘chatbot’ at the
same time are equal to 0.3% of the total number of chatbot-related
posts for the ‘chatbot’ tag.
We consider a tag to be signicant and relevant to the chat-
bot posts if its corresponding TRT and TST are above a certain
threshold. The rst three authors, with varying degrees of chatbot
development experience, independently examined the tags with
dierent TRT and TST thresholds. For each tag, we inspect a ran-
domly selected sample of posts, to identify when the tags become
less relevant and less specic to chatbots, to identify the most ap-
propriate TRT and TST thresholds. This method has been used
by several previous similar studies [
11
,
54
] and has the goal of
selecting tags relevant to chatbots without including too much
noise in the dataset. Then, we discussed the chosen thresholds to
reach a consensus on the optimal TRT and TST values. The rst
three authors independently evaluated the optimal TRT and TST
thresholds that yield the best results and discussed their choices to
reach a consensus. We nd that tags with a TRT value higher than
11% and a TST value higher than 0.14% value yield an appropriate
balance between the inclusion of more posts related to chatbots
(i.e,. more representative dataset) and the ltering of posts that are
unrelated to chatbots (i.e., less noise). It is important to note that
our thresholds are in-line with the thresholds used by previous
studies that adapted the same approach [
6
,
11
,
72
]. Finally, we use
the selected TRT and TST thresholds to identify our tag set. Table
Challenges in Chatbot Development:
A Study of Stack Overflow Posts MSR ’20, October 5–6, 2020, Seoul, Republic of Korea
Identify Chatbot Tags
chatbot
Filter
Relevant Tags
TRT
TST
LDA Topic Modelling
Select Optimal
Number of Topics
Categorize Posts
Into Topics
Label Topics’
Using Keywords
Posts Categorized
Into Labeled Topics
chatbot,
rasa,
c#,
chatbot,
rasa,
wit.ai,
Data Preprocessing
Extract Titles
& Preprocess For LDA
Download SOF Data
Extract
Coexisting Tags
Extract `chatbot’
Tagged Posts
Extract Posts Using
Filtered Tags
Extract Chatbot Posts
12 3
45
Figure 1: Overview of the methodology of our study.
Table 1: The tag set used to identify the chatbot related posts.
The TRT and TST are expressed in percentages.
Tag Name TRT TST Tag Name TRT TST
chatbot 100 100 aws-lex 14.3 0.6
facebook-
chatbot
42.1 6.2
sap-
conversational-ai
50 0.5
amazon-lex 22.2 4.3 chatfuel 26.3 0.5
rasa-nlu 18.4 2.9 pandorabots 41.2 0.3
aiml 27.6 2.6 rasa 21.2 0.3
rasa-core 22.6 2.4 chatbase 18.2 0.3
wit.ai 13.1 1.9 chatscript 30.8 0.2
chatterbot 25.4 1.6 rivescript 28.6 0.2
api-ai 11.4 0.8 program-o 37.5 0.1
web-chat 13.6 0.8 botpress 33.3 0.1
gupshup 27.1 0.6 lita 25 0.1
1 shows the tags obtained in our tag set and their respective TRT
and TST values.
Step 3: Extract chatbot posts.
After obtaining the chatbot-related
tag set, we use those tags (see Table 1) to extract the posts that will
constitute our chatbot dataset throughout this study. We extract
this corpus by querying all posts on Stack Overow that are tagged
with one of the tags in our tag set. This process yielded a dataset
containing 3,890 chatbot posts and their respective metadata.
Step 4: Preprocessing chatbot posts.
We lter out the irrelevant
information before applying the topic modeling techniques. In this
analysis, we focus only on the posts’ titles, as opposed to their
body contents, as the content in the posts’ bodies can introduce
noise to our analysis. This approach of using the posts’ titles has
been used in the prior investigations [
54
], as a post’s title has been
shown to be representative of the post body [
20
,
71
]. After extract-
ing the posts’ titles, we prepare the data to be used in the topic
modelling process. To do so, we leverage the Python NLTK [
43
]
and Gensim [
27
] tools to perform the preprocessing steps on our
dataset. First, we remove the stopwords, such as ‘how’, ‘a’ and ‘can’,
using the NLTK stopwords corpus [
44
] as those words hinder the
process of dierentiating between topics. Next, we build a bigram
model using Gensim since we notice that some words commonly
appear together (e.g., ‘Rasa NLU’ and ‘Bot Framework’) and the
topic modelling technique should consider them together. More-
over, we lemmatize the words to map them to their origin (e.g.,
’development’ is mapped to ’develop’). Those steps output a prepro-
cessed dataset that is ready to be inputted to the topic modelling
technique in our next step.
Step 5: Identify chabot topics.
To identify the topics that are
discussed by chatbot developers on Stack Overow, we use the
Latent Dirichlet Allocation (LDA) modeling technique [
14
], which
has been widely used in Software Engineering studies [
11
,
54
]. LDA
groups the posts of our dataset into a set of topics based on the word
frequencies and their co-occurrences in the posts. In particular, LDA
assigns to each post a series of probabilities (one per topic) that
indicate the chances of a post being related to a topic. The topic
with the highest probability for a particular post (i.e., the post that
contains more keywords of a particular topic) is considered to be
the post’s dominant topic. We use the Mallet implementation of
LDA in our methodology [36].
The main challenge of using LDA is to identify the optimal num-
ber of topics
K
, that the LDA uses to group the posts. If the
K
value
is too high, topics may become too specic to draw any relevant
analysis. On the other hand, if
K
value is small, the yielded topics
may be too generic, encompassing posts of many dierent aspects.
To overcome this issue, we examine dierent
K
values ranging be-
tween 5 to 20 in steps of 1 and calculate the coherence metric value
of the topics. The coherence metric measures the understandability
of the topics resulting from the LDA using dierent conrmation
measures, and has been shown to be highly correlated with human
understandability [
53
]. Thus, the rst two authors run the LDA
with varying
K
values and then stored the resulting coherence score
from each run. We nd that K values in the range of 10 to 14 have
very similar coherence scores (i.e., the dierence is very small).
To ensure that we select the best K value, the rst two authors
examined a randomly selected sample of 30 posts from each topic
for K values from 10 to 14. Based on this examination, we nd that
MSR ’20, October 5–6, 2020, Seoul, Republic of Korea Abdellatif, Costa, Badran, Abdalkareem, and Shihab
Table 2: The chatbot topics, categories, and their popularity.
Main Category Topic # Posts Avg. Views Avg. Favourites Avg. Scores
Integration
API Calls 264 354.2 1.2 0.5
Messenger Integration 463 638.0 1.4 0.7
NLU Integration/Slots 388 406.0 1.1 0.8
Development
General Creation/Integration 250 671.6 3.1 0.6
Development Frameworks 375 513.3 1.6 0.8
Implementation Technologies 320 619.2 1.5 0.7
NLU Intents & Entities 437 516.3 1.7 1.0
Model Training 347 524.3 1.4 0.7
User Interaction
Chatbot Response 253 409.1 1.2 0.7
Conversation 278 510.5 1.9 0.6
User Interface 208 536.8 2.6 0.8
User Input User Input 307 402.7 1.2 0.6
a
K
value of 12 (i.e., 12 topics) provides an optimal set of topics that
balances the generalizability and the specicity (i.e., most cohere
posts) of the resulting chatbot topics.
3 CASE STUDY RESULTS
In this section, we present the analysis of the chatbot posts and
topics to answer our research questions.
3.1 RQ1: What topics are chatbot developers
asking about?
Motivation:
Chatbot development has some particularities that dis-
tinguish it from traditional software development [
19
]. For example,
chatbot developers require specic expertise in natural language
processing, machine learning, and conversation design, which are
often unnecessary or overlooked in most conventional software
development tasks. Hence, the challenges faced by chatbot devel-
opers are likely to dier from the challenges of traditional software
development. Since developers use Q&A websites to communicate
both problems and solutions, the goal of this research question is
to dive into the invaluable data of Stack Overow to identify the
most common and pressing chatbot topics and the issues that are
more frequently encountered by the chatbot community. Moreover,
identifying the widely discussed chatbot topics is the initial step to
highlight the topics that are gaining more traction and dicult to
answer by the chatbot community.
Approach:
We use the LDA as a method to identify the dierent
topics that developers discuss on Stack Overow as mentioned in
Section 2. The rst three authors (annotators) labelled the set of
topics based on the posts overall theme. In particular, each of the an-
notators individually inspected the top 20 keywords and a random
sample of at least 30 posts from each topic in order to label it with
a title that best represents the posts of that topic. Then, the authors
discuss each of the 12 topics’ labels to reach a consensus about the
titles of all topics. We observe that some topics that discuss similar
aspects of the chatbot development process or are related to the
same chatbot component can be further grouped into categories.
For example, one topic with keywords related to ‘response’, ‘web-
hook’, and ‘card’ and another topic that has ‘display’, ‘trigger’, and
‘prompt’ keywords are related to chatbot user interaction. There-
fore, we further categorize those topics to have a hierarchical view
on the chatbot discussions on Stack Overow. We also examine the
most popular chatbot topics among developers. To investigate that,
we use three dierent complementary measurements of popularity
that have been adopted in prior work [6, 11, 12, 42]:
(1) The average number of views (avg. views)
of the post
from both registered and unregistered users. Our intuition
here is that if a post is viewed by a large number of devel-
opers, then this post is popular among chatbot developers.
Overall, this metric measures the interest of the community
by telling us how often a post is visualized.
(2) The average number of posts marked as favourite (avg.
favourites)
by Stack Overow users. This metric measures
the issues and solutions that developers deemed to be helpful
and having a high chance of recurring during the develop-
ment of chatbots.
(3) The average score (avg. scores)
of the posts. Stack Over-
ow allows it’s members to up-vote posts that they consider
to be interesting and useful. The votes are then aggregated
as a score, which we use as a metric of perceived community
value.
Results:
Table 2 shows the 12 topic titles, which are grouped into
5 main categories. It also shows the number of posts that belong
to each topic and the topics’ popularity through our popularity
metrics: views, favourites, and the scores received by developers on
Stack Overow. As seen from the table, the developers ask about
dierent topics in chatbot development and the number of posts
varies across the topics.
The 12 chatbot topics can be mainly grouped into ve categories:
‘Integration’, ‘Development’, ‘NLU’, ‘User Interaction’, and ‘User
Input’. Next, we discuss those categories in more details.
Integration:
This category contains three topics, namely Mes-
senger Integration, NLU Integration/Slots, and API Calls. This cate-
gory deals with the integration between chatbot platforms, APIs,
and websites. About 28.6% of posts in our dataset belong to this
category. We also see that the Messenger Integration topic has
the highest number of posts in our dataset. In this topic, devel-
opers mainly ask about how to create and integrate chatbots to
Challenges in Chatbot Development:
A Study of Stack Overflow Posts MSR ’20, October 5–6, 2020, Seoul, Republic of Korea
messenger applications. One of the reasons of the widespread of
chatbots is the global adoption of messaging platforms (e.g., Slack)
[
33
]. For example, Facebook reported that there are more than
300,000 active chatbots in 2018 that are deployed on its Messenger
platform [
15
]. An example of posts under this topic is a developer
asking on Stack Overow “Facebook Chatbot (PHP webhook) send-
ing multiple replies”[
47
]. As chatbots are used to integrate various
services [
33
], chabot developers are more exposed to the challenges
of multi-service and platform integration.
Development:
The posts of this category are related to building
chatbots using dierent development frameworks, asking about spe-
cial congurations and features, and specic implementations using
those frameworks. For example, a developer posted on Stack Over-
ow “How to start a conversation from Nodejs client to Microsoft
bot”[
45
]. The posts of Development Frameworks, Implementation
Technologies, and General Creation/Integration topics form this
category. In our study, this category is the second largest, contain-
ing 24.3% of the posts in our dataset. This shows that developers
tend to heavily rely on chatbot frameworks.
Natural Language Understanding (NLU):
This category con-
tains posts related to the denition of intents (the purpose/intention
behind the user’s input) and entities (important pieces of informa-
tion in the user’s input such as city names), handling and manipu-
lating those intents and entities, customizing and conguring NLUs,
and improving the performance of the NLU models. This category
comprises 20.2% of the posts in our dataset. It has Intents & Entities
and Model Training topics. Those topics are related to the chatbot
capability of understanding the users’ input and replying accord-
ingly, which has a direct impact on user satisfaction [
3
]. The post
“How can I improve the accuracy of chatbot built using Rasa?” [
46
]
is an example of posts from this category. Currently, large IT com-
panies are investing to build NLUs (e.g., Microsoft developed LUIS
platform [
38
]), which is an indicator of their importance and pop-
ularity. Moreover, NLU platforms nowadays are considered to be
one of the critical components of chatbots [
55
]. Leveraging an NLU
platform allows developers to focus on the core functionalities of
their chatbots rather than having to analyze the user input and
manage the conversation with the user.
User Interaction:
This category contains posts about conver-
sation design, generating reply messages to users, and designing
the chatbot’s graphical user interface. For example, developers ask
“How to resume or restart paused conversation in RASA?”[
51
] and
“How to add custom choices displayed through Prompt options
[...] using C#?”[
48
]. This category includes User Interface, Chatbot
Response, and Conversation topics and forms 19% of the posts in
our dataset. We believe that managing the conversation ow with
the user is not an easy task since the chatbot users might deviate
(i.e., change to other topic) from the designed conversation ow.
User Input:
The posts of this category are related to checking/-
validating and storing the user input, e.g., “How to store and retrieve
the chat history of the dialogow?”[
50
]. There is only one topic
that is included in this category and it contains 7.9% of posts in our
dataset. Having a single topic as a group indicates that parsing and
storing chatbot users’ input is a more independent problem among
the chatbot topics.
From our results, we observe that the categories cover the end-to-
end development of chatbots. The User Interaction category covers
the creation of the chatbot interface, while the User Input category
covers the manipulation of the users’ input received through the
User Interaction component. The NLU category includes posts
about understating the users’ input and optimizing the NLU Model
of the chatbot, the Development category covers the back-end
development of the core functionalities of the chatbot, and nally,
the Integration category covers the integration of all the chatbot
components together (User interface, NLU, backend, etc.). This
shows that developers are facing various challenges and seeking
knowledge about each phase of the chatbot development process.
Moreover, the topics within each category reect specic concerns
and issues within that category. For example, in the NLU category,
developers are asking questions about dening/handling intents
and entities, and improving the performance of the NLU model.
In the second part of our analysis, we investigate the popularity
of the chatbot topics. We nd that the most popular topics fall into
the Development and NLU categories. Table 2 shows that the topic
General Creation/Integration contains the most viewed and most
favourited posts by chatbot developers. This topic contains posts
with basic questions about chatbot creation and its high popularity
can be explained by the introductory nature of the topic, that is,
any newcomer will look for these posts to start developing their
rst chatbot. Another aspect of this topic’s popularity might be
due the lack of proper chatbot introductory documentation and
support for newcomers. The most viewed and favourited post in
our dataset is “Any tutorials for developing chatbots?” with more
than 71,565 views and 104 members marking it as a favorite post,
evidences the lack of documentation concern. Interestingly, our
ndings suggest that the chatbot development community should
give special attention to providing a more extensive and accessible
documentation on how to develop chatbots from scratch. Intents &
Entities is the topic with highest average of post score, the process of
handling intents and entities is one of the most specialized aspects
of chatbot development, which might explain why developers have
a higher (relative) praise for posts from this particular topic.
Chatbot developers ask about every aspect and phase of
the chatbot development process including Integration,
NLU, Development, User Input, and User Interaction. The
most popular topics in the chatbot dataset are related to
General Creation/Integration.
3.2 RQ2: What types of questions are chatbot
developers asking?
Motivation:
After understanding the most interesting topics to
chatbot developers, we set out to examine the types of posts that
they ask in each chatbot category. Prior work [
54
] shows that
developers ask dierent types (i.e., how, why, what) of questions
to address distinct challenges, hence, this analysis will help us
identify the nature of the challenges encountered during chatbot
development.
Approach:
To achieve that, we follow a similar approach used
by prior work to identify the types of the posts on Stack Over-
ow [
54
,
63
]. In particular, we randomly sample posts from each
of the ve main chatbot categories with a condence level of 95%
and a condence interval of 5%. Our random sample size for each
MSR ’20, October 5–6, 2020, Seoul, Republic of Korea Abdellatif, Costa, Badran, Abdalkareem, and Shihab
Table 3: Chatbot posts types on Stack Overow.
Main Categories % How % Why % What % Other
Integration 66.4 22.7 10.8 0.0
Development 57.9 23.4 18.3 0.4
NLU 54.3 29.5 15.9 0.4
User Interaction 66.8 22.5 10.3 0.4
User Input 68.4 14.6 14.6 2.3
Chatbots (all) 61.8 25.4 11.7 1.2
category yields a total of 1241 posts: 286 Integration posts, 273
Development posts, 258 NLU posts, 253 User Interaction posts, and
171 User Input posts. Overall, the annotators achieve substantial
agreement (kappa=0.62) on the 1241 classied posts. Our level of
agreement is higher than the agreement reached in similar stud-
ies [
54
]. For the cases that all annotators failed to agree on, the
annotators revisit the questions together and discussed them to
reach an agreement. Then, the rst three authors individually ex-
amine the sample posts’ titles and bodies and label each post using
one of following types that were used by prior work [54]:
How
: Used for posts that ask about a method or technique to
implement something [
54
]. Posts with this type dier from
the ‘why’ posts as in here the developer has a particular goal
in mind, and asks for the steps to achieve this goal (e.g., “how
to get user name in Microsoft bot framework in C# using
V4?”).
Why
: Posts where the developer asks about the reason,
cause, or purpose of something [
54
]. Posts of ‘why’ type are
often related to troubleshooting where the developer expects
an explanation of a particular (and unexpected) behavior
(e.g., “why is Wordpress blocking the js livechat window?”).
What
: Posts where the developer is asking for a particular
information [
54
]. Often, the user wants to clarify a doubt
that is useful to make more informed decisions (e.g., “what
are "implicit triggers" in a Google Action package?”).
Other
: We assign this type to posts that do not fall under
any of the above types (e.g., “chatbot conversation objects,
your approach?”).
To measure the quality of our classication of the random sample,
we use Cohen’s Kappa [
37
] to measure the level of inter-agreement
among the annotators.
Results:
Table 3 shows the percentage of the posts types for each
chatbot category. We see that more than half of the posts (61.8%) are
of ‘how’ type, followed by ‘why’ (25.4%) and ‘what’ (11.7%). This
shows that the developers are looking for more working examples,
debugging, and information. The User Interaction category has the
most ‘how’ posts (66.8%), showing a need for more sources of guid-
ance to design and manage the conversation ow between the user
and chatbot. The NLU category has the most ‘why’ posts (29.5%),
suggesting the need for discussion forums and better documenta-
tion on how the NLU models work, especially given that most NLUs
are closed source. The Development category has the most ‘what’
posts (18.3%), suggesting that providing general information about
the supported features of the chatbot frameworks is appreciated by
the community.
Table 4: The diculty per topic.
Topic Posts w/o Median
Accepted (%) Time (h)
General Creation/Integration 72.0 8.2
Intents & Entities 71.4 19.5
User Interface 70.7 7.0
Model Training 70.2 22.4
Messenger Integration 70.0 22.6
User Input 66.8 9.3
NLU Integration/Slots 66.5 12.8
Conversation 65.5 6.9
Chatbot Response 65.2 11.3
Implementation Technologies 64.7 15.5
API Calls 63.7 16.2
Development Frameworks 63.7 15.6
Chatbot developers mainly (61.8%) look for implemen-
tation guidance by posting how posts, followed by why
(25.4%) and what (11.7%). Developers are concerned about
the how aspect of the User Interaction category, whereas
most the highest share of why posts are from the NLU cat-
egory, and what posts from the Development category.
3.3 RQ3: Which topics are the most dicult to
answer?
Motivation:
Given that we know the popular topics and their types
of posts. Now, we want to investigate the diculty of answering
posts in each topic. Finding whether some topics are harder to
answer than others will help us identify the topics that need more
attention from the community. Also, it allows us to highlight the
topics where there is a need for better tools/frameworks to support
developers at addressing chatbot development challenges.
Approach:
We measure the diculty of each topic by applying
two metrics that have been used in prior work [11, 54, 72]:
(1) The percentage of posts of a topic without accepted
answers (% w/o accepted answers)
. For each chatbot topic,
we measure the percentage of posts that have no accepted
answers. While many answers can be issued in a post, the
post’s author has the sole authority to mark an answer as
accepted if it satises and solves the original post’s question.
Therefore, topics with less accepted answers are considered
more dicult [11, 54].
(2) The median time in hours for an answer to be accepted
(Median Time to Answer (Hrs.))
. We measure the median
time in hours for posts to receive an accepted answer. This
metric considers the creation time of the accepted answer
and not the time at which the answer is marked as accepted.
The longer it takes for a post to be properly answered (receive
an accepted answer), the harder the post is[11, 54].
Our dataset includes some posts that did not have sucient
time to receive an answer. In our dataset of chatbot-related posts,
questions take a median of 14.8 hours to be answered, hence, we
remove from this analysis posts that were created less than 14.8
hours before the data collection date (September 4, 2019).
Challenges in Chatbot Development:
A Study of Stack Overflow Posts MSR ’20, October 5–6, 2020, Seoul, Republic of Korea
Table 5: Correlation of topics popularity and diculty.
Correlation Coe. /
p-value
Avg. Views Avg. Score Avg. Favourite
% w/o Accepted An-
swers
0.524/0.084 0.147/0.651 0.419/0.176
Median Time to An-
swer (Hrs.)
0.105/0.749 0.223/0.485 0.335/0.287
Results:
Table 4 shows the percentage of accepted answers and
median time (in hours) to receive an accepted answer for each of
the identied topics in Section 3.1. The topics in Table 4 are ordered
based on the percentage of accepted answers they received. The
most popular topic General Creation/Integration is also the one
with the largest share of posts without accepted answers. The posts
in this topic, however, take a median time of only 8.2 hours to
receive an accepted answer, which is the third fastest median time
in our topics. To understand the reason behind the high percentage
of posts with no accepted answers (72%), we examine the posts
of this topic. We nd that the posts without an accepted answer
are given low scores (on average 0.17) from developers on Stack
Overow. This might be due to unclear or ill-formed questions,
which eectively reduces the chances of getting an accepted answer.
If we analyze the median time to answer a topic, we see a higher
variation among the topics. Messenger Integration, Intents & Enti-
ties, and Model Training are the most dicult topics based on their
time to receive accepted answers. Interestingly, Intents & Entities,
and Model Training are related to the NLU category which dis-
cusses how to load and train NLU models, and identify and handle
intents and entities. The results show that the topics related to the
NLU are harder to answer by the Stack Overow community. This
may be due to the black box implementation of most popular NLUs,
which prevents chatbot developers from fully understanding and
solving NLU related issues.
On the other hand, posts that are related to Development Frame-
works have the highest percentage of accepted answers and a me-
dian time to answer in-line with the overall chatbot topics (15.6
hours). This topic includes posts on how to implement chatbot
routines using a certain technology (e.g., “How to send location
from Facebook messenger platform?”) or comparing of dierent
platforms (e.g., “Comparison between Luis.ai vs Api.ai vs Wit.ai?”).
These are also tasks that are more closely related to traditional soft-
ware development, which could explain why the Stack Overow
respondents tend to answer this topic faster and more frequently.
To have a full view of the chatbot-related posts, we want to
examine if there is a statistically signicant correlation between
the diculty and popularity. In particular, we use the Spearman
Rank Correlation Coecient [
57
] to verify the correlations between
the three popularity metrics (avg. views, avg. favourites, and avg.
scores) and the two diculty metrics (% w/o accepted answers and
median time to answer). We choose Spearman’s rank correlation
since it does not have any assumption on the normality of the data
distribution. As shown in Table 5, we do not nd any statistically
signicant correlation between the popularity and diculty metrics
since all correlations have
pvalue >
0
.
05. In other words, the
dicult topics are not necessary popular among developers, and
vice versa.
2008 2010 2012 2016 2018 2020
2014
Years
0.3
0.2
0.1
0.0
Relative Impact (10³)
Figure 2: Relative growth of chatbot related posts over time.
Topics related to training chatbot models are the most
dicult in chatbot development. While the most popular
topic, General Creation/Integration, contains the largest
share of unanswered posts. On the other hand, posts re-
lated to the Development Frameworks topic tend to be an-
swered more frequently.
4 DISCUSSION & IMPLICATIONS
In this section, we discuss the chatbot topics evolution and compare
our ndings with the ndings in prior work. Then, we delve into
the data to identify the prevalent topics on dierent platforms and
discuss the implications of our results.
4.1 Chatbot Topics Evolution
Chatbots are an emerging topic that is getting more attention from
developers in dierent domains (e.g. security [
22
], software engi-
neering [
62
]). To examine the evolution of a topic, we utilize two
measures; the absolute growth, which measures the change in the
total number of posts over time; and the relative growth, which
represents the relative change in the total number of posts for a
specic topic compared to the change in the total number of posts
for the entire Stack Overow dataset. To highlight the evolution of
the chatbot topics, we examine the relative growth of all chatbot
topics compared to Stack Overow over time, from August 2008
to September 2019. Figure 2 shows the evolution of the chatbot
in terms of relative growth compared to Stack Overow. As seen
from the Figure, the relative growth of the chatbot topics has an
increasing trend that started in 2016. This increase in the last few
years shows that chatbots are gaining more attention from the
community over time.
To better understand the evolution of the dierent chatbot de-
velopment activities, we measure the absolute growth of each of
the ve categories over time. We nd that all of our categories are
growing positively over time as shown in Figure 3. This means
that the number of posts for every category is increasing overtime,
which in turn indicates the increasing trend of the various chatbot
development activities represented by the dierent categories.
We further investigate the reasons behind the sudden increases
(i.e., hikes) in the number of posts during specic periods of time
and nd two interesting cases as shown in Figure 3. The rst case is
related to the Integration category which has the highest spike (46
posts) on June 2017. We nd that most of the discussions during this
spike are related to the integration of the Amazon Lex platform [
7
]
MSR ’20, October 5–6, 2020, Seoul, Republic of Korea Abdellatif, Costa, Badran, Abdalkareem, and Shihab
Figure 3: Chatbot categories evolution over time.
Table 6: Comparison of popularity and diculty between
dierent elds
Metrics Chatbot Mobile Security Big Data
# of Posts 3,890 1,604,483 94,541 125,671
Avg. ViewCount 512.4 2,300 2,461.1 1,560.4
Avg. FavoriteCount 1.6 2.8 3.8 1.9
Avg. Score 0.7 2.1 2.7 1.4
Avg. AnswerCount 1.0 1.5 1.6 1.1
% w/o Answers 67.7 52 48.2 60.3
Med. TimeToAnswer
(Hrs.)
14.8 0.7 0.9 3.3
that was released in April 2017 [
9
]. The second sudden increase
can be observed in the NLU category during November 2016. Posts
of that spike are asking about the intents and entities in the Wit.ai
platform [24], which was released in April 2016 [60].
Although we show the results of the chatbot categories’ evolu-
tion over time, we share the evolution results of each of the topics
in a publicly available online dataset [
56
]. In general, we can see
a trend of chatbot development activities gaining traction among
developers. Our ndings also show that the chatbot community
tends to pick up the new platforms as shown in the cases of Amazon
Lex and Wit.ai.
4.2 Chatbot Compared to Other SE Fields
In the previous sections, we nd that chatbot discussions only
started to become more active in 2016. As a new and emerging eld,
we set out to investigate how the topics of chatbot compares against
discussions of more consolidated Software Engineering (SE) elds
such as mobile, big data and security (topics that were similarly
studied in the past)? To answer this question, we examine the
diculty and popularity of the chatbot topics and compare it against
other disciplines, by including data from similar studies on Stack
Overow, focused on the topics of mobile apps [54], security [72],
and big data [
11
]. Those studies were conducted in a dierent time
frame, therefore, we use their reported keywords to construct an
updated dataset and calculate the popularity and diculty metrics
for each of those elds.
Table 6 shows the results of the popularity and diculty metrics
among the four elds. From the sheer number of posts, the chatbot
topic is, by a few orders of magnitude, smaller than mobile, secu-
rity and big data. Second, the chatbot posts are consideranly more
dicult compared to the other elds, which is also a consequence
of having a small and niche crowd. There is a big gap in the time to
receive an accepted answer for the chatbot-related posts compared
to other topics. Most mobile and security posts are answered in less
than an hour, while most chatbot posts take at least 14 hours. This
corroborates with the emerging nature of the chatbot topic and in-
dicates that much needs to be done to put the chatbot development
community on pair with other mature elds such as mobile and
security.
4.3 Implications
The results of our study can help chatbot community at better focus-
ing their eorts on the most pressing issues in chatbot development.
In the following, we describe how our results can be used to better
guide practitioners, researchers and educators at improving the
practice and learning of chatbots development.
To help identify the most pressing issues, we present in Figure 4 a
bubble plot that positions the topics in terms of their popularity and
diculty. The size of the bubble represents the number of posts for
a particular topic and we visually split the gure into four quadrants
to show the relative importance and diculty of the topics. We
use the average number of views as a proxy for popularity and
the percentage of posts without accepted answers as a proxy for
diculty.
Implication for Practitioners.
As shown in Figure 4, albeit being
the most popular topic, beginner questions on how to build Chatbots
(General Creation/Integration) remain largely unanswered. The
development community should use this nding to devise better
tutorials and documentation aiming at reducing the entry-barrier
for developing chatbots.
Challenges in Chatbot Development:
A Study of Stack Overflow Posts MSR ’20, October 5–6, 2020, Seoul, Republic of Korea
Our ndings can help chatbot developers better prioritize their
work by taking into account the areas of the most dicult topics in
chatbot development. Topics related to NLU, such as Model Training
and Intents & Entities, are among the topics with the highest share
of posts without accepted answers. Software managers can take
that into account by assigninig more resources (development time)
to tasks that involve training NLU models, especially given that
NLU has the highest share of troubleshooting posts (Section 3.2),
indicating that developers experience issues more frequently with
this kind of tasks.
The evidence of the diculty of NLU related topics can be used to
motivate better and more intuitive NLU frameworks. Practitioners
can improve the current documentation of the NLU frameworks
and companies that develop and publish NLU platforms should
focus on improving the expressiveness of their current framework
APIs. For instance, some platforms (e.g., Google DialogFlow [
28
]
and Microsoft LUIS [
38
]) oer graphical interface for training the
NLU model, in an attempt to extend the model training to users
less familiar to software programming [21, 40].
Figure 4 also shows that Messenger Integration is the largest
topic in our dataset. In fact, Integration is the category with the
highest number of posts in Stack Overow. Chatbots are expected to
communicate between multiple services and integrate with messen-
gers to make use of already existing Social Networks platforms (e.g.,
Facebook). Practitioners should invest more resources into facilitat-
ing integration of their platforms and tools with other services. For
instance, Dialogow oers developers a one-click integration fea-
ture to some of the most popular chatting platforms, such as Slack,
Twitter and Skype [
30
]. As chatbot developers nd integration a
pressing issue, providing straightforward approaches to integration
would allow developers to focus on the core chatbot functionalities,
reducing the time and eort overhead of developing multi-service
chabots.
Implication for Researchers.
Our ndings conrm that chatbot
developers discuss topics such as Conversation, NLU Integration/S-
lots, and Chatbot Response, that dier chatbot development from
traditional software development. As shown in Figure 4, NLU re-
lated topics are notoriously dicult and research can be put into
some of the problems faced by chatbot developers at training their
NLU models. One such problem is the acquisition of a high-quality
dataset, frequently asked by developers in Stack Overow [
49?
].
A high-quality dataset that represents well the intents and entities
supported by the chatbot is paramount for the chatbot performance.
New comprehensive datasets and approaches that focus on gener-
ating labelled data can help alleviate this challenge faced by devel-
opers. Another problem is related to methods for extracting intents
and entities, which has received some attention by the research
community [
26
,
71
,
73
,
74
,
76
], but remains a challenging problem
in chatbot development.
Implication for Educators.
Educators can use our topics and cat-
egories as a roadmap to design their chatbot-related courses. The
category development also has a high number of discussions look-
ing for the most appropriate framework and best practices (‘what’
posts), hence, educators can introduce their audience to the several
existing chatbot development frameworks and discuss best prac-
tices and standards to be followed during the chatbot development
phase. As mentioned before, special attention should be given to the
310
360
410
460
510
560
610
660
710
60 62 64 66 68 70 72 74
Popularity (Avg. Views)
Difficulty (% w/o Accepted Answers)
Implementation
Technologies
Development
Frameworks
Conversation
Chatbot Response
API Calls User Interface
NLU Integration/Slots
Messenger
Integration
General Creation/Integration
Model
Training Intents &
Entities
User
Interaction
Least
Popular Least
Difficult Most
Difficult
Most
Popular
Figure 4: Chatbot topics’ popularity vs. diculty
NLU topics, which has shown to be dicult (Figure 4). In particular,
since NLU has the highest share of ‘why’ posts, this indicates that
chatbot developers are in need of theoretical explanations of NLU
machine-learning algorithms and models.
There are many aspects that practitioners, researchers, and edu-
cators can take into consideration when deciding where to focus
their eorts. Nevertheless, we believe that our ndings and impli-
cations can help improve this decision-making process.
5 RELATED WORK
In this section, we present the studies related to the chatbots in
SE domain and discuss the work that leverages and analyze Stack
Overow data to have more insights from developers perspectives.
Software Chatbots.
A number of studies have focused on imple-
menting chatbots to help developers in their daily tasks [
3
,
16
,
61
,
64
,
69
,
71
]. For example, Bradley et al
. [16]
developed Devy to as-
sist developers in their basic development tasks (e.g., commit a
code). Abdellatif et al
. [3]
developed MSRBot that leverages repos-
itories (i.e., Git and Jira) data to answer questions related to the
software projects through natural language. Moreover, chatbots
are used to assist customer service [
70
], answer student admission
questions [5], and in the health care domain [17].
The rising of chatbots in academia and industry motivates us to
examine the issues and challenges that facing chatbot developers
in their implementations. We believe that our work provides an
insights to the research community on the areas that require more
investigation to allow developers focus on the core functionalities
of the chabot and low the barrier to entry for the new practitioners
to the chatbot domain.
Using Stack Overow Data.
There is a number of studies that use
Stack Overow data to study it’s users commenting activities [
75
],
the impact of code reuse from Stack Overow on the mobile apps
[
1
], and generates code comments for a code snippet [
4
]. The work
closest to ours, is the work that applied LDA on Stack Overow.
Rosen and Shihab
[54]
summerized the mobile related questions on
Stack Overow, and the specic issues of the dierent mobile plat-
forms. Similarly, Bagherzadeh and Khatchadourian
[11]
used topic
MSR ’20, October 5–6, 2020, Seoul, Republic of Korea Abdellatif, Costa, Badran, Abdalkareem, and Shihab
modelling to extract the big data topics and big data developers
interests from Stack Overow. Wan et al
. [67]
use Stack Overow
to understand the challenges and needs amongst blockchain devel-
opers. Venkatesh et al
. [66]
examine the challenges that face client
developers when using Web APIs using the Stack Overow dump.
Yang et al
. [72]
conduct a large scale study on Stack Overow to
identify the security-related questions asked by practitioners. Jin
et al
. [32]
used Stack Overow to investigate the issues that face
developers when implementing or using Biometric APIs. Han et al
.
[31]
conducted a large-scale study on Stack Overow and Github
using LDA to point out the topics discussed among developers
about three deep learning frameworks (Tensorow, PyTorch and
Theano). Ahmed and Bagherzadeh
[6]
used LDA on Stack Overow
to identify the challenges and interests of concurrency developers.
To the best of our knowledge, there is no work that studied
chatbot-related posts using Stack Overow. We believe that our
study complements prior work in Stack Overow by analyzing
chatbot-related posts. We extracted the chatbot topics and cate-
gorize them. Also, we examined the popularity, diculty, and the
growth of those topics compared to other studies. We believe that
our work sheds the light for the research community on the ar-
eas that chatbot developers are interesting and challenging to the
developers at an early stage of evolution of chatbots.
6 THREATS TO VALIDITY
Internal Validity:
Internal validity concerns factors that could
have inuenced our results. We use tags from Stack Overow to
identify chatbot-related posts and it might be the case that some
chatbot-related posts are mislabelled (i.e., missing tags or having
incorrect tags) and therefore are omitted from our dataset. We
mitigate this threat by examining all tags that coexist with the
‘chatbot’ tag and selecting a set of tags that are related to chatbots
using the TST and TRT measures. Those measures have been used
in prior work to have a better coverage of a certain topic’s posts
and limit the noise in the dataset [
11
,
54
,
67
,
72
]. Moreover, we nd
that the TST and TRT thresholds that we obtain in our study are
in-line with previous studies [6, 11, 72].
One potential threat is that we select
K
= 12 as the optimal num-
ber of topics for the LDA topic modelling technique. The number
of topics (
K
) has a direct inuence on the quality of the resulting
topics from the LDA, and selecting an optimal number is known to
be dicult. To alleviate this threat, we follow the approach used in
similar studies to select the number of topics [
31
,
67
]. Specically,
we experiment with dierent values of
K
and we examine the co-
herence of topics to select the optimal
K
value that balances the
generalizability and relevance of the chatbot topics.
The labelling of posts types is another threat to the validity
of our results, due to the subjectivity of the process. We mitigate
this threat by performing three independent classications and
evaluating the interrater-agreement using the Cohen-Kappa test,
that indicated substantial agreement among the annotators .
Construct Validity:
Construct validity considers the relationship
between theory and observation, in case the measured variables do
not measure the actual factors. Labelling the resulting topics from
the LDA might not reect the posts associated with the topics. To
minimize this threat, the rst three authors individually examine
the keywords and more than 30 posts randomly from each topic,
then they discuss each topic’s label to reach a consensus on the
label that reects the posts of that topic. We use dierent metrics to
measure the popularity and diculty of the chatbot topics which
might be a threat to construct validity. These metrics have been
used in similar studies [6, 11, 12, 42, 54, 72].
External Validity:
Threats to external validity concern the gener-
alization of our ndings. Our study was focused on and collected
data from posts on Stack Overow, however, there are other fo-
rums that may host developers’ discussions regarding chatbots. We
believe that using Stack Overow allows for the generalizability
of our results as Stack Overow is a very popular platform that
hosts a large number of questions and answers from developers
with a wide variety of domains and expertise. We also believe that
this study can be improved by including discussions from dierent
forums or surveying actual software developers about issues that
they face when building chatbots.
The focus of this study is on chatbot which is considered to be a
sub-category of software bots [
34
,
35
]. Therefore, our observations
and results cannot be generalized to other types of bots, such as
agents. However, we believe that our observations are still relevant
and contribute to the larger community (software bot). We encour-
age other researchers to conduct similar studies on other types of
bots and compare the results from the dierent types to paint a full
picture about bots in general.
7 CONCLUSION
In this paper, we analyze Stack Overow posts to identify the most
pressing issues facing chatbot development. We nd that developers
discuss 12 chatbot-related topics that fall under ve main categories,
namely Integration, Development, NLU, User Interaction, and User
Input. Chatbot developers are highly interested in posts that are
related to chatbot creation and integration into websites. On the
other hand, training the NLU model of the chatbot proves to be
challenging task for developers. We also nd that chatbot practi-
tioners show considerable interest in understanding the behavior
of NLUs, while also seeking good recommendation regarding chat-
bot development platforms and best practices. We believe that our
results are useful to the chatbot community as they guide future
research to focus on the more pressing and dicult aspects of chat-
bot development. Moreover, our ndings help platform owners to
understand the issues faced by chatbot developers when using their
platforms, and to overcome those challenges. Chatbot educators
can take into consideration the discussed topics and categories and
their perspective diculty to better design their courses.
Our study opens the door for chatbot researches and practi-
tioners to further understand the chatbot development challenges.
Nevertheless, we plan in the future to examine developers’ discus-
sion from other forums to draw more accurate and generalizable
conclusions. We also plan to investigate the developers discussions
regarding bots in general, which would allow us to compare our
results with with other bot types. Finally, we intend to investigate
chatbot repositories and analyze the commits and bug reports to
obtain further insights regarding the various issues faced by chatbot
developers and their attempts to solve it.
Challenges in Chatbot Development:
A Study of Stack Overflow Posts MSR ’20, October 5–6, 2020, Seoul, Republic of Korea
REFERENCES
[1]
Rabe Abdalkareem, Emad Shihab, and Juergen Rilling. 2017. On Code Reuse
from StackOverow. Information Software Technology 88, C (Aug. 2017), 148–158.
https://doi.org/10.1016/j.infsof.2017.04.005
[2]
R. Abdalkareem, E. Shihab, and J. Rilling. 2017. What Do Developers Use the
Crowd For? A Study Using Stack Overow. IEEE Software 34, 2 (Mar 2017), 53–60.
https://doi.org/10.1109/MS.2017.31
[3]
Ahmad Abdellatif, Khaled Badran, and Emad Shihab. 2020. MSRBot: Using Bots
to Answer Questions from Software Repositories. Empirical Software Engineering
(EMSE) (2020). https://doi.org/10.1007/s10664-019- 09788-5
[4]
E. Aghajani, G. Bavota, M. Linares-Vásquez, and M. Lanza. 2018. Automated
Documentation of Android Apps. IEEE Transactions on Software Engineering
(2018), 1–1. https://doi.org/10.1109/TSE.2018.2890652
[5]
H. Agus Santoso, N. Anisa Sri Winarsih, E. Mulyanto, G. Wilujeng saraswati, S.
Enggar Sukmana, S. Rustad, M. Syaifur Rohman, A. Nugraha, and F. Firdausillah.
2018. Dinus Intelligent Assistance (DINA) Chatbot for University Admission Ser-
vices. In 2018 International Seminar on Application for Technology of Information
and Communication. IEEE Press, 417–423.
[6]
Syed Ahmed and Mehdi Bagherzadeh. 2018. What Do Concurrency Developers
Ask About? A Large-scale Study Using Stack Overow. In Proceedings of the
12th ACM/IEEE International Symposium on Empirical Software Engineering and
Measurement (ESEM ’18). ACM, New York, NY, USA, Article 30, 10 pages. https:
//doi.org/10.1145/3239235.3239524
[7]
Amazon. 2019. Amazon Lex - Build Conversation Bots. https://aws.amazon.com/
lex/. (Dec 2019). (Accessed on 12/12/2019).
[8]
Apple. 2020. Siri - Apple. https://www.apple.com/ca/siri/. (2020). (Accessed on
01/08/2020).
[9]
Amazon AWS. 2019. Document History for Amazon Lex - Amazon Lex. https:
//docs.aws.amazon.com/lex/latest/dg/doc-history.html. (2019). (Accessed on
12/12/2019).
[10]
C. J. Baby, F. A. Khan, and J. N. Swathi. 2017. Home automation using IoT and
a chatbot using natural language processing. In 2017 Innovations in Power and
Advanced Computing Technologies (i-PACT). IEEE Press, 1–6. https://doi.org/10.
1109/IPACT.2017.8245185
[11]
Mehdi Bagherzadeh and Ra Khatchadourian. 2019. Going Big: A Large-Scale
Study on What Big Data Developers Ask. In Proceedings of the 2019 27th ACM Joint
Meeting on European Software Engineering Conference and Symposium on the Foun-
dations of Software Engineering (ESEC/FSE 2019). Association for Computing Ma-
chinery, New York, NY, USA, 432–442. https://doi.org/10.1145/3338906.3338939
[12]
Kartik Bajaj, Karthik Pattabiraman, and Ali Mesbah. 2014. Mining Questions
Asked by Web Developers. In Proceedings of the 11th Working Conference on
Mining Software Repositories (MSR 2014). ACM, New York, NY, USA, 112–121.
https://doi.org/10.1145/2597073.2597083
[13]
Anton Barua, Stephen W. Thomas, and Ahmed E. Hassan. 2014. What are
developers talking about? An analysis of topics and trends in Stack Overow.
Empirical Software Engineering 19, 3 (01 Jun 2014), 619–654.
[14]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet
Allocation. The Journal of Machine Learning Research 3 (March 2003), 993–1022.
http://dl.acm.org/citation.cfm?id=944919.944937
[15]
Marion Boiteux. 2018. Messenger at F8 2018 - Messenger Developer Blog. https:
//blog.messengerdevelopers.com/messenger-at- f8-2018- 44010dc9d2ea. (2018).
(Accessed on 12/21/2019).
[16]
Nick C. Bradley, Thomas Fritz, and Reid Holmes. 2018. Context-Aware Conver-
sational Developer Assistants. In Proceedings of the 40th International Conference
on Software Engineering (ICSE ’18). Association for Computing Machinery, New
York, NY, USA, 993–1003. https://doi.org/10.1145/3180155.3180238
[17]
Gillian Cameron, David Cameron, Gavin Megaw, Raymond Bond, Maurice Mul-
venna, Siobhan O’Neill, Cherie Armour, and Michael McTear. 2018. Best Practices
for Designing Chatbots in Mental Healthcare: A Case Study on IHelpr. In Pro-
ceedings of the 32nd International BCS Human Computer Interaction Conference
(HCI ’18). BCS Learning & Development Ltd., Swindon, GBR, Article Article 129,
5 pages. https://doi.org/10.14236/ewic/HCI2018.129
[18]
PACT Care. 2020. Florence - Your health assistant. https://www.orence.chat/.
(2020). (Accessed on 01/08/2020).
[19]
Chatbot application Life cycle 2019. Chatbot application Life cycle -
Data Driven Investor - Medium. https://medium.com/datadriveninvestor/
chatbot-application- life-cycle- 8b2d083650a8. (June 2019). (Accessed on
12/16/2019).
[20]
G. Chen, C. Chen, Z. Xing, and B. Xu. 2016. Learning a dual-language vector
space for domain-specic cross-lingual question retrieval. In 2016 31st IEEE/ACM
International Conference on Automated Software Engineering (ASE). 744–755.
[21]
Google Dialogow. 2020. Build an agent. https://cloud.google.com/dialogow/
docs/quick/build-agent. (2020). (Accessed on 01/16/2020).
[22]
Saurabh "Dutta, Ger Joyce, and Jay" Brewer."2018". "Utilizing Chatbots to Increase
the Ecacy of Information Security Practitioners". In "Advancesin Human Factors
in Cybersecurity", Denise" "Nicholson (Ed.). "Springer International Publishing",
"237–243".
[23]
Stack Exchange. 2019. Stack Exchange Data Dump. https://archive.org/details/
stackexchange. (2019). (Sept. 2019).
[24] Facebook. 2019. Wit.ai. https://wit.ai/. (2019). (Accessed on 12/12/2019).
[25]
Sally Fincher and Josh Tenenberg. 2005. Making sense of card sorting data. Expert
Systems 22, 3 (2005), 89–93. https://doi.org/10.1111/j.1468-0394.2005.00299.x
[26]
J. Gao, J. Chen, S. Zhang, X. He, and S. Lin. 2019. Recognizing Biomedical
Named Entities by Integrating Domain Contextual Relevance Measurement and
Active Learning. In 2019 IEEE 3rd Information Technology, Networking, Electronic
and Automation Control Conference (ITNEC). 1495–1499. https://doi.org/10.1109/
ITNEC.2019.8728991
[27]
Gensim. 2019. gensim: Topic modelling for humans. https://radimrehurek.com/
gensim/. (2019). (Accessed on 12/03/2019).
[28]
Google. 2020. Dialogow. https://dialogow.com/. (2020). (Accessed on
01/16/2020).
[29]
Google. 2020. Google Assistant, your own personal Google. https://assistant.
google.com/. (2020). (Accessed on 01/08/2020).
[30]
Google. 2020. Integrations-Dialogow Documentation. https://cloud.google.com/
dialogow/docs/integrations/. (2020). (Accessed on 01/16/2020).
[31]
Junxiao Han, Emad Shihab, Zhiyuan Wan, Shuiguang Den, and Xin Xia. 2019.
What do Programmers Discuss about Deep Learning Frameworks. Empirical
Software Engineering (EMSE) (2019), To Appear.
[32]
Z. Jin, K. Y. Chee, and X. Xia. 2019. What Do Developers Discuss about Biomet-
ric APIs?. In 2019 IEEE International Conference on Software Maintenance and
Evolution (ICSME). 348–352. https://doi.org/10.1109/ICSME.2019.00053
[33]
C. Lebeuf, M. Storey, and A. Zagalsky. 2018. Software Bots. IEEE Software 35, 1
(January 2018), 18–23. https://doi.org/10.1109/MS.2017.4541027
[34]
C. Lebeuf, A. Zagalsky, M. Foucault, and M. Storey. 2019. Dening and Classifying
Software Bots: A Faceted Taxonomy. In 2019 IEEE/ACM 1st International Workshop
on Bots in Software Engineering (BotSE). 1–6. https://doi.org/10.1109/BotSE.2019.
00008
[35]
Carlene R Lebeuf. 2018. A taxonomy of software bots: towards a deeper under-
standing of software bot characteristics. Ph.D. Dissertation.
[36]
Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language
Toolkit. http://mallet.cs.umass.edu/. (2002). (Accessed on 12/03/2019).
[37]
Mary McHugh. 2012. Interrater reliability: The kappa statistic. Biochemia medica
22 (10 2012), 276–82. https://doi.org/10.11613/BM.2012.031
[38]
Microsoft. 2019. LUIS (Language Understanding) - Cognitive Services - Microsoft
Azure. https://ww w.luis.ai/home. (2019). (Accessed on 12/12/2019).
[39]
Microsoft. 2020. Microsoft Bot Framework. https://dev.botframework.com/.
(2020). (Accessed on 01/16/2020).
[40]
Microsoft. 2020. Quickstart: Create a new app in the LUIS portal - Azure
Cognitive Services | Microsoft Docs. https://docs.microsoft.com/en-us/azure/
cognitive-services/luis/get- started-portal- build-app. (2020). (Accessed on
01/16/2020).
[41]
Milja Milenkovic. 2019. The Future Is Now - 37 Fascinating Chatbot Statistics.
https://www.smallbizgenius.net/by-the-numbers/chatbot-statistics/. (Oct. 2019).
(Accessed on 12/18/2019).
[42]
Sarah Nadi, Stefan Krüger, Mira Mezini, and Eric Bodden. 2016. Jumping Through
Hoops: Why Do Java Developers Struggle with Cryptography APIs?. In Proceed-
ings of the 38th International Conference on Software Engineering (ICSE ’16). ACM,
New York, NY, USA, 935–946. https://doi.org/10.1145/2884781.2884790
[43]
Natural Language Toolkit (NLTK). 2019. Natural Language Toolkit - NLTK 3.4.5
documentation. https://www.nltk.org/. (2019). (Accessed on 12/12/2019).
[44]
Natural Language Toolkit (NLTK). 2019. NLTK’s list of english stopwords. https:
//gist.github.com/sebleier/554280. (2019). (Accessed on 12/23/2019).
[45]
Stack Overow. 2017. node.js - How to start a conversation from nodejs client to
microsoft bot - Stack Overow. https://stackoverow.com/questions/46183295/
how-to- start-a- conversation-from- nodejs-client- to-microsoft- bot. (2017). (Ac-
cessed on 12/20/2019).
[46]
Stack Overow. 2017. Rasa nlu parse request giving wrong intent
result - Stack Overow. https://stackoverow.com/questions/46466222/
rasa-nlu- parse-request- giving-wrong- intent-result. (2017). (Accessed on
12/20/2019).
[47]
Stack Overow. 2018. Facebook Chat bot (PHP webhook) sending multi-
ple replies - Stack Overow. https://stackoverow.com/questions/36609549/
facebook-chat- bot-php-webhook- sending-multiple- replies. (2018). (Accessed
on 12/20/2019).
[48]
Stack Overow. 2019. botframework - How to add custom
choices displayed through Prompt options inside Cards & trig-
ger actions on choice click in BOT V4 using c#? - Stack Over-
ow. (2019). https://stackoverow.com/questions/56280689/
how-to- add-custom- choices-displayed- through-prompt- options-inside- cards-
trigge(Accessed on 01/16/2020).
[49]
Stack Overow. 2019. nlp - Is there a dataset that provides shopping con-
versations? - Stack Overow. https://stackoverow.com/questions/55324833/
is-there- a-dataset- that-provides- shopping-conversations. (2019). (Accessed on
01/13/2020).
MSR ’20, October 5–6, 2020, Seoul, Republic of Korea Abdellatif, Costa, Badran, Abdalkareem, and Shihab
[50]
Stack Overow. 2019. node.js - How to store and retrieve the chat history of the
dialogow? - Stack Overow. https://stackoverow.com/questions/49665510/
how-to- store-and- retrieve-the-chat- history-of-the-dialogow. (2019). (Ac-
cessed on 12/21/2019).
[51]
Stack Overow. 2019. python - How to resume or restart paused conversa-
tion in RASA - Stack Overow. https://stackoverow.com/questions/57365685/
how-to- resume-or- restart-paused- conversation-in-rasa. (2019). (Accessed on
12/20/2019).
[52]
Stack Overow. 2019. Stack Overow Developer Survey 2019. https://insights.
stackoverow.com/survey/2019. (2019). (Accessed on 01/09/2020).
[53]
Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the
Space of Topic Coherence Measures. In Proceedings of the Eighth ACM Interna-
tional Conference on Web Search and Data Mining (WSDM ’15). ACM, New York,
NY, USA, 399–408. https://doi.org/10.1145/2684822.2685324
[54]
Christoer Rosen and Emad Shihab. 2016. What Are Mobile Developers Asking
About? A Large Scale Study Using Stack Overow. Empirical Software Engineering
21, 3 (June 2016), 1192–1223. https://doi.org/10.1007/s10664-015- 9379-3
[55]
B. Rychalska, H. Glabska, and A. Wroblewska. 2018. Multi-Intent Hierarchical
Natural Language Understanding for Chatbots. In 2018 Fifth International Confer-
ence on Social Networks Analysis, Management and Security (SNAMS). 256–259.
https://doi.org/10.1109/SNAMS.2018.8554770
[56]
A. Abdellatif, D. Costa, K. Badran, R. Abdalkareem, E. Shihab. 2020. Dataset.
https://zenodo.org/record/3610714. (2020). (Accessed on 01/16/2020).
[57]
Spearman. 2008. Spearman Rank Correlation Coecient. Springer New York, New
York, NY, 502–505. https://doi.org/10.1007/978- 0-387- 32833-1_379
[58]
Margaret-Anne Storey and Alexey Zagalsky. 2016. Disrupting Developer Produc-
tivity One Bot at a Time. In Proceedings of the 2016 24th ACM SIGSOFT International
Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York,
NY, USA, 928–931. https://doi.org/10.1145/2950290.2983989
[59]
Sumo. 2020. 5 Ecommerce Chatbots (Plus How To Build Your Own In 15 Minutes).
https://sumo.com/stories/ecommerce-chatbot- marketing. (2020). (Accessed on
01/08/2020).
[60]
TechCrunch. 2017. Wit.ai is shutting down Bot Engine as Facebook rolls NLP
into its updated Messenger Platform. (2017). https://techcrunch.com/2017/07/27/
wit-ai- is-shutting- down-bot- engine-as- facebook-rolls- nlp-into- its-updated-
messenger-platform (Accessed on 12/12/2019).
[61]
Y. Tian, F. Thung, A. Sharma, and D. Lo. 2017. APIBot: Question answering
bot for API documentation. In 2017 32nd IEEE/ACM International Conference on
Automated Software Engineering (ASE). 153–158. https://doi.org/10.1109/ASE.
2017.8115628
[62]
Carlos Toxtli, Andrés Monroy-Hernández, and Justin Cranshaw. 2018. Under-
standing Chatbot-mediated Task Management. In Proceedings of the 2018 CHI
Conference on Human Factors in Computing Systems (CHI ’18). ACM, New York,
NY, USA, Article 58, 6 pages. https://doi.org/10.1145/3173574.3173632
[63]
Christoph Treude, Ohad Barzilay, and Margaret-Anne Storey. 2011. How Do
Programmers Ask and Answer Questions on the Web? (NIER Track). In Proceed-
ings of the 33rd International Conference on Software Engineering (ICSE ’11). ACM,
New York, NY, USA, 804–807. https://doi.org/10.1145/1985793.1985907
[64]
Simon Urli, Zhongxing Yu, Lionel Seinturier, and Martin Monperrus. 2018. How
to Design a Program Repair Bot? Insights from the Repairnator Project. In
Proceedings of the 40th International Conference on Software Engineering: Soft-
ware Engineering in Practice (ICSE-SEIP ’18). ACM, New York, NY, USA, 95–104.
https://doi.org/10.1145/3183519.3183540
[65]
Stefano Valtolina, Barbara Rita Barricelli, and Serena Di Gaetano.
2020. Communicability of traditional interfaces VS chatbots in health-
care and smart home domains. Behaviour & Information Technology
39, 1 (2020), 108–132. https://doi.org/10.1080/0144929X.2019.1637025
arXiv:https://doi.org/10.1080/0144929X.2019.1637025
[66]
P. K. Venkatesh, S. Wang, F. Zhang, Y. Zou, and A. E. Hassan. 2016. What Do
Client Developers Concern When Using Web APIs? An Empirical Study on
Developer Forums and Stack Overow. In 2016 IEEE International Conference on
Web Services (ICWS). 131–138. https://doi.org/10.1109/ICWS.2016.25
[67]
Z. Wan, X. Xia, and A. E. Hassan. 2019. What is Discussed about Blockchain?
A Case Study on the Use of Balanced LDA and the Reference Architecture of a
Domain to Capture Online Discussions about Blockchain platforms across the
Stack Exchange Communities. IEEE Transactions on Software Engineering (2019),
1–1. https://doi.org/10.1109/TSE.2019.2921343
[68]
Joseph Weizenbaum. 1966. ELIZA-A Computer Program for the Study of Natural
Language Communication between Man and Machine. Commun. ACM 9, 1 (Jan.
1966), 36–45. https://doi.org/10.1145/365153.365168
[69]
Marvin Wyrich and Justus Bogner. 2019. Towards an Autonomous Bot for Auto-
matic Source Code Refactoring. In Proceedings of the 1st International Workshop
on Bots in Software Engineering (BotSE ’19). IEEE Press, Piscataway, NJ, USA,
24–28. https://doi.org/10.1109/BotSE.2019.00015
[70]
Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. 2017. A New
Chatbot for Customer Service on Social Media. In Proceedings of the 2017 CHI
Conference on Human Factors in Computing Systems (CHI ’17). Association for
Computing Machinery, New York, NY, USA, 3506–3510. https://doi.org/10.1145/
3025453.3025496
[71]
Bowen Xu, Zhenchang Xing, Xin Xia, and David Lo. 2017. AnswerBot: Au-
tomated Generation of Answer Summary to Developers’ Technical Questions.
In Proceedings of the 32Nd IEEE/ACM International Conference on Automated
Software Engineering (ASE 2017). IEEE Press, Piscataway, NJ, USA, 706–716.
http://dl.acm.org/citation.cfm?id=3155562.3155650
[72]
Xin-Li Yang, David Lo, Xin Xia, Zhi-Yuan Wan, and Jian-Ling" Sun. 2016. What
Security Questions Do Developers Ask? A Large-Scale Study of Stack Overow
Posts. Journal of Computer Science and Technology 31, 5 (01 Sep 2016), 910–924.
[73]
D. Ye, Z. Xing, C. Y. Foo, Z. Q. Ang, J. Li, and N. Kapre. 2016. Software-Specic
Named Entity Recognition in Software Engineering Social Content. In 2016 IEEE
23rd International Conference on Software Analysis, Evolution, and Reengineering
(SANER), Vol. 1. 90–101. https://doi.org/10.1109/SANER.2016.10
[74]
Shayan Zamanirad, Boualem Benatallah, Moshe Chai Barukh, Fabio Casati, and
Carlos Rodriguez. 2017. Programming Bots by Synthesizing Natural Language
Expressions into API Invocations. In Proceedings of the 32nd IEEE/ACM Inter-
national Conference on Automated Software Engineering (ASE 2017). IEEE Press,
832–837.
[75]
H. Zhang, S. Wang, T. Chen, and A. E. Hassan. 2019. Reading Answers on Stack
Overow: Not Enough! IEEE Transactions on Software Engineering (2019), 1–1.
https://doi.org/10.1109/TSE.2019.2954319
[76]
Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive Co-
attention Network for Named Entity Recognition in Tweets. (2018). https:
//www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16432
... For each tuple of (k, α and β), we ran LDA with 1,000 iterations, then evaluated the coherence metric [392] of the identified topics. Coherence metric has been recommended by many previous studies (e.g., [393,394]) to select the optimal number of LDA topics since it usually highly correlates with human understandability. Topic coherence is the average correlation between pairs of words that appear in the same topic. ...
... We found LDA models having from 11 to 17 topics produced similar coherence metrics. Three of the authors manually examined these cases, as in [393]. Duplicate and/or platform-specific topics (e.g., web and mobile) appeared from 14 topics, making the taxonomy less generalizable. ...
Preprint
Full-text available
The thesis advances the field of software security by providing knowledge and automation support for software vulnerability assessment using data-driven approaches. Software vulnerability assessment provides important and multifaceted information to prevent and mitigate dangerous cyber-attacks in the wild. The key contributions include a systematisation of knowledge, along with a suite of novel data-driven techniques and practical recommendations for researchers and practitioners in the area. The thesis results help improve the understanding and inform the practice of assessing ever-increasing vulnerabilities in real-world software systems. This in turn enables more thorough and timely fixing prioritisation and planning of these critical security issues.
... During the topic labeling step, we label each topic by analyzing the topic top 30 words and posts associated with the label. This approach is consistent with the related work that also analyzed topics in SO posts [12,2]. ...
... In SE, topic modeling is used to learn aspects like software logging [46], feature location [29,56], traceability linking [59,8], software and source code evolution [41,76,75], source code categorization [77], code refactoring [16], defect analysis [27], and various software maintenance tasks [69,70]. The SO posts are subject to topic modeling to understand concurrency [3], big data [11] and chatbot issues [2]. Yang et al. [89] studied security-topics in SO. ...
Preprint
Full-text available
IoT is a rapidly emerging paradigm that now encompasses almost every aspect of our modern life. As such, ensuring the security of IoT devices is crucial. IoT devices can differ from traditional computing, thereby the design and implementation of proper security measures can be challenging in IoT devices. We observed that IoT developers discuss their security-related challenges in developer forums like Stack Overflow(SO). However, we find that IoT security discussions can also be buried inside non-security discussions in SO. In this paper, we aim to understand the challenges IoT developers face while applying security practices and techniques to IoT devices. We have two goals: (1) Develop a model that can automatically find security-related IoT discussions in SO, and (2) Study the model output to learn about IoT developer security-related challenges. First, we download 53K posts from SO that contain discussions about IoT. Second, we manually labeled 5,919 sentences from 53K posts as 1 or 0. Third, we use this benchmark to investigate a suite of deep learning transformer models. The best performing model is called SecBot. Fourth, we apply SecBot on the entire posts and find around 30K security related sentences. Fifth, we apply topic modeling to the security-related sentences. Then we label and categorize the topics. Sixth, we analyze the evolution of the topics in SO. We found that (1) SecBot is based on the retraining of the deep learning model RoBERTa. SecBot offers the best F1-Score of 0.935, (2) there are six error categories in misclassified samples by SecBot. SecBot was mostly wrong when the keywords/contexts were ambiguous (e.g., gateway can be a security gateway or a simple gateway), (3) there are 9 security topics grouped into three categories: Software, Hardware, and Network, and (4) the highest number of topics belongs to software security, followed by network security.
... Such information has been leveraged by software engineering researchers to understand what developers discuss about a variety of technologies. Information from posts are used to investigate the topic trends and challenges in areas such as mobile development [67], security [79], blockchain [75], machine learning [37], big data [36], deep learning frameworks [43,49], configuration as code [63], concurrency [29], internet of things [32], chatbots [28], and new programming languages [41]. ...
... The second component is the set of documents, with each document being described as a probability distribution of topics (each document corresponds to a unique post). By associating topics with documents (posts), researchers leverage the metadata of the posts to calculate metrics that denote the popularity and the difficulty of the topics [28,29,31,36] (e.g., the popularity of a topic is measured as the number of times a topic is associated with a post). ...
Preprint
Full-text available
Programming language documentation refers to the set of technical documents that provide application developers with a description of the high-level concepts of a language. Such documentation is essential to support application developers in the effective use of a programming language. One of the challenges faced by documenters (i.e., personnel that produce documentation) is to ensure that documentation has relevant information that aligns with the concrete needs of developers. In this paper, we present an automated approach to support documenters in evaluating the differences and similarities between the concrete information need of developers and the current state of documentation (a problem that we refer to as the topical alignment of a programming language documentation). Our approach leverages semi-supervised topic modelling to assess the similarities and differences between the topics of Q&A posts and the official documentation. To demonstrate the application of our approach, we perform a case study on the documentation of Rust. Our results show that there is a relatively high level of topical alignment in Rust documentation. Still, information about specific topics is scarce in both the Q&A websites and the documentation, particularly related topics with programming niches such as network, game, and database development. For other topics (e.g., related topics with language features such as structs, patterns and matchings, and foreign function interface), information is only available on Q&A websites while lacking in the official documentation. Finally, we discuss implications for programming language documenters, particularly how to leverage our approach to prioritize topics that should be added to the documentation.
... Topic modeling techniques can be used to analyze small documents from Twitter 3 (Ahn et al. 2021) or generally text mining techniques are used in this platform (Gyódi et al. 2022), where a limit of 180 characters per document existed, or longer texts such as the political debates (Koltsova and Koltcov 2013;Kaiser et al. 2020). Additionally, topic modelling algorithms are used to provide qualitative empirical results in technological trends discussions from Question and Answering sites such as the topics related to big data computing (Bagherzadeh et al. 2019), chatbots (Abdellatif et al. 2020), or blockchain (Wan et al. 2019). Additionally, topic modeling techniques are used to detect specific sets of skills (Debortoli et al. 2014) for labour market purposes as labour market analysis has recently gain the scientific interest (Lovaglio and Mezzanzanica 2013;Lovaglio et al. 2020), or to identify the current needs of a specific sector (Papoutsoglou et al. 2022). ...
Article
Full-text available
This paper examines whether stroke survivors express that they suffered stroke symptoms on a question from the self-reported informal part of comprehensive test for communication ability. More specifically, whether they spontaneously refer to FAST (Face, Arm, Speech, Time) symptoms. Is there a connection between these FAST symptoms and the type of aphasia that they were diagnosed with? The present study involved 106 individuals with stroke; the majority having suffered an ischemic stroke. To carry out the research, statistical analysis, and analysis of the language through machine learning were performed on their answers to the one of the informal questions from the test “Why are you here today?”. All stories are in the Greek language. Replies were analyzed using term frequency and topic modelling techniques and have shown that terminology belonging to FAST symptoms appears in 37% of the reports. Further investigation of whether these FAST symptoms are associated with a particular type of aphasia showed that a statistically significant correlation between certain symptoms and diagnostic category of aphasia does not exist. However, further analysis with a larger dataset is needed.
... Due to their ability to save resources and increase software development velocity [2,3,44,46], bots are becoming more prevalent in the SE environments [46] and getting progressively harder to identify, biasing data-driven research and impacting practitioners [24]. Therefore, the research community needs to be able to isolate bots and their activities in empirical studies [16,28,30,41]. ...
Article
Full-text available
Bots have become popular in software projects as they play critical roles, from running tests to fixing bugs/vulnerabilities. However, the large number of software bots adds extra effort to practitioners and researchers to distinguish human accounts from bot accounts to avoid bias in data-driven studies. Researchers developed several approaches to identify bots at specific activity levels (issue/pull request or commit), considering a single repository and disregarding features that showed to be effective in other domains. To address this gap, we propose using a machine learning-based approach to identify the bot accounts regardless of their activity level. We selected and extracted 19 features related to the account's profile information, activities, and comment similarity. Then, we evaluated the performance of five machine learning classifiers using a dataset that has more than 5,000 GitHub accounts. Our results show that the Random Forest classifier performs the best, with an F1-score of 92.4% and AUC of 98.7%. Furthermore, the account profile information (e.g., account login) contains the most relevant features to identify the account type. Finally, we compare the performance of our Random Forest classifier to the state-of-the-art approaches, and our results show that our model outperforms the state-of-the-art techniques in identifying the account type regardless of their activity level.
... Step 3. Label all topics. Following prior work on topic labeling [4,5,8,14,43,81,114], we manually label each topic using an open card sorting approach [53]. There is no predefined label for a topic in this method, since such a label is determined throughout the open coding process. ...
Preprint
Full-text available
Our opinions and views of life can be shaped by how we perceive the opinions of others on social media like Facebook. This dependence has increased during COVID-19 periods when we have fewer means to connect with others. However, fake news related to COVID-19 has become a significant problem on Facebook. Bengali is the seventh most spoken language worldwide, yet we are aware of no previous research that studied the prevalence of COVID-19 related fake news in Bengali on Facebook. In this paper, we develop machine learning models to detect fake news in Bengali automatically. The best performing model is BERT, with an F1-score of 0.97. We apply BERT on all Facebook Bengali posts related to COVID-19. We find 10 topics in the COVID-19 Bengali fake news grouped into three categories: System (e.g., medical system), belief (e.g., religious rituals), and social (e.g., scientific awareness).
... It could lead the wrong label of the topics without the domain expert. Zou et al., 2017, Martin et al., 2015, Bagherzadeh and Khatchadourian, 2019and Abdellatif et al., 2020 noted that the approach for designating topic names may compromise their validity. ...
Thesis
Scientific research creates substantial quantities of peer-reviewed literature on a wide variety of constantly growing topics and sub-topics. Scientists and practitioners find it more arduous to assimilate this massive collection of literature. This study explores topic modelling using Latent Dirichlet Allocation (LDA) as a form of unsupervised learning and applies it to 11,187 articles’ abstracts, keywords, and titles from the Web of Science database to provide maximum contextual analyses of hydrology and flood-related literature published since 1939. We identified several essential topics in the corpus. The work structured the body of literature into its principal components, allowing researchers a quick grasp on which topics constitute hydrology and flood research. Relationships between specific individual topics were at times more prominent than others with implications on the interdisciplinary character and extensive impact of the topics such as 'Modeling and Forecasting'. In contrast, the topics 'Wetland and Ecology' and 'Urban Risk Management' are studied largely independently from other topics within hydrology and flood research. Also, trends in research themes were identified. Research on the topics 'Precipitation and Extremes' and 'Coastal hydrology' increased significantly which was not the case for all topics in hydrology and flood. In contrast to the manual literature review, this study used quantitative and qualitative methods and employed labelled topics to examine topic trends, inter topic relationships, and topic diversity. This thesis's methodology and findings may be advantageous to scientists and researchers seeking contextual knowledge of the present condition of flood-related literature. In the long run, we see topic modelling as a tool that will be used to enhance the efficiency of literature reviews, scientific communication, and can benefit science-informed legislation and decision making.
Article
Programming language documentation refers to the set of technical documents that provide application developers with a description of the high-level concepts of a language (e.g., manuals, tutorials, and API references). Such documentation is essential to support application developers in effectively using a programming language. One of the challenges faced by documenters (i.e., personnel that design and produce documentation for a programming language) is to ensure that documentation has relevant information that aligns with the concrete needs of developers, defined as the missing knowledge that developers acquire via voluntary search. In this paper, we present an automated approach to support documenters in evaluating the differences and similarities between the concrete information need of developers and the current state of documentation (a problem that we refer to as the topical alignment of a programming language documentation). Our approach leverages semi-supervised topic modelling that uses domain knowledge to guide the derivation of topics. We initially train a baseline topic model from a set of Rust -related Q&A posts. We then use this baseline model to determine the distribution of topic probabilities of each document of the official Rust documentation. Afterwards, we assess the similarities and differences between the topics of the Q&A posts and the official documentation. Our results show a relatively high level of topical alignment in Rust documentation. Still, information about specific topics is scarce in both the Q&A websites and the documentation, particularly related topics with programming niches such as network, game, and database development. For other topics (e.g., related topics with language features such as structs, patterns and matchings, and foreign function interface), information is only available on Q&A websites while lacking in the official documentation. Finally, we discuss implications for programming language documenters, particularly how to leverage our approach to prioritize topics that should be added to the documentation.
Article
Context IoT is a rapidly emerging paradigm that now encompasses almost every aspect of our modern life. As such, ensuring the security of IoT devices is crucial. IoT devices can differ from traditional computing (e.g., low power, storage, computing), thereby the design and implementation of proper security measures can be challenging in IoT devices. We observed that IoT developers discuss their security-related challenges in developer forums like Stack Overflow (SO). However, we find that IoT security discussions can also be buried inside non-security discussions in SO. Objective In this paper, we aim to understand the challenges IoT developers face while applying security practices and techniques to IoT devices. We have two goals: (1) Develop a model that can automatically find security-related IoT discussions in SO, and (2) Study the model output (i.e., the security discussions) to learn about IoT developer security-related challenges. Methods First, we download all 53K posts from StackOverflow (SO) that contain discussions about various IoT devices, tools, and techniques. Second, we manually labeled 5,919 sentences from 53K posts as 1 or 0 (i.e., whether they contain a security aspect or not). Third, we then use this benchmark to investigate a suite of deep learning transformer models. The best performing model is called SecBot. Fourth, we apply SecBot on the entire 53K posts and find around 30K sentences labeled as security. Fifth, we apply topic modeling to the 30K security-related sentences labeled by SecBot. Then we label and categorize the topics. Sixth, we analyze the evolution of the topics in SO. Results We found that (1) SecBot is based on the retraining of the deep learning model RoBERTa. SecBot offers the best F1-Score of .935, (2) there are six error categories in misclassified samples by SecBot. SecBot was mostly wrong when the keywords/contexts were ambiguous (e.g., ‘gateway’ can be a security gateway or a simple gateway), (3) there are 9 security topics grouped into three categories: Software, Hardware, and Network, and (4) the highest number of topics belongs to software security, followed by network security and hardware security. Conclusion IoT researchers and vendors can use SecBot to collect and analyze security-related discussions from developer discussions in SO. The analysis of nine security-related topics can guide major IoT stakeholders like IoT Security Enthusiasts, Developers, Vendors, Educators, and Researchers in the rapidly emerging IoT ecosystems.
Article
Full-text available
Although the use of network simulator (NS) in predicting the behavior of computer networks has increased, the users often face a variety of challenges and share them on Stack Overflow (SO). However, the challenges that users deal with have not been studied. This paper presents an NS discussion dataset extracted from SOTorrent, which consists of 2,322 NS-related question posts spanning 17 features. The process of data collection was conducted in five steps, including filtering initial post dataset using simulator tags, discovering NS-related tags, collecting the tagged posts, extracting the posts title and preprocessing for LDA (Latent Dirichlet Allocation), and finally applying the LDA topic modeling to obtain the NS posts clustered into eight different topic names. We believe that this dataset will help research community in highlighting issues faced by NS users.
Article
Full-text available
Deep learning has gained tremendous traction from the developer and researcher communities. It plays an increasingly significant role in a number of application domains. Deep learning frameworks are proposed to help developers and researchers easily leverage deep learning technologies, and they attract a great number of discussions on popular platforms, i.e., Stack Overflow and GitHub. To understand and compare the insights from these two platforms, we mine the topics of interests from these two platforms. Specifically, we apply Latent Dirichlet Allocation (LDA) topic modeling techniques to derive the discussion topics related to three popular deep learning frameworks, namely, Tensorflow, PyTorch and Theano. Within each platform, we compare the topics across the three deep learning frameworks. Moreover, we make a comparison of topics between the two platforms. Our observations include 1) a wide range of topics that are discussed about the three deep learning frameworks on both platforms, and the most popular workflow stages are Model Training and Preliminary Preparation. 2) the topic distributions at the workflow level and topic category level on Tensorflow and PyTorch are always similar while the topic distribution pattern on Theano is quite different. In addition, the topic trends at the workflow level and topic category level of the three deep learning frameworks are quite different. 3) the topics at the workflow level show different trends across the two platforms. e.g., the trend of the Preliminary Preparation stage topic on Stack Overflow comes to be relatively stable after 2016, while the trend of it on GitHub shows a stronger upward trend after 2016. Besides, the Model Training stage topic still achieves the highest impact scores across two platforms. Based on the findings, we also discuss implications for practitioners and researchers.
Article
Full-text available
Software repositories contain a plethora of useful information that can be used to enhance software projects. Prior work has leveraged repository data to improve many aspects of the software development process, such as, help extract requirement decisions, identify potentially defective code and improve maintenance and evolution. However, in many cases, project stakeholders are not able to fully benefit from their software repositories due to the fact that they need special expertise to mine their repositories. Also, extracting and linking data from different types of repositories (e.g., source code control and bug repositories) requires dedicated effort and time, even if the stakeholder has the expertise to perform such a task. Therefore, in this paper, we use bots to automate and ease the process of extracting useful information from software repositories. Particularly, we lay out an approach of how bots, layered on top of software repositories, can be used to answer some of the most common software development/maintenance questions facing developers. We perform a preliminary study with 12 participants to validate the effectiveness of the bot. Our findings indicate that using bots achieves very promising results compared to not using the bot (baseline). Most of the participants (90.0%) find the bot to be either useful or very useful. Also, they completed 90.8% of the tasks correctly using the bot with a median time of 40 seconds per task. On the other hand, without the bot, the participants completed 25.2% of the tasks with a median time of 240 seconds per task. Our work has the potential to transform the MSR field by significantly lowering the barrier to entry, making the extraction of useful information from software repositories as easy as chatting with a bot.
Article
Full-text available
Stack Overflow is one of the most active communities for developers to share their programming knowledge. Answers posted on Stack Overflow help developers solve issues during software development. In addition to posting answers, users can also post comments to further discuss their associated answers. As of Aug 2017, there are 32.3 million comments that are associated with answers, forming a large collection of crowdsourced repository of knowledge on top of the commonly-studied Stack Overflow answers. In this study, we wish to understand how the commenting activities contribute to the crowdsourced knowledge. We investigate what users discuss in comments, and analyze the characteristics of the commenting dynamics, (i.e., the timing of commenting activities and the roles of commenters). We find that: 1) the majority of comments are informative and thus can enhance their associated answers from a diverse range of perspectives. However, some comments contain content that is discouraged by Stack Overflow. 2) The majority of commenting activities occur after the acceptance of an answer. More than half of the comments are fast responses occurring within one day of the creation of an answer, while later comments tend to be more informative. Most comments are rarely integrated back into their associated answers, even though such comments are informative. 3) Insiders (i.e., users who posted questions/answers before posting a comment in a question thread) post the majority of comments within one month, and outsiders (i.e., users who never posted any question/answer before posting a comment) post the majority of comments after one month. Inexperienced users tend to raise limitations and concerns while experienced users tend to enhance the answer through commenting. Our study provides insights into the commenting activities in terms of their content, timing, and the individuals who perform the commenting. For the purpose of long-term knowledge maintenance and effective information retrieval for developers, we also provide actionable suggestions to encourage Stack Overflow users/engineers/moderators to leverage our insights for enhancing the current Stack Overflow commenting system for improving the maintenance and organization of the crowdsourced knowledge.
Article
Full-text available
Blockchain-related discussions have become increasingly prevalent in programming Q\&A websites, such as Stack Overflow and other Stack Exchange communities. Analyzing and understanding those discussions could provide insights about the topics of interest to practitioners, and help the software development and research communities better understand the needs and challenges facing developers as they work in this new domain. Prior studies propose the use of LDA to study the Stack Exchange discussions. However, a simplistic use of LDA would capture the topics in discussions blindly without keeping in mind the variety of the dataset and domain-specific concepts. Specifically, LDA is biased towards larger sized corpora; LDA-derived topics are not linked to higher level domain-specific concepts. We propose an approach that combines balanced LDA (which ensures that the topics are balanced across the domain) with the reference architecture of a domain to capture and compare topics of discussions across the Stack Exchange communities. We make a number of interesting observations, including: (1) Bitcoin, Ethereum, Hyperledger Fabric and Corda are the four most commonly-discussed blockchain platforms on the Stack Exchange communities; (2) A broad range of topics are discussed at distinct layers in our derived reference architecture. The consensus layer topics are most commonly discussed; (3) We observe an overall growth in the absolute impact for all architectural layer topics. The application layer topics have the greatest absolute impact over time in comparison to other layer topics; (4) Application layer, API layer, consensus layer and network layer topics are commonly discussed across the studied blockchain platforms. Based on our findings, we highlight future directions and provide recommendations for practitioners and researchers.
Conference Paper
Full-text available
Continuous refactoring is necessary to maintain source code quality and to cope with technical debt. Since manual refactoring is inefficient and error-prone, various solutions for automated refactoring have been proposed in the past. However, empirical studies have shown that these solutions are not widely accepted by software developers and most refactorings are still performed manually. For example, developers reported that refactoring tools should support functionality for reviewing changes. They also criticized that introducing such tools would require substantial effort for configuration and integration into the current development environment. In this paper, we present our work towards the Refactoring-Bot, an autonomous bot that integrates into the team like a human developer via the existing version control platform. The bot automatically performs refactorings to resolve code smells and presents the changes to a developer for asynchronous review via pull requests. This way, developers are not interrupted in their workflow and can review the changes at any time with familiar tools. Proposed refactorings can then be integrated into the code base via the push of a button. We elaborate on our vision, discuss design decisions, describe the current state of development, and give an outlook on planned development and research activities.
Conference Paper
Software developers are increasingly required to write big data code. However, they find big data software development challenging. To help these developers it is necessary to understand big data topics that they are interested in and the difficulty of finding answers for questions in these topics. In this work, we conduct a large-scale study on Stackoverflow to understand the interest and difficulties of big data developers. To conduct the study, we develop a set of big data tags to extract big data posts from Stackoverflow; use topic modeling to group these posts into big data topics; group similar topics into categories to construct a topic hierarchy; analyze popularity and difficulty of topics and their correlations; and discuss implications of our findings for practice, research and education of big data software development and investigate their coincidence with the findings of previous work.
Article
This paper presents a study about communicability of conversational interfaces (namely chatbots) under a semiotic perspective. A chatbot is a software system that allows you to simulate real conversations between devices and users by means of a conversational interface (CI). After introducing the chatbot concept, focusing on its advantages and issues, we will present two domains of use in which chatbot interfaces can be effective: healthcare and smart home. For carrying out simple tasks such as finding information or triggering operations, users need an easy-to-use and to an easy-to-learn system to communicate with. To face this, conversational interfaces represent the latest trend in the field of digital design. For studying the communicability aspects of a CI, we carried out a user test to compare traditional and chatbot interfaces. This paper aims at evaluating the benefits at the communicability level of a chatbot in comparison to traditional GUI for incrementing the effectiveness and efficacy of communication between users and the system specifically for users with poor attitude in using technologies. In details, we evaluated the communicability of two prototypes that can be used to solve simple tasks in order to favour user inclusion, including everyone with very little exposure to technologies.