Content uploaded by Gloria Hristova
Author content
All content in this area was uploaded by Gloria Hristova on Jul 02, 2021
Content may be subject to copyright.
ЛИТЕРАТУРЕН ОБЗОР НА ПОДХОДИ ЗА ИЗВЛИЧАНЕ НА ЗНАНИЯ ОТ
ТЕКСТ ПРИЛОЖЕНИ ВЪРХУ ДАННИ ОТ ЧАТОВЕ
Глория Христова
СУ „Свети Климент Охридски“, Стопански факултет
A SURVEY OF TEXT MINING METHODS APPLIED ON CONVERSATIONAL
DATA
Gloria Hristova
Sofia University “St. Kliment Ohridski”, Faculty of Economics and Business
Administration
Abstract
Nowadays, social media along with a wide variety of instant messaging platforms, produce
massive amounts of textual data in the form of chats that contain valuable observations over
people’s expressed views, believes, sentiments and opinions. What motivates the current study is
the need to efficiently process and better utilize such type of data. This paper provides a literature
review of text mining methods applied on chat data in order to extract insights which can be
valuable both for the business and the social sciences. Results are presented as a structured
summary of key characteristics of the reviewed methods (for example - applied algorithms for text
processing and modeling, used semantic resources etc.) and then a discussion follows. The current
study not only summarizes state-of-the-art methods for chat analysis, but it also helps in outlining
gaps in the literature thus supporting further research in this emerging field.
Key words: text mining, chat analysis, dialogue data, NLP, topic modeling, text classification
1. Introduction
The advances in modern technologies during the last decade dramatically changed the
communication between people and shifted it from mostly verbal to rather non-verbal. Along with
social media, the recent years there was a boom in the development of various platforms providing
the opportunity to send instant messages to your personal network. The business wasted no time in
adopting these technologies with the purpose of staying close to its customers and having the
opportunity to interact with them on a regular basis. The result from instant messaging
communication (informally referred to as “chatting”) is a vast amount of textual data that hides
important observations over people’s expressed views, believes, sentiments and opinions.
Companies have started to realize the significance of chat data and that it can be utilized to extract
valuable insights into customers’ attitudes towards the brand, the issues they have experienced and
their overall expectations of the provided goods and/or services. However, the volume of
conversational data usually does not allow for effective and thorough manual analysis carried out
by humans. To overcome this stumbling block, various text mining methods that automate the
knowledge extraction and manipulation of textual data come to the rescue.
In addition to utilizing the modern channels of communication, another tendency among the more
innovative companies is the usage of dialogue systems with the purpose of automating the whole
communication process with clients and dealing more efficiently with their everyday issues and
frequently asked questions. Chat data has a central role in the development of dialogue systems
because it carries important information not only about some basic patterns in communication (in
terms of general conversational flow), but also serves as a source of training data. In light of the
above mentioned, conversational data in the form of chats is an important asset for companies and
can help in achieving more than one of their strategic goals. Customer relationship management
(CRM), Automation, Quality assurance, Call Centre, Innovations etc. – all of these are examples
of business departments that can highly benefit from the manipulation and analysis of chat data
generated between the company and its customers.
The vast majority of research articles in which the subject of the empirical study is chat data are
mainly focused on how it can be used to develop chatbots. On the other hand, a few research
articles view chats not solely as training data, but rather focus on how to effectively manipulate
such type of data and analyze it in order to extract knowledge. The last motivates the current study
which has the following main aims: discovering up-to-date techniques for text analysis of chat
data, outlining existing challenges in the pre-processing of such data, and identifying business
problems which can be addressed by its utilization. In order to accomplish these aims, a survey of
recent research in the field is carried out and key aspects of the reviewed approaches and methods
are summarized. To the author’s knowledge, until now there have been no documented attempts to
accomplish the task of reviewing text mining methods applied on conversational data. The current
survey is not intended to be exhaustive - however, it not only summarizes some of the main up-to-
date methods for chat analysis, but findings can be used in outlining gaps in the literature thus
supporting further research in the field.
2. Results – Summary of findings documented in the literature
Summary of findings made in the current study is presented in Table 1. Only research published
after 2016 (inclusive) is included in the literature review. Initially, according to key words around
60 papers in which the object of the empirical study is conversational data were collected.
However, after filtering the irrelevant ones (as such are considered those analyzing conversation
transcriptions or focusing on issues other than knowledge extraction and statistical analysis of chat
data), 23 research papers are left for further detailed review. Each article is evaluated according to
chosen key aspects specified in the first row of Table 1. Following is a discussion of some of the
major conclusions made as a result from the current survey.
3. Discussion
From the literature review of research on chat data analysis, it can be concluded that the addressed
problems and application areas of studies in the field is very diverse, for example – automation of
processes in call centers, boosting the efficiency of dialogue systems, enhancing the customer
experience in various online communication and entertainment platforms, fraud and crime
detection (such as online identity theft), improving education by the implementation of innovative
techniques to ease the process of learning, and more. It is obvious that not all of the studies have
arisen solely out of business needs, but some of them have socially beneficial purposes. Recently,
more innovative experiments emerge such as that of (Laddha, Hanoosh, & Mukherjee, 2019) in
which a recommender system of stickers in an instant messaging platform is developed with the
aim of saving typing time by replacing the user's message with relevant, expressive stickers. It
should be noted that most of the research is focused on the application of analytical techniques
with the main aim of improving conversational agents in various respects. When it comes to
dialogue systems for corporate usage, usually the goal is to improve the communication with
clients thus increasing their overall engagement and satisfaction. (Roy, et al., 2016, March)
develop a quality assurance system
Table 1: Summary of key characteristics of research articles included in the survey.
for call center dialogues, while (Oraby, Gundecha, Mahmud, Bhuiyan, & Akkiraju, 2017, March)
discover interesting patterns between call center operators’ actions and customer satisfaction level.
There are also attempts to create more intelligent conversational agents – for example, (Inaba &
Takahashi, 2018, July) develop an analytical tool that calculates users’ interests from their
utterances - the aim is to help personalize dialogue systems by focusing the dialogue on topics that
are of interest to the user while avoiding the unappealing ones. Results of (Chen, Hsu, Kuo, & Ku,
2018) are also valuable for the development of emotionally intelligent conversational agents which
have more human-like responses.
The main objectives underlie the choice of methods that are going to be employed in a given
study. Based on the general text mining methods applied on chat data, the research articles in
Table 1 can be divided to three broad groups each focusing mainly (but not solely) on one of the
following: descriptive analysis; topic modeling; text classification. Studies in the first group are
generally devoted to: getting familiar with data characteristics and peculiarities, exploring the
ways to process efficiently such type of data, identifying what semantic resources could be used
and what problems can arise when dealing with such type of data. For example, (Rungruangthum
& Todd, 2017) examine the linguistic characteristics of chat communication, which can be used as
a signal that one chat party is trying to mislead the other in some way. Research focusing on topic
discovery methods applied on chat data in different domains, generally, have the following
objectives – tracking of topics/expressed interests over time, detection of specific topics occurring
in chats, linking the emergence of specific topics to particular points in time/events etc. For
example, (Musabirov, Bulygin, Okopny, & Konstantinova, 2018) apply techniques for topic
modeling to streams of chat data. (Carnein, Assenmacher, & Trautmann, 2017, November)
develop an algorithm for real-time stream clustering of chat messages, which is implemented in
the R stream library, and can be applied to texts of different lengths, languages, and content.
According to the main text mining methods applied, the last group of research articles that should
be mentioned is the one in which text classification techniques are applied usually with the goal of
automating different processes, for example – discovering chat participants' particular types of
behavior, tracking conversational flow, estimating user interest in the chat dialogue, emotion
detection, quality control etc. In this regard, intent classification proves to be one of the most
commonly tackled problems - in order to stimulate research in the field, (Qu, et al., 2018, June)
create a dataset of dialogues in which utterances are annotated with user intent. At the same time,
(Chen, Hsu, Kuo, & Ku, 2018) create a chat corpus annotated with the emotion on utterance level,
which allows modeling the sequence of emotions expressed in a dialogue. Speaking of emotion
detection, the study of (Chen, Lee, & Huang, 2018) should also be mentioned. The authors have a
somewhat different approach towards the task of sentiment analysis – they use a valence-arousal
emotion space model in which emotions are represented in the Euclidean space as a function of
valence and emotional intensity. In this way more diverse emotional states can be described
compared to the ordinary polarity model. The final aim of the study is to be able to continuously
track and analyze users’ emotions while they are chatting.
When it comes to natural language processing, the domain (source) and language of textual data
are of crucial importance since they determine the choice of suitable pre-processing techniques,
available semantic resources and more. In this regard, it is not surprising that more than a half of
the reviewed articles use chat data in English since this language is the most common choice
among research in the text mining field. However, there are attempts to analyze dialogues in other
languages also. Furthermore, many of the studies aim at improving dialogue systems for corporate
usage, but surprisingly, few of them have used chat data generated in a call center or similar
source within a given company. Maybe the reason lies in the fact that such data is less available
due to privacy issues. The most common sources of chat data used in research turn out to be
different social networks and chat rooms.
Another finding worth mentioning is that a key differentiator in research articles analyzing
conversational data is the number of participants in the analyzed communication (specifically,
whether they are two or many). The reason lies in the fact that chats with multiple participants may
require different techniques for pre-processing – for example, segmentation to threads is often
necessary when multiple participants are chatting at one place about various topics. An example of
research article dealing with the problems arising from many participants in conversational data is
the study of (Wang, Huang, & Gan, 2016).
The current survey provides valuable insights into the usage and performance of different learning
algorithms and text pre-processing techniques applied on chat data - exploring Table 1 in detail
allows a more precise analysis and evaluation of all the text mining methods found in the
literature. One general observation is that word embeddings become more extensively used in chat
analysis since 2018 and deep learning solutions tend to replace the traditional machine learning
methods for text classification. Actually, there are studies which aim at making a comparison
between the performance of such traditional techniques (among which Naïve Bayes and SVM
prove to be the most popular) and deep learning models. Apart from this, when it comes to topic
modeling methods, the current survey suggests that LDA (Latent Dirichlet Allocation) is among
the most widely adopted. Broadly speaking, the key challenges in analyzing conversational data
arise from its noisy nature – some of the fundamental pre-processing techniques that are almost
inevitable are anonymization, spell-check and abbreviation normalization.
In general, what can be viewed as a drawback in the text analytics field is the dependency on
semantic resources for the accomplishment of different natural language processing tasks.
Findings from the current survey prove this observation. Most of the research on conversational
data heavily depends on the usage of semantic resources not only in the preprocessing phase of the
data, but for the statistical analysis also. The last means that the replicability of experiments
becomes difficult since there may be no similar semantic resources available for the target
language. What is more, it comes as no surprise that there are a lot of manual tasks (for example,
spell check or annotation) performed in the experiments. Future research should be devoted to the
exploration of new methods designed to overcome the difficult and cumbersome manual tasks
during text pre-processing and ease the whole analytical process.
Almost none of the articles provide a practical implementation of the discussed techniques in the
form of an analytical tool that can be employed in order to automate the text analysis process on a
daily basis. For example, (Roy, et al., 2016, March) developed an end-to-end real-time quality
assurance system for contact centers which has an interactive dashboard providing views of
ongoing dialogues at different granularity. From the current survey, it can be concluded that more
has to be done towards the development of comprehensive analytical tools for chat data analysis
and the adoption of effective visualization techniques in experiments.
4. Acknowledgements
The presentation and dissemination of these research results is supported in part by Sofia
University Science Fund Project 80-10-71/12.04.2019 "Empirical Study of the Adaptive Market
Hypothesis through the Lens of Behavioral Finance".
5. References
Adanir, G. A. (2019). Detecting Topics of Chat Discussions in A Computer Supported
Collaborative Learning (CSCL) Environment. Turkish Online Journal of Distance Education,
20(1), 96-114.
Carnein, M., Assenmacher, D., & Trautmann, H. (2017, November). Stream clustering of chat
messages with applications to twitch streams. In International Conference on Conceptual
Modeling (pp. 79-88). Springer, Cham.
Chatterjee, A., Gupta, U., Chinnakotla, M. K., Srikanth, R., Galley, M., & Agrawal, P. (2019).
Understanding emotions in text using deep learning and big data. Computers in Human Behavior,
93, 309-317.
Chen, C. H., Lee, W. P., & Huang, J. Y. (2018). Tracking and recognizing emotions in short text
messages from online chatting services. Information Processing & Management, 54(6), 1325-
1344.
Chen, S. Y., Hsu, C. C., Kuo, C. C., & Ku, L. W. (2018). Emotionlines: An emotion corpus of
multi-party conversations. arXiv preprint arXiv:1802.08379.
Damnati, G., Guerraz, A., & Charlet, D. (2016, May). Web chat conversations from contact
centers: a descriptive study. Proceedings of the Tenth International Conference on Language
Resources and Evaluation (LREC 2016), (pp. pp. 2017-2021).
Ebrahimi, M., Suen, C. Y., Ormandjieva, O., & Krzyzak, A. (2016). Recognizing predatory chat
documents using semi-supervised anomaly detection. Electronic Imaging, 2016(17), 1-9.
Inaba, M., & Takahashi, K. (2018, July). Estimating User Interest from Open-Domain Dialogue.
In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, (pp. 32-40).
Laddha, A., Hanoosh, M., & Mukherjee, D. (2019). Understanding Chat Messages for Sticker
Recommendation in Hike Messenger. arXiv preprint arXiv:1902.02704.
Maity, S. (2019, June). TDBot at SemEval-2019 Task 3: Context Aware Emotion Detection Using
A Conditioned Classification Approach. In Proceedings of the 13th International Workshop on
Semantic Evaluation, (pp. pp. 335-339).
Musabirov, I., Bulygin, D., Okopny, P., & Konstantinova, K. (2018). Between an arena and a
sports bar: Online chats of esports spectators. arXiv preprint arXiv:1801.02862.
Nasr, A., Damnati, G., Guerraz, A., & Bechet, F. (2016, September). Syntactic parsing of chat
language in contact center conversation corpus. In Proceedings of the 17th Annual Meeting of the
Special Interest Group on Discourse and Dialogue , (pp. 175-184).
Oraby, S., Gundecha, P., Mahmud, J., Bhuiyan, M., & Akkiraju, R. (2017, March). How May I
Help You?: Modeling Twitter Customer Service Conversations Using Fine-Grained Dialogue
Acts. Proceedings of the 22nd International Conference on Intelligent User Interfaces (pp. 343-
355). ACM.
Ozeran, M., & Martin, P. (2019). “Good Night, Good Day, Good Luck”. Information Technology
and Libraries, 38(2), 49-57.
Pronoza, E., Pronoza, A., & Yagunova, E. (2018, October). Extraction of Typical Client Requests
from Bank Chat Logs. Mexican International Conference on Artificial Intelligence (pp. pp. 156-
164). Cham: Springer.
Qu, C., Yang, L., Croft, W. B., Trippas, J. R., Zhang, Y., & Qiu, M. (2018, June). Analyzing and
characterizing user intent in information-seeking conversations. In The 41st International ACM
SIGIR Conference on Research & Development in Information Retrieval (pp. pp. 989-992). ACM.
Qu, C., Yang, L., Croft, W. B., Zhang, Y., Trippas, J. R., & Qiu, M. (2019, March). User intent
prediction in information-seeking conversations. In Proceedings of the 2019 Conference on
Human Information Interaction and Retrieval (pp. pp. 25-33). ACM.
Roy, S., Mariappan, R., Dandapat, S., Srivastava, S., Galhotra, S., & Peddamuthu, B. (2016,
March). Qa rt: A system for real-time holistic quality assurance for contact center dialogues.
Thirtieth AAAI conference on artificial intelligence.
Rungruangthum, M., & Todd, R. W. (2017). Differences in Language Used by Deceivers and
Truth-Tellers in Thai Online Chat. Journal of the Southeast Asian Linguistics Society, 10(2), 90-
114.
Shibani, A., Koh, E., Lai, V., & Shim, K. J. (2017). Assessing the language of chat for teamwork
dialogue. Journal of Educational Technology & Society, 20(2), 224-237.
Shrivastava, S., & Singh, P. (2016). Impostor Detection Through Chat Analysis. Procedia
Computer Science, 89, 540-548.
Wang, T., Huang, Z., & Gan, C. (2016). On mining latent topics from healthcare chat logs.
Journal of biomedical informatics, 61, 247-259.
Willis, A., Evans, A., Kim, J. H., Bryant, K., Jagvaral, Y., & Glass, M. (2017). Identifying domain
reasoning to support computer monitoring in typed-chat problem solving dialogues. Journal of
Computing Sciences in Colleges, 33(2), 11-19.