ChapterPDF Available

Anomaly Detection for Short Texts: Identifying Whether Your Chatbot Should Switch from Goal-Oriented Conversation to Chit-Chatting: Third International Conference, DTGS 2018, St. Petersburg, Russia, May 30 – June 2, 2018, Revised Selected Papers, Part II

Authors:

Abstract

Goal-oriented conversational agents are systems able converse with humans using natural language to help them reach a certain goal. The number of goals (or domains) about which an agent could converse is limited, and one of the issues is to identify whether a user talks about the unknown domain (in order to report a misunderstanding or switch to chit-chatting mode). We argue that this issue could be resolved if we consider it as an anomaly detection task which is in a field of machine learning. The scientific community developed a broad range of methods for resolving this task, and their applicability to the short text data was never investigated before. The aim of this work is to compare performance of 6 different anomaly detection methods on Russian and English short texts modeling conversational utterances, proposing the first evaluation framework for this task. As a result of the study, we find out that a simple threshold for cosine similarity works better than other methods for both of the considered languages.
Anomaly Detection for Short Texts: Identifying
Whether Your Chatbot Should Switch From
Goal-Oriented Conversation to Chit-Chatting
Amir Bakarov1,2, Vasiliy Yadrintsev2,4, and Ilya Sochenkov2,3
1The National Research University Higher School of Economics, Moscow, Russia
2Federal Research Center ‘Computer Science and Control’ of Russian Academy of
Sciences, Moscow, Russia
3Skolkovo Institute of Science and Technology, Moscow, Russia
4Peoples’ Friendship University of Russia (RUDN University), Moscow, Russia
amirbakarov@gmail.com, vvyadrincev@gmail.com, i.sochenkov@skoltech.ru
Abstract. Goal-oriented conversational agents are systems able con-
verse with humans using natural language to help them reach a certain
goal. The number of goals (or domains) about which an agent could
converse is limited, and one of the issues is to identify whether a user
talks about the unknown domain (in order to report a misunderstanding
or switch to chit-chatting mode). We argue that this issue could be re-
solved if we consider it as an anomaly detection task which is in a field of
machine learning. The scientific community developed a broad range of
methods for resolving this task, and their applicability to the short text
data was never investigated before. The aim of this work is to compare
performance of 6 different anomaly detection methods on Russian and
English short texts modeling conversational utterances, proposing the
first evaluation framework for this task. As a result of the study, we find
out that a simple threshold for cosine similarity works better than other
methods for both of the considered languages.
Keywords: anomaly detection, novelty detection, conversational agent,
chatbot, distributional semantics, word embeddings
1 Introduction
The task of anomaly detection (also called outlier detection) is to find in a
given set objects highly deviating from others. Such deviating objects are called
anomalies, or outliers [1]. The task of anomaly detection is considered as a
supervised machine learning task, and it is actually similar to the classification
task, but the primary difference is that in the former the number of positive
(non-deviating) samples in the training set is dominant, while the number of
negative samples (deviating samples, anomalies) is low (e.g. there could be 1%
of anomalous examples in the training set). The anomaly detection task is usually
confused with a novelty detection task. Actually, goal of that task is basically
the same (to find the outlying objects), but the point is that there no anomalous
objects in the training set, so the model is being trained only on one class of
objects and learns to find objects highly deviating from the ones participating in
the training [2]. This is why the task of novelty detection is also called one-class
classification task.
Neither anomaly detection task, nor novelty detection task are not actu-
ally widespread in the natural language processing. We are aware of a certain
amount of cases whether anomaly detection systems work under the hood of
recommender systems or document classification models dealing with large doc-
uments [3]. However, in this work we propose another application of this task
to NLP: we rely on it for textual data consisting of very short texts (1 or 2
sentences), considering the problem of automated intent classification in conver-
sational agents.
Conversational agents (also called dialog systems) are systems that are able
to converse with a human on a natural language, imitating dialogue with a
real human being [4]. Usually taxonomy of conversational agents proposes two
distinctive axis. On the first axis agents are distinct by amount of their word
knowledge: there are the open-domain bots which could converse in an unlimited
number of domains of human knowledge (sports, science, literature), and the
closed-domain bots, which could support only one or two topics of conversion
[5]. On the other axis, conversational agents are distinct by the purpose of their
use: there are so-called general conversation agents, or chatbots which do not
consider a certain goal of dialogue and can just chat about everything, and there
are goal-oriented conversational agents that should help a user to reach a goal
through a short conversation (for example, to order a pizza).
The goal-oriented agents are the main interest of business and industry nowa-
days since they help to automatize some human work (of a call center, for in-
stance) or to propose a much more friendlier interface for certain complicated
systems (for example, they can help searching though FAQ) [6]. Usually such
agents are not limited with a single possible goal, so they support different goals
in the same domain (or even in different ones): to order pizza, to reserve table in
a pizza restaurant, and so on. However, extending an agent to multiple domains
or multiple goals usually is not a hard task: it could be considered as a classifi-
cation problem which is widely known in NLP and machine learning community
and has been successfully resolved from different perspectives [7].
However, the main issue comes when one wants to extend the conversa-
tional agents to a chatbot, i.e. to implement both of the behavior models (a
goal-oriented talk and a general talk). In this case, the agent should recognize
whether a user wants to reach a certain goal or whether she wants just to talk
about something, switching between these modes [8]. Reducing this task to the
aforementioned classification problem possibly should not work since utterances
used in a general talk could highly differ from each other (by domain, by style,
by other things), and if we consider all general utterances as a single class, that
will be a highly heterogeneous class.
We argue that in this case the issue could be considered as an anomaly
detection (or even novelty detection) problem that we described in the start of
this section. Actually, we have a number of homogenous objects, for which we can
generate a train set (in our case homogenous objects are examples of utterances
of a certain domain or goal), and we have objects which could have unlimited
number of possible domains (utterances for a general talk), for which we are not
able to generate a train set. Then, the first ones could be considered as normal
objects, and the second ones could be considered as anomalous objects. We can
try to train a model to distinguish the first ones from the second ones (inducing
the anomaly detection task), or to try to find how much a new object deviates
from the known ones (inducing the novelty detection task). We can even create
a system able to work in a number of multiple domains or goals by using a
separate model. The anomaly detection model would distinct general utterances
from goal-oriented ones, and the second one (a simple multi-class classifier) could
perform a classification to distinguish goal in a more narrow domain.
The idea is that such systems should help a user reach a certain goal if the
utterance belongs to one of known domains (defined by the developers), and
enable a chit-chatting mode or report misunderstanding (saying something like
I’m not able to help you with this question) in the other case. Conversational
agents usually recognize domain by comparing semantics of a new utterance with
an already known semantics of each of the domains (that could be defined with
keywords, for example). Semantics processing is usually performed with different
semantic modeling approaches like distributional semantic models. Such models
had a recent success in a broad range of various natural language processing
tasks, and in this work we will also rely on distributional semantics to model the
meaning of utterances and find the degree of similarity between pairs of them.
So, the main aim of our work is to try to apply different anomaly detection
algorithm to the problem of detection the unknown (off-topic) conversational
utterances. We argue that the task of anomaly detection was never considered
before in a natural language community before from the perspective of conver-
sational agents and short text data, so we consider our work as a first towards
exploration of application of anomaly detection methods for short texts. So,
our main contribution consists in comparing and evaluating different anomaly
detection methods on a benchmark of short texts.
Another of our major contribution is in creation of a cross-lingual evaluation
benchmark for this task. We propose it for two different languages, Russian and
English. We crawl data from Web forums and manually annotate it to create the
first datasets for off-topic anomaly detection for short texts. Moreover, in the
Russian natural language processing community the task of anomaly detection
(as well as the task of novelty detection) was never considered before, so we are
first to introduce and investigate this task for Russian.
All in all, this study is organized as follows. In Section 2 we put our paper in
the context of previous works. In Section 3 we extensively describe our dataset,
while in Section 4 we describe the setup of our experiments. In Section 5 we
propose the obtained results and a discussion on them. Section 6 concludes the
paper.
2 Related Work
The roots of the task of anomaly detection goes back to the 19th century [9],
when this task was firstly formulated. Nowadays the scientific community is
aware of a broad range of methods for anomaly detection, like One Class SVM
or Isolation Forest; a survey of existing anomaly detection methods goes out
of scope of this work, and an interested reader could see a survey of modern
methods of anomaly detection by Varun Chandola [1] or a survey of novelty
detection techniques [10].
Being a very mainstream problem for different fields of machine learning, the
anomaly detection task is rarely has been applied to different natural language
processing issue. We are aware only of certain works that rely on detection of
anomalies for textual data. Baker et al. was first to propose such task, considering
novelty detection from the perspective of topic detection and tracking [11]. The
first extensive work on anomaly detection for textual data considered document
classification through One Class SVM [12], and then it was extended by Guthrie,
who deal with an issue of detection of documents with unusual genre or sentiment
in a document collection [13]. Later, Kumaraswamy et al. explored importance of
domain knowledge provided in first-order logic in the task of anomaly detection
for textual data [14]. In 2016, Camacho-Collados proposed an outlier detection
in word sets as an evaluation benchmark for word embeddings [15], while Pande
and Prohuja were first to investigate application of word embeddings to the task
of anomaly detection [16].
So, mostly works on anomaly detection for textual data were performed as
a part of document classification task, considering processing of linguistic or
stylometric features (like average word length) and sparse vectors. The main
difference of our work is that we process short texts and sentences, relying on
compositional textual representations obtained as a function of dense vectors
produced by distributional semantic models. Additionally, all studies that we
mentioned considered only English data, but we are not aware of any work
related to anomaly detection for Russian.
In other words, our work could be considered the first towards multiple dif-
ferent scopes.
3 Short Text Anomaly Detection Dataset
We are not aware of any suitable dataset for the anomaly detection task for short
texts, so this study presents such a dataset. We suggest that it will help other
researchers to evaluate different techniques of anomaly detection for a similar
task. Our dataset consists of English and Russian parts, and could be called
cross-lingual.
Since we resolve the task of anomaly detection from the scope of a conversa-
tional agent, we made our dataset consist of real conversational utterances. We
considered Web forums as the best source of data because usually user messages
there have a conversational style and presented as short texts of 1-2 sentences.
Moreover, posts on Web forums are taxonomically separated for different do-
mains (in other words, the range of covered topics is wide), so we could propose
a multi-domain analysis with such data.
To make this dataset, we crawled two collections of posts from the most
popular Web forums in each language: Dvach in Russian (https://2ch.hk) and
Reddit in English (https://www.reddit.com)5.
Each part consisted of 11 domains with different topics consisting of 100
posts each in which 10 were homogeneous (so each posts in each domain was
topically related to all other objects in the same domain) and 1 was heteroge-
neous (the posts were not necessary to be topically related to other objects in
this domain). In other words, 11 homogeneous domains modeled the known class
and heterogeneous domain modeled the anomalous (or novel) class.
We used already defined Web forum taxonomy of domains to assess posts
with domains (each post could have only 1 domain). To create assessments for
English, we used a subreddit structure of Reddit which is presented as a pool
of topical sub-forums dedicated to discussions on a certain topic. We sampled
10 different subreddits trying to pick the most diverse domain as possible, and
then sampled 100 posts from each one:
r/science (discussions on news in science and technology)
r/politics (discussions on political news and events)
r/askhistorians (discussions on a historical science)
r/space (discussions on news related to space and astronomy)
r/minecraft (discussions on a Minecraft video game)
r/sex (discussions on sexual activities)
r/guns (discussions on guns and pistols)
r/food (discussions on food and cooking)
r/music (discussions on different artists, genres and news of music industry)
r/motorcycles (discussions on motorcycles)
To create the anomalous class we used the subreddit with jokes (r/jokes)
which pretend to contain short texts not limited with a single domain (so there
could be jokes about politics, about school, and so on).
As for Russian part, we used a hierarchical taxonomy of Dvach, which pro-
poses different subforums (dubbed “boards”) split into threads of discussion of
more narrow topics. According to the aforementioned methodology, we picked
threads that have the most diverse domains by our opinion:
Greek Literature (discussions on literature of ancient Greece)
Borussia Dortmund (discussions on Borussia Dortmund football club)
Coffee (discussions on coffee)
Java (discussions on Java programming language)
Fountain Pens (discussions on fountain pens)
Bread Bakery (discussions on bread bakery)
5Actually, the collection of Reddit posts is based on an already crawled corpus avail-
able at https://github.com/linanqiu/reddit-dataset
Hairstyles (discussions on hairstyles)
Keyboards (discussions on computer keyboards)
Higher School of Economics (discussions on the National Research Uni-
versity Higher School of Economics)
macOS (discussions on Macintosh Operating System)
To create the anomalous class we used randomly sampled posts from /b
board which does not limit to a single topic of discussion and allows conversations
on every possible topic.
We also asked three bilingual Russian-English volunteers to check the accu-
racy of the automated assessments considering that each of the posts actually
belongs to the proposed domain. If the assessor marked certain domain assess-
ment as incorrect, we re-sampled the post from this domain, and re-checked it
with all three assessors. We have done this iteratively until all 100 posts in each
10 domains were considered as correct by all assessors. All in all, each part of
our cross-lingual datasets consisted of 1100 posts, so the whole amount of posts
in our dataset was 2200.
4 Experimental Setting
In this work we testes applicability of the following techniques of anomaly de-
tection (they were mentioned but not compared by performance at [10]):
1. One-Class Support Vector Machine. Draws a soft boundary on training
objects, considering all objects falling outside the boundary as anomalous
[12].
2. Isolation Forest. Isolates objects of the dataset by sampling features for
which a threshold value would be randomly selected between the maximum
and minimum values of that feature. Objects falling out of the threshold
would considered as anomalies [17].
3. Local Outlier Factor. Computes the local density deviation on every ob-
ject of the dataset, considering as anomalies samples that have a substantially
lower density than their neighbors [18].
4. Threshold for Standard Deviation of Classifier’s Predictions. The
idea is to train a multi-class classifier (we used a logistic regression in this
study) on “normal classes”, and then for each new object compute a standard
derivation of probabilities of classifier’s predictions for classes. If standard
derivation will be lower than the threshold (should be defined in advance),
the object will be labeled as anomalous.
5. Threshold for Distance to Topical Keywords. The idea is very similar
to the previous method, but here we creates a set of references (bags of
keywords) for each class of training data. Each set of references is generated
as a set of keywords reporting the topic of a set of short texts through a
topic modeling technique (we used LDA in this study [19]). Then for each
new object we should compute distance (we used cosine measure) between
new object and every reference; if the distance to each reference will be lower
than the threshold, the object will be labeled as anomalous.
6. Threshold for Reconstruction Error. The idea is to train an autoencoder
[20] on training data and for each new object compute a reconstruction error
between regression target and actual value. The objects which error will be
higher than the threshold would be considered as anomalies.
In our experiments the data was lemmatized and cleared from stop-words
using NLTK stopword lists [21]. As a morphological analyzer, we used pymorphy2
[22] for Russian and UDPipe [23] for English.
To obtain representations of the short texts (conversational utterances) we
used the averaged vectors of all the words in the sentence, considering this as
the most effective and robust approach for obtaining compositional distributional
representations (out-of-the-vocabulary words were dropped) [24]. To obtain the
word embeddings we used two Word2Vec models for each language [25]: one
trained on the Russian data of Dvach [26] and one trained on the English news
corpus. For each method we tuned the best hyperparameters on each dataset
which were obtained by grid search; code, datasets and links to models are
available at our GitHub6.
5 Results
Dvach Reddit
One Class SVM 0.5 0.53
Isolation Forest 0.47 0.47
Local Outlier Factor 0.47 0.47
Threshold for Standard Deviation 0.68 0.71
Threshold for Distance to Keywords 0.5 0.5
Threshold for Reconstruction Error 0.5 0.5
Table 1: Performance of each of the compared methods on each dataset
measured in accuracy.
For evaluation we used the whole anomalous class and 100 posts from shuffled
data of other classes, the remaining amount (900 posts of 10 domains) was used
for training the models. Quantitative results of the comparison are presented
in Table 1 where we show accuracy of work of each of compared methods on
each dataset (we use this measure since we have an equal class balance, and
we actually are not interested in a high precision or recall particularly, and just
want to obtain a general performance score). It is observable that the most simple
technique which is a threshold for semantic distance had the highest score. We
6https://github.com/bakarov/conversational-anomaly
could explain this by the fact that other methods are not capable of working
with high-dimensional data whether it is unknown which vector components
could actually be significant.
To more properly analyze the obtained results, we made a graphical rep-
resentation through a method of t-SNE [27]. We projected the whole dataset
in a two-dimensional space, marking anomalies with red-color points and nor-
mal objects with black-color points. These visualizations illustrate how well each
methods works, comparing the picture with a gold-standard topology propose
at the leftmost picture.
Fig. 1: Topology of anomalous (red) and normal data (black) in a
two-dimensional space for both datasets. The leftmost column with pictures is
a gold-standard topology that should be obtained, others illustrate topology
predicted by an each method.
The figures suggest that it is hard to visually distinguish separate clusters of
classes (may be only just a few ones), and anomalous data is strongly mixed with
normal data in both datasets. We think that such fact could be an explanation
for low results of all of the employed methods. However, it is visually observable
that the autoencoder marked data that is far from the centroid as anomalous,
so it can be concluded that it could work better whether the normal classes and
anomalous classes would not be so strongly mixed.
6 Conclusions
In this study we investigated the applicability of mainstream methods for anomaly
and novelty detection to the data consisting of short texts, and created two pub-
licly available datasets for off-topic anomaly detection based on data crawled
from Russian and English web forums. We compared different techniques for
anomaly detection on these datasets, concluding that using threshold on SD of
predictions of a metric classifier is the most efficient method on our datasets.
The proposed survey and obtained results could help researchers and indus-
trial teams to improve their results in making the most efficient architecture of
a conversational agent. Another important contribution is that out work is the
first towards the task of anomaly detection applied to textual data in Russian.
We think that obtained results could be called interesting and worth further
investigation. In the future we plan to extend this work, considering different
approaches to distributional compositionality. We also want to more properly
explore impact of different features of used word embeddings model (window
size, training algorithm, pick of a corpus, and so on), proposing the task of
anomaly detection as an extrinsic evaluation benchmark for word embeddings.
Extending the dataset to other languages (especially the low-resource ones, like
Chuvash) also goes to our plans.
Acknowledgements
We thank three anonymous reviewers for helpful and attentive reviews. We also
thank our colleague, Andrey Kutuzov, for productive discussion on this paper.
This study was supported by the Ministry of Education and Science of the
Russian Federation (grant 14.756.31.0001) and partially funded by RFBR ac-
cording to the research project 15-29-06031.
References
1. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM
Comput. Surv. 41 (2009) 15:1–15:58
2. Markou, M., Singh, S.: Novelty detection: a review—part 1: statistical approaches.
Signal processing 83(12) (2003) 2481–2497
3. Guthrie, D., Guthrie, L., Allison, B., Wilks, Y.: Unsupervised anomaly detection.
In: IJCAI. (2007) 1624–1628
4. Lester, J., Branting, K., Mott, B.: Conversational agents. The Practical Handbook
of Internet Computing (2004) 220–240
5. Chen, H., Liu, X., Yin, D., Tang, J.: A survey on dialogue systems: Recent advances
and new frontiers. arXiv preprint arXiv:1711.01731 (2017)
6. Cui, L., Huang, S., Wei, F., Tan, C., Duan, C., Zhou, M.: Superagent: A cus-
tomer service chatbot for e-commerce websites. Proceedings of ACL 2017, System
Demonstrations (2017) 97–102
7. Venkatesh, A., Khatri, C., Ram, A., Guo, F., Gabriel, R., Nagar, A., Prasad, R.,
Cheng, M., Hedayatnia, B., Metallinou, A., et al.: On evaluating and comparing
conversational agents. arXiv preprint arXiv:1801.03625 (2018)
8. Mathur, V., Singh, A.: The rapidly changing landscape of conversational agents.
arXiv preprint arXiv:1803.08419 (2018)
9. Edgeworth, F.: Xli. on discordant observations. The London, Edinburgh, and
Dublin Philosophical Magazine and Journal of Science 23(143) (1887) 364–375
10. Pimentel, M.A., Clifton, D.A., Clifton, L., Tarassenko, L.: A review of novelty
detection. Signal Processing 99 (2014) 215–249
11. Baker, L.D., Hofmann, T., McCallum, A., Yang, Y.: A hierarchical probabilistic
model for novelty detection in text. In: Proceedings of International Conference
on Machine Learning. (1999)
12. Manevitz, L.M., Yousef, M.: One-class svms for document classification. Journal
of Machine Learning Research 2(Dec) (2001) 139–154
13. Guthrie, D.: Unsupervised Detection of Anomalous Text. PhD thesis, Citeseer
(2008)
14. Kumaraswamy, R., Wazalwar, A., Khot, T., Shavlik, J.W., Natarajan, S.: Anomaly
detection in text: The value of domain knowledge. In: FLAIRS Conference. (2015)
225–228
15. Camacho-Collados, J., Navigli, R.: Find the word that does not belong: A frame-
work for an intrinsic evaluation of word vector representations. In: Proceedings
of the 1st Workshop on Evaluating Vector-Space Representations for NLP. (2016)
43–50
16. Pande, A., Ahuja, V.: Weac: Word embeddings for anomaly classification from
event logs. In: Big Data (Big Data), 2017 IEEE International Conference on,
IEEE (2017) 1095–1100
17. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Data Mining, 2008.
ICDM’08. Eighth IEEE International Conference on, IEEE (2008) 413–422
18. Breunig, M.M., Kriegel, H.P., Ng, R.T., Jörg Sander, booktitle=SIGMOD Confer-
ence, y.: Lof: Identifying density-based local outliers
19. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine
Learning research 3(Jan) (2003) 993–1022
20. Ng, A.: Sparse autoencoder. CS294A Lecture notes 72(2011) (2011) 1–19
21. Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL
on Interactive presentation sessions, Association for Computational Linguistics
(2006) 69–72
22. Korobov, M.: Morphological analyzer and generator for russian and ukrainian
languages. In: International Conference on Analysis of Images, Social Networks
and Texts, Springer (2015) 320–332
23. Straka, M., Hajic, J., Straková, J.: Udpipe: Trainable pipeline for processing conll-
u files performing tokenization, morphological analysis, pos tagging and parsing.
In: LREC. (2016)
24. Li, B., Liu, T., Zhao, Z., Tang, B., Drozd, A., Rogers, A., Du, X.: Investigating
different syntactic context types and context representations for learning word
embeddings. In: Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing. (2017) 2421–2431
25. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: Advances in neural
information processing systems. (2013) 3111–3119
26. Bakarov, A., Gureenkova, O.: Automated detection of non-relevant posts on the
russian imageboard “2ch”: Importance of the choice of word representations. In: In-
ternational Conference on Analysis of Images, Social Networks and Texts, Springer
(2017) 16–21
27. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine
Learning Research 9(2579-2605) (2008) 85
... CAs are software agents able to interact with the user through natural language. Generally, CAs can be classified in two main classes: end-to-end and modular systems [12]. The former are generally based on neural networks [13], [14], [15], [16] and they learn a dialog model from a set of past conversations [17]. ...
Article
Full-text available
This study considers the problem of automated detection of non-relevant posts on Web forums and discusses the approach of resolving this problem by approximation it with the task of detection of semantic relatedness between the given post and the opening post of the forum discussion thread. The approximated task could be resolved through learning the supervised classifier with a composed word embeddings of two posts. Considering that the success in this task could be quite sensitive to the choice of word representations, we propose a comparison of the performance of different word embedding models. We train 7 models (Word2Vec, Glove, Word2Vec-f, Wang2Vec, AdaGram, FastText, Swivel), evaluate embeddings produced by them on dataset of human judgements and compare their performance on the task of non-relevant posts detection. To make the comparison, we propose a dataset of semantic relatedness with posts from one of the most popular Russian Web forums, imageboard 2ch, which has challenging lexical and grammatical features.
Chapter
This study considers the problem of automated detection of non-relevant posts on Web forums and discusses the approach of resolving this problem by approximation it with the task of detection of semantic relatedness between the given post and the opening post of the forum discussion thread. The approximated task could be resolved through learning the supervised classifier with a composed word embeddings of two posts. Considering that the success in this task could be quite sensitive to the choice of word representations, we propose a comparison of the performance of different word embedding models. We train 7 models (Word2Vec, Glove, Word2Vec-f, Wang2Vec, AdaGram, FastText, Swivel), evaluate embeddings produced by them on dataset of human judgements and compare their performance on the task of non-relevant posts detection. To make the comparison, we propose a dataset of semantic relatedness with posts from one of the most popular Russian Web forums, imageboard “2ch”, which has challenging lexical and grammatical features.
Article
Dialogue systems have attracted more and more attention. Recent advances on dialogue systems are overwhelmingly contributed by deep learning techniques, which have been employed to enhance a wide range of big data applications such as computer vision, natural language processing, and recommender systems. For dialogue systems, deep learning can leverage a massive amount of data to learn meaningful feature representations and response generation strategies, while requiring a minimum amount of hand-crafting. In this article, we give an overview to these recent advances on dialogue systems from various perspectives and discuss some possible research directions. In particular, we generally di- vide existing dialogue systems into task-oriented and non- task-oriented models, then detail how deep learning techniques help them with representative algorithms and finally discuss some appealing research directions that can bring the dialogue system research into a new frontier.
Conference Paper
Automatic natural language processing of large texts often presents recurring challenges in multiple languages: even for most advanced tasks, the texts are first processed by basic processing steps – from tokenization to parsing. We present an extremely simple-to-use tool consisting of one binary and one model (per language), which performs these tasks for multiple languages without the need for any other external data. UDPipe, a pipeline processing CoNLL-U-formatted files, performs tokenization, morphological analysis, part-of-speech tagging, lemmatization and dependency parsing for nearly all treebanks of Universal Dependencies 1.2 (namely, the whole pipeline is currently available for 32 out of 37 treebanks). In addition, the pipeline is easily trainable with training data in CoNLL-U format (and in some cases also with additional raw corpora) and requires minimal linguistic knowledge on the users’ part. The training code is also released.
Article
Anomalies correspond to the behavior of a system which does not conform to its expected or normal behavior. Identifying such anomalies from observed data, or the task of anomaly detection, is an important and often critical analysis task. This includes finding abnormalities in a medical image, fraudulent transactions in a credit card history, or structural defects in an aircraft’s engine. The importance of this problem has resulted in a large body of literature on this topic. However, given that the definition of an anomaly is strongly tied to the underlying application, the existing research is often embedded in the application domains, and it is unclear how methods developed for one domain would perform in another. The goal of this article is to provide a general introduction of the anomaly detection problem. We start with the basic formulation of the problem and then discuss the various extensions. In particular, we discuss the challenges associated with identifying anomalies in structured data and provide an overview of existing research in this area. We hope that this article will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.