ChapterPDF Available

Probabilistic Word Association for Dialogue Act Classification with Recurrent Neural Networks: 19th International Conference, EANN 2018, Bristol, UK, September 3-5, 2018, Proceedings

Authors:

Abstract and Figures

The identification of Dialogue Act’s (DA) is an important aspect in determining the meaning of an utterance for many applications that require natural language understanding, and recent work using recurrent neural networks (RNN) has shown promising results when applied to the DA classification problem. This work presents a novel probabilistic method of utterance representation and describes a RNN sentence model for out-of-context DA Classification. The utterance representations are generated from keywords selected for their frequency association with certain DA’s. The proposed probabilistic representations are applied to the Switchboard DA corpus and performance is compared with pre-trained word embeddings using the same baseline RNN model. The results indicate that the probabilistic method achieves 75.48% overall accuracy and an improvement over the word embedding representations of 1.8%. This demonstrates the potential utility of using statistical utterance representations, that are able to capture word-DA relationships, for the purpose of DA classification.
Content may be subject to copyright.
Probabilistic Word Association for Dialogue Act
Classification with Recurrent Neural Networks
Nathan Duran[0000000160844406] and Steve Battle[0000000271547869]
University of the West of England,
Coldharbour Ln, Bristol BS16 1QY
{nathan.duran,steve.battle}@uwe.ac.uk
Abstract. The identification of Dialogue Acts (DA) is an important
aspect in determining the meaning of an utterance for many applica-
tions that require natural language understanding, and recent work us-
ing recurrent neural networks (RNN) has shown promising results when
applied to the DA classification problem. This work presents a novel
probabilistic method of utterance representation and describes a RNN
sentence model for out-of-context DA Classification. The utterance repre-
sentations are generated from keywords selected for their frequency asso-
ciation with certain DAs. The proposed probabilistic representations are
applied to the Switchboard DA corpus and performance is compared with
pre-trained word embeddings using the same baseline RNN model. The
results indicate that the probabilistic method achieves 75.48% overall
accuracy and an improvement over the word embedding representations
of 1.8%. This demonstrates the potential utility of using statistical ut-
terance representations, that are able to capture word-DA relationships,
for the purpose of DA classification.
Keywords: Dialogue Acts ·Neural Networks ·Probabilistic
1 Introduction
The notion of a Dialogue Act (DA) originated from John Austin’s ‘illocutionary
act’ theory [1] and was later developed by John Searle [25], as a method of
defining the semantic content and communicative function of a single utterance
of dialogue. The utility of DAs as a set of labels for a semantic interpretation
of a given utterance has led to their use in many applications requiring Natural
Language Understanding (NLU). In dialogue management systems they have
been used as a representation of user and system dialogue turns [5] or as a set
of possible system actions [3]. For spoken language translation Kumar et al.,
[13] utilized the contextual information provided by DAs to improve accuracy in
phrase based statistical speech translation.
To facilitate their use in such systems, utterances must first be assigned a
single DA label, sometimes called short text classification. Previously, many dif-
ferent approaches have been applied to the DA classification problem, including
Support Vector Machines (SVM) and Hidden Markov Models (HMM) [27], n-
grams [17] and Bayesian networks [11]. More recently Artificial Neural Network
2 N. Duran and S. Battle
(ANN) based approaches have led to increased performance, particularly Con-
volutional Neural Networks (CNN) [10] and Recurrent Neural Networks (RNN)
[8, 14, 22]. Many of these approaches consider the DA classification task on both
a sentential and discourse level. The sentence level is concerned with how the
order and meaning of words compose to form the meaning of a sentence [15].
Similarly, on a discourse level, the order and meaning of sentences compose to
form the meaning of sequences in dialogue [24]. Certainly, a given utterance and
its associated DA is often directly influenced by the nature of the preceding ut-
terances and the current context of the dialogue. For example, ‘okay’ could be an
acknowledgement of understanding or an agreement to a request, the intention
is dependent on the utterance it is responding to. This view has led much of
the neural network based research to model the semantic content of a sentence,
in conjunction with some other contextual information, such as previous utter-
ance or DA sequences, or a change in speaker turn, to predict the appropriate
DA for the current utterance [10,14]. Including such historical and contextual
information has been shown to improve classification accuracy [16], and likely
must be considered for any sophisticated classification model. However, moti-
vated to examine the importance of different lexical and syntactic features, and
their contribution towards DA classification, in this work each utterance is con-
sidered individually and out-of-context, that is, without any other contextual
or historical information. Further, Cerisara et al., [2] determined that the use
of traditional word embeddings, such as GloVe [23] and Word2vec [20], have a
limited impact on the DA classification task and therefore, this work explores
an alternative approach for utterance representations.
This work presents a simple, yet effective, probabilistic method of utterance
representation and DA classification using RNN architectures. The utterance
representations are generated from the probability distribution over all DAs for
each word in the utterance. Intuitively, each word is represented by a vector of
the probabilities that it is associated with each DA, and an utterance is then
a matrix formed from the vector representations of the words it contains. This
representation method was inspired by the intuition that certain keywords may
be associated with certain DA types and therefore act as indicators for the DA
of the utterances that contain them [26]. A Long-Short Term Memory (LSTM)
network based model is then used to classify the utterances according to their
associated DA types. A description of the model and utterance representations
can be found in section 3. Experimentally this method is applied to the DA
classification task using the Switchboard Dialogue Act (SwDA) corpus (section
4) and yields results comparable with more sophisticated approaches that also
consider utterance or dialogue context information. Performance for representa-
tions generated from different word frequencies is compared to traditional word
vector representations using the same LSTM model in section 5. The following
section is a discussion of neural network architectures for DA classification and
cue-phrase and n-gram approaches with similar motivations to this probabilistic
word representation method.
Probabilistic Dialogue Act Classification with RNN 3
2 Related Work
The ability of RNN to model long term dependencies in sequential data has led
to their widespread use in many Natural Language Processing (NLP) tasks [21],
and recently both LSTM and CNN have been applied to the DA classification
problem with great success. These approaches commonly employ a combination
of a sentence model and a discourse or context model. The sentence model acts
at the utterance level and encodes sentences, often from word embeddings, into
a fixed length vector representation. The encoded utterances are then used as
input to a discourse level model that incorporates some other contextual infor-
mation and classifies the current utterance. For example, Lee and Dernoncourt
[14] experiment with both LSTM and CNN sentence models for encoding short
text representations, followed by various ANN architectures for classifying the
encoded utterances based on the current, and up to two of the previous, utter-
ances. To try and capture interactions between speakers, Kalchbrenner and Blun-
som [10] used a Hierarchical CNN sentence model in conjunction with a RNN
discourse model and condition the recurrent and output weights on the current
speaker. Liu et al., [16] examine several different CNN and LSTM based archi-
tectures that incorporate different context information such as speaker change
and dialogue history. Their work shows that including context information con-
sistently yielded improvements over their baseline system.
These approaches all use pre-trained word embeddings to construct utter-
ance representations for the input to a sentence model. While word embeddings
carry some semantic similarity information useful for many language classifica-
tion tasks [18], they do not convey any relational information between the words
in an utterance and its associated DA. The work of Cerisara et al., [2] showed that
pre-trained embeddings did not help the DA classification task and this is likely
due to the word vector training corpora commonly being non-conversational.
In an effort to incorporate word-DA relationships into a similar model as those
already discussed, Papalampidi, Iosif and Potamianos [22] explored the use of
keywords that are representative of DAs. First a set of keywords was constructed
based on word frequency and saliency with respect to each DA. The keywords
were then used to add a weighting value to pre-trained word embeddings for
input to an LSTM sentence model. A two-layer ANN then classified utterances
based on the current and preceding two utterances in a similar fashion as [14].
The notion that certain words or phrases can act as indicators for utterance
DA labels is more often explored via probabilistic methods. Garner et al., [4]
described a theory of word frequencies for dialogue move recognition and con-
cluded that better performance could be achieved using a more involved n-gram
model, such as that applied by Louwerse and Crossley [17]. Webb and Hepple
[28], selected a set of cue phrases based on the probability of an n-gram occurring
within a given DA and keeping only those with the ‘predictivity’ values over a
certain threshold. Utterance classification is then performed by identifying the
cue phrases it contains and assigning a DA label based on the phrase with the
highest predictivity value. The probabilistic methodology in this paper combines
4 N. Duran and S. Battle
aspects of the previously described approaches, sing an LSTM sentence model
with utterance representations generated from probabilistic word-DA relations.
3 Model
3.1 LSTM Sentence Model
The sentence model is similar to those used by Papalampidi, Iosif and Potami-
anos [22], and also Khanpour, Guntakandla and Nielsen [12], and is based on
a standard LSTM network as described by Hochreiter and Schmidhuber [7]. A
given utterance that contains nwords, is converted into a sequence of n×m
dimensional vectors V1, V2, . . . , Vn. Where mis either the dimension of the word
embeddings, or the number of DA in the case of the probabilistic representations
(see section 3.2). The lexical order of the words in the original sentence are rep-
resented as a sequence of successive time-steps in V. This sequence is given as
input to the LSTM which produces an hdimensional vector at each time-step
n, where his the size of the LSTM hidden dimension. A pooling layer then com-
bines the output vectors from each time-step h1,h2,...,hninto a single vector
representation sof the utterance. Finally, a single feed forward layer computes
the probability distribution over all DA using the softmax activation function.
Fig. 1. LSTM Sentence Model
Probabilistic Dialogue Act Classification with RNN 5
Following the initialisation parameters proposed by Ji, Haffari and Eisen-
stein [8], with the exception of the bias (b) and output (U) matrices, all LSTM
weights are initialised in the range ±p6÷(d1+d2) where d1and d2are the
input and output dimensions respectively. Uis initialised with random uniform
distribution in the range ±0.05 and bis initialised to 0. Optimisation is per-
formed using the RMSProp algorithm with an initial learning rate α= 0.001
and decay rate γ= 0.001.
3.2 Probabilistic Word Representations
The probabilistic word vector representations are simply generated by calculat-
ing a probability distribution over each of the DAs for each word that occurs
at, or above, a certain frequency in the corpus vocabulary. First a set of nkey-
words is created keeping only those that occur at a frequency equal to or greater
than a frequency threshold value (see section 5.2). Using this set of keywords,
a probability matrix Xis created of size n×m, where mis the number of DA
used in the corpus. Thus, an element xij in Xwill be the probability that the
ith word in nappears in an utterance that has the corresponding jth DA tag in
m. Each row in Xis a probability distribution for the ith word in nfor all DAs,
and is then effectively a Probabilistic word embedding for creating the utterance
representations as described in section 3.1.
4 Experimental Dataset
The Switchboard Dialogue Act (SwDA) contains 1155 transcripts of 5-minute
telephone conversations between two speakers that did not know each other
and were provided a topic for discussion. Each of the 205,000 utterances is
annotated with one of 42 DA tags using the Discourse Annotation and Markup
System of Labelling (DAMSL) [9]. The transcripts were split into the same 1115
for training and 19 for testing as used by Stolcke et al., [26] and others [8, 10]. The
remaining 21 transcripts were used as a validation set and 300 were randomly
selected from the training set for development purposes.
Table 1. Datasets
Dataset # of Transcripts # of Utterances
Training 1115 192,768
Development 300 51,611
Test 19 4,088
Validation 21 3,196
The transcripts were pre-processed to remove the disfluency (breaks or ir-
regularities), and other annotation symbols, in order to convert each utterance
6 N. Duran and S. Battle
into a plain text sentence. Additionally, any utterances tagged as ‘Non-Verbal’,
such as laughter or coughing, were removed from the transcript, as these do not
contain any relevant lexical information. In this work therefore, 41 of the original
42 DAMSL tags were used, reducing the utterance count by 2%.
Table 2. Most frequent Switchboard DA tags.
Dialogue Act Tag Example Count %
Statement-non-opinion sd Me, I’m in the legal department. 75,138 37%
Acknowledge (Backchannel) b Uh-huh. 38,233 19%
Statement-opinion sv I think it’s great 26,422 13%
Abandoned or Turn-Exit %- So, - 15,545 7%
Agree/Accept aa That’s exactly it. 11,123 5%
Appreciation ba I can imagine. 4,759 2%
Yes-No-Question qy Do you have special training? 4,726 2%
Yes answers ny Yes. 3,031 1%
5 Results and Discussion
5.1 Parameter Tuning
As previously stated, the same LSTM model is used for comparison between
the traditional word embedding utterance representations and the Probabilistic
representations. Parameters were first tuned using traditional word embeddings
and then applied to the Probabilistic representations. Keeping the hyperparam-
eters fixed, different word-to-vector techniques and dimensions are tested. Each
parameter is then tuned one at a time to determine the optimum configuration.
Both l2-regularization and decay rate were also tested, though were shown to
only have a negative impact on performance, and therefore l2-regularization is
not used and decay is fixed at γ= 0.001. For all parameter tuning and utterance
representation testing the development set described in section 4 is used. Results
shown are averaged over 5 runs of 10 epochs.
Word Embeddings Two pre-trained sets of word embeddings were tested,
word2vec [19] trained on the Google News corpus and GloVe [23] trained on a
Wikipedia corpus. In addition, a second word2vec embedding was trained on
the SwDa corpus itself using the Gensim python package.1Dimensions in the
range 100-300 were tested and hyperparameters were kept fixed with the values;
dropout = 0.3, hidden dimension = 64, and max-pooling. Table 3 shows the best
classification accuracy was achieved using word2vec trained on Google News with
300 dimensions and this is adopted for the rest of the experiments.
1https://radimrehurek.com/gensim/
Probabilistic Dialogue Act Classification with RNN 7
Table 3. Word embeddings performance.
Word
Embeddings
Embedding
Dimension
Test Set
Accuracy %
Validation Set
Accuracy %
100 70.62 74.22
GloVe Wiki 200 71.31 74.98
300 70.45 73.68
100 69.41 72.97
Word2vec Google News 200 71.14 75.06
300 71.65 75.07
100 68.93 72.58
Word2vec SwDa 200 68.71 72.29
300 69.18 72.81
Dropout Dropout is a regularisation method generally used to prevent overfit-
ting in neural networks [6]. The results in Table 4 concur with others’ findings
[14, 12, 22] that a value of 0.3 is optimal.
LSTM Hidden Dimension The LSTM hidden state corresponds to the dimen-
sionality of the output vectors of an LSTM cell at each time-step h1,h2,...,hn
and therefore determines the dimension of the utterance representation sthat is
generated by the pooling layer. Table 4 shows minimal impact on performance
provided the hidden dimension is close to the maximum sentence length (107
words).
Pooling Two different pooling mechanisms are tested. Max-pooling keeps the
element-wise maximum of the hvectors output by the LSTM and mean-pooling
averages the hvectors.
Table 4. Hyperparameter tuning performance.
Dropout Hidden
Dimension Pooling Test Set
Accuracy %
Validation Set
Accuracy %
0.0 70.91 74.45
0.1 71.36 74.82
0.2 64 Max 71.58 74.82
0.3 71.62 75.17
0.4 71.26 74.97
0.3 128 Max 72.32 75.66
256 72.01 75.51
0.3 128 Mean 68.51 72.09
8 N. Duran and S. Battle
5.2 Probabilistic Word Embeddings
Different word frequency thresholds (see section 3.2) were tested to explore the
performance impact on the probabilistic word representation vectors. The thresh-
olds are simply an indication of the number of times a given word occurs in the
SwDa corpus. Table 5 shows accuracy tends to decrease with larger thresholds.
This is likely due to data for the utterance representations becoming too sparse
as fewer words are included in the probability matrix. For example, a threshold
of 2 eliminates around half of the words in the vocabulary. The thresholds min-
imum value was kept at 2 to maximise the likelihood of words appearing in at
least one of the test datasets.
Table 5. Performance for different word frequency thresholds.
Word
Frequency
Test Set
Accuracy %
Validation Set
Accuracy %
274.68 77.38
4 74.14 77.1
6 73.66 75.7
8 73.01 76.35
10 72.69 76.04
Table 6 shows a subset of the probability matrix created using the method
described in section 3.2 for the SwDa corpus. It can be seen that certain words
correlate significantly with specific DAs. These are particularly useful features for
differentiating between DA that are otherwise semantically similar, for example,
Statement-opinion and Statement-non-opinion. However, certain DA’s such as
Abandoned/Turn-Exit do not correlate strongly with any words.
Table 6. Example of word probabilities for the five most common DA’s.
Statement
Non-Opinion
Acknowledge
(Backchannel)
Statement
Opinion
Abandoned
Turn-Exit
Agree
Accept
My 86.67 0.03 4.88 1.52 0.08
Yeah 0.08 71.49 0.07 1.54 16.68
Should 26.22 0.00 59.19 0.40 0.40
Um 36.94 15.56 9.14 20.30 0.78
True 5.85 0.30 14.67 0.10 62.83
Probabilistic Dialogue Act Classification with RNN 9
5.3 Results
Evaluation of the probabilistic and pre-trained word embedding representations
was performed on the full training set and results shown are an average over
10 runs for the test dataset. Table 7 shows the highest classification accuracies
achieved on the test dataset for both word representation methods using the
RNN model. The proposed model trained with word embeddings resulted in a
similar accuracy (73.68%) as the sentence model in the work of Papalampidi et
al., [22]. The model trained on the Bayes representation shows an improvement of
1.8% over the word embeddings baseline. Though direct comparisons are difficult,
due to differences in pre-possessing, models and other methodology, Table 7 also
shows the probabilistic model is comparable to methods from the literature where
context information was also used.
Table 7. Performance of the RNN model and other methods from the literature.
Model Classification
Accuracy %
Sentence Level
Proposed LSTM - Probabilistic 75.48
Proposed LSTM - Word Embeddings 73.68
Sentence (Papalampidi et al., 2017) 73.8
Sentence and Discourse Level
Sentence and Discourse (Papalampidi et al., 2017) 75.6
LSTM (Lee and Dernoncourt, 2016) 69.6
CNN (Lee and Dernoncourt, 2016) 73.1
RCNN (Kalchbrenner and Blunsom, 2013) 73.9
DRLM-joint training (Ji et al., 2016) 74.0
DRLM-conditional training (Ji et al., 2016) 77.0
Inter-annotator Agreement (Stolcke et al., 2000) 84.0
Majority DA baseline 32.2
6 Conclusion
This work has presented a novel probabilistic approach to utterance representa-
tion, and an LSTM sentence model, for the purpose of DA classification. When
applied to the SwDA corpus in an out-of-context fashion the probabilistic rep-
resentations improve DA classification accuracy by 1.8% when compared to tra-
ditional word embeddings. Further, the overall highest classification accuracy
achieved is competitive with approaches from the literature using more sophisti-
cated classifier models that also consider contextual information, and improves
10 N. Duran and S. Battle
on previously published results that only use a sentence level model. These find-
ings also concur with previous work [2] to show that the traditional word em-
bedding approach for utterance representation may not improve accuracy for the
DA classification task. This highlights the need to find alternative representation
methods for DA classification, such as the proposed probabilistic keyword-DA
relationships.
Regarding future work, it would be beneficial to determine whether using
probabilistic representations yields similar improvements in accuracy when us-
ing additional contextual information, and more sophisticated discourse and sen-
tence models, which have been shown to improve results [14]. Additionally, it
may be valuable to determine the portability of the approach by applying key-
words gathered from one corpus to DA classification on another distinct corpus
[29]. This would help to determine if certain keywords are able to generalise to
new corpora, and perhaps reduce the amount of training data required.
References
1. Austin, J.L.: How To Do Things With Words. Oxford University Press, London
(1962)
2. Cerisara, C., Kr´al, P., Lenc, L.: On the Effects of Using Word2vec Representations
in Neural Networks for Dialogue Act Recognition. Comput. Speech Lang. 47(July),
175–193 (2017). https://doi.org/10.1016/j.csl.2017.07.009
3. Cuay´ahuitl, H., Yu, S., Williamson, A., Carse, J.: Deep Reinforcement Learning
for Multi-Domain Dialogue Systems. In: NIPS Work. Deep Reinf. Learn. pp. 1–9.
Barcelona (2016)
4. Garner, P.N., Browning, S.R., Moore, R.K., Russell, M.J.: A Theory of
Word Frequencies and its Application to Dialogue Move Recognition. In: 4th
Int. Conf. Spok. Lang. Process. (ICSLP 96). ISCA, Philadelphia, PA (1996).
https://doi.org/10.1109/ICSLP.1996.607999
5. Griol, D., Hurtado, L., Segarra, E., Sanchis, E.: A Statistical Approach to Spoken
Dialog Systems Design and Evaluation. Speech Commun. 50(8-9), 666–682 (2008).
https://doi.org/10.1016/j.specom.2008.04.001
6. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:
Improving neural networks by preventing co-adaptation of feature detectors. arXiv
(2012). https://doi.org/arXiv:1207.0580
7. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Comput. 9(8),
1735–1780 (1997)
8. Ji, Y., Haffari, G., Eisenstein, J.: A Latent Variable Recurrent Neural Network for
Discourse Relation Language Models. In: Proc. NAACL-HLT 2016. pp. 332–342.
ACL, San Diego (2016)
9. Jurafsky, D., Shriberg, E., Biasca, D.: Switchboard SWBD-DAMSL Shallow-
Discourse-Function Annotation Coders Manual. Tech. rep. (1997)
10. Kalchbrenner, N., Blunsom, P.: Recurrent Convolutional Neural Networks for Dis-
course Compositionality. In: Proc. Work. Contin. Vector Sp. Model. their Compos.
pp. 119–126. ACL, Sofia, Bulgaria (2013). https://doi.org/10.1109/ICCV.2015.221
11. Keizer, S.: A Bayesian Approach to Dialogue Act Classification. In: BI-DIALOG
2001 Proc. 5th Work. Form. Semant. Pragmat. Dialogue. pp. 210–218 (2001)
Probabilistic Dialogue Act Classification with RNN 11
12. Khanpour, H., Guntakandla, N., Nielsen, R.: Dialogue Act Classification in
Domain-Independent Conversations Using a Deep Recurrent Neural Network. In:
COLING 2016, 26th Int. Conf. Comput. Linguist. pp. 2012–2021. Osaka (2016)
13. Kumar, V., Sridhar, R., Narayanan, S., Bangalore, S.: Enriching Spoken Lan-
guage Translation with Dialog Acts. In: Proc. 46th Annu. Meet. Assoc. Comput.
Linguist. Hum. Lang. Technol. Short Pap. - HLT ’08. p. 225. No. June (2008).
https://doi.org/10.3115/1557690.1557755
14. Lee, J.Y., Dernoncourt, F.: Sequential Short-Text Classification with Recurrent
and Convolutional Neural Networks. In: NAACL 2016 (2016)
15. Li, D.: The Pragmatic Construction of Word Meaning in Utterances. J. Chinese
Lang. Comput. 18(3), 121–137 (2012)
16. Liu, Y., Han, K., Tan, Z., Lei, Y.: Using Context Information for Dialog Act
Classification in DNN Framework. In: Proc. 2017 Conf. Empir. Methods Nat. Lang.
Process. pp. 2160–2168. ACL, Copenhagen, Denmark (2017)
17. Louwerse, M., Crossley, S.: Dialog Act Classification Using N-Gram Algorithms.
In: FLAIRS Conf. 2006. pp. 758–763. Melbourne Beach (2006)
18. Mandelbaum, A., Shalev, A.: Word Embeddings and their Use in Sentence Classi-
fication Tasks. arXiv (2016)
19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Distributed Representations
of Words and Phrases and Their Compositionality. In: NIPS’13 Proc. 26th
Int. Conf. Neural Inf. Process. Syst. pp. 3111–3119. Lake Tahoe (2013).
https://doi.org/10.1162/jmlr.2003.3.4-5.951
20. Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient Esti-
mation of Word Representations in Vector Space. arXiv (2013).
https://doi.org/10.1162/153244303322533223
21. Mikolov, T., Karafi´at, M., Burget, L., Khudanpur, S.: Recurrent neural network
based language model. In: INTERSPEECH. pp. 1045–1048. Makuhari (2010)
22. Papalampidi, P., Iosif, E., Potamianos, A.: Dialogue Act Semantic Rep-
resentation and Classification Using Recurrent Neural Networks. In: SEM-
DIAL 2017 Work. Semant. Pragmat. Dialogue. pp. 77–86. No. August (2017).
https://doi.org/10.21437/SemDial.2017-9
23. Pennington, J., Socher, R., Manning, C.: Glove: Global Vectors for Word Repre-
sentation. In: Proc. 2014 Conf. Empir. Methods Nat. Lang. Process. pp. 1532–1543
(2014). https://doi.org/10.3115/v1/D14-1162
24. Schegloff, E.A.: Sequence Oranization in Interaction: A Primer in Con-
versation Analysis I. Cambridge University Press, Cambridge (2007).
https://doi.org/10.1017/CBO9780511791208
25. Searle, J.: Speech Acts: An Essay in the Philosophy of Language. Cambridge Uni-
versity Press, London (1969)
26. Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor,
P., Martin, R., Van Ess-Dykema, C., Meteer, M.: Dialogue Act Modeling for Au-
tomatic Tagging and Recognition of Conversational Speech. Comput. Linguist.
26(3), 339–373 (2000). https://doi.org/10.1162/089120100561737
27. Surendran, D., Levow, G.A.: Dialog Act Tagging with Support Vector Machines
and Hidden Markov Models. In: Interspeech 2006 9th Int. Conf. Spok. Lang. Pro-
cess. pp. 1950–1953. Pittsburgh (2006)
28. Webb, N., Hepple, M.: Dialogue Act Classification Based on Intra-Utterance Fea-
tures. In: Proc. AAAI Work. Spok. Lang. Underst. (2005)
29. Webb, N., Liu, T.: Investigating the portability of corpus-derived cue phrases for
dialogue act classification. In: Proc. 22nd Int. Conf. Comput. Linguist. - Vol. 1.
pp. 977–984. No. August, Manchester (2008)
... Given the restraints in computational power, the GMM is restricted to have a diagonal co-variance matrix and hence require independence between the input dimensions. Artificial neural networks (ANNs) on the other hand, do not need this requirement of independence, which is why they are more currently used for acoustic modelling and give increased performance to speech recognition, particularly Convolutional Neural Networks and Recurrent Neural Networks [23]. There are several possible ways to exploit ANNs in automatic speech recognition systems, such as the Hidden Markov Model-Artificial Neural Network hybrid system, which takes the advantage of ANN's strong representation learning power and HMM's sequential modelling ability. ...
Article
Full-text available
Tamil talk is a speech to text application and was designed from a perspective of language and philosophy. This paper takes an indigenous approach in reflecting on the design and user acceptance of Tamil talk. The paper makes use of literature in critically reflecting on the design and the potential user acceptance of the application. It takes a multidisciplinary approach and explores the influence of factors like language shift, language maintenance and philosophy in the context of user acceptance of speech to text. The application may appeal to a section of the native Tamil speakers as suggested in the literature but there are complex challenges that needs further research. Further research shall be in developing the application that conforms to the conceptual framework and widely test with the native speakers to arrive at a more precise prediction of user acceptance.
... Artificial neural networks (ANNs) on the other hand, do not need this requirement of independence, which is why they are more currently used for acoustic modelling and give increased performance to speech recognition, particularly Convolutional Neural Networks and Recurrent Neural Networks (Duran and Battle, 2018). There are several possible ways to exploit ANNs in automatic speech recognition systems, such as the Hidden Markov Model-Artificial Neural Network hybrid system which takes the advantage of ANN's strong representation learning power and HMM's sequential modelling ability. ...
... Artificial neural networks (ANNs) on the other hand, do not need this requirement of independence, which is why they are more currently used for acoustic modelling and give increased performance to speech recognition, particularly Convolutional Neural Networks and Recurrent Neural Networks (Duran and Battle, 2018). There are several possible ways to exploit ANNs in automatic speech recognition systems, such as the Hidden Markov Model-Artificial Neural Network hybrid system which takes the advantage of ANN's strong representation learning power and HMM's sequential modelling ability. ...
Conference Paper
Full-text available
Tamil is one of the longest surviving classical languages in the world. Speech to text in Tamil would benefit to a lot of native Tamil speakers throughout the world. There are many speech recognition and speech to text applications available for a wide variety of languages but many minority languages, such as Tamil are overlooked. In this paper, we propose to develop a system for Tamil speech to text that will be consitent with the pronunciation of the user and conforms wih the syntax of the language.
Article
Full-text available
Dialogue act recognition is an important component of a large number of natural language processing pipelines. Many research works have been carried out in this area, but relatively few investigate deep neural networks and word embeddings. This is surprising, given that both of these techniques have proven exceptionally good in most other language-related domains. We propose in this work a new deep neural network that explores recurrent models to capture word sequences within sentences, and further study the impact of pretrained word embeddings. We validate this model on three languages: English, French and Czech. The performance of the proposed approach is consistent across these languages and it is comparable to the state-of-the-art results in English. More importantly, we confirm that deep neural networks indeed outperform a Maximum Entropy classifier, which was expected. However, and this is more surprising, we also found that standard word2vec embeddings do not seem to bring valuable information for this task and the proposed model, whatever the size of the training corpus is. We thus further analyse the resulting embeddings and conclude that a possible explanation may be related to the mismatch between the type of lexical-semantic information captured by the word2vec embeddings, and the kind of relations between words that is the most useful for the dialogue act recognition task.
Conference Paper
Full-text available
In this study, we applied a deep LSTM structure to classify dialogue acts (DAs) in open-domain conversations. We found that the word embeddings parameters, dropout regularization, decay rate and number of layers are the parameters that have the largest effect on the final system accuracy. Using the findings of these experiments, we trained a deep LSTM network that outperforms the state-of-the-art on the Switchboard corpus by 3.11%, and MRDA by 2.2%.
Article
Full-text available
Standard deep reinforcement learning methods such as Deep Q-Networks (DQN) for multiple tasks (domains) face scalability problems. We propose a method for multi-domain dialogue policy learning---termed NDQN, and apply it to an information-seeking spoken dialogue system in the domains of restaurants and hotels. Experimental results comparing DQN (baseline) versus NDQN (proposed) using simulations report that our proposed method exhibits better scalability and is promising for optimising the behaviour of multi-domain dialogue systems.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Conference Paper
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large num- ber of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alterna- tive to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example,we present a simplemethod for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.