Content uploaded by Titouan Parcollet

Author content

All content in this area was uploaded by Titouan Parcollet on Sep 11, 2017

Content may be subject to copyright.

DEEP QUATERNION NEURAL NETWORKS

FOR SPOKEN LANGUAGE UNDERSTANDING

Titouan Parcollet1,2, Mohamed Morchid1, Georges Linar`

es1

1LIA, University of Avignon (France)

2Orkis (France)

{firstname.lastname}@univ-avignon.fr

ABSTRACT

Deep Neural Networks (DNN) received a great interest

from researchers due to their capability to construct robust

abstract representations of heterogeneous documents in a la-

tent subspace. Nonetheless, mere real-valued deep neural

networks require an appropriate adaptation, such as the con-

volution process, to capture latent relations between input

features. Moreover, real-valued deep neural networks reveal

little in way of document internal dependencies, by only

considering words or topics contained in the document as

an isolate basic element. Quaternion-valued multi-layer per-

ceptrons (QMLP), and autoencoders (QAE) have been intro-

duced to capture such latent dependencies, alongside to rep-

resent multidimensional data. Nonetheless, a three-layered

neural network does not beneﬁt from the high abstraction

capability of DNNs. The paper proposes ﬁrst to extend the

hyper-complex algebra to deep neural networks (QDNN) and,

then, introduces pre-trained deep quaternion neural networks

(QDNN-AE) with dedicated quaternion encoder-decoders

(QAE). The experiments conduced on a theme identiﬁcation

task of spoken dialogues from the DECODA data set show,

inter alia, that the QDNN-AE reaches a promising gain of

2.2% compared to the standard real-valued DNN-AE.

Index Terms—Quaternions, deep neural networks, spo-

ken language understanding, autoencoders, machine learning.

1. INTRODUCTION

Deep neural networks (DNN) have become ubiquitous in a

broad spectrum of domains speciﬁc applications, such as im-

age processing [1, 2], speech recognition [3], or spoken lan-

guage understanding (SLU) [4]. State-of-the art approaches

involve different neural-based structures to construct abstract

representations of documents in a low dimensional subspace,

such as deep neural networks [5], recurrent neural networks

(RNN)[6, 7, 8, 9], convolutional neural networks (CNN)[1],

and, more recently, generative adversarial neural networks

(GAN)[10]. However, in a standard real-valued neural struc-

ture, the latent relations between input features are difﬁcult

to represent. Indeed, multidimensional features require to be

reduced to a one dimensional vector before the learning pro-

cess, while an appropriate solution is to process a multidimen-

sional input as a single homogeneous entity. In other words,

real-valued representations reveal little in way of document

internal structure by only considering words or topics con-

tained in the document as an isolate basic element. Therefore,

quaternion multi-layer perceptrons (QMLP) [11, 12, 13] and

quaternion autoencoders (QAE) [14] have been introduced to

capture such latent dependencies, thanks to the fourth dimen-

sionality of hyper-complex numbers alongside to the Hamil-

ton product [15]. Nonetheless, previous quaternion-based

studies focused on three-layered neural networks, while the

efﬁciency and the effectiveness of DNN have already been

demonstrated [16, 5].

Therefore, this paper proposes ﬁrst to extend QMLPs to deep

quaternion neural networks (QDNN) for theme identiﬁca-

tion of telephone conversations. Indeed the high abstraction

capability of DNNs added to the quaternion latent relation

representations, fully expose the potential of hyper-complex

based neural structures. Nevertheless, in [17], the authors

highlighted the non-local optimum convergence, and the high

overﬁtting probability of training deep neural networks. To

alleviate these weaknesses, different methods have been pro-

posed, such as adding noises during learning to prevent the

overﬁtting phenomenon[18], or a pre-training process to eas-

ily converge to a non-local optimum [19], with a Restricted

Boltzman Machine (RBM) [20] or an encoder-decoder neural

network (AE) [21].

The paper proposes then to compare the randomly initial-

ized QDNN with a greedy layer-wise pre-trained QDNN

using QAE, called “QDNN-AE”, to fully expose the quater-

nion deep neural structure capabilities during a SLU task.

The experiments are conduced on the DECODA telephone

conversations framework and show a promising gain of the

QDNN compared to the QMLP. Moreover, the experiments

underline the impact of pre-training with a dedicated autoen-

coder for a QDNN. Finally the proposed quaternion based

models are compared to the real-valued ones.

The rest of the paper is organized as follows: Section 2

presents the quaternion deep neural networks and quater-

nion encoder-decoders and Section 3 details the experimental

protocol. The results are discussed in Section 4 before con-

cluding on Section 5

2. DEEP QUATERNION NEURAL NETWORKS

(QDNN) AND QUATERNION AUTOENCODERS

(QAE)

The proposed QDNN combines the well-known real-valued

deep neural network 1with the Quaternion algebra. Sec-

tion 2.1 details the quaternion fundamental properties re-

quired to deﬁne and understand the QDNN algorithms pre-

sented in Section 2.2.

2.1. Quaternion algebra

The quaternion algebra Qis an extension of the complex

numbers deﬁned in a four dimensional space as a linear

combination of four basis elements denoted as 1,i,j,kto

represent a rotation. A quaternion Qis written as:

Q=r1 + xi+yj+zk(1)

In a quaternion, ris its real part while xi+yj+zkis the

imaginary part (I) or the vector part. There is a set of basic

Quaternion properties needed for the QDNN deﬁnition:

•all products of i,j,k

i2=j2=k2=ijk =−1

•quaternion conjugate Q∗of Qis: Q∗=r1−xi−yj−zk

•dot product between two quaternions Q1and Q2is

hQ1, Q2i=r1r2+x1x2+y1y2+z1z2

•quaternion norm: |Q|=pr2+x2+y2+z2

•normalized quaternion Q/=Q

|Q|

•Hamilton product ⊗between Q1and Q2encodes latent

dependencies and is deﬁned as follows:

Q1⊗Q2=(r1r2−x1x2−y1y2−z1z2)+

(r1x2+x1r2+y1z2−z1y2)i+

(r1y2−x1z2+y1r2+z1x2)j+

(r1z2+x1y2−y1x2+z1r2)k(2)

This encoding capability has been conﬁrmed by [22].

Indeed, the authors have demonstrated the rotation,

transformation and scaling capabilities of a single

quaternion due to the Hamilton product. Moreover,

it performs an interpolation between two rotations fol-

lowing a geodesic over a sphere in the R3space.

1A multilayer perceptron (MLP) with more than one hidden layer

Given a segmentation S={s1, s2, s3, s4}of a document

p∈Pdepending on the document segmentation detailed

in [12] and a set of topics from a latent Dirichlet allocation

(LDA) [23] z={z1, . . . , zi, . . . , zT}, each topic ziin a doc-

ument dis represented by the quaternion:

Qp(zi) = xd−p1(zi)1 + x2

p(zi)i+x3

p(zi)j+x4

p(zi)k(3)

where xm

p(zi)is the prior of the topic ziin segment sm

of a document pas described in [12]. This quaternion is then

normalized to obtain the input Q/

p(zi)of QMLPs.

More about hyper-complex numbers can be found in [24,

25, 26] and more precisely about quaternion algebra in [27].

2.2. Deep Quaternion Neural Networks (QDNN)

This section details the QDNN algorithms and structure (Fig-

ure 1). QDNN differs from the real-valued DNN in each

learning subprocess, and all elements of the QDNN (inputs

Qp, labels t, weights w, biases b, outputs γ, . . . ) are quater-

nions:

t=outputs =lM

l3

l2

Qp=inputs =l1

Fig. 1. Illustration of a Quaternion Deep Neural Network with

2hidden layers (M= 4).

Activation function

The activation function βis the split [11] ReLU function (α),

applied to each element of the quaternion Q=r1 + xi+yj+

zkas follows:

β(x) = α(r)1 + α(x)i+α(y)j+α(z)k(4)

Where αis

α(x) = Max(0, x)(5)

Forward phase

Let Nlbe the number of neurons contained in the layer l(1≤

l≤L) and Lbe the number of layers of the QDNN including

the input and the output layers. θl

nis the bias of the neuron n

(1≤n≤Nl) from the layer l. Given a set of Pnormalized

quaternion input patterns Q/

p(1≤p≤P, denoted as Qp

for convenience in the rest of the paper) and a set of labels

tpassociated to each input Qp, the output γl

n(γ0

n=Qn

pand

γM

n=tn) of the neuron nfrom the layer lis given by:

γl

n=β(Sl

n)

with Sl

n=

Nl−1

X

m=0

wl

nm ⊗γl−1

m+θl

n(6)

Learning phase

The error eobserved between the expected outcome yand

the result of the forward phase γMis then evaluated for the

output layer (l=M) as follows:

el

n=tn−γl

n(7)

and for the hidden layer (2≤l < M −1)

el

n=

Nl+1

X

h=1

w∗l+1

h,n ⊗δl+1

h,(8)

The gradient δis computed with

δl

n=el

n×∂β(Sl

n)

∂Sl

n

where ∂β(Sl

n)

∂Sl

n

=(1,if Sl

n>0

0,otherwise (9)

Update phase

The weights wl

n,m and the bias values θl

nhave to be respec-

tively updated to bwl

n,m and b

θl

n:

bwl

n,m =wl

n,m +δl

n⊗β?(Sl

n)(10)

b

θl

n=θl

n+δl

n.(11)

2.3. Quaternion Autoencoder (QAE)

The QAE [14] is a three-layered (M= 3) neural network

made of an encoder and a decoder where N1=N3as de-

picted in Figure 2. The well-known autoencoder (AE) is ob-

tained with the same algorithm than the QAE, but with real

numbers, and the Hamilton product is replaced with the mere

dot product.

Given a set of Pnormalized inputs Qp(1≤p≤P), the

encoder computes an hidden representation l2of Qp, while

the decoder attempts to reconstruct the input vector Qpfrom

this hidden vector from l2to obtain the output vector ˜

Qp. The

learning phase follows the algorithm previously described in

Section 2.2. Indeed, the QAE attempts to reduce the recon-

struction error eMSE between ˜

Qpand Qpby using the tradi-

tional Mean Square Error (MSE) [19] between all m(1≤

m≤N1) quaternions Qmand estimated ˜

Qmcomposing the

pattern Qp:

eMSE(˜

Qm, Qm) = || ˜

Qm−Qm||2(12)

to minimize the total reconstruction error LMSE

LMSE =1

PX

p∈PX

m∈M

eMSE(˜

Qm, Qm).(13)

˜

Qp=outputs =l3

l2

Qp=inputs =l1

Fig. 2. Illustration of a quaternion autoencoder.

2.4. QDNN initialized with dedicated QAEs

Deep neural networks learning process is impacted by a wide

variety of issues related to the large number of parameters [3],

such as the vanishing or exploding gradient, and the over-

ﬁtting phenomenon. Different techniques have been proposed

to address thes drawbacks [18, 28, 29, 30]: additive noises,

normalization preprocessing, adaptive learning rates, and

pre-training. The pre-training process allows the neural net-

work structure to converge faster using a pre-learning phase

in an unsupervised task, to a non-local optimum. Indeed,

an autoencoder is employed for learning the weight matrices

composing the QDNN, except the last one that is randomly

initialized, as illustrated in Figure 3-(b)-(c). Therefore, the

auto-encoded neural networks (DNN-AE, QDNN-AE) are

able to map effectively the initial input features in an homo-

geneous subspace, learned during an unsupervised training

process with dedicated encoder-decoder neural networks.

t=outputs =lM

l3

l2

Qp=inputs =l1

˜

Qp

l2

Qp

˜

l2

l3

l2

(a) (b) (c)

Fig. 3. Illustration of a pre-trained Quaternion Deep Neu-

ral Network (a) based on 2dedicated Quaternion encoder-

decoders (b-c).

3. EXPERIMENTAL PROTOCOL

The efﬁciency and the effectiveness of the proposed QDNN

and QDNN-AE are evaluated during a spoken language un-

derstanding task of theme identiﬁcation of telephone conver-

sations described in Section 3.1. The conversations data set is

from, the DECODA framework detailed in Section 3.2. Sec-

tion 3.3 expresses the dialogue features employed as inputs

of autoencoders as well as the conﬁgurations of each neural

network.

3.1. Spoken Language Understanding task

The application considered in this paper, and depicted in Fig-

ure 4, concerns the automatic analysis of telephone conversa-

tions [31] between an agent and a customer in the call center

of the Paris public transport authority (RATP) [32]. The most

important speech analytics for the application are the con-

versation themes. Relying on the ontology provided by the

RATP, we have identiﬁed 8 themes related to the main rea-

son of the customer call, such as time schedules, trafﬁc states,

special offers, lost and found,...

A conversation involves a customer, which is calling from

an unconstrained environment (typically from train station or

street, by using a mobile phone) and an agent which is sup-

posed to follow a conversation protocol to address customers

requests or complains. The conversation tends to vary accord-

ing to the model of the agent protocol. This paper describes a

theme identiﬁcation method that relies on features related to

this underlying structure of agent-customer conversation.

Here, the identiﬁcation of conversation theme encounters

two main problems. First, speech signals may contain very

noisy segments that are decoded by an Automatic Speech

Recognition (ASR) system. On such difﬁcult environments,

ASR systems frequently fail and the theme identiﬁcation

component has to deal with high Word Error Rates (WER

'49%).

Second, themes can be quite ambiguous, many speech

acts being theme-independent (and sometimes confusing) due

to the speciﬁcities of the applicative context: most of con-

versations evoke trafﬁc details or issues, station names, time

schedules, etc... Moreover, some of the dialogues contain sec-

ondary topics, augmenting the difﬁculty of dominant theme

identiﬁcation. On the other hand, dialogues are redundant

and driven by the RATP agents which try to follow, as much

as possible, standard dialogue schemes.

3.2. Spoken dialogue data set

The DECODA corpus [32] contains human-human telephone

real-life conversations collected in the Customer Care Ser-

vice System of the Paris transportation system (RATP). It is

composed of 1,242 telephone conversations, corresponding

to about 74 hours of signal, split into a train (train - 739 di-

alogues), a development (dev - 175 dialogues) and a test set

(test - 327 dialogues). Each conversation is annotated with

one of the 8 themes. Themes correspond to customer prob-

lems or inquiries about itinerary, lost and found, time sched-

ules, transportation cards, state of the trafﬁc, fares, ﬁnes and

special offers. The LIA-Speeral Automatic Speech Recogni-

tion (ASR) system [33] is used for automatically transcribing

each conversation. Acoustic model parameters are estimated

from 150 hours of telephone speech. The vocabulary contains

5,782 words. A 3-gram language model (LM) is obtained

by adapting a basic LM with the training set transcriptions.

Automatic transcriptions are obtained with word error rates

(WERs) of 33.8%,45.2% and 49.%on the train, development

and test sets respectively. These high rates are mainly due to

speech disﬂuencies in casual users and to adverse acoustic en-

vironments in metro stations and streets.

Agent : Bonjour

Client : Bonjour

Agent : Je vous écoute...

Client : J’appelle car j’ai reçu

une amende aujourd’hui, mais

ma carte Imagine est toujours

valable pour la zone 1 [...] J’ai

oublié d’utiliser ma carte Navigo

pour la zone 2

Agent : Vous n’avez pas utilisé

votre carte Navigo, ce qui

explique le fait que vous ayez

reçu une amende [...]

Client : Merci au revoir

Agent : Au revoir

Agent

Client

Cartes de

transport

Agent: Hello

Customer: Hello

Agent: Speaking...

Customer: I call you because

I was ﬁned today, but I still

have an Imagine card

suitable for zone 1 [...] I forgot

to use my Navigo card for

zone 2

Agent: You did not use

your Navigo card, that is

why they give you a ﬁne not

for a zone issue [...]

Customer: Thanks, bye

Agent: Bye

Agent

Customer

Transportation

cards

(a) Original dialogue (in French) (b) Translated dialogue (in English)

Fig. 4. Exemple of a dialogue from the DECODA corpus for

the SLU task of theme identiﬁcation. This dialogue has been

labeled by the agent as “OBJECTS” (Lost & founds objects).

3.3. Input features and Neural Networks conﬁgurations

The experiments compare our proposed QDNN, QDNN-

AE with DNN, DNN-AE based on real-numbers and to the

QMLP[12], MLP made of a single hidden layer.

Input features: [12] show that a LDA [23] space with 25

topics and a speciﬁc user-agent document segmentation in-

volving the quaternion Q=r1 + xi+yj+zkto be build

with the user part of the dialogue in the ﬁrst complex value

x, the agent in yand the topic prior of the whole dialogue

on z, achieve the best results on 10 folds with the QMLP.

Therefore, we keep this segmentation and concatenate the

10 representations of size 25 in a single input vector of size

Qp= 250. Indeed, the compression of 10 folds in a single

input vector gives to the QDNNs more features to generalize

patterns. For fair comparison, a QMLP with the same input

vector is tested.

Neural Networks conﬁgurations: First of all, the appropri-

ate size of a single layer for both DNN (MLP) and QDNN

(QMLP) have to be investigated by varying the number of

neurons Nbefore extending to multiple layers. Different

QMLP, MLP have thus been learned by ﬂuctuating the hid-

den layer size from 8to 1024. Finally we trained multiple

DNN and QDNN by varying the number of layers from 1

to 5. Indeed, it is not straightforward to investigate all the

possible topologies using 8to 1024 neurons in 1to 5layers.

Therefore, each layer contains the same ﬁxed number of neu-

rons. During the experiments, a dropout rate of 50% is used

for each layer to prevent overﬁtting.

4. EXPERIMENTAL RESULTS

Section 4.1 details the experiments to ﬁnd out the “op-

timal” number of hidden neurons Nlwith a real-valued

(MLP) and a quaternion-valued (QMLP) neural networks.

Then, DNN/QDNN and their pre-trained equivalents DNN-

AE/QDNN-AE are compared in Section 4.2. Finally, perfor-

mances of all neural networks models (real- and quaternion-

valued) are depicted in Section 4.3.

4.1. QMLP vs. MLP

Figure 5 shows the different accuracies obtained on the devel-

opment and test data sets, with a real-valued and a quaternion

based neural networks, composed with a single hidden layer

(M= 3). To stick with a realistic case, the optimal neu-

rons number in the hidden layer is chosen with respect to the

results obtained on the development data set, by varying the

number of neurons in the hidden layer. The best accuracies

on the development data set for both MLP and QMLP are

observed with an hidden layer composed with 512 neurons.

Indeed, the QMLP and MLP reach an accuracy of 90.38%

and 91.38% respectively. Moreover, both MLP and QMLP

performances go down since the hidden layer contains more

than 512 neurons.

8 256 512 1024

80

82

84

86

88

90

92

(a) SEG 1 on Dev

Mean = 81.8

Mean = 81.3

Mean = 82.4

QMLP

MLP

8 256 512 1024

(b) SEG 1 on Test

Mean = 74.8

Mean = 74.6

Mean = 75.4

QMLP

MLP

Fig. 5.Accuracies in %obtained on the development (left)

and test (right) data sets by varying the number of neurons in

the hidden layer of the QMLP and MLP respectively.

4.2. Quaternion- and real-valued Deep Neural Networks

This Section details the performances obtained for both DNN,

QDNN, DNN-AE and QDNN-AE for 1,2,3,4and 5layers

composed of 512 neurons.

DNN vs. QDNN randomly initialized. Table 1 and 2 show

the performances obtained for the straightforward real-valued

and the proposed quaternion deep neural networks trained

without autoencoders and learned with a dropout noise [18]

to prevent overﬁtting.

The “Real Test” accuracies observed for the Test data set

are obtained depending on the best accuracy reached on the

Development data set.

The “Best Test” accuracy is obtained with the best con-

ﬁguration (number of hidden neurons for the MLP/QMLP or

number of hidden layers for the DNN/QDNN) on the Test

data set.

Topology Dev Best Test. Real Test Epochs

2-Layers 91.38 84.92 84.30 609

3-Layers 90.80 84 84 649

4-Layers 86.76 85.23 82.39 413

5-Layers 87.36 80.02 77.36 728

Table 1.Summary of accuracies in % obtained by the DNN

It is worth emphasizing that, as depicted in Table 1, the re-

sults observed for the DNN on the development and test data

sets drastically decrease while the number of layer increases.

This is due to the small size of the training data set (739 doc-

uments). Indeed, there is not enough patterns for the DNN

to construct a high abstract representation of the documents.

Conversely, Table 2 shows that the proposed QDNN achieves

stable performances with a standard deviation of barely 0.6

on the development set while the DNN gives more than 2.0.

Indeed, the DNN accuracies move down from 85% with 2/3/4

hidden layers, to 80% with 5 hidden layers. This can be easily

explained by the random initialization of the large number of

neural parameters, that makes difﬁcult DNNs to converge to

a non-local optimum. Indeed, the Hamilton product of the

QDNN constraint the model to learn the latent relations be-

tween each component. Therefore the best DNN results are

observed with only 2hidden layers with 91.38% and 84.30%,

while the QDNN obtains 92.52% and 85.23% with 4layers

for the development and test data sets respectively. Finally,

the QDNN converged about 6times faster than the DNN with

the same topology (148 epochs for the QDNN and 728 for

the DNN composed with 5 hidden layers for example).

DNN vs. QDNN initialized with dedicated encoder-

decoders. Table 3 and Table 4 expose the obtained pre-

trained QDNN-AE results with dedicated autoencoders (QAE

for the QDNN and AE for the DNN). It is worth underlying

that the numbers of epochs required to converge for the DNN-

Topology Dev Best Test. Real Test Epochs

2-Layers 91.95 86.46 84 140

3-Layers 91.95 85.53 85.23 113

4-Layers 92.52 86.46 85.23 135

5-Layers 90.80 85.84 84 148

Table 2.Summary of accuracies in % obtained by the

QDNN

AE is lower than those for the DNN for all the topologies,

as depicted in Table 1. Moreover, the accuracies reported for

the DNN-AE are more stable and move up with regard to the

number of layers.

Topology Dev Best Test. Real Test Epochs

2-Layers 90.23 84 82.46 326

3-Layers 90.80 84.92 83.69 415

4-Layers 91.52 85.23 84.64 364

5-Layers 91.95 85.23 84.30 411

Table 3.Summary of accuracies in % obtained by the DNN-

AE

The same phenomenon is observed with the QDNN-AE,

but with a smaller gain for the number of epochs alongside

with better reported accuracies. Indeed, the DNN-AE gives

an accuracy of 84.30% on the test data set while the QDNN-

AE obtains an accuracy of 86.46% in real conditions with a

gain of 2.16 points.

Topology Dev Best Test. Real Test Epochs

2-Layers 92.52 86.46 84.61 100

3-Layers 93.57 86.23 85.83 95

4-Layers 92.52 86.46 86.46 88

5-Layers 93.57 86.76 86.46 132

Table 4.Summary of accuracies in % obtained by the

QDNN-AE

Overall, the pre-training process allows each model

(DNN-AE/QDNN-AE) to better perform on a theme iden-

tiﬁcation task of telephone conversations. Indeed, both

DNN-AE and QDNN-AE need less epochs (and thus less

processing time) and reach better accuracies, due to their

pre-training process based on dedicated encoder-decoders to

converge quickly to an optimal conﬁguration (weight matri-

ces w) during the ﬁne tuning phase.

4.3. QDNN-AE vs. other neural networks

Table 5 sums up the results obtained on the theme identiﬁ-

cation task of telephone conversations from the DECODA

corpus, with different real-valued and quaternion neural net-

works. The ﬁrst remark is that the proposed QDNN-AE

obtains the best accuracy (86.46%) for both development and

test data sets compared to the real-valued neural networks

(Deep stacked autoencoder DSAE (82%), MLP (83.38%),

DNN (84%) and DNN-AE (84.3%)). Moreover, the QDNN

randomly initialized outperforms also all real-valued neural

networks with an accuracy of 85.23%. We can point out that

each quaternion-based neural networks performs better than

his real-valued equivalent thanks to the Hamilton product

(+2.61% for the QMLP for example). Finally, the QDNN

presents a gain of roughly 1.25% compared to the real-valued

DNN, and the pre-trained QDNN-AE shows an improvement

of 2.10% compared to the DNN-AE.

Models Type Dev. Real Test Epochs Impr.

DSAE[34] R88.0 82.0 - -

MLP R91.38 83.38 499 +1.38

QMLP Q90.38 84.61 381 +2.61

DNN R91.38 84 609 -

QDNN Q92.52 85.23 135 +1.23

DNN-AE R91.95 84.30 411 -

QDNN-AE Q93.57 86.46 132 +2.16

Table 5.Summary of accuracies in % obtained by different

neural networks on the DECODA famework.

5. CONCLUSION

Summary. This paper proposes a promising deep neural net-

work framework, based on the quaternion algebra, coupled

with a well-adapted pre-training process made of quaternion

encoder-decoders. The initial intuition that the QDNN-AE

better captures latent abstract relations between input fea-

tures, and can generalize from small corpus due to the high

dimensionality added by multiple layers, has been demon-

strated. It has been shown that a well-suited pre-training pro-

cess alongside to an increased number of neural parameters,

allow the QDNN-AE to outperform all the previously inves-

tigated models on the DECODA SLU task. Moreover, this

paper shows that quaternion-valued neural networks always

perform better and faster than real-valued ones, achieving

impressive accuracies on the small DECODA corpus with

a small number of input features and, therefore, few neural

parameters.

Limitations and Future Work. The document segmentation

process is a crucial issue when it comes to better capture la-

tent, temporal and spacial informations, and thus needs more

investigation to expose the potential of quaternion-based

models. Moreover, such DNN algorithms are adapted from

real-based ones and do not take into account the entire set

of speciﬁcities of the quaternion algebra. Therefore, a future

work will consist in investigate different structures of neural

networks such as recurrent and convolutional, and propose

well-tailored learning algorithms adapted to hyper-complex

numbers (rotations).

6. REFERENCES

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-

ton, “Imagenet classiﬁcation with deep convolutional

neural networks,” in Advances in neural information

processing systems, 2012, pp. 1097–1105.

[2] Dan Ciregan, Ueli Meier, and J ¨

urgen Schmidhuber,

“Multi-column deep neural networks for image classi-

ﬁcation,” in Computer Vision and Pattern Recognition

(CVPR), 2012 IEEE Conference on. IEEE, 2012, pp.

3642–3649.

[3] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl,

Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se-

nior, Vincent Vanhoucke, Patrick Nguyen, Tara N

Sainath, et al., “Deep neural networks for acoustic mod-

eling in speech recognition: The shared views of four

research groups,” Signal Processing Magazine, IEEE,

vol. 29, no. 6, pp. 82–97, 2012.

[4] George E Dahl, Dong Yu, Li Deng, and Alex Acero,

“Context-dependent pre-trained deep neural networks

for large-vocabulary speech recognition,” IEEE Trans-

actions on Audio, Speech, and Language Processing,

vol. 20, no. 1, pp. 30–42, 2012.

[5] Alex Graves, Abdel-rahman Mohamed, and Geoffrey

Hinton, “Speech recognition with deep recurrent neural

networks,” in Acoustics, speech and signal processing

(icassp), 2013 ieee international conference on. IEEE,

2013, pp. 6645–6649.

[6] Tomas Mikolov, Martin Karaﬁ´

at, Lukas Burget, Jan Cer-

nock`

y, and Sanjeev Khudanpur, “Recurrent neural net-

work based language model.,” in Interspeech, 2010,

vol. 2, p. 3.

[7] Ken-ichi Funahashi and Yuichi Nakamura, “Approxi-

mation of dynamical systems by continuous time recur-

rent neural networks,” Neural networks, vol. 6, no. 6,

pp. 801–806, 1993.

[8] Felix A Gers, J¨

urgen Schmidhuber, and Fred Cummins,

“Learning to forget: Continual prediction with lstm,”

1999.

[9] Alex Graves and J¨

urgen Schmidhuber, “Framewise

phoneme classiﬁcation with bidirectional lstm and other

neural network architectures,” Neural Networks, vol. 18,

no. 5, pp. 602–610, 2005.

[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,

Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron

Courville, and Yoshua Bengio, “Generative adversar-

ial nets,” in Advances in neural information processing

systems, 2014, pp. 2672–2680.

[11] Paolo Arena, Luigi Fortuna, Giovanni Muscato, and

Maria Gabriella Xibilia, “Multilayer perceptrons to ap-

proximate quaternion valued functions,” Neural Net-

works, vol. 10, no. 2, pp. 335–342, 1997.

[12] Titouan Parcollet, Mohamed Morchid, Pierre-Michel

Bousquet, Richard Dufour, Georges Linar`

es, and Re-

nato De Mori, “Quaternion neural networks for spoken

language understanding,” in Spoken Language Technol-

ogy Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 362–

368.

[13] Mohamed Morchid, Georges Linar`

es, Marc El-Beze,

and Renato De Mori, “Theme identiﬁcation in telephone

service conversations using quaternions of speech fea-

tures,” in Interspeech. ISCA, 2013.

[14] Teijiro Isokawa, Tomoaki Kusakabe, Nobuyuki Matsui,

and Ferdinand Peper, “Quaternion neural network and

its application,” in Knowledge-based intelligent infor-

mation and engineering systems. Springer, 2003, pp.

318–324.

[15] William Rowan Hamilton, Elements of quaternions,

Longmans, Green, & Company, 1866.

[16] Ronan Collobert and Jason Weston, “A uniﬁed archi-

tecture for natural language processing: Deep neural

networks with multitask learning,” in Proceedings of

the 25th international conference on Machine learning.

ACM, 2008, pp. 160–167.

[17] Xavier Glorot and Yoshua Bengio, “Understanding the

difﬁculty of training deep feedforward neural networks,”

in International conference on artiﬁcial intelligence and

statistics, 2010, pp. 249–256.

[18] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,

Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout:

A simple way to prevent neural networks from overﬁt-

ting,” The Journal of Machine Learning Research, vol.

15, no. 1, pp. 1929–1958, 2014.

[19] Yoshua Bengio, “Learning deep architectures for ai,”

Foundations and trends R

in Machine Learning, vol. 2,

no. 1, pp. 1–127, 2009.

[20] Ruslan Salakhutdinov and Geoffrey Hinton, “Deep

boltzmann machines,” in Artiﬁcial Intelligence and

Statistics, 2009, pp. 448–455.

[21] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo

Larochelle, et al., “Greedy layer-wise training of deep

networks,” Advances in neural information processing

systems, vol. 19, pp. 153, 2007.

[22] Toshifumi Minemoto, Teijiro Isokawa, Haruhiko

Nishimura, and Nobuyuki Matsui, “Feed forward neu-

ral network with random quaternionic neurons,” Signal

Processing, vol. 136, pp. 59–68, 2017.

[23] David M Blei, Andrew Y Ng, and Michael I Jordan,

“Latent dirichlet allocation,” the Journal of machine

Learning research, vol. 3, pp. 993–1022, 2003.

[24] I.L. Kantor, A.S. Solodovnikov, and A. Shenitzer, Hy-

percomplex numbers: an elementary introduction to al-

gebras, Springer-Verlag, 1989.

[25] Jack B Kuipers, Quaternions and rotation sequences,

Princeton university press Princeton, NJ, USA:, 1999.

[26] Fuzhen Zhang, “Quaternions and matrices of quater-

nions,” Linear algebra and its applications, vol. 251,

pp. 21–57, 1997.

[27] J.P. Ward, Quaternions and Cayley numbers: Algebra

and applications, vol. 403, Springer, 1997.

[28] Diederik Kingma and Jimmy Ba, “Adam: A

method for stochastic optimization,” arXiv preprint

arXiv:1412.6980, 2014.

[29] Matthew D Zeiler, “Adadelta: an adaptive learning rate

method,” arXiv preprint arXiv:1212.5701, 2012.

[30] Dumitru Erhan, Yoshua Bengio, Aaron Courville,

Pierre-Antoine Manzagol, Pascal Vincent, and Samy

Bengio, “Why does unsupervised pre-training help deep

learning?,” Journal of Machine Learning Research, vol.

11, no. Feb, pp. 625–660, 2010.

[31] John J Godfrey, Edward C Holliman, and Jane Mc-

Daniel, “Switchboard: Telephone speech corpus for

research and development,” in Acoustics, Speech, and

Signal Processing, 1992. ICASSP-92., 1992 IEEE Inter-

national Conference on. IEEE, 1992, vol. 1, pp. 517–

520.

[32] Frederic Bechet, Benjamin Maza, Nicolas Bigouroux,

Thierry Bazillon, Marc El-Beze, Renato De Mori, and

Eric Arbillot, “Decoda: a call-centre human-human

spoken conversation corpus.,” in LREC, 2012, pp. 1343–

1347.

[33] Georges Linares, Pascal Noc´

era, Dominique Massonie,

and Driss Matrouf, “The lia speech recognition sys-

tem: from 10xrt to 1xrt,” in Text, Speech and Dialogue.

Springer, 2007, pp. 302–308.

[34] Killian Janod, Mohamed Morchid, Richard Dufour,

Georges Linares, and Renato De Mori, “Deep stacked

autoencoders for spoken language understanding,” ISCA

INTERSPEECH, vol. 1, pp. 2, 2016.