Content uploaded by Titouan Parcollet

Author content

All content in this area was uploaded by Titouan Parcollet on Jun 05, 2018

Content may be subject to copyright.

A preview of the PDF is not available

Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it easier to train speech recognition systems in an end-to-end fashion. However in real-valued models , time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives, are processed as individual elements, while a natural alternative is to process such components as composed entities. We propose to group such elements in the form of quaternions and to process these quaternions using the established quaternion algebra. Quaternion numbers and quaternion neural networks have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies , and to solve many tasks with less learning parameters than real-valued models. This paper proposes to integrate multiple feature views in quaternion-valued convolutional neu-ral network (QCNN), to be used for sequence-to-sequence mapping with the CTC model. Promising results are reported using simple QCNNs in phoneme recognition experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme error rate (PER) with less learning parameters than a competing model based on real-valued CNNs.

Figures - uploaded by Titouan Parcollet

Author content

All figure content in this area was uploaded by Titouan Parcollet

Content may be subject to copyright.

Content uploaded by Titouan Parcollet

Author content

All content in this area was uploaded by Titouan Parcollet on Jun 05, 2018

Content may be subject to copyright.

A preview of the PDF is not available

... In other words, the degree of freedom of quaternion feedforward is only one-fourth of the corresponding parameters in its real space, thus reducing the parameterization by 75% [23]. Similarly, for complex numbers and quaternions, this parameterization reduction can also be explained by weight sharing [24,25]. Complex numbers, quaternions, and GA share parameters in their real and imaginary parts, reducing the number of parameters. ...

... Quaternion model. Quaternion networks effectively represent spatial transformations and analyze multidimensional signals, having been utilized in challenging tasks such as image and language processing [25,31]. In this study, the quaternion model is similarly built upon the real-valued model by substituting each module with quaternion counterparts, where the four pressure layer data of the wind field serve as the four components of quaternions. ...

Typhoons, as highly destructive natural disasters, significantly impact society and economy, making accurate prediction of their intensity crucial. Artificial intelligence technologies, such as machine learning, have been extensively applied in the meteorological domain due to their advantages in processing large-scale datasets and learning complex nonlinear relationships. However, existing machine learning models for typhoon intensity prediction face issues like high complexity and parameter count, increasing computational demands and risking overfitting. This study leverages the advantages of geometric algebra for the unified expression and computation of multi-dimensional variables, introduces a geometric algebra-based convolutional neural network with spatial attention to enhance typhoon intensity predictions. This study employed typhoon data from the Northwest Pacific from 2000 to 2018. The results indicate that, compared to the traditional forecasting methods of the China Meteorological Administration (CMA), the proposed geometric algebra neural network model improved prediction accuracy by approximately 9% and reduced the root mean square error by about 5% compared to real-valued neural network models. Most importantly, while maintaining predictive performance, the geometric algebra neural network significantly reduced the number of model parameters, achieving up to a fourfold decrease compared to real-valued neural networks. This study offers new insights and technical advancements for using Geometric Algebra in analyzing to handle long-term, large-scale meteorological data.

... To address this, here we have designed a 3D deep learning framework based on quaternion convolution neural networks (QCNN) with self-attention alongside physics-based loss to superresolve high resolution 3D maps using as little data as possible. Using real-valued convolution for quaternion-based data has been shown to be inefficient and has loss in the inter-channel relationship that arise from quaternion vector interdependencies 20 ; which leads to longer training times and larger data burdens. We demonstrate that a quaternion-valued neural network is more efficient and produces better results than real-valued convolution neural networks such as those used in previous work 19 . ...

... Inspired by this idea, we propose the use of quaternion selfattention for EBSD super-resolution, using physics-aware quaternion convolution for orientation recognition, a physicsbased loss function that is sensitive to material crystal symmetry, and progressive learning to incorporate long-range material relationships. Physics-aware quaternion convolution follows the approach of 20,21,32 , where convolution is depth-wise and uses a reduced number of interdependent weights whose connectivity is based on the Hamiltonian. We use a loss function that accurately measures the crystal orientations in EBSD maps and also accounts for the hexagonal close-packed symmetry present in α-phase Ti-6Al-4V and Ti-7Al, the two alloys investigated here. ...

Gathering 3D material microstructural information is time-consuming, expensive, and energy-intensive. Acquisition of 3D data has been accelerated by developments in serial sectioning instrument capabilities; however, for crystallographic information, the electron backscatter diffraction (EBSD) imaging modality remains rate limiting. We propose a physics-based efficient deep learning framework to reduce the time and cost of collecting 3D EBSD maps. Our framework uses a quaternion residual block self-attention network (QRBSA) to generate high-resolution 3D EBSD maps from sparsely sectioned EBSD maps. In QRBSA, quaternion-valued convolution effectively learns local relations in orientation space, while self-attention in the quaternion domain captures long-range correlations. We apply our framework to 3D data collected from commercially relevant titanium alloys, showing both qualitatively and quantitatively that our method can predict missing samples (EBSD information between sparsely sectioned mapping points) as compared to high-resolution ground truth 3D EBSD maps.

... This approach represents the relationship between RGB colors in an image through the rotation of quaternions in the imaginary parts (i, j, and k axes), and has shown higher accuracy in color image processing compared to real-valued convolutional neural networks. The quaternion convolutional neural network, proposed by Parcollet et al. [12], introduces a model that divides the feature map into individual components of quaternions and convolves them. This approach has demonstrated superior recognition capabilities in speech recognition tasks compared to real-valued CNNs. ...

The purpose of this paper is to propose a new multi-layer feedforward quaternion neural network model architecture, Reverse Quaternion Neural Network which utilizes the non-commutative nature of quaternion products, and to clarify its learning characteristics. While quaternion neural networks have been used in various fields, there has been no research report on the characteristics of multi-layer feedforward quaternion neural networks where weights are applied in the reverse direction. This paper investigates the learning characteristics of the Reverse Quaternion Neural Network from two perspectives: the learning speed and the generalization on rotation. As a result, it is found that the Reverse Quaternion Neural Network has a learning speed comparable to existing models and can obtain a different rotation representation from the existing models.

... The reason is that, regarding human speech, the spectrogram plots are hard to identify, as they look similar to one another. Even so, Parcollet et al. [13] stated in their study that convolutional neural network when used with deep learning techniques would enable an AI to improve its classification parameters and identify sounds of individuals, given that a specific speech input has already been made. ...

Deep learning (DL) techniques which implement deep neural networks became popular due to the increase of high-performance computing facilities. DL achieves higher power and flexibility due to its ability to process many features when it deals with unstructured data. DL algorithm passes the data through several layers; each layer is capable of extracting features progressively and passes it to the next layer. Initial layers extract low-level features, and succeeding layers combine features to form a complete representation. This research attempts to utilize DL techniques for identifying sounds. The development in DL models has extensively covered classification and verification of objects through images. However, there have not been any notable findings concerning identification and verification of the voice of an individual from different other individuals using DL techniques. Hence, the proposed research aims to develop DL techniques capable of isolating the voice of an individual from a group of other sounds and classify them based on the use of convolutional neural networks models AlexNet and ResNet, that are used in voice identification. We achieved the classification accuracy of ResNet and AlexNet model for the problem of voice identification is 97.2039 % and 65.95% respectively, in which ResNet model achieves the best result.

... We trained all models for 300 epochs, using the Adam optimizer, with a learning rate of 0.001, batch size of 32, and binary cross-entropy loss function. Although the initialization of the weights can affect significantly the performance of the network, we are not aware of a study addressing the initialization of Clifford deep networks besides complex and quaternion-valued models [32,12,27]. However, Clifford neural networks can be implemented as real-valued models in standard deep-learning libraries and, in our computational experiments, we used the weight initialization methods available in Keras and Tensorflow. ...

... Convolution in the quaternion domain formally can be defined the same as that in the real domain [20,22,23] ...

Since their first applications, Convolutional Neural Networks (CNNs) have solved problems that have advanced the state-of-the-art in several domains. CNNs represent information using real numbers. Despite encouraging results, theoretical analysis shows that representations such as hyper-complex numbers can achieve richer representational capacities than real numbers, and that Hamilton products can capture intrinsic interchannel relationships. Moreover, in the last few years, experimental research has shown that Quaternion-valued CNNs (QCNNs) can achieve similar performance with fewer parameters than their real-valued counterparts. This paper condenses research in the development of QCNNs from its very beginnings. We propose a conceptual organization of current trends and analyze the main building blocks used in the design of QCNN models. Based on this conceptual organization, we propose future directions of research.

This study proposes a set of generic rules to revise existing neural networks for 3D point cloud processing to rotation-equivariant quaternion neural networks (REQNNs), in order to make feature representations of neural networks to be rotation-equivariant and permutation-invariant. Rotation equivariance of features means that the feature computed on a rotated input point cloud is the same as applying the same rotation transformation to the feature computed on the original input point cloud. We find that the rotation-equivariance of features is naturally satisfied, if a neural network uses quaternion features. Interestingly, we prove that such a network revision also makes gradients of features in the REQNN to be rotation-equivariant
w.r.t.
inputs, and the training of the REQNN to be rotation-invariant
w.r.t.
inputs. Besides, permutation-invariance examines whether the intermediate-layer features are invariant, when we reorder input points. We also evaluate the stability of knowledge representations of REQNNs, and the robustness of REQNNs to adversarial rotation attacks. Experiments have shown that REQNNs outperform traditional neural networks in both terms of classification accuracy and robustness on rotated testing samples.

The field of deep learning has seen significant advancement in recent years. However, much of the existing work has been focused on real-valued numbers. Recent work has shown that a deep learning system using the complex numbers can be deeper for a fixed parameter budget compared to its real-valued counterpart. In this work, we explore the benefits of generalizing one step further into the hyper-complex numbers, quaternions specifically, and provide the architecture components needed to build deep quaternion networks. We go over quaternion convolutions, present a quaternion weight initialization scheme, and present algorithms for quaternion batch-normalization. These pieces are tested in a classification model by end-to-end training on the CIFAR-10 and CIFAR-100 data sets and a segmentation model by end-to-end training on the KITTI Road Segmentation data set. The quaternion networks show improved convergence compared to real-valued and complex-valued networks, especially on the segmentation task.

Speech recognition is largely taking advantage of deep learning, showing that substantial benefits can be obtained by modern Recurrent Neural Networks (RNNs). The most popular RNNs are Long Short-Term Memory (LSTMs), which typically reach state-of-the-art performance in many tasks thanks to their ability to learn long-term dependencies and robustness to vanishing gradients. Nevertheless, LSTMs have a rather complex design with three multiplicative gates, that might impair their efficient implementation. An attempt to simplify LSTMs has recently led to Gated Recurrent Units (GRUs), which are based on just two multiplicative gates. This paper builds on these efforts by further revising GRUs and proposing a simplified architecture potentially more suitable for speech recognition. The contribution of this work is two-fold. First, we suggest to remove the reset gate in the GRU design, resulting in a more efficient single-gate architecture. Second, we propose to replace tanh with ReLU activations in the state update equations. Results show that, in our implementation, the revised architecture reduces the per-epoch training time with more than 30% and consistently improves recognition performance across different tasks, input features, and noisy conditions when compared to a standard GRU.

Deep Neural Networks (DNN) received a great interest from researchers due to their capability to construct robust abstract representations of heterogeneous documents in a latent subspace. Nonetheless, mere real-valued deep neural networks require an appropriate adaptation, such as the con-volution process, to capture latent relations between input features. Moreover, real-valued deep neural networks reveal little in way of document internal dependencies, by only considering words or topics contained in the document as an isolate basic element. Quaternion-valued multi-layer per-ceptrons (QMLP), and autoencoders (QAE) have been introduced to capture such latent dependencies, alongside to represent multidimensional data. Nonetheless, a three-layered neural network does not benefit from the high abstraction capability of DNNs. The paper proposes first to extend the hyper-complex algebra to deep neural networks (QDNN) and, then, introduces pre-trained deep quaternion neural networks (QDNN-AE) with dedicated quaternion encoder-decoders (QAE). The experiments conduced on a theme identification task of spoken dialogues from the DECODA data set show, inter alia, that the QDNN-AE reaches a promising gain of 2.2% compared to the standard real-valued DNN-AE. Index Terms— Quaternions, deep neural networks, spoken language understanding, autoencoders, machine learning.

In the last decades, encoder-decoders or autoen-coders (AE) have received a great interest from researchers due to their capability to construct robust representations of documents in a low dimensional sub-space. Nonetheless, autoencoders reveal little in way of spoken document internal structure by only considering words or topics contained in the document as an isolate basic element, and tend to overfit with small corpus of documents. Therefore, Quaternion Multi-layer Per-ceptrons (QMLP) have been introduced to capture such internal latent dependencies, whereas denoising autoen-coders (DAE) are composed with different stochastic noises to better process small set of documents. This paper presents a novel autoencoder based on both hitherto-proposed DAE (to manage small corpus) and the QMLP (to consider internal latent structures) called " Quater-nion denoising encoder-decoder " (QDAE). Moreover, the paper defines an original angular Gaussian noise adapted to the specificity of hyper-complex algebra. The experiments, conduced on a theme identification task of spoken dialogues from the DECODA framework, show that the QDAE obtains the promising gains of 3% and 1.5% compared to the standard real valued denoising autoencoder and the QMLP respectively.

A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation paramters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.

Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.

Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates $backslash$emphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

Segmental conditional random fields (SCRFs) and connectionist temporal classification (CTC) are two sequence labeling objectives used for end-to-end training of speech recognition models. Both models define the transcription probability by marginalizing decisions about latent segmentation alternatives to derive a sequence probability: the former uses a globally normalized joint model of segment labels and durations, and the latter classifies each frame as either an output symbol or a "continuation" of the previous label. In this paper, we train a recognition model by optimizing an interpolation between the SCRF and CTC losses, where the same recurrent neural network (RNN) encoder used for feature extraction for both outputs. We find that this multi-task objective improves recognition accuracy when decoding with either the SCRF or CTC models. Additionally, we show that CTC can also be used to pretrain the RNN encoder, which improves the convergence rate when learning the joint model.

Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes it feasible to train an end-to-end speech recognition system instead of hybrid settings. However, RNNs are computationally expensive and sometimes difficult to train. In this paper, inspired by the advantages of both CNNs and the CTC approach, we propose an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections. By evaluating the approach on the TIMIT phoneme recognition task, we show that the proposed model is not only computationally efficient, but also competitive with the existing baseline systems. Moreover, we argue that CNNs have the capability to model temporal correlations with appropriate context information.