Figure 3 - uploaded by Mirco Ravanelli
Content may be subject to copyright.
2: Acoustic reverberation in a typical enclosure. 

2: Acoustic reverberation in a typical enclosure. 

Source publication
Article
Full-text available
Deep learning is an emerging technology that is considered one of the most promising directions for reaching higher levels of artificial intelligence. Among the other achievements, building computers that understand speech represents a crucial leap towards intelligent machines. Despite the great efforts of the past decades, however, a natural and r...

Similar publications

Preprint
Full-text available
Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper we target learning a cross-modal alignment between the embedding spaces of speech and text learned from corpo...

Citations

... Modern automatic speech recognition (ASR) systems significantly struggle in more realistic distant-talking speech scenarios [1,2], despite their promising results in close-talking and controlled conditions. Indeed, distant speech recognition is significantly more difficult as it often implies speech signal highly corrupted with noise and reverberation [3,4]. ...
... State-of-the-art speech recognition systems perform reasonably well in close-talking conditions. However, their performance degrades significantly in more realistic distant-talking scenarios, since the signals are corrupted with noise and reverberation [1][2][3]. A common approach to improve the robustness of distant speech recognizers relies on the adoption of multiple microphones [4,5]. ...
Preprint
Full-text available
Despite the significant progress in automatic speech recognition (ASR), distant ASR remains challenging due to noise and reverberation. A common approach to mitigate this issue consists of equipping the recording devices with multiple microphones that capture the acoustic scene from different perspectives. These multi-channel audio recordings contain specific internal relations between each signal. In this paper, we propose to capture these inter-and intra-structural dependencies with quaternion neural networks, which can jointly process multiple signals as whole quaternion entities. The quaternion algebra replaces the standard dot product with the Hamilton one, thus offering a simple and elegant way to model dependencies between elements. The quaternion layers are then coupled with a recurrent neural network, which can learn long-term dependencies in the time domain. We show that a quaternion long-short term memory neural network (QLSTM), trained on the concatenated multi-channel speech signals, outperforms equivalent real-valued LSTM on two different tasks of multi-channel distant speech recognition.
... Modern automatic speech recognition (ASR) systems significantly struggle in more realistic distant-talking speech scenarios [1,2], despite their promising results in close-talking and controlled conditions. Indeed, distant speech recognition is significantly more difficult as it often implies speech signal highly corrupted with noise and reverberation [3]. ...
Preprint
Full-text available
Distant speech recognition remains a challenging application for modern deep learning based Automatic Speech Recognition (ASR) systems, due to complex recording conditions involving noise and reverberation. Multiple microphones are commonly combined with well-known speech processing techniques to enhance the original signals and thus enhance the speech recog-nizer performance. These multi-channel follow similar input distributions with respect to the global speech information but also contain an important part of noise. Consequently, the input representation robustness is key to obtaining reasonable recognition rates. In this work, we propose a Fusion Layer (FL) based on shared neural parameters. We use it to produce an expressive embedding of multiple microphone signals, that can easily be combined with any existing ASR pipeline. The proposed model called FusionRNN showed promising results on a multi-channel distant speech recognition task, and consistently outperformed baseline models while maintaining an equal training time.
... The Group 1 focused on the use of self-supervised/unsupervised techniques for improving speech recognition in challenging acoustic environments [25]. This activity extended a recent attempt done in the months before JSALT by some of the group members that developed a problem-agnostic speech encoder (PASE) based on multi-task self-supervised learning [22]. ...
Technical Report
This report summarizes activities and achievements obtained during and after JSALT 2019 workshop on Using Cooperative Ad-hoc Microphone Arrays for ASR. Besides its contents, relevant contributions are given by the attached slides, used during the closing ceremony, and by recent paper submissions to ICASSP 2020. The report is organized in six sections: an introduction to goals of the JSALT workshop, a section on basic issues and on a concept framework addressed during the workshop, and three sections referred to activities and results of each team group. A final section will draw conclusions and outline some future activities. The attached documents are the slides of the closing ceremony, five papers submitted to ICASSP 2020, and an additional document on synthetic data-sets. In particular, the fields addressed by the five papers are: sample drop detection, voice/overlap speech activity detection, multi-task self-supervised learning, filterbank design for waveform modeling in CNN, and speech separation. The rationale behind the organization of the next discussion is that we aim to provide more details for topics which are not addressed inside the attached submitted papers.
... Deep learning has shown remarkable success in numerous speech tasks, including speech recognition [1][2][3][4] and speaker recognition [5,6]. The deep learning paradigm aims to describe the world around us by means of a hierarchy of representations, that are progressively combined to lead to representations of higher level abstractions. ...
Preprint
Full-text available
Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way. Even though the mutual information between two random variables is hard to measure directly in high dimensional spaces, some recent studies have shown that an implicit optimization of MI can be achieved with an encoder-discriminator architecture similar to that of Generative Adversarial Networks (GANs). In this work, we learn representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence. The proposed encoder relies on the SincNet architecture and transforms raw speech waveform into a compact feature vector. The discriminator is fed by either positive samples (of the joint distribution of encoded chunks) or negative samples (from the product of the marginals) and is trained to separate them. We report experiments showing that this approach effectively learns useful speaker representations, leading to promising results on speaker identification and verification tasks. Our experiments consider both unsupervised and semi-supervised settings and compare the performance achieved with different objective functions.
... Deep learning has shown remarkable success in numerous speech tasks, including speech recognition [1][2][3][4] and speaker recognition [5,6]. The deep learning paradigm aims to describe the world around us by means of a hierarchy of representations, that are progressively combined to lead to representations of higher level abstractions. ...
Preprint
Full-text available
Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way. Even though the mutual information between two random variables is hard to measure directly in high dimensional spaces, some recent studies have shown that an implicit optimization of MI can be achieved with an encoder-discriminator architecture similar to that of Generative Adversarial Networks (GANs). In this work, we learn representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence. The proposed encoder relies on the SincNet architecture and transforms raw speech waveform into a compact feature vector. The discriminator is fed by either positive samples (of the joint distribution of encoded chunks) or negative samples (from the product of the marginals) and is trained to separate them. We report experiments showing that this approach effectively learns useful speaker representations, leading to promising results on speaker identification and verification tasks. Our experiments consider both unsupervised and semi-supervised settings and compare the performance achieved with different objective functions.
... To the best of our knowledge, this paper is the first that shows the effectiveness of the proposed SincNet in a speech recognition application. Moreover, this work not only considers standard close-talking speech recognition, but it also extends the validation of SincNet to distant-talking speech recognition [49,50,51]. ...
Preprint
Full-text available
Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Nevertheless , the internal "black-box" representations automatically discovered by current neural architectures often suffer from a lack of interpretability, making of primary interest the study of explainable machine learning techniques. This paper summarizes our recent efforts to develop a more interpretable neural model for directly processing speech from the raw waveform. In particular, we propose SincNet, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning. Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture converges faster, performs better, and is more interpretable than standard CNNs.
... To the best of our knowledge, this paper is the first that shows the effectiveness of the proposed SincNet in a speech recognition application. Moreover, this work not only considers standard close-talking speech recognition, but it also extends the validation of SincNet to distant-talking speech recognition [53][54][55]. ...
Preprint
Full-text available
Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Nevertheless, the internal "black-box" representations automatically discovered by current neural architectures often suffer from a lack of interpretability, making of primary interest the study of explainable machine learning techniques. This paper summarizes our recent efforts to develop a more interpretable neural model for directly processing speech from the raw waveform. In particular, we propose SincNet, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning. Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture converges faster, performs better, and is more interpretable than standard CNNs.
... Over the last years, we witnessed a progressive improvement and maturation of Automatic Speech Recognition (ASR) technologies [1,2], that have reached unprecedented performance levels and are nowadays used by millions of users worldwide. ...
... and are later imported into the python environment using the kaldi-io utilities inherited from the kaldi-io-for-python project 3 . The features are then processed by the function load-chunk, that performs context window composition, 2 The configuration file is fully described in the project documentation. 3 github.com/vesis84/kaldi-io-for-python shuffling, as well as mean and variance normalization. ...
Preprint
Full-text available
The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility. The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters. Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.
... Over the last years, we witnessed a progressive improvement and maturation of Automatic Speech Recognition (ASR) technologies [1,2], that have reached unprecedented performance levels and are nowadays used by millions of users worldwide. ...
... and are later imported into the python environment using the kaldi-io utilities inherited from the kaldi-io-for-python project 3 . The features are then processed by the function load-chunk, that performs context window composition, 2 The configuration file is fully described in the project documentation. 3 github.com/vesis84/kaldi-io-for-python shuffling, as well as mean and variance normalization. ...
Preprint
Full-text available
The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility. The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters. Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.