Table 4 - uploaded by Stefano Squartini
Content may be subject to copyright.
NSegSRR values for processed audio files of meeting IS1009b. 

NSegSRR values for processed audio files of meeting IS1009b. 

Source publication
Article
Full-text available
This paper proposes a real-time person activity detection framework operating in presence of multiple sources in reverberated environments. Such a framework is composed by two main parts: The speech enhancement front-end and the activity detector. The aim of the former is to automatically reduce the distortions introduced by room reverberation in t...

Contexts in source publication

Context 1
... calculating the NSegSRR value, the involved signals are assumed to be time-aligned. In Table 4 are reported the NSegSRR values for processed audio files of meeting IS1009b, for each source and all different reverberation time. In order to provide a comparison, the NSegSRR for non- processed audio files has been evaluated as well. ...
Context 2
... is consistent with the NSegSRR val- ues shown in Sec. 4.2 in which the same behaviour can be observed in the non-processed (Table 4) and processed (Table 5) results. ...

Similar publications

Article
Full-text available
We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme durati...
Conference Paper
Full-text available
This work proposes a dominance detection framework operating in reverberated environments. The framework is composed of a speech enhancement front-end, which automatically reduces the distortions introduced by room reverberation in the speech signals, and a dominance detector, which processes the enhanced signals and estimates the most and least do...

Citations

... In this paper, PU is defined as an indicator of the uncertainty level of the perception of an emotional state for a given observed sample. As mentioned in the section "Introduction", emotion prediction is a subjective task that differs from many other objective pattern recognition tasks, such as object detection [22] and speaker identification [43], where there is a ground truth. In contrast, to obtain a gold standard for a subjective task like emotion recognition, it is common that a number of raters are required to annotate the same sample to minimise the individual bias in perception and rating as much as possible. ...
Article
Full-text available
Predicting emotions automatically is an active field of research in affective computing. Considering the property of the individual’s subjectivity, the label of an emotional instance is usually created based on opinions from multiple annotators. That is, the labelled instance is often accompanied with the corresponding inter-rater disagreement information, which we call here the perception uncertainty. Such uncertainty information, as shown in previous studies, can provide supplementary information for better recognition performance in such a subjective task. In this paper, we propose a multi-task learning framework to leverage the knowledge of perception uncertainty to ameliorate the prediction performance. In particular, in our novel framework, the perception uncertainty is exploited in an explicit manner to manipulate an initial prediction dynamically, in contrast to merely estimating the emotional state and perception uncertainty simultaneously, as done in a conventional multi-task learning framework. To evaluate the feasibility and effectiveness of the proposed method, we perform extensive experiments for time- and value-continuous emotion predictions in audiovisual conversation and music listening scenarios. Compared with other state-of-the-art approaches, our approach yields remarkable performance improvements in both datasets. The obtained results indicate that integrating the perception uncertainty information can enhance the learning process.
... Last but not least, the application of the proposed framework to automatic speech recognition will be analyzed: Some work has already been done by the authors [37] but more efforts are needed to take the noise presence into account and to suitably integrate the framework with the feature extraction front-end [31,40]. Other relevant application scenarios to be investigated in the near future are the keyword spotting [43,44], the activity detection [32], the dominance estimation [20,33], the emotion understanding and recognition [5,7,38]. ...
Article
Full-text available
This paper deals with speech enhancement in noisy reverberated environments where multiple speakers are active. The authors propose an advanced real-time speech processing front-end aimed at automatically reducing the distortions introduced by room reverberation in distant speech signals, also considering the presence of background noise, and thus to achieve a significant improvement in speech quality for each speaker. The overall framework is composed of three cooperating blocks, each one fulfilling a specific task: speaker diarization, room impulse responses identification and speech dereverberation. In particular, the speaker diarization algorithm pilots the operations performed in the other two algorithmic stages, which have been suitably designed and parametrized to operate with noisy speech observations. Extensive computer simulations have been performed by using a subset of the AMI database under different realistic noisy and reverberated conditions. Obtained results show the effectiveness of the approach.
... Usually, speaker diarization is a pre-processing stage of automatic speech recognizers [10], but recently it has been successfully employed also in dereverberation-robust front-ends for speaker activity detection [11]. The diarization algorithm here addressed is based on [12], which describes its real-time implementation on the BeagleBoard platform. ...
Conference Paper
The ever increasing energy requirements of supercomputers and server farms is driving the scientific and industrial communities to take in deeper consideration the energy efficiency of computing equipments. This contribution addresses the issue proposing a cluster of ARM processors for high performance computing. The cluster is composed of five BeagleBoard-xM, with one board managing the cluster, and the other boards executing the actual processing. The software platform is based on the Angstrom GNU/Linux distribution and is equipped with a distributed file system to ease sharing data and code among the nodes of the cluster, and with tools for managing tasks and monitoring the status of each node. The computational capabilities of the cluster have been assessed through High-Performance Linpack and a cluster-wide speaker diarization algorithm, while power consumption has been measured using a clamp meter. Experimental results obtained in the speaker diarization task showed that the energy efficiency of the BeagleBoard-xM cluster is comparable to the one of a laptop computer equipped with a Intel Core2 Duo T8300 running at 2.4 GHz. Furthermore, removing the bottleneck due to the Ethernet interface, the BeagleBoard-xM cluster is able to achieve a superior energy efficiency.
Article
In this paper, we present a new model-based approach for human gait recognition in the sagittal plane via deterministic learning (DL) theory. Side silhouette lower limb joint angles characterize the gait system dynamics and are selected as the gait feature. Locally accurate identification of the gait system dynamics is achieved by using radial basis function (RBF) neural networks through DL. The obtained knowledge of the approximated gait system dynamics is stored in constant RBF networks. A gait signature is then derived from the extracted gait system dynamics along the phase portrait of joint angles. A bank of estimators is constructed using constant RBF networks to represent the training gait patterns. By comparing the set of estimators with a test gait pattern, a set of recognition errors are generated, and the average L (1) norms of the errors are taken as the similarity measure between the dynamics of the training gait patterns and the dynamics of the test gait pattern. Therefore, the test gait pattern can be rapidly recognized according to the smallest error principle. In contrast to other existing approaches, the main focus of this paper is on obtaining and reusing the knowledge of the gait system dynamics. Finally, experiments are carried out on the CASIA-A and CASIA-B gait databases to benchmark the effectiveness of the proposed approach.