[Show abstract][Hide abstract]ABSTRACT: The purpose of this paper is to present a method for real time augmented reality sound production from virtual sources, which are located in a real environment. In the performed experiments, we will initially emphasize on augmenting audio information, beyond the existing environmental sounds, using headphones. The main goal of the approach is to produce a virtual sound that has a natural result so that the user gets immersed and senses a context aware synthetic sound. The necessary data, such as spatial coordinates of source and listener, relative distance and relative velocity between them, room dimensions and potential obstacles between virtual source and listener are given as input to the proposed framework. Real time techniques are used for data processing. These techniques are fast and effective in order to achieve high performance requirements. The resulted sound gives the impression to the listener that the virtual source is part of the real environment. Any dynamic change of the parameters will have as a result the simultaneous real time change of the produced sound.
[Show abstract][Hide abstract]ABSTRACT: The audio analysis of speaker’s surroundings has been a first step for several processing systems that enable speaker’s mobility though his daily life. These algorithms usually operate in a short-time analysis decomposing the incoming events in time and frequency domain. In this paper, an automatic sound recognizer is studied, which investigates audio events of interest from urban environment. Our experiments were conducted using a close set of audio events from which well known and commonly used audio descriptors were extracted and models were training using powerful machine learning algorithms. The best urban sound recognition performance was achieved by SVMs with accuracy equal to approximately 93%.
[Show abstract][Hide abstract]ABSTRACT: The localization of eye centers and tracking of gaze constitutes an integral component of many human–computer interaction applications. A number of constraints including intrusiveness, mobility, robustness and high-price of eye tracking systems have hindered the way of the integration of eye trackers in everyday applications. Several ‘passive’ systems based on a single camera have been lately proposed in the literature, exhibiting though subordinate precision compared to the commercial, hardware-based eye tracking devices. In this paper we introduce an automatic, non-intrusive method for precise eye center localization in low resolution images, acquired from single low-cost cameras. To this end, the proposed system uses color information to derive a novel eye map that emphasizes the iris area and a radial symmetry transform which operates both on the original eye images and the eye map. The performance of the proposed method is extensively evaluated on four publicly available databases containing low resolution images and videos. Experimental results demonstrate great accuracy in challenging cases and resilience to pose and illumination variations, achieving significant improvement over existing methods.
Full-text available · Article · Apr 2015 · Image and Vision Computing
[Show abstract][Hide abstract]ABSTRACT: In this report we present an overview of the approaches and techniques that are used in the task of automatic audio segmentation. Audio segmentation aims to find changing points in the audio content of an audio stream. Initially, we present the basic steps in an automatic audio segmentation procedure. Afterwards, the basic categories of segmentation algorithms, and more specific the unsupervised, the data-driven and the mixed algorithms, are presented. For each of the categorizations the segmentation analysis is followed by details about proposed architectural parameters, such us the audio descriptor set, the mathematical functions in unsupervised algorithms and the machine learning algorithms of data-driven modules. Finally a review of proposed architectures in the automatic audio segmentation literature appears, along with details about the experimenting audio environment (heading of database and list of audio events of interest), the basic modules of the procedure (categorization of the algorithm, audio descriptor set, architectural parameters and potential optional modules) along with the maximum achieved accuracy.
[Show abstract][Hide abstract]ABSTRACT: Expression of emotional state is considered to be a core facet of an individual's emotional competence. Emotional processing in BN has not been often studied and has not been considered from a broad perspective. This study aimed at examining the implicit and explicit emotional expression in BN patients, in the acute state and after recovery. Sixty-three female participants were included: 22 BN, 22 recovered BN (R-BN), and 19 healthy controls (HC). The clinical cases were drawn from consecutive admissions and diagnosed according to DSM-IV-TR diagnostic criteria. Self reported (explicit) emotional expression was measured with State-Trait Anger Expression Inventory-2, State-Trait Anxiety Inventory, and Symptom Check List-90 items-Revised. Emotional facial expression (implicit) was recorded by means of an integrated camera (by detecting Facial Feature Tracking), during a 20 minutes therapeutic video game. In the acute illness explicit emotional expression [anxiety (p<0.001) and anger (p<0.05)] was increased. In the recovered group this was decreased to an intermediate level between the acute illness and healthy controls [anxiety (p<0.001) and anger (p<0.05)]. In the implicit measurement of emotional expression patients with acute BN expressed more joy (p<0.001) and less anger (p<0.001) than both healthy controls and those in the recovered group. These findings suggest that there are differences in the implicit and explicit emotional processing in BN, which is significantly reduced after recovery, suggesting an improvement in emotional regulation.
Full-text available · Article · Jul 2014 · PLoS ONE
[Show abstract][Hide abstract]ABSTRACT: Aiming at automatic detection of non-linguistic sounds from vocalizations, we investigate the applicability of various subsets of audio features, which were formed on the basis of ranking the relevance and the individual quality of several audio features. Specifically, based on the ranking of the large set of audio descriptors, we performed selection of subsets and evaluated them on the non-linguistic sound recognition task. During the audio parameterization process, every input utterance is converted to a single feature vector, which consists of 207 parameters. Next, a subset of this feature vector is fed to a classification model, which aims at straight estimation of the unknown sound class. The experimental evaluation showed that the feature vector composed of the 50-best ranked parameters provides a good trade-off between computational demands and accuracy, and that the best accuracy, in terms of recognition accuracy, is observed for the 150-best subset.
[Show abstract][Hide abstract]ABSTRACT: Monitoring of animal communities is necessary to assess the conservation status of threatened species and to implement efficient conservation measures. However, classical observer-based survey techniques are expensive and time-consuming. Automated acoustic monitoring provides a solution for monitoring sound-emitting animals, such as mammals, birds, amphibians, and insects. Several Autonomous Recording Units (ARUs) can be simultaneously operated in 24/7 modus.
[Show abstract][Hide abstract]ABSTRACT: We report on the development of an automated acoustic bird recognizer with improved noise robustness, which is part of a long-term project, aiming at the establishment of an automated biodiversity monitoring system at the Hymettus Mountain near Athens, Greece. In particular, a typical audio processing strategy, which has been proved quite successful in various audio recognition applications, was amended with a simple and effective mechanism for integration of temporal contextual information in the decision-making process. In the present implementation, we consider integration of temporal contextual information by joint post-processing of the recognition results for a number of preceding and subsequent audio frames. In order to evaluate the usefulness of the proposed scheme on the task of acoustic bird recognition, we experimented with six widely used classifiers and a set of real-field audio recordings for seven bird species commonly present at the Hymettus Mountain. The highest achieved recognition accuracy obtained on the real-field data was approximately 93%, while experiments with additive noise showed significant robustness in low signal-to-noise ratio setups. In all cases, the integration of temporal contextual information was found to improve the overall accuracy of the recognizer.
Full-text available · Article · Jul 2013 · International Journal of Intelligent Systems Technologies and Applications
[Show abstract][Hide abstract]ABSTRACT: The MoveOn speech and noise database was purposely designed and implemented in support of research on spoken dialogue interaction in a motorcycle environment. The distinctiveness of the MoveOn database results from the requirements of the application domain—an information support and operational command and control system for the two-wheel police force—and also from the specifics of the adverse open-air acoustic environment. In this article, we first outline the target application, motivating the database design and purpose, and then report on the implementation details. The main challenges related to the choice of equipment, the organization of recording sessions, and some difficulties that were experienced during this effort, are discussed. We offer a detailed account of the database statistics, the suggested data splits in subsets, and discuss results from automatic speech recognition experiments which illustrate the degree of complexity of the operational environment.
Article · Jun 2013 · Language Resources and Evaluation
[Show abstract][Hide abstract]ABSTRACT: The performance of recent dereverberation methods for reverberant speech preprocessing prior to Automatic Speech Recognition (ASR) is compared for an extensive range of room and source-receiver configurations. It is shown that room acoustic parameters such as the clarity (C50) and the definition (D50) correlate well with the ASR results. When available, such room acoustic parameters can provide insight into reverberant speech ASR performance and potential improvement via dereverberation preprocessing. It is also shown that the application of a recent dereverberation method based on perceptual modelling can be used in the above context and achieve significant Phone Recognition (PR) improvement, especially under highly reverberant conditions.
[Show abstract][Hide abstract]ABSTRACT: We report on a recent progress with the development of an automated bioacoustic bird recognizer, which is part of a long-term project , aiming at the establishment of an automated biodiversity monitoring system at the Hymettus Mountain near Athens. In particular, employing a classical audio processing strategy, which has been proved quite successful in various audio recognition applications, we evaluate the appropriateness of six classifiers on the bird species recognition task. In the experimental evaluation of the acoustic bird recognizer, we made use of real-field audio recordings of two bird species, which are known to be present at the Hymettus Mountain. Encouraging recognition accuracy was obtained on the real-field data, and further experiments with additive noise demonstrated significant noise robustness in low SNR conditions.
[Show abstract][Hide abstract]ABSTRACT: We describe a novel design, implementation and evaluation of a speech interface, as part of a platform for the development of serious games. The speech interface consists of the speech recognition component and the emotion recognition from speech component. The speech interface relies on a platform designed and implemented to support the development of serious games, which supports cognitive-based treatment of patients with mental disorders. The implementation of the speech interface is based on the Olympus/RavenClaw framework. This framework has been extended for the needs of the specific serious games and the respective application domain, by integrating new components, such as emotion recognition from speech. The evaluation of the speech interface utilized purposely collected domain-specific dataset. The speech recognition experiments show that emotional speech moderately affects the performance of the speech interface. Furthermore, the emotion detectors demonstrated satisfying performance for the emotion states of interest, Anger and Boredom, and contributed towards successful modelling of the patient’s emotion status. The performance achieved for speech recognition and for the detection of the emotional states of interest was satisfactory. Recent evaluation of the serious games showed that the patients started to show new coping styles with negative emotions in normal stress life situations.
Article · Sep 2012 · Expert Systems with Applications
[Show abstract][Hide abstract]ABSTRACT: Automatic recognition of sound events can be valuable for efficient situation analysis of audio scenes. In this article we address the problem of detecting human activities in natural environments based solely on the acoustic modality. The primary goal is the continuous acoustic surveillance of a particular natural scene for illegal human activities (trespassing, hunting, etc.) in order to promptly alert an authorized officer for taking the appropriate measures. We constructed a novel system that is mainly characterized by its hierarchical structure as well as by its acoustic parameters. Each sound class is represented by a hidden Markov model created using descriptors from the time, frequency, and wavelet domains. The system has the ability to automatically adapt to acoustic conditions of different scenes via the feedback loop that serves unsupervised model refinement. We conducted extensive experiments for assessing the performance of the system with respect to its recognition and detection capabilities. To this end we employed confusion matrices and Detection Error Tradeoff curves while we report that high performance was achieved for both detection and recognition.
Article · Sep 2012 · Journal of the Audio Engineering Society
[Show abstract][Hide abstract]ABSTRACT: This paper gives an overview of the assessment and evaluation methods which have been used to determine the quality of the INSPIRE smart home system. The system allows different home appliances to be controlled via speech, and consists of speech and speaker recognition, speech understanding, dialogue management, and speech output components. The performance of these components is first assessed individually, and then the entire system is evaluated in an interaction experiment with test users. Initial results of the assessment and evaluation are given, in particular with respect to the transmission channel impact on speech and speaker recognition, and the assessment of speech output for different system metaphors.
[Show abstract][Hide abstract]ABSTRACT: We propose a two-stage phone duration modelling scheme, which can be applied for the improvement of prosody modelling in speech synthesis systems. This scheme builds on a number of independent feature constructors (FCs) employed in the first stage, and a phone duration model (PDM) which operates on an extended feature vector in the second stage. The feature vector, which acts as input to the first stage, consists of numerical and non-numerical linguistic features extracted from text. The extended feature vector is obtained by appending the phone duration predictions estimated by the FCs to the initial feature vector. Experiments on the American-English KED TIMIT and on the Modern Greek WCL-1 databases validated the advantage of the proposed two-stage scheme, improving prediction accuracy over the best individual predictor, and over a two-stage scheme which just fuses the first-stage outputs. Specifically, when compared to the best individual predictor, a relative reduction in the mean absolute error and the root mean square error of 3.9% and 3.9% on the KED TIMIT, and of 4.8% and 4.6% on the WCL-1 database, respectively, is observed.
Full-text available · Article · Aug 2012 · Computer Speech & Language