Recommendation BS.1116-1, Methods for the Subjective Assessment of Small Impairments in Audio Systems Including Multichannel Sound Systems
... We test this by comparing the performances of two in-room models based on 1/20 and 1/3-octave data. The results have broad ramifications throughout the audio industry because 1/3-octave measurements are commonly used to diagnose and equalize loudspeakers in rooms, and are endorsed in many international standards [16]. ...
... Our premise is supported by a substantial body of scientific evidence from previous loudspeaker studies [5]- [16], including the test results reported in Part One [1]. Together these studies show that the frequency response of the loudspeaker is the most important factor related to perceived sound quality. ...
... Measurements based on 1/3-octave analyzers with fixed center frequency filters will likely produce even worse results. The ITU-R BS1116 recommendation specifies an in-room 1/3octave loudspeaker response of ± 3 dB between 50 Hz and 2 kHz rising to +3/-6 dB at 16 kHz [16]. Our study clearly shows this cannot adequately Loudspeaker Model-Part 2 distinguish good loudspeakers from mediocre ones. ...
A controlled listening test was conducted on 31 different models of around-ear (AE) and on-ear (OE) headphones to determine listeners’ sound quality preferences. One-hundred-thirty listeners both trained and untrained rated the headphones based on preference using a virtual headphone method that used a single replicator headphone equalized to match magnitude and minimum phase responses of the different headphones. Listeners rated seven different headphones in each trial that included high (the new Harman AE-OE target curve) and low anchors. On average, both trained and untrained listeners preferred the high anchor to 31 other choices. Using machine learning a model was developed that predicts the listeners’ headphone preference ratings of the headphones based on deviation in magnitude response from the Harman target curve.
... Le système 5.1, "home La largeur de base B entre les haut-parleurs L et R est comprise de préférence entre 2 et 3 mètres selon la recommandation ITU-R BS.1116BS. (1997 et peut atteindre 5 mètres pour des locaux appropriés. La configuration spatiale des enceintes d'un système 5.1 est d'une importance primordiale car elle conditionne directement la qualité d'écoute et le réalisme des effets sonores. Le point d'écoute dit de référence se nomme le "sweet spot" et se situe au centre du cercle sur lequel so ...
... Comme énoncé précédemment, les versions ou objets évalués, doivent avoir subi des dégradations moyennes ou fortes (Soulodre et Lavoie, 1999). Les dégradations sont marquées et la détection des altérations n'est pas difficile. En revanche, la recommandation ITU-R BS.1116BS. (1997, détaillée dans la section I. 2.2, est dédiée à l'évaluation de la qualité audio pour des dégradations faibles c'est-à-dire pour des systèmes haute qualité. Il est spécifié que la qualité des versions évaluées avec la méthode MUSHRA doit figurer dans la moitié inférieure de l'échelle proposée par la norme ITU-R BS.1116BS. (1997. ...
... dation ITU-R BS.1116BS. (1997, détaillée dans la section I. 2.2, est dédiée à l'évaluation de la qualité audio pour des dégradations faibles c'est-à-dire pour des systèmes haute qualité. Il est spécifié que la qualité des versions évaluées avec la méthode MUSHRA doit figurer dans la moitié inférieure de l'échelle proposée par la norme ITU-R BS.1116BS. (1997. ...
Aujourd'hui, les technologies de captation et de restitution sonore se développent dans le but de diffuser des scènes avec un rendu spatialisé. Avant leur diffusion, les extraits sonores peuvent être évalués en terme de qualité par des méthodes recommandées par I'Union lnternationale des Télécommunications (évaluation des codecs de compression, procédés de prise ou restitution sonore...). Cependant, ces standards d'évaluation montrent certaines faiblesses notamment en ce qui concerne les attributs de qualité à évaluer. La dimension spatiale n'est pas prise en compte spécifiquement. Dans ce travail, une méthodologie dédiée à l'évaluation de la qualité de I'audio spatialisé est mise en place notamment pour répondre aux biais identifiés. De par l'utilisation d'une catégorisation libre et d'une analyse multidimensionnelle, vingt-huit attributs ont été catégorisés en trois familles d'attributs : le Timbre, l'Espace et les Défauts. Ces trois attributs généraux ont été inclus dans un test d'écoute. Celui-ci se déroule en deux phases : l'évaluation de la qualité globale suivie de l'évaluation des trois attributs simultanément sur une même interface. Les tests sont réalisés sans référence explicite, le fichier original constitue une référence cachée. De plus, trois signaux audio, dit ancrages, spécifiques à chacun des trois attributs ont été définis puis superposés pour définir un ancrage unique triplement dégradé. La méthode a été testée à la fois sur un système de restitution au casque avec des contenus binauraux mais également sur un système multicanal 5.1. L'évaluation de stimuli de qualité intermédiaire est préconisée ainsi que des contenus présentant un effet spatial prononcé. L'évaluation multicritère a montré son intérêt dans certaines conditions et permet ainsi d'identifier les caractéristiques qui sont dégradées. Les attributs Défauts et Timbre ont montré un poids influant sur la qualité globale tandis que le poids de I'attribut Espace est plus discutable.
... [6] states since the majority of tests are for consumer products then "most listening tests should be done in rooms whose essential acoustical parameters are similar to those of typical domestic room". Due to the variety of listening environments that could be considered, instead several standards were developed which aim to define the ideal listening room [7,8] for subjective listening tests. ...
... ITU-R BS.1534 [10] introduced the Multi-Stimulus test with Hidden Reference and Anchor (MUSHRA). It was developed specifically for evaluating small differences in audio codec performances as an alternative to ITU-R BS.1116 [8] which was unsuitable for discriminating between small differences [11]. One important aspect of this interface was the requirement to add an anchor to the pool of evaluated content. ...
Subjective experiments are a cornerstone of modern research, with a variety of tasks being undertaken by subjects. In the field of audio, subjective listening tests provide validation for research and aid fair comparison between techniques or devices such as coding performance, speakers, mixes and source separation systems. Several interfaces have been designed to mitigate biases and to standardise procedures, enabling indirect comparisons. The number of different combinations of interface and test design make it extremely difficult to conduct a truly unbiased listening test. This paper resolves the largest of these variables by identifying the impact the interface itself has on a purely auditory test. This information is used to make recommendations for specific categories of listening tests.
... Ein Ziel bei dieser Art von Training ist die konsistente Bewertung von vordefinierten Qualitätsmerkmalen [38]. Wird [39] und mittlerer [40] Qualitätsunterschiede empfohlen. Für Tests zur Bewertung kleiner Unterschiede wird in der Recommendation BS.1116-3 der International Telecommunication Union (ITU) beispielsweise empfohlen, die Probanden alle möglichen Stimuli Probehören zu lassen. ...
... Für die Bewertungsphase wird in dieser Norm empfohlen, ein möglichst schnelles Umschalten zwischen den zu vergleichenden Stimuli zu realisieren. Die Intention ist, dass die Bewertung der Qualitätsunterschiede maßgeblich auf Basis des Kurzzeitgedächtnisses geschieht: "Since long-and medium-term aural memory is unreliable, the test procedure should rely exclusively on short-term memory" [39]. ...
The goal of technical evolutions in the context of entertainment electronics is to improve the user experience by providing visuals and acoustics in the best possible way. With modern virtual and augmented reality devices and applications the goal of a reproduction indistinguishable from reality became more tangible. When the listener is no longer able to distinguish artificial sound sources from real ones the term auditory illusion is used. In order to achieve such illusions different technical challenges need to be mastered. But the assumption that the exact replica of the ear signals lead to the same perceptions as in the corresponding real-life situation is not correct. Fundamental mechanisms of human perception such as the integration of cues from different modalities and the dependency on expectations and experience add another layer of complexity. These expectations can change depending on prior sound exposure. In the context of spatial hearing this means that listeners are probably able to learn how to interpret spatial cues. Such mechanisms and their effect on the perceived quality of spatial sound reproduction systems are the scope of this work. Perceptual studies investigate the learning of spatial localization cues and adaptation mechanisms related to room acoustic perception. Quality deficits due to mismatched ear signals are measured and it is shown how quality ratings can change depending on training. The results suggest, that learning and adaptation processes are a key factor for the establishment of an auditory illusion. The practical relevance of such effects and their underlying principles are discussed.
... The PEAQ compares the ence signal and the signal under test and gives a score of 1 to 5, corresponding to p excellent. The signal differences are analyzed in frequency and time domains by a tive model that was validated by the subjective listening test conducted in ITU-R R mendation BS.1116 [23]. Because PEAQ is based on generally accepted psychoac (PEAQ), to measure perceived audio quality objectively. ...
... The PEAQ compares the reference signal and the signal under test and gives a score of 1 to 5, corresponding to poor to excellent. The signal differences are analyzed in frequency and time domains by a cognitive model that was validated by the subjective listening test conducted in ITU-R Recommendation BS.1116 [23]. Because PEAQ is based on generally accepted psychoacoustic principles [24], in this study, objective measurements were made using PEAQ to indicate whether the proposed system provides tolerable audio quality even if the listener is out of the listening area. ...
During the COVID-19 pandemic, smart home requirements have shifted toward entertainment at home. The purpose of this research project was therefore to develop a robotic audio system for home automation. High-end audio systems normally refer to multichannel home theaters. Although multichannel audio systems enable people to enjoy surround sound as they do at the cinema, stereo audio systems have been popularly used since the 1980s. The major shortcoming of a stereo audio system is its narrow listening area. If listeners are out of the area, the system has difficulty providing a stable sound field. This is because of the head-shadow effect blocking the high-frequency sound. The proposed system, by integrating computer vision and robotics, can track the head movement of a user and adjust the directions of loudspeakers, thereby helping the sound wave travel through the air. Unlike previous studies, in which only a diminutive scenario was built, in this work, the idea was applied to a commercial 2.1 audio system, and listening tests were conducted. The theory and the simulation coincide with the experimental results. The approximate rate of audio quality improvement is 31%. The experimental results are encouraging, especially for high-pitched music.
... One of the cases where the proper reproduction of recorded sound is highly important is during the conduction of listening experiments. On this topic, the ITU-R BS.1116-1 recommendation specifies several criteria that a sound laboratory should meet in order to qualify as suitable for the conduction of listening experiments [6]. The specifications included in this document should first of all serve as a guideline to be followed during the construction process of the room and the selection of the loudspeakers and reproduction system. ...
... Tolerance limits for the operational room response at the listening position, as defined by the ITU-R BS.1116-1 recommendation[6]. ...
The problem of loudspeaker/room equalisation is a common topic in the world of audio engineering. There are many different ways to approach it, ranging from manual tuning of frequencies to recently more automated solutions. In this paper, a combined equalisation approach using a limited number of Infinite Impulse Response (IIR) filters is presented. The parameters of the needed filters were calculated through an algorithm that received room measurements as input, while a Graphical User Interface (GUI) was developed in order to visualise the resulting equalisation in the room. The motivation for this equalisation was the correction of problematic resonances in the room response of the main sound lab at Fraunhofer IIS according to the ITU-R BS.1116-1 specifications. The developed approach was compared with a commercial room equalisation solution of Finite Impulse Response (FIR) filters, by means of numerical evaluation and an ABX listening experiment. The results indicate that the selection of room positions and the averaging of their measurements significantly influence the effectiveness of the resulting equalisation.
... Another method for categorizing distortion is presented in [10], where the authors propose the diagnostic acceptability measure (DAM). The ITU-R BS.1116 [11] defines how to test the quality of an audio sequence. ...
... 11: H.264 encoder[70].operations are performed. ...
This thesis concerns the audio and video transmission over wireless networks adopting
the family of the IEEE 802.11x standards. In particular, the original contribution of
this study involves the discussion of four issues: the adaptive retransmission of video
packets with distortion and delay requirements, the reliability of different video quality assessments for retry limit adaptation purposes, the fast distortion estimation, and
the joint adaptation of the retry limits of both voice and video packets. The material
presented in this dissertation is the result of four-year studies performed within the
Telecommunication Group of the Department of Engineering and Architecture at the
University of Trieste during the course of Doctorate in Information Engineering.
Multimedia applications have registered a tremendous increase in the last years. The
request of video data represents a big portion of the total data traffic on the Internet, and
hence improving the Quality of Service (QoS) at the end-user is a very important topic
in nowadays telecommunication research. To introduce QoS control at the MAC layer
of an 802.11 network, task group e (TGe) developed the 802.11e amendment, which
extends the functionalities of the legacy Distributed Coordination Function (DCF) by
adopting the Enhanced Distributed Channel Access (EDCA). The EDCA enables a prioritization of the traffic during the contention period by defining four Access Categories
(ACs): Voice (VO), Video (VI), Best Effort (BE), and BacK ground (BK), whose differentiation is based on parameters such as transmission opportunity, arbitration inter
iii
frame space (AIFS), minimum and maximum contention windows. Another parameter associated to each AC is the maximum retry limit associated to each packet, which
represents the maximum number of times that a packet can be retransmitted. Even if
the EDCA settings are specified to provide a higher priority to the voice and video
ACs, collisions involving audio/video packets may still occur, thus making necessary
proper prioritization policies of the traffic. Accordingly, several studies focused on this
issue, aiming to prioritize audio and video signals according to an importance metric. More precisely, in scientific literature are present solutions for audio transmission,
based on pre-computers descriptors of the audio signals, multi-resolution techniques
for 3D audio rendering, prioritization mechanisms relying on the perceptual quality of
audio signals exposed to packet loss. Furthermore, for the transmission of video flows,
solutions based on lightweight prioritization schemes or on more sophisticated pixelbased techniques have been presented. An interesting possibility, is represented by the
manipulation of the retry limit associated to each packet. Several studies have investigated this issue. The main objective of these methods is the adaptation of the number
of retransmissions to the acceptable delay and the perceived distortion. More precisely,
for the packets containing information whose loss would imply a high distortion on the
video sequence, it is necessary to set a higher retry limit value. Instead, for the packets
with information less important for the decoding process, the corresponding retry limit
can be lower. This leads to the derivation of elaborate optimization strategies, able to
provide significant performance improvements with respect to those achievable using
the 802.11e default settings.
In the context of wireless networks adopting the standard 802.11e, the core of this thesis
is the development of fast algorithms capable to calculate the best retry limit for each
packet in the audio and video queue, aiming to choose the best retry limit in accordance
with the distortion associated to that packet, for providing a better multimedia quality
at end user. The novel aspects of this study are represented by the theoretical and numerical modeling which account for the presence of the other AC’s in the evaluation
of the best retry limit, but maintaining always a very low computational cost for the
evaluation of the network behavior. Furthermore, a study on the best quality assessment
to use with retransmission purposes is portrayed in this thesis, trying to find not only
the most suitable quality assessment to adopt, but also finding an approximation of the
values provided by it, and requiring low computational cost, thus making the adoption
of this index suitable for scenarios characterized by low delay.
More precisely, the first proposed algorithm rely on a distortion estimation method that
is capable to reliably evaluate the Mean Square Error (MSE)-based distortion of a video
sequence. The proposed algorithm, which requires low computational cost, evaluates
the retry limits accounting for the presence of the other access categories and using the
available distortion values. In order to find the best video quality assessment to employ
in these retry limit adaptation scenarios, a second study on the most suitable video quality assessment has been portrayed. The study has shown that the Structural SIMilarity
(SSIM)-based distortion can outperform the MSE-based one, when it is used with adaptive retransmission purposes. On the other hand, if the SSIM results to be more suitable
than the MSE, it requires a high computational cost to be evaluated. To overcome the
drawback introduced by the SSIM, in terms of computational cost, an algorithm capable to evaluate the SSIM-based distortion in low CPU times has been created, making
this quality assessment adoptable also in transmissions characterized by low acceptable
delays. Finally the last part of the thesis has focused on extending the first proposed
algorithm, to a scenario involving the transmission of not only video contents, but also
audio contents, which are usually both present in multimedia flows.
This thesis is organized in two parts. The first part provides the background material, while the second part is dedicated to the original results. With reference to the
first part, the fundamentals of multimedia transmission over Wi-Fi Networks are briefly
summarized in the first chapter. An overview of the most common audio and video
coding standards are presented in the second chapter, focusing mainly on the two standards adopted in the second part of this dissertation, the G.729 speech coding standard,
and the H.264 Video coding standard. The third chapter introduces the most significant
aspects of distributed wireless networks, both considering the Physical (PHY) and the
Medium Access Control (MAC) layers.
The second part describes the original results obtained in the field of 802.11e retry
limit adaptation and low-cost SSIM-based estimation. In particular, the fourth chapter
presents an algorithm for the fast evaluation of the best retry limit associated to each
video packet in an 802.11e contention-based scenario. This algorithm, in accordance
with the estimated distortion and the maximum cumulated delay for each packet, selects the best retry limit with a low computational burden. Given that the MSE often
fails in terms of the visual perception of the quality of the scene, chapter five places its
focus on comparing, in a retry limit adaptation scenario, the adoption of the MSE with
the adoption of another video quality assessment: the SSIM. The aim is to exploit the
possibility of adopting a video quality assessment capable to better measure the fidelity
of the video distortion. Chapter six focuses on overcoming the drawback introduced
by the SSIM, represented by the high computation burden required for evaluating the
SSIM-based distortion. In this chapter, a fast distortion estimation based on the structural similarity for videos encoded with the H.264 standard is presented. Finally the
last chapter of this dissertations, extends the model presented in chapter four, aiming to
jointly evaluate the best retry limits for audio and video flows, both present in multimedia transmissions.
The intent of the work presented hereafter, is to develop and test computationally cheap
solutions for improving the quality of audio/video delivery in 802.11-based wireless
networks, focusing on the careful selection of the retransmission strategy and the reliable estimation of the content distortion.
... The methodology for this type of subjective test for telephony is given in lTU-T P.800 [1]. Tests suitable for more Proceedings of the institute of Acoustics general audio and broadcast tests are described in lTU-Fl BS.1116 [2]. A more detailed summary of typical experimental methods can be found in [3]. ...
... Speech data was recorded in a listening room fulfilling the requirements of ITU-R BS.1116-1 [32]. A total of 25 male and 25 female speakers participated in the data collection. ...
... participate in three consecutive subjective experiments carried out at FORCE Technology SenseLab including I: listening/audio, II: viewing/video, and III: AV test. The experiments were performed in a standardized listening room that meets the acoustical requirements of EBU 3276[23] and ITU-R BS.1116-3[24], compliants for listening and VR experiment with head-mounted display. The experimental setup for AV experiment is depicted inFig 1. ...
... participate in three consecutive subjective experiments carried out at FORCE Technology SenseLab including I: listening/audio, II: viewing/video, and III: AV test. The experiments were performed in a standardized listening room that meets the acoustical requirements of EBU 3276[23] and ITU-R BS.1116-3[24], compliants for listening and VR experiment with head-mounted display. The experimental setup for AV experiment is depicted inFig 1. ...
To open up new possibilities to assess the multimodal perceptual quality of omnidirectional media formats, we proposed a novel open source 360 audiovisual (AV) quality dataset. The dataset consists of high-quality 360 video clips in equirectangular (ERP) format and higher-order ambisonic (4th order) along with the subjective scores. Three subjective quality experiments were conducted for audio, video, and AV with the procedures detailed in this paper. Using the data from subjective tests, we demonstrated that this dataset can be used to quantify perceived audio, video, and audiovisual quality. The diversity and discriminability of subjective scores were also analyzed. Finally, we investigated how our dataset correlates with various objective quality metrics of audio and video. Evidence from the results of this study implies that the proposed dataset can benefit future studies on multimodal quality evaluation of 360 content.
... Both Experiment 1 and Experiment 2 were conducted in a Recommendation ITU-R BS.1116-3 [13]-compliant listening room at the Applied Psychoacoustics Lab of the University of Huddersfield (6:2 m  5:2 m  3:5 m, RT ¼ 0:25 s, NR ¼ 12). A total of 22 loudspeakers were used in this experiment (7 Genelec 8331As and 15 Genelec 8040As). ...
The present study subjectively evaluated loudspeaker reproductions of four different classical recordings in 0+2+0 (stereo), 0+5+0 (surround), 4+5+0 (surround with four height channels), each of which was downmixed from the original 9+10+3 (i.e. NHK 22.2), in terms of four attributes: listener envelopment (LEV), presence (i.e. sense of being there), overall tonal quality (OTQ) and overall listening experience (OLE). Prior to the main experiment, the playback levels of the upper and bottom loudspeaker layers relative to the middle layer level were subjectively adjusted for each of the original 9+10+3 recordings. It was found that the preferred levels of the upper and bottom layers were around 4 dB and 6 dB lower than that of the middle layer, on average. From multiple comparison listening tests, the perceived degradation from the original 9+10+3 to 4+5+0 was found to be significantly dependent on the recording technique used as well as the programme material. It was also found that 0+5+0 was not significantly different from 4+5+0 in general. Overall, LEV was most correlated with OLE, whilst Presence and OTQ tended to have a strong association.
... Both Experiment 1 and Experiment 2 were conducted in a Recommendation ITU-R BS.1116-3 [13]-compliant listening room at the Applied Psychoacoustics Lab of the University of Huddersfield (6:2 m  5:2 m  3:5 m, RT ¼ 0:25 s, NR ¼ 12). A total of 22 loudspeakers were used in this experiment (7 Genelec 8331As and 15 Genelec 8040As). ...
The present study subjectively evaluated loudspeaker reproductions of four different classical recordings in 0+2+0 (stereo), 0+5+0 (surround), 4+5+0 (surround with four height channels), each of which was downmixed from the original 9+10+3 (i.e. NHK 22.2), in terms of four attributes: listener envelopment (LEV), presence (i.e. sense of being there), overall tonal quality (OTQ) and overall listening experience (OLE). Prior to the main experiment, the playback levels of the upper and bottom loudspeaker layers relative to the middle layer level were subjectively adjusted for each of the original 9+10+3 recordings. It was found that the preferred levels of the upper and bottom layers were around 4 dB and 6 dB lower than that of the middle layer, on average. From multiple comparison listening tests, the perceived degradation from the original 9+10+3 to 4+5+0 was found to be significantly dependent on the recording technique used as well as the programme material. It was also found that 0+5+0 was not significantly different from 4+5+0 in general. Overall, LEV was most correlated with OLE, whilst Presence and OTQ tended to have a strong association.
... The results are reported in Table 3. Perceptive test. Finally, we evaluate the global quality using 40 native English speaking evaluators 6 and a MUSHRA test [28]. We selected 20 sentences from our corpus and for each sentence, we presented the listeners with a reference audio clip (generated with the full sentence context) and then asked them to assign a similarity score to five test clips: the hidden reference (identical to the reference and used as the MUSHRA high anchor), k = 0 (used as the low anchor), Ground-Truth k = 1, GPT-2 prediction k = 1 and random prediction k = 1. ...
... We evaluate the perceptual impact of the lookahead k using a MUSHRA listening test [24]. To that purpose, we selected 20 sentences and generated each at multiple values of k, namely k = 1, 2, 4, 6. k = 1 corresponds to a lookahead of one space (or one punctuation mark). ...
... In addition to this, van der Veen and van Maanen [6] provide analytic expressions to the non-linearity of capacitors, for a variety of circuit topologies. Dodds et al. [7] shows that a trained panel of listeners with "critical listening experience" is capable of discernment through detailed test listening methodology (ITU-R BS.1116-1 [8]). ...
Different electrically-equivalent capacitors are known to impact the sonic signature of the audio circuit. In this study, the non-linear behaviour of five different coupling capacitors of equivalent capacitance (marketed as "audio capacitors"), one at a time, are characterised. A dataset containing the input and output signals of a non-linear amplifier is logged, its audio features are extracted and the non-linear behaviour is analysed. Machine learning is then applied on the dataset to supplement analysis of the Total Harmonic Distortion (THD). The five capacitors' THD performance seem to fall into two categories: below 200 Hz, there is significant standard deviation of 14.1 dBc; above 200 Hz, the capacitors show somewhat similar behaviour, with only 0.01 dBc standard deviation. This separation however, does not hold at regions below 0.2 V. A support vector machine model is trained and classifies the five capacitors well above chance: the best classification at 84% and worst at 36%. The methodology introduced here may also be used to meaningfully assess the complicated behaviour of other audio electronic components.
... Perceptive test. Finally, we evaluate the global quality using 40, native English speaking, human evaluators 4 and a MUSHRA test [26]. We selected 20 sentences from our corpus and for each sentence, we presented the listeners with a reference audio clip (generated with the full sentence context) and then asked them to assign a similarity score to five test clips: the hidden reference (identical to the reference and used 3 We did not evaluate error in the internal FastSpeech 2 pitch predictions because we observed a few extreme prediction values which did not materialize in the resultant audio. ...
The prosody of a spoken word is determined by its surrounding context. In incremental text-to-speech synthesis, where the synthesizer produces an output before it has access to the complete input, the full context is often unknown which can result in a loss of naturalness in the synthesized speech. In this paper, we investigate whether the use of predicted future text can attenuate this loss. We compare several test conditions of next future word: (a) unknown (zero-word), (b) language model predicted, (c) randomly predicted and (d) ground-truth. We measure the prosodic features (pitch, energy and duration) and find that predicted text provides significant improvements over a zero-word lookahead, but only slight gains over random-word lookahead. We confirm these results with a perceptive test.
... Sufficient shielding from external noise in combination with the low inherent noise levels and high output signal-noise-ratio render the system a suitable environment for sensitive measurement applications and listening experiments. Since the noise floor lies below the NR15 curve, the experimental setup additionally fulfils the requirements for perceptual assessment of audio systems as per ITU-R BS.1116-3 [81] regarding maximum permissible BNL. ...
Introduction: Surrounding spherical loudspeaker arrays facilitate the application of various spatial audio reproduction methods and can be used for a broad range of acoustic measurements and perceptual evaluations.
Methods: Installed in an anechoic chamber, the design and implementation of such an array of 68 coaxial loudspeakers, sampling a spherical cap with a radius of 1.35 m on an equal-area grid, is presented. A network-based audio backbone enables low-latency signal transmission with low-noise amplifiers providing a high signal-to-noise ratio. To address batch-to-batch variations, the loudspeaker transfer functions were equalised by individually designed 512-taps finite impulse response filters. Time delays and corresponding level adjustments further helped to minimise radial mounting imperfections.
Results: The equalised loudspeaker transfer functions measured under ideal conditions and when mounted, their directivity patterns, and in-situ background noise levels satisfy key criteria towards applicability. Advantages and shortcomings of the selected decoders for panning-based techniques, as well as the influence of loudspeaker positioning errors, are analysed in terms of simulated performance metrics. An evaluation of the achievable channel separation allows deriving recommendations of feasible subset layouts for loudspeaker-based binaural reproduction.
Conclusion: The combination of electroacoustic properties, simulated sound field synthesis performance and measured channel separation classifies the system as suitable for its target applications.
... Fraunhofer IIS in Erlangen, Germany is one of the most active and innovative audio research organizations contributing to many of the commercially successful open standards-based audio compression schemes such as MPEG Layer-3 and MPEG AAC. All those efforts and contributions are based on scientific findings which always have to be implemented and tested in a standardized environment – in the case of audio this has to be a listening room built according to the well known ITU-R BS 1116-1 recommendation [1]. In its former building the audio and multimedia departments of Fraunhofer IIS had such a room [2], where thousands of listening tests of all types were conducted over the last 15 years. ...
The new audio laboratory rooms of the Fraunhofer IIS and their technical design are presented here. The vision behind them is driven by the very high demands of a leading edge audio research organization with more than 100 scientists and engineers. The 300 m2 sound studio complex was designed with the intention of providing capabilities that are in combination far more extensive than those available in common audio research or production facilities. The reproduction room for listening tests follows the strict recommendations of ITU-R BS 1116. The results of the qualification measurements regarding direct sound, reflected sound, and steady state sound field will be shown and the construction efforts needed to achieve these values are explained. The connection from all the computers in the server room to more than 70 loudspeakers in the reproduction rooms, other audio interfaces, and the projection screens is done by an audio and video routing system. The architecture of the advanced control software of this routing system is presented. It allows easy and flexible access for each class of user to all the possibilities made available by this completely new system.
... We evaluate the perceptual impact of the lookahead k using a MUSHRA listening test [24]. To that purpose, we selected 20 sentences and generated each at multiple values of k, namely k = 1, 2, 4, 6. k = 1 corresponds to a lookahead of one space (or one punctuation mark). ...
In incremental text to speech synthesis (iTTS), the synthesizer produces an audio output before it has access to the entire input sentence. In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i.e. when generating speech output for token n, the system has access to n + k tokens from the text sequence. We first analyze the impact of this incremental policy on the evolution of the encoder representations of token n for different values of k (the lookahead parameter). The results show that, on average, tokens travel 88% of the way to their full context representation with a one-word lookahead and 94% after 2 words. We then investigate which text features are the most influential on the evolution towards the final representation using a random forest analysis. The results show that the most salient factors are related to token length. We finally evaluate the effects of lookahead k at the decoder level, using a MUSHRA listening test. This test shows results that contrast with the above high figures: speech synthesis quality obtained with 2 word-lookahead is significantly lower than the one obtained with the full sentence.
... The acoustic perception tests were carried out in the listening room of the Acoustics Laboratory of the Applied Physics II Department of the School of Architecture of the University of Seville (Fig. 1b). The room, rectangular in shape (and of dimensions 5.1 Â 7.5 Â 3.0 m) in accordance with the ITU [13] requirements, is semi-anechoic, has a low background noise (L Aeq = 24.6 dB), with an average mid-reverberation time of 0.2 s. The walls of the room have been treated with a foam pyramidal absorbent material and with bass traps in the corners. ...
The aim of this work is to present the methodology implemented for the assessment of the human perception of sound and of the degree of acoustic comfort of occupants in an ancient Roman theatre. The evaluation is carried out through a visual and acoustic experience in a virtual environment. The textured 3D visual model of the space, and the binaural auralisations based on either on-site empirical measurements or on acoustic simulations, are displayed in a listening room designed with a very short reverberation time and low background noise. By means of sophisticated equipment for 3D virtual environment reproduction to groups of people, this listening room enables the physical ambience to be recreated of the Roman theatre of Cartagena, which is located in the southeast of Hispania (Spain). Groups of people can therefore subjectively assess the intelligibility of speech and the clarity for music of this open-air performance venue. The results accentuate the strong correlation between audio and visual perceptual aspects and contribute towards a more comprehensive understanding of the architectural aural experience.
... Subjective experiments were performed in the FORCE Technology SenseLab's listening room which fulfills the requirements of EBU 3276 and ITU-R BS.1116-3 [17]. The test sequences were evaluated separately in terms of audio quality, video quality, and audiovisual quality test in this order. ...
This paper studies the quality of multimedia content focusing on 360 video and ambisonic spatial audio reproduced using a head-mounted display and a multichannel loudspeaker setup. Encoding parameters following basic video quality test conditions for 360 videos were selected and a low-bitrate codec was used for the audio encoder. Three subjective experiments were performed for the audio, video, and audiovisual respectively. Peak signal-to-noise ratio (PSNR) and its variants for 360 videos were computed to obtain objective quality metrics and subsequently correlated with the subjective video scores. This study shows that a Cross-Format SPSNR-NN has a slightly higher linear and monotonic correlation over all video sequences. Based on the audiovisual model, a power model shows a highest correlation between test data and predicted scores. We concluded that to enable the development of superior predictive model, a high quality, critical, synchronized audiovisual database is required. Furthermore, comprehensive assessor training may be beneficial prior to the testing to improve the assessors' discrimination ability particularly with respect to multichannel audio reproduction. In order to further improve the performance of audiovisual quality models for immersive content, in addition to developing broader and critical audiovisual databases, the subjective testing methodology needs to be evolved to provide greater resolution and robustness.
... Subjective experiments were performed in the FORCE Technology SenseLab's listening room which fulfills the requirements of EBU 3276 and ITU-R BS.1116-3 [17]. The test sequences were evaluated separately in terms of audio quality, video quality, and audiovisual quality test in this order. ...
This paper studies the quality of multimedia content focusing on 360 video and ambisonic spatial audio format reproduced using a head-mounted display and a multichannel loudspeaker setup. Encoding parameters following basic video quality test conditions for 360 videos were selected and a low-bitrate codec was used for the audio encoder. Subjective evaluations were performed independently for the audio, video, and audiovisual quality by using optimal design. Peak signal-to-noise ratio (PSNR) and its variant for 360 videos were computed to obtain objective quality metrics and subsequently correlated with the subjective scores. This study shows that a multiplicative model of audiovisual quality has a moderately high correlation between subjective and predicted scores. However, we conclude that for audiovisual quality research, a high quality, synchronized, critical, audio-video dataset would be required in order to develop a superior predictive model of immersive audiovisual systems. Furthermore, comprehensive assessor training would be beneficial prior to the testing to improve the assessors' discrimination ability particularly with respect to multichannel audio reproduction.
In order to further improve the performance of audiovisual quality models for immersive content, in addition to developing broader and critical audiovisual databases, the subjective testing methodology needs to be evolved to provide greater resolution and robustness.
... The subjects were placed in an audio listening room. The room complied with ITU-R recommendations [31] in terms of acoustics properties, reverberation time and background noise. All subjects conducted the listening test with the same hardware, which included an external sound-card (Focusrite Scarlett 6i6) and highquality headphones (Sennheiser HD 650) and a test interface running on a MacBook pro computer. ...
We present a spatial audio coding method which can extend existing speech/audio codecs, such as EVS or Opus, to represent first-order ambisonic (FOA) signals at low bit rates. The proposed method is based on principal component analysis (PCA) to decorrelate ambisonic components prior to multi-mono coding. The PCA rotation matrices are quantized in the generalized Euler angle domain; they are interpolated in quaternion domain to avoid discontinuities between successive signal blocks. We also describe an adaptive bit allocation algorithm for an optimized multi-mono coding of principal components. A subjective evaluation using the MUSHRA methodology is presented to compare the performance of the proposed method with naive multi-mono coding using a fixed bit allocation. Results show significant quality improvements at bit rates in the range of 52.8 kbit/s (4 × 13.2) to 97.6 kbit/s (4 × 24.4) using the EVS codec.
... As listening room and equipment may influence the outcome of the experiment, listening quality tests are commonly performed with each participant using the same listening conditions, usually a quiet chamber of controlled dimensions [16]. This, however, does not represent realistic scenarios encountered in real life. ...
... As listening room and equipment may influence the outcome of the experiment, listening quality tests are commonly performed with each participant using the same listening conditions, usually a quiet chamber of controlled dimensions [16]. This, however, does not represent realistic scenarios encountered in real life. ...
Estimating the perceived quality of an audio signal is critical for many multimedia and audio processing systems. Providers strive to offer optimal and reliable services in order to increase the user quality of experience (QoE). In this work, we present an investigation of the applicability of neural networks for non-intrusive audio quality assessment. We propose three neural network-based approaches for mean opinion score (MOS) estimation. We compare our results to three instrumental measures: the perceptual evaluation of speech quality (PESQ), the ITU-T Recommendation P.563, and the speech-to-reverberation energy ratio. Our evaluation uses a speech dataset contaminated with convolutive and additive noise, labeled using a crowd-based QoE evaluation, evaluated with Pearson correlation with MOS labels, and mean-squared-error of the estimated MOS. Our proposed approaches outperform the aforementioned instrumental measures, with a fully connected deep neural network using Mel-frequency features providing the best correlation (0.87) and the lowest mean squared error (0.15)
... The "precedence effect" is a well-researched psychophysical auditory phenomenon utilizing a two-loudspeaker, inter-channel lead-lag signal arrangement [1][2][3][4]. However, the vast majority of precedence-oriented investigations have been delivered utilizing ear-level, horizontal loudspeaker configurations, such as ITU BS.1116-1 [5]. Considering the rise in multi-channel "immersive" audio systems employing height, side, and floor channels [6][7][8][9], it's worth considering the precedence effect in alternative loudspeaker configurations. ...
The effects of inter-channel time difference (ICTD) on a sound source’s perceived location are well understood for horizontal loudspeaker configurations. This experiment tested the effect of novel loudspeaker configurations on a listener’s ability to localize the leading signal in ICTD scenarios. The experiment was designed to be a comparison to standard horizontal precedence-effect experiments but with non-traditional loudspeaker arrangements. Examples of such arrangements include vertical, elevated, and lowered configurations. Data will be analyzed using sign and ANOVA tests with listeners’ responses being visualized graphically. Outcomes are expected to follow a predicted precedence-based suppression model assuming localization will be concentrated at the leading loudspeaker.
... The IAL qualifies at most points as a reference lis- tening room according to ITU-R BS.1116-1 [1]. However, the average reverberaon me is sig- nificantly shorter than the ITU requirement. ...
The newly established Immersive Audio Lab at Hamburg University of Applied Sciences' Media Technology Department comprises a 33.2 High Density Loudspeaker Array (HDLA), suitable for a diverse set of spatial audio formats including HOA, set up in an octagon-shaped two story high studio.
The Immersive Audio Lab has been designed for audio production and scientific as well as artistic research, with topics as diverse as sound design and music production in various spatial audio formats including HOA, research in perception, aesthetics and technology of spatial audio, and research in perception of complex auditory scenes, virtual soundscapes and virtual acoustics.
... IIS(Silzle et al. (2009)). This room has a mean reverberation time of T m = 0.33 s and was designed to fulfill the strict recommendations of ITU-R BS.1116(ITU-R (1997)). For every single person, three clap signals with a uniform length of T = 20 s were recorded. ...
Applause is the sound of many people gathered in one place and clapping their hands. It is an expression of enthusiasm and appreciation. In live recordings, it is a vital part evoking the notion of actual live-feeling and participation in an event. Dependent on factors like the audience size, location, and event type, applause signals can be very different in character. For very few people clapping, it consists of very sparse and well-organized isolated claps, ranging up to extremely dense and completely noise-like signals with no identifiable structure when a huge crowd is applauding. Because of this diversity and its impulsiveness, applause proved to be a very critical signal in the context of digital signal processing and in particular in perceptual audio coding. Specifically, the mixture of noise-like signal components (with large entropy), and individually perceivable and temporally fine-granular impulses provides only very little redundancy or irrelevancy that could be used for bit rate reduction. As a consequence, such signals frequently suffer from audible quantization noise due to coding that alters the perceived character of the decoded applause signal. Quantifying these perceptual differences proved to be difficult since traditionally used perceptual attributes do not address the salient properties of applause-like sounds. Finding such an attribute, however, would allow to identify and adjust corresponding signal properties, thereby enabling an improved processing of applause signals. This thesis consists of two parts. In its first part, new insights of the psychoacoustics of applause-like sounds are presented. The question of a perceptual attribute capturing the predominant characteristics of applause sounds is also addressed by proposing and investigating "density" as such an attribute. In a number of perceptual experiments, the notion of the density attribute is refined by two additional subordinated attributes, and several hypotheses about spatial and temporal structures within applause sounds are explored. In the second part of the thesis, these findings are transferred into actual signal processing algorithms enabling a better suited handling of applause-like signals. As a basic building block, a novel semantic applause decomposition method into foreground claps and noise-like background is introduced and shown to be beneficial in several applications. In particular, it is applied in the prediction of an applause sound's overall density, in a novel blind mono-to-stereo upmix method that exploits elements of the spectral and temporal structure of applause-like sounds, and in a new post-processing scheme to enhance the perceived quality of perceptually coded applause signals.
... In the subjective evaluation, the double-blind triplestimulus with hidden reference method standardized in ITU-R BS.1116 [15] was used. Figure 5 shows the experimental procedure, where stimulus R denotes the reference sound, namely, the PCM sound, and stimuli A and B denote the sounds used for evaluation. ...
Even if three-dimensional multichannel sound becomes available in broadcasting, backward compatibility to conventional sound systems will be necessary. There are two transmission formats that can achieve this requirement. One is the simulcast format, and the other is the channel-scalable format. Although the channel-scalable format is advantageous over the simulcast format in terms of the required data rate, the unmasking artifact cannot be avoided when matrix operations are used to realize scalability. To solve this problem, this paper proposes a novel approach that models the quantization noise signal with a polynomial expansion of a decoded signal and removes it from the decoded signal. A subjective evaluation revealed that the proposed method can alleviate the unmasking artifact in the scalable coding of 8.1- and 22.2-channel audio signals.
... 1116 (ITU-R BS. 1116) [76], and the multiple stimulus with hidden reference and anchors (MUSHRA) [78], which is presented in ITU-R BS. 1534 [78]. In the double-blind triple stimulus method the subjects are presented with three stimuli is a sorting and grading process whereas double-blind triple stimulus is a detec- tion and grading process. ...
Conventionally, the quality of the urban sound environment is evaluated using assessments that involve measurements and noise maps. However, noise sources in the urban environment vary with time and cannot be assessed by equivalent sound levels and spectral content only. Moreover, natural sounds such as running water can have a positive impact on the perception of environments even though they increase the overall level of the sound field. This explains the need for auralization, which is a technique used to make the sound field of an environment audible with the presence of all sound sources.
The propagation modeling for auralization can be done both with geometrical acoustics and wave-based methods. Urban environments are acoustically complex and geometrical acoustics methods have limitations in capturing modal effects and multiple diffractions from building edges inside inner city configurations, especially at low frequencies. Since noise sources such as cars in the urban environment have a low frequency content, a method that provides high accuracy at low frequencies could be needed. This can be achieved with the use of wave-based methods, but their downside is that they are computationally demanding.
In this thesis both geometrical acoustics and wave-based methods have been used for auralization purposes. This thesis contains two main subjects: 1) Modeling of directivity in the wave-based acoustics pseudospectral-time domain method (PSTD). With regards to acoustic modeling and auralization, directivity has a clear influence on the perceived sound field and needs to be included in computations. 2) Designing and evaluating car pass-by auralizations, using impulse responses computed with PSTD, a hybrid geometrical acoustics method and measurements.
Firstly, a method for the incorporation of directivity in PSTD is presented. PSTD is a time-domain method that provides an efficient way to solve the linear acoustics equations. First, a given frequency dependent source directivity is decomposed into spherical harmonic functions. The directive source is then implemented through spatial distributions in PSTD that relate to the spherical harmonics, and time-dependent functions are assigned to the spatial distributions in order to obtain the frequency content of the directivity. Since any directivity function can be expressed as a summation of a series of spherical harmonics, the approach can be used to model any type of directive source. The method was evaluated with two computational examples: 1) Modeling of an analytical directivity function in a 3D PSTD simulation; 2) Modeling of horizontal plane head-related transfer functions (HRTFs) in a 2D simulation. Results of the 1st example showed that this approach yields accurate directivity results in PSTD. It was also observed that the accuracy of the results is dependent on the distance of the recording point from the center of the modeled source. The average error across all angles and frequency bands was approximately 0.9 dB. In the 2nd example, horizontal plane HRTFs were modeled up to a frequency of 7.5 kHz. Almost perfect matching was achieved up to approximately 5 kHz.
Secondly, a method for auralization of a car pass-by in a street is explored using PSTD, including the technique developed for the directivity modeling. The transfer paths between sound source locations and a listener are represented via binaural impulse responses, which are computed with the PSTD method. A dry synthesized car signal is convolved with the binaural impulse responses of the different locations in the street, and cross-fade windows are used in order to make the transition between the source positions smooth and continuous. The auralizations were performed for the simplified scenarios where buildings are absent, and for an environment where a long flat wall is located behind the car. A same/different listening test was carried out in order to investigate if increasing the angular spacing between the discrete source positions affects the perception of the auralizations. Signal detection theory (SDT) was used for the design and the analysis of the listening test. Results showed that differences exist, although they are difficult to notice. On average, 52.3% of the subjects found it difficult to impossible to spot any difference between auralizations with larger angular spacing (up to 10±) and the reference auralization (2± angular spacing). This auralization methodology was extended for an urban street canyon. This time the auralizations were implemented using binaural impulse responses measured inside a real street canyon and simulated with geometrical acoustics software. The same listening experiment was again conducted. Results showed that subjects could detect the difference between the reference auralization and auralizations with larger angular increments much better than in the previous case, both for the auralizations synthesized from the measurements as well as those from the simulations.
... The most reliable method for quality assessment is via subjective testing with a group of listeners ( Brachmański, 2015). Usually, a variant of the MOS (Mean Opinion Score) is applied ( ITU, 1996;ITU 1997). One of the most frequently used methods is DCR (Degradation Category Rating), where listeners compare the quality of two samples in a 5-step scale. ...
In the age of digital media, delivering high quality content to consumers is one of the most demanding tasks. There exist numerous broadcasting standards, with different pros and cons, and the DAB/DAB+ (Digital Audio Broadcasting) system is one of the most popular among them. From an engineer’s perspective, efficient resource management under limited bandwidth conditions has always been a challenge. In this paper a subjective quality assessment study of the DAB and DAB+ broadcasting system is performed on a representative group of signal samples. It describes the radio link, including a fully functional transmitter designed for the purpose of this test, as well as the receiver side representing a commercially available consumer device, for a truly real-time and end-to-end quality evaluation.
This paper proposes a method to reduce the computational cost of the spectral division method that synthesizes moving sources. The proposed method consists of two approximations: that of the secondary source driving function and that of the trajectory of the moving sources. Combining these two approximations simplifies the integral calculations that traditionally appear in the driving functions, replacing them with a correction of the frequency magnitude and phase of the source signals. Numerical simulations and subjective experiments show that the computational cost can be reduced by a factor of 50–100 compared to the conventional method without significantly affecting the synthesized sound field and the sense of localization.
Zur Bitratenreduktion (Datenkomprimierung) eingesetzte Codierungsverfahren haben die Aufgabe, die benötigte Datenmenge zur Übertragung oder Speicherung von digitalen Signalen ohne Verlust oder mit möglichst geringem Qualitätsverlust zu verkleinern. Sie werden entweder aus ökonomischen Gründen wie der Kostenersparnis durch geringere erforderliche Übertragungsbandbreiten, oder aus technischen Gründen wie einem in der Größe beschränkten Speicherplatz oder eingeschränkten Übertragungskapazitäten eingesetzt. Codierungsverfahren finden Anwendung in Datennetzen wie beispielsweise dem Internet beim Multimediavertrieb von Musik und Filmen, bei Streamingdiensten und im Rundfunk, in Filmtheatern und in der Telekommunikation aber auch auf physikalischen Datenträgern wie DVD (Digital Versatile Disc), bei der Archivierung großer Datenmengen auf Festplatte und auf Speicherkarten in portablen Mediaplayern. Dieses Kapitel vermittelt die technischen Grundlagen der effizienten und gehörrichtigen Audiocodierung. Ergänzend werden einige gebräuchliche standardisierte Verfahren zur Messung der subjektiven Audioqualität erläutert. Desweiteren wird ein Überblick über gängige verlustlose und verlustbehaftete Audiocodierverfahren und ihre qualitative Einordnung gegeben.
Cette thèse s'inscrit dans le contexte de l'essor des contenus immersifs. Depuis quelques années, les technologies de captation et de restitution sonore immersive se sont développées de manière importante. Ce nouveau contenu a fait naître le besoin de créer de nouvelles méthodes dédiées à la compression audio spatialisée, notamment dans le domaine de la téléphonie et des services conversationnels. Il existe plusieurs manières de représenter l'audio spatialisé, dans cette thèse nous sommes intéressés à l'ambisonie d'ordre 1. Dans un premier temps, nos travaux ont porté sur la recherche d'une solution pour améliorer le codage multimono. Cette solution consiste en un traitement en amont du codec multimono pour décorréler les signaux des composantes ambisoniques. Une attention particulière a été portée à la garantie de continuité du signal entre les trames et à la quantification des métadonnées spatiales. Dans un second temps, nous avons étudié comment utiliser la connaissance de la répartition de l'énergie du signal dans l'espace, aussi appelée image spatiale, pour créer de nouvelles méthodes de codage. L'utilisation de cette image spatiale a permis d'élaborer deux méthodes de compression. La première approche proposée est basée sur la correction spatiale du signal décodés. Cette correction se base sur la différence entre les images spatiales du signal d'origine et du signal décodés pour atténuer les altérations spatiales. Ce principe a été étendu dans une seconde approche à une méthode de codage paramétrique. Dans une dernière partie de cette thèse, plus exploratoire, nous avons étudié une approche de compression par réseaux de neurones en nous inspirant de modèles de compression d'images par auto-encodeur variationnel.
The full text is available here https://research.chalmers.se/en/publication/527070 .
The presentation of extended reality for consumer and professional applications requires major advancements in the capture and reproduction of its auditory component to provide a plausible listening experience. A spatial representation of the acoustic environment needs to be considered to allow for movement within or an interaction with the augmented or virtual reality. This thesis focuses on the application of capturing a real-world acoustic environment by means of a spherical microphone array with the subsequent head-tracked binaural reproduction to a single listener via headphones. The introduction establishes the fundamental concepts and relevant terminology for non-experts of the field. Furthermore, the specific challenges of the method due to spatial oversampling the sound field as well as physical limitations and imperfections of the microphone array are presented to the reader. The first objective of this thesis was to develop a software in the Python programming language, which is capable of performing all required computations for the acoustic rendering of the captured signals in real-time. The implemented processing pipeline was made publicly available under an open-source license. Secondly, specific parameters of the microphone array hardware as well as the rendering software that are required for a perceptually high reproduction quality have been identified and investigated by means of multiple user studies. Lastly, the results provide insights into how unwanted additive noise components in the captured microphone signals from different spherical array configurations contribute to the reproduced ear signals.
The paper aims to discuss a case study of sensing analytics and technology in acoustics when applied to reverberation conditions. Reverberation is one of the issues that makes speech in indoor spaces challenging to understand. This problem is particularly critical in large spaces with few absorbing or diffusing surfaces. One of the natural remedies to improve speech intelligibility in such conditions may be achieved through speaking slowly. It is possible to use algorithms that reduce the rate of speech (RoS) in real time. Therefore, the study aims to find recommended values of RoS in the context of STI (speech transmission index) in different acoustic environments. In the experiments, speech intelligibility for six impulse responses recorded in spaces with different STIs is investigated using a sentence test (for the Polish language). Fifteen subjects with normal hearing participated in these tests. The results of the analytical analysis enabled us to propose a curve specifying the maximum RoS values translating into understandable speech under given acoustic conditions. This curve can be used in speech processing control technology as well as compressive reverse acoustic sensing.
Steganography is a technique that makes it possible to hide additional information (payload) in the original signal (cover work). This paper focuses on hiding information in a speech signal. One of the major problems with steganographic systems is ensuring synchronization. The paper presents four new and effective mechanisms that allow achievement of synchronization on the receiving side. Three of the developed methods of synchronization operate directly on the acoustic signal, while the fourth method works in the higher layer, analyzing the structure of the decoded steganographic data stream. The results of the research concerning both the evaluation of signal quality and the effectiveness of synchronization are presented. The signal quality was assessed based on both objective and subjective methods. The conducted research confirmed the effectiveness of the developed methods of synchronization during the transmission of steganographic data in the VHF radio link and in the VoIP channel.
This chapter broadly introduces the reader to sound quality. The concept of sound quality has a relatively long history of emergence. Probably the oldest sounds associated with a quality rating have been human speech and singing, then theatre and music‐making, including musical instruments. The development of physics and related mathematics started to enable a relationship between objective factors and subjective quality of sound. The chapter discusses the concept of sound quality from a methodological point of view and in different problem domains, such as speech transmission, concert hall and auditorium acoustics, audio reproduction of sound, noise quality, and the general concept of product sound quality. It covers some basic concepts and aspects related to speech communication over different channels which are key factors of sound quality in the context of speech. The chapter reviews the sound quality measures for audio and for speech in telecommunication.
The chapter outlines the concepts of Sound Quality and Quality of Experience (QoE). Building on these, it describes a conceptual model of sound quality perception and experience during active listening in a spatial-audio context. The presented model of sound quality perception considers both bottom-up (signal-driven) as well as top-down (hypothesis-driven) perceptual functional processes. Different studies by the authors and from the literature are discussed in light of their suitability to help develop implementations of the conceptual model. As a key prerequisite, the underlying perceptual ground-truth data required for model training and validation are discussed, as well as means for deriving these from respective listening tests. Both feature-based and more holistic modeling approaches are analyzed. Overall, open research questions are summarized, deriving trajectories for future work on spatial-audio Sound Quality and Quality of Experience modeling.
The central topic of this thesis is Reverberation. Reverberation is used as a global term to describe a series of physical and perceptual phenomena that occur in enclosed environments and relate to the acoustical interaction between a sound source and the enclosure.
This work focuses on the effects of reverberation that are likely to occur within common listening environments, such as car cabins and ordinary resi- dential listening rooms. In the first study, a number of acoustical fields was captured in a physically modified car cabin and evaluated by expert listeners in a laboratory, using a spatial reproduction system. In the second study, nine acoustical conditions from four ordinary listening rooms were perceptually eval- uated by experienced listeners.
The results indicated the importance of decay times in these types of enclosures, even in these theoretically short and non- dominant quantities. It was shown that a number of perceived attributes were evoked by the alterations of the fields both within the same enclosure as well as between different ones.
The studies made use of a novel assessment framework, which forms a signifi- cant part of this work. The proposed framework overcomes previously identified challenges in perceptual evaluation of room acoustics, relating to acquisition and presentation of the acoustical fields, as well as the perceptual evaluation of such complex sound stimuli. It was shown that this framework was able to decompose the phenomena that underline the perceived sensations across assessors. The related multivariate analysis techniques employed the conjoint interpretation of both the physical and perceptual properties of the fields in a factorial space and effectively enabled the direct investigation of their relation- ships.
Overall the work described in this thesis contributes to: (1) understanding the perceptual effects imposed in the reproduced sound within automotive and residential enclosures, and (2) the design and implementation of a perceptual assessment protocol for evaluating room acoustics.
The thesis contains two parts. In the first, the background and rationale of the research project are presented. The second part includes four articles that describe in detail the research undertaken.
This work aims at modifying speech signal samples and test them with objective speech quality indicators after mixing the original signals with noise or with an interfering signal. Modifications that are applied to the signal are related to the Lombard speech characteristics, i.e., pitch shifting, utterance duration changes, vocal tract scaling, manipulation of formants. A set of words and sentences in Polish, recorded in silence, as well as in the presence of interfering signals, i.e., pink noise and the so-called babble speech, also referred to as the “cocktail-party” effect is utilized. Speech samples were then processed and measured utilizing objective indicators to check whether modifications applied to the signal in the presence of noise increased values of the speech quality index, i.e., PESQ (Perceptual Evaluation of Speech Quality) standard.
The human auditory system manages to handle very different tasks ranging from orientation
in complex traffic situations, speech communication at a crowded party or communication via mobile devices, even in highly adverse situations where the target signal is disturbed by different types of maskers as environmental noise, disturbing talkers, detrimental sound reflections or distortions from signal processing. Therefore, experimental methods from different fields of hearing research as psychoacoustics (discrimination or detection thresholds), speech intelligibility, and audio quality are required to capture the abilities and limitations of the auditory system. Only a few rather complex auditory models have been demonstrated to be applicable to predict data from psychoacoustics, speech intelligibility and audio quality, reflecting the three areas of auditory perception considered in this thesis. However, some parameters (e.g., the frequency range of the auditory filterbank) were often adapted according to the individual experiments. A generalized modeling approach, that consequently uses identical model parameters and processing stages for the extraction of auditory features in the model front end in combination with a task-dependent decision stage (back end) would be required to identify and understand which features are universal and capture information relevant for predictions of experiments in the three areas of auditory perception considered here. Moreover, with regard to computational efficiency of the model as would be required for applications as, for example, online monitoring of
speech quality for signal processing algorithms in hearing-aids, it is unclear to which extent such a generalized auditory modeling approach can be simplified while still providing a reasonable prediction performance.
Hence, the aim of this thesis is to provide a modeling approach with low complexity, that
consists of a joint front end only including basic auditory processing stages required to account for the most relevant masking effects, and a task-dependent back end for predicting effects of psychoacoustic masking, speech intelligibility, and audio quality. The first part (chapter 2) of this thesis suggests an auditory modeling approach based on the power spectrum model (PSM; Fletcher, 1940, Patterson and Moore, 1986) and the envelope power spectrum model (EPSM; Ewert and Dau, 2000) as front end to predict psychoacoustic masking and speech intelligibility on basis of spectral and temporal features. The proposed model was assessed by a critical set of psychoacoustic and speech intelligibility experiments and achieved a prediction performance comparable to state-of-the-art models for predicting psychoacoustic and speech intelligibility data. Motivated by findings from Schubotz et al. (2016), implying the relevance of short-time power features for speech intelligibility predictions, the second part (chapter 3) provides a revised spectral feature analysis within the PSM-pathway of the model suggested in the first part. This revised model was successfully evaluated with the identical set of experiments applied in the first part of this work, and the speech intelligibility experiments carried out in Schubotz et al. (2016).
An analysis of the PSM- and EPSM-pathway of the revised model provides information about
the contribution of spectral and temporal cues to speech intelligibility predictions for different maskers. The third part of this thesis (chapter 4) represents an extension of the auditory models presented in chapters 2 and 3 to account for signal degradations in terms of audio quality. The suggested audio quality model was successfully evaluated for four databases with different types of distortions that cover a broad range of quality influencing factors and offered better average prediction performance across the four databases than other state-of-the-art quality models. So far, the proposed modeling approaches described in the previous chapters only rely on monaural cues, while binaural cues are not considered. The fourth part of this thesis (chapter 5) contributes towards an binaural extension of these proposed models by providing an experimental evaluation framework, that can be applied as benchmark test to binaural speech intelligibility models. Thus, in chapter 5, based on the studies of Schubotz et al. (2016), Ewert et al. (2017), the effect of different room acoustical properties on speech reception thresholds and the spatial release were assessed. Findings of this study indicate the importance of spatial cues for speech intelligibility in reverberant surroundings. Taken together, this thesis offers a generalized modeling approach
for predicting data from of psychoacoustic masking, speech intelligibility, and audio quality
experiments. Additionally, the thesis provides benchmark databases that can be utilized for the development and evaluation of auditory models.
The objective of this research project is to present the results of some computer simulations by
means of ray tracing based on the main propositions of design for control rooms in recording
studios from the 60’s. Although the validity of this technique may be questioned when dealing
with low frequency simulations, it does acquaint us with the pattern of early reflections, which
helps control the quality and consistency of the hearing conditions in studio control rooms. The
simulated impulsive responses allow us a comparative evaluation of objective acoustic
parameters. Nevertheless, not all are applicable to rooms of small volume. The characteristics
from each design philosophy will be compared and discussed in the light of these results.
RESUMEN
El objetivo de éste trabajo de investigación es presentar los resultados de las simulaciones en
computadora mediante trazado de rayos aplicado a las principales propuestas de diseño para
cuartos de control de estudios de grabación a partir de los años 60. Aunque la validez de la
técnica puede cuestionarse para simulaciones a frecuencias bajas, permite el conocimiento del
patrón de reflexiones tempranas, la cual controla la calidad y la consistencia de las condiciones
de escucha en cuartos de control de estudios. Las respuestas impulsivas simuladas permiten
una evaluación comparativa de parámetros acústicos objetivos. Sin embargo, no todos son
aplicables a recintos de volumen pequeño. Se compararán las características de cada filosofía
de diseño y serán discutidas a la luz de estos resultados.
Peu d'études ont été menées sur l'influence de la stéréoscopie sur la perception d'un mixage audio au cinéma. Les témoignages de mixeurs ou les articles scientifiques montrent pourtant une grande diversité d'opinions à ce sujet. Certains estiment que cette influence est négligeable, d'autres affirment qu'il faut totalement revoir notre conception de la bande-son, aussi bien au niveau du mixage que de la diffusion. Une première série d'expériences s'est intéressée à la perception des sons d'ambiance. 8 séquences, dans leurs versions stéréoscopiques (3D-s) et non-stéréoscopiques (2D), ont été diffusées dans un cinéma à des sujets avec plusieurs mixages différents. Pour chaque présentation, les sujets devaient évaluer à quel point le mixage proposé leur paraissait trop frontal ou au contraire trop « surround », le but étant de mettre en évidence une éventuelle influence de la stéréoscopie sur la perception de la balance frontal/surround d'un mixage audio. Les résultats obtenus ont rejoint ceux d'une expérience préliminaire menée dans un auditorium de mixage, où les sujets se trouvaient en situation de mixeur et devaient eux-mêmes régler la balance frontal/surround : l'influence de la stéréoscopie était faible et n'apparaissait que pour quelques séquences. Des études ont ensuite été menées sur la perception des objets sonores tels que dialogues et effets. Une quatrième expérience s'est intéressée à l'effet ventriloque en élévation : lorsque l'on présente à un sujet des stimuli audio et visuel temporellement coïncidents mais spatialement disparates, les sujets perçoivent parfois le stimulus sonore au même endroit que le stimulus visuel. On appelle ce phénomène l’effet ventriloque car il rappelle l'illusion créée par le ventriloque lorsque sa voix semble plutôt provenir de sa marionnette que de sa propre bouche. Ce phénomène a été très largement étudié dans le plan horizontal, et dans une moindre mesure en distance. Par contre, très peu d'études se sont intéressées à l'élévation. Dans cette expérience, nous avons présenté à des sujets des séquences audiovisuelles montrant un homme en train de parler. Sa voix pouvait être reproduite sur différents haut-parleurs, qui créaient des disparités plus ou moins grandes en azimut et en élévation entre le son et l'image. Pour chaque présentation, les sujets devaient indiquer si la voix semblait ou non provenir de la même direction que la bouche de l'acteur. Les résultats ont montré que l'effet ventriloque était très efficace en élévation, ce qui suggère qu'il n'est peut-être pas nécessaire de rechercher la cohérence audiovisuelle en élévation au cinéma.
The need of assessing audio quality using objective metrics is discussed. The literature review conducted showed an inability of standard objective metrics to relate to perceptual quality. This project investigates the dependencies of audio quality, focussing on the perceptual and cognitive process involved in quality assessments, as an attempt to improve the existing objective metrics. The perceptual and cognitive properties of the auditory system were found to be essential tools in objective audio quality metrics. A literature-based investigation into the current perceptual-based measurement methods revealed the unreliability and the limitations they engage. Their output is questioned as such schemes are based only on physical properties of the auditory system, they are application specific and they provide very restricted global evaluation. The lower-level attributes of timbre were found to be a necessary addition to these schemes, as an attempt to improve their reliability. An experiment was carried out using a wide-range of processed sounds with various 'brightness' alterations. Listening tests revealed that a slight increase in brightness tends to improve the perceived quality; further brightness alteration affects the quality significantly. The results of the experiments prove the importance of assessing quality in terms of perceptual attributes rather than single BAQ ratings. This will allow better understanding of the sound characteristics during objective evaluation, similarly to parametric quality assessment via listening tests. Further work is suggested, as to investigate the major timbral and spatial attributes involved in audio quality. It is assumed that by incorporating perceptual attributes into these schemes would be a significant attempt to improve the reliability, effectiveness and the usefulness of the perceptually based schemes. This might result to a more useful measurement scheme that could describe 'how a product sounds' rather than how well is hiding distortions compared to older methods.
In ambisonics, the number of loudspeakers must be greater than or equal to the requirement of the ambisonic order. On the one hand, if the number of loudspeakers in an ambisonic system satisfies the minimum requirement exactly, localization in the lateral regions can be poor. On the other hand, the use of a large number of loudspeakers in an ambisonic array induces spectral impairments. In this paper, we present a binaural ambisonic decoder equipped with a 1/3-octave equalizer that demonstrates how to improve sound localization without perceptual spectral impairment. The equalization decoding estimates the level of spectrum impairment in a virtual high-density loudspeaker array and uses a 1/3-octave filterbank to equalize the frequency components, which are low-pass filtered or comb filtered. Therefore, the magnitude of the treated signal is nearly uniform from low to high frequencies. Both objective and subjective listening tests were conducted. The experimental results show that the proposed method reinforces sound localization, especially at low frequencies. Additionally, the use of a 1/3-octave filterbank facilitates the higher-order extension of binaural ambisonic decoding.
(Open Access: https://doi.org/10.1121/1.5040489)
Binaural rendering of Ambisonic signals is of great interest in the fields of virtual reality, immersive media, and virtual acoustics. Typically, the spatial order of head-related impulse responses (HRIRs) is considerably higher than the order of the Ambisonic signals. The resulting order reduction of the HRIRs has a detrimental effect on the binaurally rendered signals, and perceptual evaluations indicate limited externalization, localization accuracy, and altered timbre. In this contribution, a binaural renderer, which is computed using a frequency-dependent time alignment of HRIRs followed by a minimization of the squared error subject to a diffuse-field covariance matrix constraint, is presented. The frequency-dependent time alignment retains the interaural time difference (at low frequencies) and results in a HRIR set with lower spatial complexity, while the constrained optimization controls the diffuse-field behavior. Technical evaluations in terms of sound coloration, interaural level differences, diffuse-field response, and interaural coherence, as well as findings from formal listening experiments show a significant improvement of the proposed method compared to state-of-the-art methods.
In this paper, we propose a binaural system for virtual auditory space synthesis and reproduction. We introduce a subjective listening test for head-related transfer function (HRTF) customization and real-time implementation of ambisonic surround sound over headphones. In this test, a listener first selects the closest-matching HRTF dataset from an existing database. The selected HRTF dataset must fulfill the criteria of front-back discrimination and up-down discrimination. Secondly, the selected HRTF dataset is utilized to build a binaural ambisonic system composed of an encoder for synthesizing multiple sound sources, an image-source model for listening-room simulation, an ambisonic rotator responding to head motion in three axes, and a binaural ambisonic decoder. Additionally, we propose a system optimization technique that minimizes computational cost and enhances audio quality. Finally, we present both subjective and objective measurements of the perceived audio quality that were conducted to validate this approach. The audio quality assessment examined the frequency hearing impairment, localization error, and localization blur. The proposed system is especially suitable for mixing and rendering immersive sound in virtual reality.
ResearchGate has not been able to resolve any references for this publication.