Conference PaperPDF Available

Quality Assessment of Two Fullband Audio Codecs Supporting Real-Time Communication

Authors:

Abstract and Figures

Recent audio codecs enable high quality signals up to fullband (20 kHz) which is usually associated with the maximal audible bandwidth. Following previous studies on speech coding assessment, we survey in this novel study the music coding ability of two real-time codecs with fullband capability – the IETF standardized Opus codec as well as the 3GPP specified EVS codec.We tested both codecs with vocal, instrumental and mixed music signals. For evaluation, we predicted human assessments using the instrumental POLQA method which has been primarily designed for speech assessment. Additionally, we performed two listening tests as a reference with a total of 21 young adults. Opus and EVS show a similar music coding performance. The quality assessment mainly depends on the specific music characteristics and on the tested bitrates from 16.4 to 64 kbit/s. The POLQA measure and the listening results are correlating, whereas the absolute ratings of the young listeners achieve much lower MOS values.
Content may be subject to copyright.
Part I
The final publication is
available at Springer via
http://dx.doi.org/10.1007/978-
3-319-43958-7
Quality Assessment of two Fullband Audio
Codecs Supporting Real-Time Communication
M. Maruschke, O. Jokisch, M. Meszaros, F. Tro jahn, M. Hoffmann
Leipzig University of Telecommunications (HfTL), Germany
maruschke@hft-leipzig.de, jokisch@hft-leipzig.de
http://www.hft-leipzig.de
Abstract. ——————– DRAFT ONLY! —————————
The complete publication is available at Springer via ———–
http://dx.doi.org/10.1007/978-3-319-43958-7
Recent audio codecs enable high quality signals up to fullband (20 kHz),
which is usually associated with the max. audible bandwidth. Following
previous studies on speech coding assessment, we survey in this novel
study the music coding ability of two real-time codecs with fullband
capability – the IETF standardized Opus codec as well as the 3GPP
specified EVS codec. We are testing both codecs with vocal, instrumental
and mixed music signals. For evaluation, we predict human assessments
by the instrumental POLQA method, which was primarily designed for
speech assessment. Additionally, we perform a listening test with 21
young adults as a reference. Opus and EVS show a similar music coding
performance. The quality assessment mainly depends on the concrete
music characteristics and on the tested bitrates from 16.4 to 64 kbit/s.
The POLQA measures and the listening results are correlating, whereas
the absolute ratings of the young listeners achieve lower MOS values.
Keywords: Opus, EVS, music coding, POLQA, listening test
1 Introduction
In the current IP-based real-time communication, two predominant fullband audio
codecs are used – the Internet-driven Opus codec [1] and the telecommunication
carrier-patronized codec Enhanced Voice Service (EVS) [2]. The widely spread
Opus codec is pre-installed in popular web-browsers such as Google Chrome
or Firefox and supports their so called Web-based Real-Time Communication
(WebRTC) functionality [3]. Nearly at the same time, the EVS coder standard-
ization was conducted by telecommunication industry – aiming at a flexible
and sustainable fullband audio codec which is backward compatible to already
existing wideband speech codecs in public cellular networks (e. g. AMR-WB
codec). Despite of diverse motivation in the codec developments, both Opus
and EVS support the following audio bandwidth categories (cf. Table 1): Both
codecs are intended and specified to support different needs – fullband audio (FB)
speech coding as well as high-quality music communication applications like live
music streaming or web-radio broadcasting [1] [2]. Therefore, it is reasonable to
assess the codec quality of such “all-rounders”, most notably for different music
options. In this contribution we focus on the comparison of the fullband music
performance provided by Opus versus EVS. To create diverging challenges, we
test samples of the following music categories:
Singing voice (a-capella music),
Musical instruments,
Mixed music (instrumental parts and singing voice).
We intend to detect the audible differences by using these music styles in
varying codec operating modes. The design targets on adequate bitrate conditions
for both, Opus and EVS codecs in FB mode using the instrumental assessment
method Perceptual Objective Listening Quality Assessment (
POLQA
) [4] in FB
music operation. It should be noted here that the perceptual model of POLQA
was not developed for music assessment which means to test the limits of this
method which has not be published so far. Furthermore, we performed a listening
test with human probands. The widely-used audio quality rating score, the Mean
Opinion Score (
MOS
) was utilized. Following an overview about previous Opus
and EVS related research, we introduce our test design in section 3. Afterwards
we discuss the instrumental and perceptual assessment results with Opus and
EVS in section 4 and formulate some conclusions.
Beyond there are other fullband codecs in the AAC-ELD family [5] and also
the G.719 [6]. Nevertheless, for practical reasons (e. g. regarding the missing
capability in WebRTC environment) they are not widely spread.
2 Previous studies on Opus and EVS
Standardized in RFC 6716 [1] by the Internet Engineering Task Force (
IETF
),
Opus is designed as an all-purpose interactive speech and audio codec. Applicable
in multiple use cases, Opus is suitable for scopes like Voice over IP, videocon-
ferencing, online-gaming or audio on demand. It comprises low bit rate speech
as well as very high quality stereo music. To realize high quality and dynamic
characteristics, Opus combines the linear prediction-based SILK codec with
the Modified Discrete Cosine Transform (
MDCT
)-based Constrained Energy
Lapped Transform (
CELT
) codec. For a flexible use, the Opus codec supports
the frequency band types NB, WB, SWB and FB. Consequently, the Opus codec
provides speech and music (alternatively mono or stereo) within a bit rate range
Table 1. Audio bandwidths and corresponding quality levels
Abbreviation Meaning Pass-band Quality Expectation
NB narrowband 0,3 . . . 3,4 kHz traditional phone voice
WB wideband 50 Hz . . . 7 kHz AM radio/ HD voice
SWB superwideband 50 Hz . . . 14 kHz FM radio/ full HD voice + music
FB fullband 20 Hz . . . 20 kHz CD quality/ full HD voice + music
from 6 kbit/s to 510 kbit/s and low delay coding (2,5 ms to 60 ms) for all relevant
sample rates, from 8 kHz up to 48 kHz. Opus supports constant and variable
bitrate (VBR) which is the default operational mode. Since its introduction in
2010/2011, the Opus codec has passed several listening test campaigns – supple-
mented by comparison to other speech and audio codecs (Speex NB/WB, iLBC,
G.722.1/G.722.1C, AMR NB/WB, HE-AAC, Vorbis). The results are examined
and summarized in Hoene et al. [7] in which Opus outperformed all codecs – in
particular in the wider bands if applicable. Nevertheless, the mentioned tests
were carried out without disruptions on stand-alone codec.
EVS was standardized by the 3GPP in 2014 following the AMR-WB codec
and is designed for packet switched networks as well as for mobile communication
like VoLTE [2]. It’s comparable to Opus as an all-purpose codec. To provide
high quality and dynamic characteristics, EVS combines several working modes.
It can change on command between Linear Prediction (
LP
)-based, Frequency
Domain and Inactive Signal (Comfort Noise Generation (
CNG
)) coding. For
the applicability on multiple switched network use cases, EVS provides a higher
errors resilience against packet losses and errors. Futhermore EVS supports an
interactive mode to interact with AMR-WB. It supports all frequency band types
whereby FB is optional. Like Opus, EVS can also handle speech and music at a
bit rate range from 7.2 kbit/s up to 128 kbit/s and low delay times from 30.9
ms to 32 ms for all relevant sample rates (8 kHz NB up to 48 kHz FB). Current,
EVS is not yet ready for stereo music processing.
The ITU-T P.863 recommendation (
POLQA
) describes an objective method
for predicting overall listening speech quality from narrowband (NB) up to
superwideband (SWB) telecommunication scenarios as perceived by the user in
an ITU-T P.800 or ITU-T P.830 Absolute Category Rating (
ACR
) listening-only
test.
POLQA
supports two operational modes, one for narrowband and one
for superwideband. By reason of internal frequency limitations to 14 kHz, the
POLQA method in superwideband mode (P.863 SWB) is not able to differentiate
between clean and unprocessed audio 14 kHz SWB and 20 kHz FB test signals.
Nevertheless, we used the POLQA prediction tool for our quality survey: our
motivation is to check how far the POLQA method in the SWB mode is suitable
for FB music test evaluation.
Currently there are three research studies provided by other institutions/authors
on comparing Opus and EVS under several circumstances/conditions. The first
investigation is from Anssi R¨am¨o and Henri Toukomaa (both from Nokia Net-
works) [8]. It is a listening test using a discretive nine-point
MOS
scale at a
bit rate range from 4.7 kbit/s up to 128 kbit/s with clean speech and mixed
content. The second test provided by the ITU-T study group 12 is a P.800 ACR
based listening test to evaluate the prediction performance of
POLQA
assesment
method [9]. Therein, the EVS codec have been tested, focussing bit rates between
7.2 kbit/s and 24.4 kbit/s. Based on the MOS scores, the prediction performance
of POLQA superwideband mode (rev 2.4) was validated. The last known study
is provided by 3GPP itself [10]. It’s an ITU-T P.800 listening test using the
five-point MOS scale. The tests were conducted under laboratory conditions
using all frequency bandwidths (NB-FB) with bit rates between 4.7 kbit/s and
24.4 kbit/s . This quality test illustrates the performance characterization of the
EVS codec.
After careful consideration there are no direct comparisons between Opus and
EVS in FB mode with bit rates higher then 32 kBit/s using a five-point MOS
scale. So this question is placed and answered for the first time with this pa-
per. Furthermore, the solely FB audio/music codec testing with the POLQA
assessment is new.
3 Test design
3.1 Fullband Opus and EVS testing concept
The two codecs under examination require a different minimum data bitrate
for the audio FB operating mode. While for Opus this minimum bitrate equals
around 20 kbit/s (VBR mode), for EVS it is 16.4 kbit/s. Furthermore, both
codecs where tested using the bitrates 32 kbit/s and 64 kbit/s.
As EVS does not support the stereo production mode we tested both codecs
solely in the mono operation mode. Therefore, when using bitrates higher then 64
kbit/s a better quality is not expected because the codecs reach their saturation
curve (in mono mode), as demonstrated in [8].
As Sound Quality Assessment Material (SQAM) we selected recorded samples
for subjective tests provided by European Broadcasting Union (EBU) [11]. This
EBU SQAM lossless sound samples are available free of charge for research and
development use. To achieve a well-balanced variety of music types, we picked
6 sound samples from this database (2 singing voice, 2 musical instruments, 2
mixed music samples).
3.2 Instrumental POLQA method and perceptual test
In order to validate the POLQA (ITU-T P.863) prediction method an
ACR
listening only test was conducted.
The figure 1 shows the procedure for determining the audio quality expressed
by
MOS
ACR for the listening test and
MOS
Listening Quality Objective (LQO)
Fig. 1. Experimental setup including the POLQA method.
for the POLQA method.
For the instrumental assessment test we utilized the SQuadAnalyzer (version
2.4.0.4) software which supports the POLQA algorithm in SWB operating mode.
The selected stereo signal sound samples of the SQAM database were prepro-
cessed into mono signal reference samples, ready for our test setup.
To find out the appropriate listening test environment, we conducted two
test variants where different acoustic room conditions where given. To minimize
background noises, the first listening test variant took place in a sound insulating
cabinet according the P.800 requirements. The second test variant where per-
formed in a regular lecture room. It turned out that the differences in the results
of the cabinet variant where
0.2 MOS points compared to the lecture room
variant. We estimate that this minor differences are insignificant. By summarizing
the results of both variants, the following overall test scenario resulted:
21 students as naive listeners,
20 - 28 years of age,
frequent music listener,1
6 different SQAM samples (violin, glockenspiel, quartet, soprano, two pop
music examples),
resulting 42 audio samples in total (each SQAM sample given in three different
bitrates for Opus as well as for EVS, plus the FB reference signal),
three training sequences,
five point ACR MOS scale.
4 Results and discussion
4.1 Effect of the codec bitrate
The Figure 2 compares POLQA measures and listening test results for both
codecs – Opus and EVS – by summarizing all test samples depending on bitrate.
As a reference, the mean assessments of the uncoded samples at 768 kbit/s (PCM
48 kHz, 16 bit) are given. The quality degradation between reference and 64
kbit/s version is obviously low (Opus = / EVS = ..).
4.2 Influence of music characteristics
The Figure 3 illustrates the influence of the music characteristics of diverging
music pieces by using the Opus codec.
In contrast to the Opus results (cf. Figure 3), the Figure 4 summarizes the
EVS codec performance on the different music music pieces.
1
listening to music several hours a day, using different audio player techniques (HD
stereo headsets, high quality sound systems).
Fig. 2. Overall results including POLQA and listening test.
Fig. 3. Music coding by Opus – listening test results.
5 Conclusion
Will follow soon.
Fig. 4. Music coding by EVS – listening test results.
Acknowledgment
We would like to thank SwissQual AG (a Rhode & Schwarz company) for
supplying the POLQA tool SQuadAnalyzer – in particular Jens Berger for
the elaborate discussions. Further acknowledgments go to Andr´e Schuster for
supporting the experiments in the sound insulating cabinet of HfT Leipzig and
to all student volunteers in the listening tests.
References
1.
J. Valin, K.Vos, and T. Terriberry, “Definition of the Opus Audio Codec,” RFC
6716 (Proposed Standard), Internet Engineering Task Force, Sep. 2012. [Online].
Available: http://www.ietf.org/rfc/rf c6716.txt
2.
3GPP, “EVS Codec General Overview,” 3rd Generation Partnership Project
(3GPP), TS 26.441 v12.1.0, Dec. 2014. [Online]. Available: http://www.3gpp.org
/DynaReport/26441.htm
3. Google Inc. (2014, Sep.) WebRTC. [Online]. Available: http://www.webrtc.org/
4.
ITU-T, “Methods for objective and subjective assessment of speech quality
(POLQA): Perceptual Objective Listening Quality Assessment,” International
Telecommunication Union (Telecommunication Standardization Sector), REC P.863,
Sep. 2014. [Online]. Available: http://www.itu.int/rec/T-REC-P.863-201409-I/en
5.
Frauenhofer IIS, “The AAC-ELD Family For High Quality Communication
Services,” Frauenhofer IIS), Technical Paper, Dec. 2015. [Online]. Avail-
able: http://www.iis.fraunhofer.de/content/dam/iis/de/doc/ame/wp/Fraunhof
erIIS Technical-Paper AAC-ELD- family.pdf
6.
ITU-T, “Low-complexity, full-band audio coding for high-quality, conversational
applications,” International Telecommunication Union (Telecommunication
Standardization Sector), REC G.719, Jun. 2008. [Online]. Available: http:
//www.itu.int/rec/T-REC-G.719- 200806-I/en
7.
C. Hoene, J. Valin, K. Vos, and J. Skoglund, “Summary of OPUS listening test
results draft-ietf-codec-results-03,” Internet-Draft, Internet Engineering Task Force,
Jan. 2014, Available: http://tools.ietf.org/html/draft-ietf -codeco-results- 03 [re-
trieved: April., 2015].
8.
A. R¨am¨o and H. Toukomaa, “Subjective Qualitiy Evaluation of the 3GPP EVS
codec,” IEEE International Conference on Acoustics, Speech and Signal Processing;
Brisbane; Australia, April 2015.
9.
ITU-T. (2016, Jan.) P.Imp863: Implementer’s Guide on assessment of
EVS coded speech with Recommendation ITU-TP.863. [Online]. Available:
http://www.itu.int/rec/T-REC-P.Imp863-201601-I!Oth1/en
10.
ETSI, “Universal Mobile Telecommunications System (UMTS); LTE; Codec
for Enhanced Voice Services (EVS); Performance characterization,” European
Telecommunications Standards Institute (ETSI), TS 126952 v13.0.0, Jan. 2016.
[Online]. Available: http://www.etsi.org/deliver/etsi tr/126900 126999/126952/13.
00.00 60/tr 126952v130000p.pdf
11.
European Broadcasting Union. (2008, Oct.) Sound Quality Assessment Material
recordings for subjective tests. [Online]. Available: https://tech.ebu.ch/public
ations/sqamcd
... Additionally, audio evaluation by the authors in [12] assess the performance of codecs Opus and Enhanced Voice Services (EVS), which were tested through vocal, instrumental, and mixed music signals using POLQA [11]. The effect of lower bitrates (16.4 kbit/s and 20 kbit/s respectively for EVS and OPUS), indicate high degradation. ...
... The authors of [11] also focus on the audio and speech quality assessment of the Opus codec within the real-time communication mode. As in the work by the authors of [12], digital signals are assessed through POLQA, and additionally, through the non-standardized Audio Quality Analyzer (AQuA) from Sevana company. The authors state that the prediction of the instrumental measures should closely correspond to quality scores from a human listening test, considered a subjective test. ...
... However, there is discrepancy as to which methods of audio evaluation are superior [1], [12]. The ITU-R BS.1534-3 Multi Stimulus test with Hidden Reference and Anchor (MUSHRA) has been designed as a subjective measurement of intermediate quality level of audio systems and has been used to evaluate technology such as headsets, speech codecs for telecommunication, and audio codecs [19]. ...
Article
Full-text available
Music Real-time Communication applications (M-RTC) enable music making (musiking) for musicians simultaneously across geographic distance. When used for musiking, M-RTC such as Zoom and JackTrip, require satisfactorily received acoustical perception of the transmitted music to the end user; however, degradation of audio can be a deterrent to using M-RTC for the musician. Specific to the audio quality of M-RTC, we evaluate the quality of the audio, or the Quality of Experience (QoE), of five network music conferencing applications through quantitative perceptual analysis to determine if the results are commensurate with data analysis. The ITU-R BS.1534-3 Multi Stimulus test with Hidden Reference and Anchor (MUSHRA) analysis is used to evaluate the perceived audio quality of the transmitted audio files in our study and to detect differences between the transmitted audio files and the hidden reference file. A comparison of the signal-to-noise ratio (SNR) and total harmonic distortion (THD) analysis to the MUSHRA analysis shows that the objective metrics may indicate that SNR and THD are factors in perceptual evaluation and may play a role in perceived audio quality; however, the SNR and THD scores do not directly correspond to the MUSHRA analysis and do not adequately represent the preferences of the individual listener. Since the benefits of improved M-RTC continue to be face-to-face communication, face-to-face musiking, reduction in travel costs, and depletion of travel time, further testing with statistical analysis of a larger sample size can provide the additional statistical power necessary to make conclusions to that end.
... POLQA was standardized by the ITU-T in 2014 with recommendation P.863 [34]. The perceptual model of POLQA is able to assess speech of up to Super-Wideband (SWB), although recent studies show that POLQA can even be utilized for the assessment of FB music [12]. POLQA outputs its results as MOS-Listening Quality Objective -Super-Wideband (LQO sw ). ...
... .1: Audio bandwidths and corresponding quality levels[12] ...
... POLQA additionally offers a NB mode, but the SWB mode can assess frequency content from NB up to SWB and under certain limitations, the scores of FB audio assessed by POLQA in SWB mode, can also be considered as valid[12]. Therefore, the SWB mode is preferred.3.6 WebRTC performance parameters ...
Thesis
Full-text available
In general, the prevalence of Internet communication services based on WebRTC is rising significantly. Their success relies on usability and the provided quality, which is dependent on several network parameters such as bitrate, packet loss, delay and jitter. On top of network related parameters, WebRTC offers its own set of variables which may influence the call quality. This thesis will focus on identifying the WebRTC performance parameters of calls through the example of the \immmr application. Additionally, the metrics that affect these parameters will be evaluated. Furthermore, an audio monitoring solution will be concepted and realized. The implementation of a call quality assessment and the detection of the parameters with most impact to WebRTC call quality is of great value for users and application developers.
... Many studies have already addressed the question of how this audio compression affects the acoustic speech signal and its reception. However, the vast majority of these studies focus on questions of speech intelligibility or speaker identification [24,25]. The speakers' perceived traits or impact on the audience have rarely been dealt with so far [26,27,28]. ...
... Speech compression for mobile communication is mostly used to reduce the bandwidth for transmission, the transmission delay as well as required system memory and storage [25,33]. A numerous variety of different compression techniques for different applications is available, so that a selection of certain codecs necessary for an initial investigation is needed. ...
Conference Paper
Full-text available
Emotions are an integral part of a speaker's charismatic impact. Previous studies took started from this impact examined the associated emotional features on the part of the speaker and the recipient. We start here from the emotions themselves and test with a view to, e.g., everyday business communication and based on isolated, enacted stimulus sentences, which emotions make speakers sound more or less charismatic and how this interacts with speaker gender and speech compression. The results of a perception experiment with 21 listeners show that high-arousal emotions make speakers sound more charismatic than low-arousal emotions. Moreover, some compression codes, including the popular MP3 codec, perform surprisingly poorly at differentiating emotions in terms of perceived speaker charisma, particularly in combination with female speakers' voices.
... M. Maruschke, O. Jokisch, M. Meszaros, F. Trojahn, and M. Hoffmann have evaluated the quality performance of EVS audio coder for violin musical instrument and pop music at different output bitrates. They found out that EVS introduces good quality for input musical signals [8]. Guillaume Fuchs and et al have evaluated the quality performance of the EVS audio coder, the coder operates at WB and SWB, this test was done for musical and mixed content input audio signals at different output bitrates. ...
... EVS was evaluated only for orchestral violin musical instrument as described in [8]. The evaluation results were compared with this research results. ...
Article
Full-text available
A new 3GPP audio coder, this coder called Enhanced Voice Services (EVS), which is a high definition full band audio coder (HD), that provides new features for improving the real-time audio communication systems. The ordinary speech coders operate in the time domain. To improve the quality of the audio encoders, these encoders operate in the frequency domain. EVS introduces a new novel switching between speech and audio coder without adding any artificial errors and with negligible delay. EVS produces highly encoding quality for speech, music, and mixed content. The EVS audio quality has not been evaluated yet for oriental and orchestral musical instruments. This paper evaluates the EVS audio coder performance for oriental and orchestral musical instruments. Extensive testing done over 10 musical instruments and more than 80 audio recorded signals at different output bitrates. The evaluation was done by both subjective (i.e. P.800 MOS) with 22 naïve listeners and objective evaluation technique (Perceptual Evaluation of Audio Quality PEAQ algorithm). Both of MOS and PEAQ results shown that the average quality of the EVS audio coder for orchestral musical instruments is slightly better than the quality for oriental musical instruments at different output bitrates.
... A large variety of different compression techniques for different speech applications is available, aiming to optimize the processing in different aspects, i.e. transmission delay, required memory, and computation complexity [26], [27]. Therefore, usually a selection of certain codecs highlighting specific aspects is necessary. ...
Conference Paper
Far-field speech recognition gained a lot of attention in the last years. In particular, the appearance of commercial voice assistants has taken research to a new level in case of recognition, understanding and applications. This technology has become one of the mainstay products, with well-known examples like ALEXA, Siri, or Cortana from Amazon, Apple, or Microsoft. One of the main reasons behind this, is the given naturalness of speaking as a form of communication, in contrast to the use of additional external periphery. In this sense, the usage of a voice-controlled mobile phone does not differ substantially from the voice control of a smart home application. Besides making the operation of a technical system as simple as possible, voice assistants should also lead to a natural interaction. In this contribution, a natural interaction is characterised by understanding the full scale of human expressions and as a consequence of this the engagement of people allowing them to interact seamlessly with each other and the technical environment. Actual voice assistants are very good in recognising and understanding the speech content in sense of a factual information, but lack in the understanding of further aspects of the message (e.g. appeal, relationship). In many cases the factual information is just one perspective and not always the most important one and a lack in the understanding of further perspectives (e.g. affect, charisma) lead to many misinterpretations. During the last decades the research community is actively developing techniques that allows to recognise and detect affective information of the speaker and include them into the decision process of technical assistants. Unfortunately, as these investigations are mostly data-driven and based on AI-modelling, the impact of different acoustic environments or transmission channels are often neglected. The contribution will therefore explore the influence of different channel coding and different room- acoustics on the feature representation and recognition performance of affective states or charismatic information.
... In today's (mobile) communication systems, speech compression is heavily used, as it reduces the bandwidth for transmission, the transmission delay as well as required system memory and storage [19,20]. A number of studies investigated the impact of compression on spectral quality and acoustic features [21,22]. ...
Chapter
Full-text available
Previous research by the authors showed that signal compression codecs used in remote meetings and mobile communications have a substantial negative effect on perceived speaker charisma. Moreover, this effect size varied as a function of speaker gender. Following up from this previous study, we conducted a multipara-metric acoustic analysis of a set of sentences elicited from male and female speakers in order to detail the effect of speech-signal compression on charisma-related acoustic-prosodic feature settings. Results show that all compression algorithms caused significant acoustic changes compared to the baseline condition. Almost all of them go in an unfavorable direction concerning speaker charisma. The six compression methods also performed differently well. While OPUS and MP3 caused the fewest negative effects, SPEEX and AMRNB resulted in the most negative effects; GSMFR took a middle position. Moreover, evidence is found for gender-specific effects in terms of both the number of negatively affected acoustic features and their type. The results are discussed with respect to their conceptual implications of perceived speaker charisma and the further development of codecs.
... For data transmission, compression is heavily used within modern (mobile/remote) systems. Compression allows reducing the transmission bandwidth while retaining the speech intelligibility (ITU-T, 1996;ITU-T, 2014;Maruschke et al., 2016). Several codecs have been developed to meet various applications with different quality requirements (Siegert et al., 2016a). ...
Article
Full-text available
Remote meetings via Zoom, Skype, or Teams limit the range and richness of nonverbal communication signals. Not just because of the typically sub-optimal light, posture, and gaze conditions, but also because of the reduced speaker visibility. Consequently, the speaker's voice becomes immensely important, especially when it comes to being persuasive and conveying charismatic attributes. However, to offer a reliable service and limit the transmission bandwidth, remote meeting tools heavily rely on signal compression. It has never been analyzed how this compression affects a speaker's persuasive and overall charismatic impact. Our study addresses this gap for the audio signal. A perception experiment was carried out in which listeners rated short stimulus utterances with systematically varied compression rates and techniques. The scalar ratings concerned a set of charismatic speaker attributes. Results show that the applied audio compression significantly influences the assessment of a speaker's charismatic impact and that, particularly female speakers seem to be systematically disadvantaged by audio compression rates and techniques. Their charismatic impact decreases over a larger range of different codecs; and this decrease is additionally also more strongly pronounced than for male speakers. We discuss these findings with respect to two possible explanations. The first explanation is signal-based: audio compression codecs could be generally optimized for male speech and, thus, degrade female speech more (particularly in terms of charisma-associated features). Alternatively, the explanation is in the ears of the listeners who are less forgiving of signal degradation when rating female speakers' charisma.
... For music transmission, the compression is optimized to remove certain parts of the original sound signal that go beyond the auditory resolution ability [24], [13], [15]. Voice quality and intelligibility are of interest for voice data transmission [25]. Losing information while compressing audio without drastically changing it is an essential aspect of the audio augmentation method at the data level. ...
Conference Paper
To train end-to-end automatic speech recognition models, it requires a large amount of labeled speech data. This goal is challenging for languages with fewer resources. In contrast to the commonly used feature level data augmentation, we propose to expand the training set by using different audio codecs at the data level. The augmentation method consists of using different audio codecs with changed bit rate, sampling rate, and bit depth. The change reassures variation in the input data without drastically affecting the audio quality. Besides, we can ensure that humans still perceive the audio, and any feature extraction is possible later. To demonstrate the general applicability of the proposed augmentation technique, we evaluated it in an end-to-end automatic speech recognition architecture in four languages. After applying the method, on the Amharic, Dutch, Slovenian, and Turkish datasets, we achieved a 1.57 average improvement in the character error rates (CER) without integrating language models. The result is comparable to the baseline result, showing CER improvement of 2.78, 1.25, 1.21, and 1.05 for each language. On the Amharic dataset, we reached a syllable error rate reduction of 6.12 compared to the baseline result.
Article
Full-text available
Introduction Calls via video apps, mobile phones and similar digital channels are a rapidly growing form of speech communication. Such calls are not only— and perhaps less and less— about exchanging content, but about creating, maintaining, and expanding social and business networks. In the phonetic code of speech, these social and emotional signals are considerably shaped by (or encoded in) prosody. However, according to previous studies, it is precisely this prosody that is significantly distorted by modern compression codecs. As a result, the identification of emotions becomes blurred and can even be lost to the extent that opposing emotions like joy and anger or disgust and sadness are no longer differentiated on the recipients' side. The present study searches for the acoustic origins of these perceptual findings. Method A set of 108 sentences from the Berlin Database of Emotional Speech served as speech material in our study. The sentences were realized by professional actors (2m, 2f) with seven different emotions (neutral, fear, disgust, joy, boredom, anger, sadness) and acoustically analyzed in the original uncompressed (WAV) version and as well as in strongly compressed versions based on the four popular codecs AMR-WB, MP3, OPUS, and SPEEX. The analysis included 6 tonal (i.e. f0-related) and 7 non-tonal prosodic parameters (e.g., formants as well as acoustic-energy and spectral-slope estimates). Results Results show significant, codec-specific distortion effects on all 13 prosodic parameter measurements compared to the WAV reference condition. Means values of automatic measurement can, across sentences, deviate by up to 20% from the values of the WAV reference condition. Moreover, the effects go in opposite directions for tonal and non-tonal parameters. While tonal parameters are distorted by speech compression such that the acoustic differences between emotions are increased, compressing non-tonal parameters make the acoustic-prosodic profiles of emotions more similar to each other, particularly under MP3 and SPEEX compression. Discussion The term “flat affect” comes from the medical field and describes a person's inability to express or display emotions. So, does strong compression of emotional speech create a “digital flat affect”? The answer to this question is a conditional “yes”. We provided clear evidence for a “digital flat affect”. However, it seems less strongly pronounced in the present acoustic measurements than in previous perception data, and it manifests itself more strongly in non-tonal than in tonal parameters. We discuss the practical implications of our findings for the everyday use of digital communication devices and critically reflect on the generalizability of our findings, also with respect to their origins in the codecs' inner mechanics.
Conference Paper
Full-text available
Highly-efficient technologies or algorithms, including speech & audio coding, play an important role in human-machine interaction, human communication and user interface design. This contribution summarizes selected results from our performance studies on speech and audio codecs – mainly within the internet-oriented WebRTC scenario in comparison to our tests with codecs designed for cellular phone networks. Furthermore, some implications of transcoding are surveyed. Finally, we address the research potential with regard to both, Opus codec and Enhanced Voice Services (EVS) codec.
Conference Paper
Full-text available
The Opus codec is used in several applications fields of speech and audio communication. This article describes the instrumental quality assessment of Opus-coded speech in a web browser-based real-time communication using POLQA and AQuA method. Furthermore, we tested Opus with mixed vocal and music signals and also performed a perceptual test. WebRTC framework-coded speech achieves a similar MOS assessment compared to standalone coding. The observed degradations depend on signal bandwidth, on variations in speech (e. g. by emotions) or in music (vocal vs. instrumental) and on the assessment method.
Conference Paper
Full-text available
This paper discusses the voice and audio quality characteristics of EVS, the recently standardized 3GPP codec. Comparison to Opus, IETF driven open source codec as well as industry standard voice codecs: 3GPP AMR and AMR-WB, and ITU-T G.718B, G.722.1C and G.719 as well as direct signals at varying bandwidths was made. Voice and audio quality was evaluated with three subjective listening tests containing clean and noisy speech as well as a mixed condition test containing both speech and music intermixed. Nine-scale subjective mean opinion score was calculated for all tested conditions.
Conference Paper
Full-text available
The Internet Engineering Task Force (IETF) – the open Internet standards-development body – considers the Opus codec as a highly versatile audio codec for interactive voice and music transmission. In this review we survey the dynamic functioning of the Opus codec within a Web Real-Time Communication (WebRTC) framework based on the Google Chrome browser. The codec behavior and the effectively utilized features during the active communication process are tested and analyzed under various testing conditions. In the experiments, we verify the Opus performance and interactivity. Relevant codec parameters can easily be adapted in application development. In addition, WebRTC framework-coded speech achieves a similar MOS assessment compared to stand-alone Opus coding.
Definition of the Opus Audio Codec
  • J Valin
  • K Vos
  • T Terriberry
J. Valin, K.Vos, and T. Terriberry, "Definition of the Opus Audio Codec," RFC 6716 (Proposed Standard), Internet Engineering Task Force, Sep. 2012. [Online].
The AAC-ELD Family For High Quality Communication Services
  • Iis Frauenhofer
Frauenhofer IIS, " The AAC-ELD Family For High Quality Communication Services, " Frauenhofer IIS), Technical Paper, Dec. 2015. [Online]. Available: http://www.iis.fraunhofer.de/content/dam/iis/de/doc/ame/wp/Fraunhof erIIS Technical-Paper AAC-ELD-family.pdf
Imp863: Implementer's Guide on assessment of EVS coded speech with Recommendation ITU-TP.863 Available: http://www.itu.int/rec/T-REC-P.Imp863-201601-I!Oth1/en 10. ETSI); LTE; Codec for Enhanced Voice Services (EVS); Performance characterization
ITU-T. (2016, Jan.) P.Imp863: Implementer's Guide on assessment of EVS coded speech with Recommendation ITU-TP.863. [Online]. Available: http://www.itu.int/rec/T-REC-P.Imp863-201601-I!Oth1/en 10. ETSI, "Universal Mobile Telecommunications System (UMTS); LTE; Codec for Enhanced Voice Services (EVS); Performance characterization," European Telecommunications Standards Institute (ETSI), TS 126952 v13.0.0, Jan. 2016. [Online]. Available: http://www.etsi.org/deliver/etsi tr/126900 126999/126952/13. 00.00 60/tr 126952v130000p.pdf
Summary of OPUS listening test results draft-ietf-codec-results-03
  • C Hoene
  • J Valin
  • K Vos
  • J Skoglund
C. Hoene, J. Valin, K. Vos, and J. Skoglund, "Summary of OPUS listening test results draft-ietf-codec-results-03," Internet-Draft, Internet Engineering Task Force, Jan. 2014, Available: http://tools.ietf.org/html/draft-ietf-codeco-results-03 [retrieved: April., 2015].
Sound Quality Assessment Material recordings for subjective tests
European Broadcasting Union. (2008, Oct.) Sound Quality Assessment Material recordings for subjective tests. [Online]. Available: https://tech.ebu.ch/public ations/sqamcd
UMTS); LTE; Codec for Enhanced Voice Services (EVS); Performance characterization
  • Etsi