Conference PaperPDF Available

Can we still use PEAQ? A Performance Analysis of the ITU Standard for the Objective Assessment of Perceived Audio Quality

Authors:

Abstract

Abstract—The Perceptual Evaluation of Audio Quality (PEAQ) method as described in the International Telecommunication Union (ITU) recommendation ITU-R BS.1387 has been widely used for computationally estimating the quality of perceptually coded audio signals without the need for extensive subjective listening tests. However, many reports have highlighted clear limitations of the scheme after the end of its standardization, particularly involving signals coded with newer technologies such as bandwidth extension or parametric multi-channel coding. Until now, no other method for measuring the quality of both speech and audio signals has been standardized by the ITU. Therefore, a further investigation of the causes for these limitations would be beneficial to a possible update of said scheme. Our experimental results indicate that the performance of PEAQ’s model of disturbance loudness is still as good as (and sometimes superior to) other state-of-the-art objective measures, albeit with varying performance depending on the type of degraded signal content (i.e. speech or music). This finding evidences the need for an improved cognitive model. In addition, results indicate that an updated mapping of Model Output Values (MOVs) to PEAQ’s Distortion Index (DI) based on newer training data can greatly improve performance. Finally, some suggestions for the improvement of PEAQ are provided based on the reported results and comparison to other systems. Index Terms—PEAQ, ViSQOL, PEMO-Q, objective quality assessment, audio quality, speech quality, auditory model, audio coding.
Can we still use PEAQ? A Performance Analysis of
the ITU Standard for the Objective Assessment of
Perceived Audio Quality
Pablo M. Delgado
International Audio Laboratories Erlangen
Erlangen, Germany
pablo.delgado@audiolabs-erlangen.de
J¨
urgen Herre
International Audio Laboratories Erlangen
Erlangen, Germany
juergen.herre@audiolabs-erlangen.de
Abstract—The Perceptual Evaluation of Audio Quality (PEAQ)
method as described in the International Telecommunication
Union (ITU) recommendation ITU-R BS.1387 has been widely
used for computationally estimating the quality of perceptually
coded audio signals without the need for extensive subjective
listening tests. However, many reports have highlighted clear
limitations of the scheme after the end of its standardization,
particularly involving signals coded with newer technologies such
as bandwidth extension or parametric multi-channel coding. Until
now, no other method for measuring the quality of both speech
and audio signals has been standardized by the ITU. There-
fore, a further investigation of the causes for these limitations
would be beneficial to a possible update of said scheme. Our
experimental results indicate that the performance of PEAQ’s
model of disturbance loudness is still as good as (and sometimes
superior to) other state-of-the-art objective measures, albeit with
varying performance depending on the type of degraded signal
content (i.e. speech or music). This finding evidences the need
for an improved cognitive model. In addition, results indicate
that an updated mapping of Model Output Values (MOVs) to
PEAQ’s Distortion Index (DI) based on newer training data can
greatly improve performance. Finally, some suggestions for the
improvement of PEAQ are provided based on the reported results
and comparison to other systems.
Index Terms—PEAQ, ViSQOL, PEMO-Q, objective quality
assessment, audio quality, speech quality, auditory model, audio
coding.
I. INTRODUCTION
Efforts initiated in 1994 by the ITU-R to identify and
recommend a method for the objective measurement of per-
ceived audio quality culminated in 2001 with recommenda-
tion BS.1387 [1], most commonly known as the Perceptual
Evaluation of Audio Quality (PEAQ) method. This method
is based on generally accepted psychoacoustic principles and
has successfully been adopted by the perceptual audio codec
development and the broadcasting industries [2].
Like numerous other objective audio quality assessment
systems, PEAQ takes as inputs a possibly degraded signal
(e.g. by bit rate reduction strategies in audio coding) and an
unprocessed reference and then compares them in a perceptual
domain (i.e. the internal representations of said signals [4])
A joint institution between the Friedrich-Alexander Universit¨
at Erlangen-
N¨
urnberg (FAU) and Fraunhofer IIS, Germany.
Peripheral
Model
Feature
Extraction Regression
Reference
Processed
Signal
...
MOVs
Quality
grade
Subjective
Data
training
(DI or ODG)
Fig. 1. High level representation of the PEAQ method [3].
aided by a peripheral ear model and feature extraction stage.
The method then calculates one or more objective degradation
indices related to some aspect of perceived quality degradation
(Fig. 1). In PEAQ, these unidimensional indices (i.e. features)
are termed Model Output Values (MOVs) [1]. The features are
mapped to a perceived overall quality scale using regression
models trained with subjective data. In the case of PEAQ,
several MOVs are combined and mapped to an objective
quality scale termed Objective Difference Grade (ODG) using
an Artificial Neural Network (ANN). In addition to the ODG
output, a Distortion Index (DI) is derived by removing the
output layer of the ANN ODG mapping [3] to provide a scale-
independent condensed degradation index. Both the MOV
combination to a DI and the mapping to an ODG scale are
carried out using training data from subjective experiments
with Subjective Difference Grade (SDG) scores as defined in
recommendation ITU-R BS.1116 [5].
Limitations of objective quality assessment systems can
be related to missing models of psychoacoustic phenomena,
unsuitable parameter tuning or errors in the degradation index-
to-quality mapping. Some solutions to PEAQ limitations are
introduced in PEMO-Q [6] with the introduction of a mod-
ulation filter bank and a degradation measure that considers
instantaneous values instead of averages. Various PEAQ ex-
tensions for better stereo and multi-channel audio prediction
were proposed with improved performance (e.g. [7]). However,
these extensions did not address existing problems related to
timbral quality estimation.
Predicting perceived audio quality degradation by perform-
ing comparisons in the (partial) loudness domain [3], [4] has
proven to be a successful approach incorporated in many
standard objective quality measurement systems [1], [8]. The
PEAQ MOVs based on the partial loudness domain model is
meant only to reliably work within the realm of small distur-
bances near the masked threshold in the distorted signals. The
Perceptual Objective Listener Quality Assessment (POLQA
[8]) method extended the disturbance loudness model to con-
sider larger disturbances as well.
Regarding the cognitive modeling of distortion perception,
the asymmetric weighting of added and missing time fre-
quency components between degraded and reference internal
representations is also widely used [1], [6], [8]. It is generally
assumed that added disturbances in the degraded signal will
contribute more importantly to the perceived distortion than
missing components. However, there has been only sparse
literature analyzing whether this relationship holds the same
for all kinds of signals.
Most previous performance comparisons of PEAQ use either
the DI or ODG outputs (e.g. [2], [9]), which are dependent on
the original training data. This fact can be problematic as the
mapping can interfere with the performance evaluation of the
intrinsic perceptual model. Alternatively, the work presented
in [10] showed that separate MOVs of PEAQ have good
prediction ability for different types of distortions, including
those degradations related to newer audio coding technologies
[11] like tonality mismatch or unmasked noise.
II. ME TH OD
A. PEAQ’s disturbance loudness model evaluation
As a first objective, we want to evaluate the performance of
PEAQ’s partial loudness disturbance model in the context of
the perceived degradations caused by newer audio codecs in a
wide quality range. Secondly, we want to investigate whether
the relationship between the different types of disturbances due
to added or missing time/frequency components is dependent
on signal content type (i.e. audio or speech).
For this purpose, we consider the three internal compar-
isons performed by PEAQ’s advanced version in the dis-
turbance loudness domain separately. These three internal
comparisons should describe possible degradations due to
linear distortions (AvgLinDistA), due to additive distur-
bances (RmsN oiseLoudnessA) and due to missing compo-
nents (RmsM issingC omponents) [1] in the degraded signal
under test in comparison to the reference. From these three
comparisons, only two MOVs are fed into the final regression
stage: AvgLinDistAand
RmsN oiseLoudAsymA=RmsN oiseLoudnessA
+ 0.5RmsM issingC omponents. (1)
The independent analysis of added versus missing compo-
nents requires that we analyze all three comparisons separately
and not the combined MOV values. For this, we use our
own [1] MATLAB implementation of PEAQ1and consider
1Compared for reproducibility against MOVs of the PEAQ OPERA 3.5
distribution with the same settings as [9]. The RMSE values of all MOV
differences represent at most 5% of the SD of said MOVs for the used
database.
RmsN oiseLoudnessAand RmsM issingC omponents as
two different MOVs instead of RmsN oiseLoudAsymA.
B. Subjective audio quality database
We use the Unified Speech and Audio Coding Verifica-
tion Test 1 Database (USAC VT1) [12] as ground truth for
performance testing. The database contains newer generation
audio codecs in comparison to those used in [1], which contain
newer coding technologies [11] that cause artifacts known to
challenge current objective quality measurement systems.
The USAC VT1 database contains a total of 27 reference
mono audio excerpts (samples) corresponding to different
sample types: 9 music-only samples, 9 speech-only samples
and 9 mixed speech and music samples, which makes it
convenient for separately studying the performance of the
objective measurement systems for speech and music signals.
The database includes of a total of 12 treatments of the samples
including 3.5and 7.5kHz anchors plus three perceptual audio
codecs (AMR-WB+, HE-AAC v2 and USAC) at a wide range
of bit rates (8kbps to 24kbps) as degraded signals (i.e. test
items).
The subjective quality assessment test was carried out
following the MUltiple Stimuli with Hidden Rreference and
Anchor (MUSHRA) method [13] using 62 listeners from 13
different test sites with previous training and post-screening.
Gender ratio of the listeners was not reported. The subjective
scores ranging from 0 points (bad quality) to 100 points
(excellent quality) are pooled into mean quality MUSHRA
scores and 95% confidence intervals for each sample/treatment
combination over all listeners. The subjective scores gave an
average of 37 points (SD 16.79) for the worst quality codec
and 77 points (SD 14.35) for the best -sounding codec, so the
quality can be considered to contain a wide audio quality range
and not just small disturbances. In total, 324 condensed data
points are used, which will be used as training and validation
data as explained in the following paragraphs.
C. MOV to objective quality score mapping
Most of the objective measurement systems use some kind
of regression procedure to map one or more objective quality
degradation measures to quality scores by using subjective test
data [1], [6], [8], [9]. As is the case with PEAQ’s DI, most
systems also provide outputs representing a generalization of
the objective quality without any mapping to a specific quality
scale (i.e. the degradation indices). They should monotonically
map quality scores, but should not depend on scales or scale
anchors. As explained in the Introduction, PEAQ’s DI is
scale-independent, but still depends on the original subjective
training data [3].
This work will require the remapping of the degradation
measures (excluding each system’s individual quality scale
outputs from previous training) to the subjective data in order
to abstract the system performance from the regression stage
to focus on the underlying signal processing model.
Our mapping procedure is carried out using a MAT-
LAB implementation of the Multivariate Adaptive Regression
Splines (MARS) model [14], which is a piecewise-linear
and piecewise-cubic regression model. The model performs
a Generalized Cross Validation (GCV) on the data used
for building the model which approximates the leave-one-
out cross-validation method. This technique is known to be
robust to overfitting when a limited amount of training data is
available.
D. Bootstraping and inference statistics
One question related to the problem of mapping degradation
measures to a quality scale is –on one hand– whether a
given mapping will reliably be able to predict the audio
quality outside of the realm of the available training data.
On the other hand, there is the need for separately evaluating
the feature extraction stages (which need to show a strong
correlation with some aspect of subjective quality) from the
mapping/regression stages. Our approach to tackle this prob-
lem with the available training data is to use a bootstrap
technique using Monte Carlo simulations. A similar approach
has also made its way into the latest revision of the MUSHRA
recommendation [13] for subjective quality data analysis.
For each realization of the experiment, we divide the data
points into two disjoint sets: the model-building data (training
and cross-validation, 80% of the items) and test data (remain-
ing 20%). The model-building dataset is used to train and
cross-validate a MARS model that maps the different degrada-
tion measures (i.e. MOVs) to a single objective quality score.
The trained model remains unchanged and is in turn evaluated
in its ability to predict the subjective quality scores of the
test dataset from the MOVs calculated for the corresponding
test items. This procedure is repeated N= 2000 times,
each realization randomly samples (with replacement) the data
points belonging to training and test datasets respectively.
Resampling the data presents advantages in generalization
when small amounts of data are available. Additionally, prob-
lems with the regularization and normalization of scores from
different experiments/laboratories are less severe as the data
was likely gathered with the same testing procedure and initial
conditions. The bootstrap method allows us to a certain
degree to focus on the performance behind the feature
extraction while still using real world data for validation.
Taking mean system performance figures (see II-E) over a
sufficient set of trained regression models, we are effectively
decoupling the influence of the training data on the perfor-
mance measurement of feature extraction stage. However, for a
general final system performance analysis, as many and diverse
subjective test databases as possible should be used.
E. System performance metrics
System performance metrics are always calculated assessing
how well the system’s objective measurement output predicts
subjective scores that have not been used for training (i.e.
from the test dataset). This work uses Pearson’s (Rp) and
Spearman’s (Rs) correlation between objective and subjective
scores and Absolute Error Score (AES) as measures. The AES
can be seen as an extension of the usual RMS error that also
considers weighting by the confidence intervals (CI) of the
mean subjective score [1].
For the Monte Carlo simulations, point estimates of the
mean performance metric values (with an additional msub-
script) and their associated 95% CI are provided. Confidence
intervals from the Monte Carlo simulations enable us to pro-
vide a notion of the significance of the performance margins
among systems in a direct way.
III. RES ULTS A ND DISCUSSION
Table I shows the overall baseline system performance for
the used database. This overall performance was obtained by
comparing each system’s output objective scores against the
totality of the subjective scores of the database without any
bootstraping or retraining. The objective scores PEAQ ODG
and PEAQ DI represent the output of the advanced version
of our implementation of PEAQ’s Objective Difference Grade
(ODG) and Distortion Index (DI) respectively. PEMO-Q ODG
and PEMO-Q PSMt are the ODG and similarity outputs of
the PEMO-Q implementation found in [15]. ViSQOLA MOS
and ViSQOLA NSIM are the Mean Opinion Score (MOS)
and similarity outputs respectively of the ViSQOL Audio
impementation found in [16]. All the model outputs related
to ODG or MOS contain a built-in regression stage trained
with listening test subjective scores obtained with different test
methodologies. Outputs ViSQOLA NSIM, PEMO-Q PSMt
and PEAQ DI are the degradation indices, of which only
PEAQ DI still depends on built-in training data, as explained
in section II-C.
All objective measures except those of ViSQOL audio
show a weak correlation against subjective scores. Still, a
higher Spearman correlation value Rs–which measures rank
preservation indicates that the systems could be used in the
context of one-to-one comparisons (e.g. testing effects of new
coding tools within the same audio codec).
ViSQOL Audio’s degradation index NSIM is clearly the best
performing feature. However, as PEAQ DI –also a degradation
index– is still dependent on the original training data, it might
not represent the true performance of the underlying perceptual
model. We present a performance analysis of different MOVs
without any built-in training in Table II.
TABLE I
BASELINE SYS TEM PERFORMANCE
Objective System Performance
Measure RpRsAES
PEAQ ODG 0.65 0.7 6.88
PEAQ DI 0.65 0.7 2.85
PEMO-Q ODG 0.65 0.71 3.05
PEMO-Q PSMt 0.64 0.71 3.56
ViSQOLA MOS 0.76 0.83 2.84
ViSQOLA NSIM 0.83 0.83 2.47
Table II shows the average system performance measures
obtained by the bootstrap method from Section II-D for
combinations of outputs of the evaluated objective measure-
ment systems. The table is also divided into performance
measures for music-only items, speech-only items and all
items together. Multiple realizations of MOV-to-quality-score
mappings of the shown objective measures were trained and
tested with disjoint sets of data points. Performance measures
are calculated exclusively on the test data for each realization
of the experiment. The amount of iterations Nhas proven
to be sufficient so that the 95% CI do not overlap. Therefore,
the difference in performance figures support significant effect
sizes. Additionally, the rank order in performance of the
objective measures that have been evaluated in Table I and
Table II is preserved, which speaks for the reliability of the
bootstrap method.
According to Table II, ViSQOL Audio achieves a high
and stable performance for all items overall. As the items in
the database do not show complex temporal misalignments,
the more complex (in comparison to PEAQ and PEMO-Q)
alignment algorithm [9] cannot be the cause of its high effec-
tiveness. Instead, a perceptually motivated similarity measure
such as NSIM might be promising as an alternative to a
disturbance loudness difference model for evaluating speech
and music signals in an unified system. ViSQOLA NSIM
presents the most balanced (and high) performance for all
categories, although not the best in the category All Samples”
for the evaluated data. The PEMO-Q similarity measure PSMt
is among the best performing for speech items and among
the worst performing for music items. It is also the worst
performing degradation index overall for the All Samples”
item category.
The PEAQ DI quality measure still shows a low perfor-
mance for all the combinations of sample types. However,
PEAQ’s AvgLinDistAshows the best performing measure-
ments from all the single degradation indexes tested in the case
of the sample type category All Samples” and the second best
for “Music Only”. For music, RmsNoiseLoudAis the worst
performing (except DI), but it is the best performing single
degradation index for speech samples. These observations
support the hypothesis that PEAQ still has an acceptable
perceptual model for the quality estimation task despite the
comparatively bad performance of PEAQ DI. Furthermore,
results of Table II show that added components and missing
components play different roles in different signal types.
Further data visualization is provided in Figures 2 and 3
showing SDG and mean ODG scores obtained using the three
different disturbance loudness MOVs described in Section II-A
separated in speech and music-only items.
Given the subjective database characteristics, the strong
MOV-subjective score correlation associated with the partial
loudness of disturbances model imply that this model is
also suitable for a wider quality range namely the inter-
mediate quality level as defined in [13] with a proper
mapping. The lack of a proper remapping could explain
the low discrimination power that PEAQ showed on lower
bit rate codecs as reported in [2]. However, from a visual
examination of Fig 2 follows that the discrimination power of
the additive disturbance loudness model for speech items is
–on average– in line with the subjective scores in all quality
ranges. The observation that the PEAQ DI performance is low
in all experiments but the performance of certain PEAQ
MOVs is not suggests that a different MOV weighting in
PEAQ is needed for an improved distortion index. Table II
shows that retraining PEAQ with the disturbance loudness-
related MOVs greatly improves performance in comparison to
PEAQ DI, which confirms that a more suitable mapping is
beneficial. Remapping has also been reported to improve the
performance POLQA for music content [2]. Future work might
consider including “POLQA Music” –an objective quality
system suitable for speech and audio in a similar analysis as
the one presented in this work when it becomes available.
The experimental data does not provide evidence that the
underlying perceptual model of PEAQ would be the weak link
in the processing chain, except maybe for the model of additive
distortions in music items.
Although remapping greatly improves performance, mea-
surements also show some possible limitations of the per-
ceptual model. Considering the strong correlation values, it
is clear that measuring the disturbance loudness of the added
components represents a very meaningful descriptor of quality
degradation in the case of speech signals, but is much less
meaningful in the case of music signals. Besides retraining,
further performance improvement could be achieved with
the addition of a speech/music discriminator as part of the
cognitive model, which in turn would allow the use of a
suitable disturbance loudness analysis mode weighting (added
versus missing components). PEMO-Q also weights added
components more importantly than missing components on
its PSMt measure [6], which is in accordance with the data
in Table II showing a better performance of PSMt for speech-
only items than music-only items.
Analyzing loudness differences of missing components has
proven a good strategy in the case of music signals, but
not so much for speech. These results suggest that a better
model for measuring the loudness of added components in
music is needed. Recent studies indicate there might be
different cognitive processes involved in the quality evaluation
of speech and audio signals [17]. An extended cognitive model
considering these results in conjunction with the disturbance
loudness model may be a promising update to PEAQ’s current
perceptual model. The role of an extended disturbance loud-
ness model on the overall system performance could then be
analyzed with respect to the contributions of a proper MOV-
to-quality-score mapping.
The AvgLinDistAMOV presents a comparatively high
performance in the Monte Carlo simulations in all categories,
although most of the signal degradations in the database are
not due to linear distortions. By definition, linear distortions
do not introduce additional frequency components. Therefore,
part of the performance could be explained by the fact that
the MOV also performs an indirect analysis of missing com-
ponents. However, its superior performance in comparison to
the actual measurement of missing components hints to some
additional benefits (compare performance of AvgLinDistA
and RmsM issingC omponents in Fig. 3). Further analysis
TABLE II
BOOTSTRAPED MEAN SYSTEM PERFORMANCE (BES T SIN GL E-M OV VALUE S IN B OLD )a
Objective Music Only Speech Only All Samples
Measure Rpm AESmRpm AESmRpm AE Sm
PEAQ DI 0,55 2,96 0,77 2,6 0,72 2,56
PEAQ AvgLinDistA0,84 1,8 0,81 2,4 0,84 1,99
PEAQ RmsNoiseLoudA0,56 2,9 0,9 1,76 0,8 2,23
PEAQ RmsMissingC omponents b0,76 2,27 0,75 2,72 0,78 2,31
PEMO-Q PSMt 0,61 2,79 0,87 1,97 0,67 2,76
ViSQOLA NSIM 0,86 1,79 0,86 2,09 0,82 2,09
PEAQ AvgLinDistA+RmsN oiseLoudA+RmsM issingC omponents 0,84 1,9 0,9 1,81 0,88 1,76
PEAQ ADVANCED RETRAIN 0,9 1,53 0,97 1 0,91 1,54
aAll two-sided 95% confidence intervals in the range of: Rp<±0.01,AES < ±0.09
bDerived from RmsNoiseLoudAsymA[1]
of Fig 3 shows a more stable performance of AvgLinDistA
than RmsM issingC omponents in the higher and lower
quality extremes (smaller errors). The data suggest that a
deeper analysis of the inner workings in the calculation of
the AvgLinDistAmight shed some light on its performance,
especially concerning the level and pattern adaptation stages,
as described in [3].
The best performance in Table II is given by retraining all
five MOVs from PEAQ Advanced. Accordingly, future work
might include a similar analysis as the one presented in this
work for the two remaining MOVs, with additional analysis of
MOV interaction according to different sample type categories.
IV. SUMMARY AND CONCLUSION
This work used Monte Carlo simulations to compare some
of the most used objective quality measurement systems of
perceptually coded audio signals while keeping focus on their
underlying perceptual model rather than the quality scale map-
pings. Individual MOVs showed an acceptable performance
whereas mapping-dependent distortion measures did not. The
data supports the finding that PEAQ’s perceptual model
particularly the modeling of disturbance loudness is still
appropriate for newer audio codecs, even in the intermediate
quality range. Consequently, the remapping of the perceptual
model’s MOV to a single quality score using training data
from listening tests performed over newer audio codecs can
improve the overall performance of PEAQ.
The disturbance loudness model showed a significant differ-
ence in performance depending on the sample type category
evaluated. Some of PEAQ’s MOVs performances are on par
with other newer systems. Further performance for the joint
evaluation the quality of speech and audio signals in PEAQ
could be improved, provided that the correct disturbance
loudness analysis mode is used for each case. An improved
cognitive model applied to the pattern adaptation stage could
also improve PEAQ on this matter.
REFERENCES
[1] ITU-R Rec. BS.1387, Method for objective measurements of perceived
audio quality, Geneva, Switzerland, 2001.
[2] P. Poˇ
cta and J. G. Beerends, “Subjective and objective assessment of
perceived audio quality of current digital audio broadcasting systems and
web-casting applications,” IEEE Transactions on Broadcasting, vol. 61,
no. 3, pp. 407–415, Sep. 2015.
[3] T. Thiede, W. C. Treurniet, R. Bitto, C. Schmidmer, T. Sporer, J. G.
Beerends, and C. Colomes, “PEAQ - the ITU standard for objective
measurement of perceived audio quality,” J. Audio Eng. Soc., vol. 48,
no. 1/2, pp. 3–29, January/February 2000.
[4] J. G. Beerends and J. A. Stemerdink, “A perceptual audio quality
measure based on a psychoacoustic sound representation,” J. Audio Eng.
Soc, vol. 40, no. 12, pp. 963–978, 1992.
[5] ITU-R Rec. BS.1116, Methods for the subjective assessment of small
impairments in audio systems, Geneva, Switzerland, 2015.
[6] R. Huber and B. Kollmeier, “PEMO-Q—a new method for objective
audio quality assessment using a model of auditory perception,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 14, no. 6,
pp. 1902–1911, Nov 2006.
[7] S. K¨
ampf, J. Liebetrau, S. Schneider, and T. Sporer, “Standardization of
PEAQ-MC: Extension of ITU-R BS.1387-1 to Multichannel Audio, in
40th AES International Conference on Spatial Audio, Tokyo, Oct 2010.
[8] ITU-T Rec. P.863, Perceptual Objective Listening Quality Assessment,
Geneva, Switzerland, 2014.
[9] C. Sloan, N. Harte, D. Kelly, A. C. Kokaram, and A. Hines, “Objective
assessment of perceptual audio quality using ViSQOLAudio, IEEE
Transactions on Broadcasting, vol. PP, no. 99, pp. 1–13, 2017.
[10] M. Torcoli and S. Dick, “Comparing the effect of audio coding artifacts
on objective quality measures and on subjective ratings, in Audio
Engineering Society Convention 144, May 2018.
[11] M. Neuendorf et al., “The iso/mpeg unified speech and audio coding
standard—consistent high quality for all content types and at all bit
rates,” J. Audio Eng. Soc, vol. 61, no. 12, pp. 956–977, 2013.
[12] ISO/IEC JTC1/SC29/WG11, “USAC verification test re-
port N12232,” International Organisation for Stan-
dardisation, Tech. Rep., 2011. [Online]. Avail-
able: https://mpeg.chiariglione.org/standards/mpeg-d/unified-speech-
and-audio-coding/unified-speech-and-audio-coding-verification-test
[13] ITU-R Rec. BS.1534, Method for the subjective assessment of interme-
diate quality levels of coding systems, Geneva, Switzerland, 2015.
[14] G. Jekabsons, Areslab: Adaptive regression splines toolbox for matlab.
http://www.cs.rtu. lv/jekabsons/regression.html, 2019.
[15] V. Emiya, E. Vincent, N. Harlander, and V. Hohmann, “Subjective
and objective quality assessment of audio source separation, IEEE
Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7,
pp. 2046–2057, Sep. 2011.
[16] A. Hines, E. Gillen, D. Kelly, J. Skoglund, A. Kokaram, and N. Harte,
“Visqol audio matlab implementation. http://www.sigmedia.tv/tools,”
Accessed 2019.
[17] R. Huber, S. R¨
ahlmann, T. Bisitz, M. Meis, S. Steinhauser,
and H. Meister, “Influence of working memory and attention on
sound-quality ratings,” The Journal of the Acoustical Society of
America, vol. 145, no. 3, pp. 1283–1292, 2019. [Online]. Available:
https://doi.org/10.1121/1.5092808
Items
-90
-80
-70
-60
-50
-40
-30
-20
-10
Subjective Difference Grade
USAC_Test1
Speech Items
RmsNoiseLoud_PEAQ
Subj. mean
95% CI
Obj. Score (R=0.90201,AES=1.761)
Regr. line
Fig. 2. Subjective difference grade scores [13] (MUSHRA) and Monte Carlo average objective scores on test data (speech-only) after training with
RmsNoiseLoudAMOV, which incorporates a quality model based on loudness of additive disturbances [3].
Items
-80
-70
-60
-50
-40
-30
-20
-10
Subjective Difference Grade
USAC_Test1
Music Items
Group A: AvgLinDist_PEAQ
Group B: RmsMissingComponents
Subj. mean
95% CI
Obj A (R=0.85102,AES=1.8097)
Obj B (R=0.76749,AES=2.2705)
Regr. line A
Regr. line B
Fig. 3. Subjective difference grade scores [13] (MUSHRA score) and Monte Carlo average objective scores on test data (music-only) after training with
AvgLinDistAand RmsMissing Components MOV.
... In this case, the employed audio signals could not be publicly shared due to copyright limitations. Another example is the USAC verification test [16] dataset, which has been widely used (e.g., [25,2,5]) but is also not publicly available. ...
... One of the goals for ODAQ is to serve as a shared resource that could be expanded and utilized by the research community and facilitate studies around audio quality. Several previous works analyzed the performance of objective quality metrics in predicting ground-truth subjective quality scores, e.g., [25,5,27,11]. This type of analysis has two issues: 1) It has limited reproducibility since openly available datasets of subjective scores were scarce before ODAQ, and 2) It cannot include newer metrics proposed after the time of writing. ...
... The top-performing metrics on the current dataset share two key strengths: 1) They are based on au-ditory perceptual models, and 2) They have been cross-validated across various signal types, including general audio and music. The results support similar findings from previous studies [25,5]. Some metrics did not correlate well with the subjective scores in ODAQ, possibly due to the mismatches in application domains or the lack of psycho-acoustic foundations. ...
Preprint
The Open Dataset of Audio Quality (ODAQ) was recently introduced to address the scarcity of openly available audio datasets with corresponding subjective quality scores. The dataset, released under permissive licenses, comprises audio material processed using six different signal processing methods operating at five quality levels, along with corresponding subjective test results. To expand the dataset, we provided listener training to university students to conduct further subjective tests and obtained results consistent with previous expert listeners. We also showed how different training approaches affect the use of absolute scales and anchors. The expanded dataset now comprises results from three international laboratories providing a total of 42 listeners and 10080 subjective scores. This paper provides the details of the expansion and an in-depth analysis. As part of this analysis, we initiate the use of ODAQ as a benchmark to evaluate objective audio quality metrics in their ability to predict subjective scores
... For the perceptual model we use our MATLAB [27] implementation [24] of the model used by the PEAQ method [4], in its advanced version 1 . The perceptual model performs a comparison of REF and SUT (e.g., potentially degraded by a perceptual audio codec) in the transformed psychophysical domain [5], [11], [29], [30]. ...
... • PEAQ DI is a MATLAB implementation of the objective measure specified in [4], used in advanced operation mode. The MATLAB implementation was validated against the output of a commercially available implementation, as reported in [24]. ...
... The inference time measurements were carried out in MATLAB using a MacBook Pro 16" with a 2.6 GHz 6-Core Intel Core i7 Processor and 32 GB of RAM. Mean time measurements were taken out of 10000 training and inference cycles with data splitting and randomization on the USAC VT1 database, using the Monte Carlo procedure described in [24]. Table VII presents the estimated inference complexity in terms of execution time and the number of parameters for the architectures detailed in Section III-E. ...
Preprint
Efficient audio quality assessment is vital for streamlining audio codec development. Objective assessment tools have been developed over time to algorithmically predict quality ratings from subjective assessments, the gold standard for quality judgment. Many of these tools use perceptual auditory models to extract audio features that are mapped to a basic audio quality score prediction using machine learning algorithms and subjective scores as training data. However, existing tools struggle with generalization in quality prediction, especially when faced with unknown signal and distortion types. This is particularly evident in the presence of signals coded using non-waveform-preserving parametric techniques. Addressing these challenges, this two-part work proposes extensions to the Perceptual Evaluation of Audio Quality (PEAQ - ITU-R BS.1387-1) recommendation. Part 1 focuses on increasing generalization, while Part 2 targets accurate spatial audio quality measurement in audio coding. To enhance prediction generalization, this paper (Part 1) introduces a novel machine learning approach that uses subjective data to model cognitive aspects of audio quality perception. The proposed method models the perceived severity of audible distortions by adaptively weighting different distortion metrics. The weights are determined using an interaction cost function that captures relationships between distortion salience and cognitive effects. Compared to other machine learning methods and established tools, the proposed architecture achieves higher prediction accuracy on large databases of previously unseen subjective quality scores. The perceptually-motivated model offers a more manageable alternative to general-purpose machine learning algorithms, allowing potential extensions and improvements to multi-dimensional quality measurement without complete retraining.
... For the perceptual model we use our MATLAB [27] implementation [24] of the model used by the PEAQ method [4], in its advanced version 1 . The perceptual model performs a comparison of REF and SUT (e.g., potentially degraded by a perceptual audio codec) in the transformed psychophysical domain [5], [11], [29], [30]. ...
... • PEAQ DI is a MATLAB implementation of the objective measure specified in [4], used in advanced operation mode. The MATLAB implementation was validated against the output of a commercially available implementation, as reported in [24]. ...
... The inference time measurements were carried out in MATLAB using a MacBook Pro 16" with a 2.6 GHz 6-Core Intel Core i7 Processor and 32 GB of RAM. Mean time measurements were taken out of 10000 training and inference cycles with data splitting and randomization on the USAC VT1 database, using the Monte Carlo procedure described in [24]. Table VII presents the estimated inference complexity in terms of execution time and the number of parameters for the architectures detailed in Section III-E. ...
Article
Efficient audio quality assessment is vital for streamlining audio codec development. Objective assessment tools have been developed over time to algorithmically predict quality ratings from subjective assessments, the gold standard for quality judgment. Many of these tools use perceptual auditory models to extract audio features that are mapped to a basic audio quality score prediction using machine learning algorithms and subjective scores as training data. However, existing tools struggle with generalization in quality prediction, especially when faced with unknown signal and distortion types. This is particularly evident in the presence of signals coded using non-waveform-preserving parametric techniques. Addressing these challenges, this two-part work proposes extensions to the Perceptual Evaluation of Audio Quality (PEAQ - ITU-R BS.1387-1) recommendation. Part 1 focuses on increasing generalization, while Part 2 targets accurate spatial audio quality measurement in audio coding. To enhance prediction generalization, this paper (Part 1) introduces a novel machine learning approach that uses subjective data to model cognitive aspects of audio quality perception. The proposed method models the perceived severity of audible distortions by adaptively weighting different distortion metrics. The weights are determined using an interaction cost function that captures relationships between distortion salience and cognitive effects. Compared to other machine learning methods and established tools, the proposed architecture achieves higher prediction accuracy on large databases of previously unseen subjective quality scores. The perceptually-motivated model offers a more manageable alternative to general-purpose machine learning algorithms, allowing potential extensions and improvements to multi-dimensional quality measurement without complete retraining.
... They exhibited superior performance compared to the PEAQ technique under some conditions. Despite its outdated neural network, the PEAQ method is still used by some engineers [14]. It constitutes the only international standard for the objective assessment of audio quality [10]. ...
... The performance of the objective audio quality methods is widely regarded as satisfactory [9]- [14]. However, since these methods are predominantly intrusive, requiring access to a reference signal, the scope of their real-world applications is significantly reduced. ...
Article
Full-text available
Most of the existing algorithms for the objective audio quality assessment are intrusive, as they require access both to an unimpaired reference recording and an evaluated signal. This feature excludes them from many practical applications. In this paper, we introduce a non-intrusive audio quality assessment method. The proposed method is intended to account for audio artefacts arising from the lossy compression of music signals. During its development, 250 high-quality uncompressed music recordings were collated. They were subsequently processed using the selection of five popular audio codecs, resulting in the repository of 13,000 audio excerpts representing various levels of audio quality. The proposed non-intrusive method was trained with the data obtained employing a well-established intrusive model (ViSQOL v3). Next, the performance of the trained model was evaluated utilizing the quality scores obtained in the subjective listening tests undertaken remotely over the Internet. The listening tests were carried out in compliance with the MUSHRA recommendation (ITU-R BS.1534-3). In this study, the following three convolutional neural networks were compared: (1) a model employing 1D convolutional filters, (2) an Inception-based model, and (3) a VGG-based model. The last-mentioned model outperformed the model employing 1D convolutional filters in terms of predicting the scores from the listening tests, reaching a correlation value of 0.893. The performance of the Inceptionbased model was similar to that of the VGG-based model. Moreover, the VGG-based model outperformed the method employing a stacked gated-recurrent-unit-based deep learning framework, recently introduced by Mumtaz et al. (2022).
... While these methods are suitable for evaluating human speech signals, they may not be appropriate for evaluating the DoM of UBC. For example, the research results of [193] show that PEAQ is not suitable for evaluating the quality of music-type audio. Therefore, PEAQ might not be suitable for evaluating whistle mimetic signals. ...
Article
Full-text available
Underwater biomimetic communication (UBC) technology has been studied to overcome the restricted physical covertness of artificial acoustical communication signals. Recent advancements have demonstrated data transmission at 300 b/s with imperceptible mimicry levels and communication distances of up to 60 km. While these results have raised expectations for practical applications, the actual performance and maturity of this technology have remained unclear. This article investigates and evaluates existing biomimetic communication methods to clarify the current level of UBC technology to plan a development strategy. To analyze their performance, we classified them based on sound types, such as whistles and clicks, and modulation techniques. Simultaneously, the technological maturity of these methods is also assessed using the technical readiness level framework. The results show that biomimetic communication could be a more promising solution for military underwater communication requiring covertness. Finally, we suggest research directions to further develop this technology into a fully operational system.
... We benchmark RF-GML against two FR metrics: ViSQOL-v3 (audio mode) [18] and GML (also trained with CutMix) [15]. ViSQOL-v3 is included due to its strong correlation with subjective scores in audio codec evaluations [32]. GML is included for comparison as RF-GML is derived from it and trained on the same dataset. ...
Preprint
Full-text available
This paper introduces a novel reference-free (RF) audio quality metric called the RF-Generative Machine Listener (RF-GML), designed to evaluate coded mono, stereo, and binaural audio at a 48 kHz sample rate. RF-GML leverages transfer learning from a state-of-the-art full-reference (FR) Generative Machine Listener (GML) with minimal architectural modifications. The term "generative" refers to the model's ability to generate an arbitrary number of simulated listening scores. Unlike existing RF models, RF-GML accurately predicts subjective quality scores across diverse content types and codecs. Extensive evaluations demonstrate its superiority in rating unencoded audio and distinguishing different levels of coding artifacts. RF-GML's performance and versatility make it a valuable tool for coded audio quality assessment and monitoring in various applications, all without the need for a reference signal.
... For our experiments, we employed two objective audio quality metrics: the speech-specific perceptual evaluation of speech quality (PESQ) [13] and the open source implementation GstPEAQ [14] of perceptual evaluation of audio quality (PEAQ) [15,16], which has been used previously for the evaluation of watermarking of music [17]. We used 200 randomly selected speech utterances from TIMIT [18] with sampling rate of 16 kHz. ...
... Note that unlike GML, ViSQOL-v3 is not a DNN-based coded audio quality predictor. We use ViSQOL-v3 as a benchmark because it has been reported in [28], that out of all objective measures designed to evaluate codecs, ViSQOL shows the best correlation with subjective scores and achieves high and stable performance for all content types. In addition, we benchmark against our non-generative counterpart (non-GML), i.e., using the same base model as used in GML, but trained on the same dataset to predict the mean MUSHRA score with smooth L1-loss [9]. ...
Preprint
Full-text available
We show how a neural network can be trained on individual intrusive listening test scores to predict a distribution of scores for each pair of reference and coded input stereo or binaural signals. We nickname this method the Generative Machine Listener (GML), as it is capable of generating an arbitrary amount of simulated listening test data. Compared to a baseline system using regression over mean scores, we observe lower outlier ratios (OR) for the mean score predictions, and obtain easy access to the prediction of confidence intervals (CI). The introduction of data augmentation techniques from the image domain results in a significant increase in CI prediction accuracy as well as Pearson and Spearman rank correlation of mean scores.
Article
We describe and evaluate a hybrid neural audio coding system consisting of a perceptual audio encoder and a generative model, MDCTNet. By applying recurrent layers (RNNs) we capture correlations in both time and frequency directions in a perceptually weighted adaptive modified discrete cosine transform (MDCT) domain. By training MDCTNet on a diverse set of full-range monophonic audio signals at 48 kHz sampling, we achieve performance competitive with state-of-the-art audio coding at 24 kb/s variable bitrate (VBR). We also quantify the effect of the generative model-based decoding at lower and higher bitrates and discuss some caveats of the use of data driven signal reconstruction for the audio coding task. An audio demo is available online at https://mdctnet-demo.github.io.
Article
Full-text available
This study investigated the potential influence of cognitive factors on subjective sound-quality ratings. To this end, 34 older subjects (ages 61-79) with near-normal hearing thresholds rated the perceived sound quality of speech and music stimuli that had been distorted by linear filtering, non-linear processing, and multiband dynamic compression. In addition, all subjects performed the Reading Span Test (RST) to assess working memory capacity (WMC), and the test d2-R (a visual test of letter and symbol identification) was used to assess the subjects' selective and sustained attention. The quality-rating scores, which reflected the susceptibility to signal distortions, were characterized by large interindividual variances. Linear mixed modelling with age, high-frequency pure tone threshold, RST, and d2-R results as independent variables showed that individual speech-quality ratings were significantly related to age and attention. Music-quality ratings were significantly related to WMC. Taking these factors into account might lead to improved sound-quality prediction models. Future studies should, however, address the question of whether these effects are due to procedural mechanisms or actually do show that cognitive abilities mediate sensitivity to sound-quality modifications.
Article
Full-text available
This paper investigates the impact of different audio codecs typically deployed in current digital audio broadcasting (DAB) systems and web-casting applications, which represent a main source of quality impairment in these systems and applications, on the quality perceived by the end user. Both subjective and objective assessments are used. Two different audio quality prediction models, namely Perceptual Evaluation of Audio Quality (PEAQ) and Perceptual Objective Listening Quality Assessment (POLQA) Music, are evaluated by comparing the predictions with subjectively obtained grades. The results show that the degradations introduced by the typical lossy audio codecs deployed in current DAB systems and web-casting applications operating at the lowest bit rate typically used in these distribution systems and applications seriously impact the subjective audio quality perceived by the end user. Furthermore, it is shown that a retrained POLQA Music provides the best overall correlations between predicted objective measurements and subjective scores allowing to predict the final perceived quality with good accuracy when scores are averaged over a small set of musical fragments (R = 0.95).
Article
Full-text available
With the advent of devices that unite a multitude of functionalities, the industry has an increased demand for an audio codec that can deal equally well with all types of audio content. In early 2012 the ISO/IEC JTC1/SC29/WG11 (MPEG) finalized the new MPEG-D Unified Speech and Audio Coding standard, bringing together the previously separated worlds of general audio and speech coding. It does so by integrating elements from audio coding and speech coding into a unified system over a wide range of bit rates. The present publication outlines all aspects of this standardization effort, starting with the history and motivation of the MPEG work. Technical features of the final system are described. Listening test results and performance numbers show the advantages of the new system over current state-of-the-art codecs.
Article
Digital audio broadcasting services transmit substantial amounts of data that is encoded to minimize bandwidth whilst maximizing user quality of experience. Many large service providers continually alter codecs to improve the encoding process. Performing subjective tests to validate each codec alteration would be impractical, necessitating the use of objective perceptual audio quality models. This paper evaluates the quality scores from ViSQOLAudio, an objective perceptual audio quality model, against the quality scores of PEAQ, POLQA, and PEMO-Q on three datasets containing fullband audio encoded with a variety of codecs and bitrates. The results show that ViSQOLAudio was more accurate than all other models on two of the datasets and performed well on the third, demonstrating the utility of ViSQOLAudio for predicting the perceptual audio quality for encoded music.
Conference Paper
PEAQ (Perceptual Evaluation of Audio Quality) is a standardized algorithm for the objective measurement of perceived audio quality [1]. It predicts the perceived audio quality of mono and stereo audio files as listeners would do in a subjective listening test according to ITU-R BS.1116-1 [2]. Unfortunately this prediction is not intended for multichannel material, such as 5.1 or beyond. Additionally, the quality estimation of the standard does not consider spatial artifacts. Members of ITU-R Working Party 6C are developing an extension towards multichannel compatibility, as well as an integration of the modeling of spatial artifacts assessments [3]. In this paper, the current status of the standardization progress is illustrated. The concept of one of the three proposals currently under consideration is explained and the results of a first verification test conducted by ITU-R are presented.
Article
Perceptual coding of audio signals is increasingly used in the transmission and storage of high-quality digital audio, and there is a strong demand for an acceptable objective method to measure the quality of such signals. A new measurement method is described that combines ideas from several earlier methods. The method should meet the requirements of the user community, and it has been recommended by ITU Radiocommunication study groups.
Article
A general method to measure the subjective quality of audio devices is developed using the concept of internal sound representation. A model of the human auditory system is used to calculate the internal representation of the input and output signals of an audio device. The transformation from the physical domain to the psychophysical (internal) domain is performed by way of two operations, time-frequency spreading and level compression. These operations permit describing the masking behavior of the human auditory system at and above masked threshold. Moreover, the subjectively perceived audio quality can be assessed. The perceptual audio quality measure (PAQM) shows an increased correlation with the subjectively perceived audio quality as compared to quality measures that use the masked threshold concept.
Article
We aim to assess the perceived quality of estimated source signals in the context of audio source separation. These signals may involve one or more kinds of distortions, including distortion of the target source, interference from the other sources or musical noise artifacts. We propose a subjective test protocol to assess the perceived quality with respect to each kind of distortion and collect the scores of 20 subjects over 80 sounds. We then propose a family of objective measures aiming to predict these subjective scores based on the decomposition of the estimation error into several distortion components and on the use of the PEMO-Q perceptual salience measure to provide multiple features that are then combined. These measures increase correlation with subjective scores up to 0.5 compared to nonlinear mapping of individual state-of-the-art source separation measures. Finally, we released the data and code presented in this paper in a freely-available toolkit called PEASS.