Access to this full-text is provided by MDPI.
Content available from Applied Sciences
This content is subject to copyright.
applied
sciences
Article
Influence of the Quality of Consumer Headphones in
the Perception of Spatial Audio
Pablo Gutierrez-Parera * and Jose J. Lopez
Institute of Telecommunications and Multimedia Applications (ITEAM), Universitat Politècnica de València,
Valencia 46022, Spain; jjlopez@dcom.upv.es
*Correspondence: pabgupa@iteam.upv.es; Tel.: +34-963-877-007
Academic Editor: Vesa Valimaki
Received: 29 February 2016; Accepted: 12 April 2016; Published: 22 April 2016
Abstract:
High quality headphones can generate a realistic sound immersion reproducing binaural
recordings. However, most people commonly use consumer headphones of inferior quality, as
the ones provided with smartphones or music players. Factors, such as weak frequency response,
distortion and the sensitivity disparity between the left and right transducers could be some of the
degrading factors. In this work, we are studying how these factors affect spatial perception. To this
purpose, a series or perceptual tests have been carried out with a virtual headphone listening test
methodology. The first experiment focuses on the analysis of how the disparity of sensitivity between
the two transducers affects the final result. The second test studies the influence of the frequency
response relating quality and spatial impression. The third test analyzes the effects of distortion using
a Volterra kernels scheme for the simulation of the distortion using convolutions. Finally, the fourth
tries to relate the quality of the frequency response with the accuracy on azimuth localization. The
conclusions of the experiments are: the disparity between both transducers can affect the localization
of the source; the perception of quality and spatial impression has a high correlation; the distortion
produced by the range of headphones tested at a fixed level does not affect the perception of binaural
sound; and that some frequency bands have an important role in the front-back confusions.
Keywords:
headphones; spatial sound; quality; perception; binaural; subjective test; distortion;
frequency response; front-back confusion
PACS: 43.66.Pn, 43.66.Yw
1. Introduction
With the advent of high definition TV, 3D video and mobile devices, spatial audio technologies
have gained great popularity in recent years. Speaker sets have evolved from the classic stereo systems
into many channels, not only considering 2D configurations, but also height speakers. The variety
of formats (from 5.1 to 22.2 and also headphone systems) and the reproduction techniques (Vector
Base Amplitude Panning (VBAP) [
1
], Wave Field Synthesis (WFS) [
2
], Ambisonics, etc.) open up
many possibilities for the recreation of acoustic environments and especially the creation of new
musical experiences. Audio reproduction systems based on loudspeakers are the most popular, but the
headphone-based systems are increasing in popularity because of the private hearing they provide in
any type of environment, as well as the widespread use of mobile devices nowadays. Headphones are
commonly employed to reproduce stereo recordings, but binaural material represents a step forward.
The reproduction of binaural sound over headphones uses the principles of the human auditory
system [
3
]. It assumes that, if we are able to reproduce in the listener’s ears with headphones the same
pressures that the listener experiences in a natural environment, a realistic acoustic immersion can be
simulated [4].
Appl. Sci. 2016,6, 117; doi:10.3390/app6040117 www.mdpi.com/journal/applsci
Appl. Sci. 2016,6, 117 2 of 18
To have a correct sense of spatial immersion, high quality microphones should be employed
in conjunction with acoustic mannequins. In addition, high quality headphones should be used for
playback. However, low end headphones are widely used in most cases, either for economic reasons or
simply because they are included with mobile devices. It is generally known that low cost headphones
usually provide a poorer sense of immersion, but the degrading factors that cause such a loss in quality
have not been sufficiently studied, as well as the level of their effects. In this research, we are laying
the groundwork for a strategy to study the factors that affect the spatial sensation on listening with
headphones and their relationship to perceived quality.
Hypothesis and Planning of the Study
Different factors can affect the perception of the spatial sound image. Our hypothesis states that
three main factors are responsible for this degradation. Some of these degrading factors could be the
frequency response, the distortion and the disparity between the left-right transducers, especially in
low cost headphones. To determine this, we propose a series of perceptual tests [
5
] to particularly
study these factors.
Section 2describes the methodology, the headphones employed in the study, as well as the
technique used to measure and simulate them. Sections 3–6explain a series of perceptual tests that
constitute the bulk of this research. Firstly, Section 3presents a perceptual test carried out to study the
influence of the sensitivity disparity between left and right transducers and to establish the degree to
which perception of the sound source position in the azimuth is affected. Although in high quality
headphones, manufacturers match transducers with similar sensibilities, these low cost headphones
have different sensibilities due to broader manufacturing tolerances. Another second subjective
perceptual test described in Section 4was conducted to evaluate the effect of the frequency response in
the perception of quality and spatial impression with headphones. As frequency response is the factor
that varies most among different headphones due to their quality, this test is of particular interest to
better understand how frequency response affects the spatial sound impression. Section 5outlines the
third perceptual test planned to evaluate the effect of harmonic distortion in listening with headphones.
Distortion can be considerable if high dynamic sound and high reproduction levels are employed.
Section 6explains the fourth and last test, which studies the relation of the frequency response with
the accuracy of localization in the horizontal plane. The capacity of a headphone to generate a good
spatial immersion can be different from its capacity to generate precise locations. To explore this point,
azimuth localization is tested here for different kinds of headphones. The discussion and conclusions
of these experiments are presented in Section 7.
2. Headphones Measurements and Virtual Headphone Simulation
It is well known in loudspeaker testing that visual cues play an undesirable role in the results
provided by test subjects. Similarly, when testing headphones, tactile cues can also influence
results. Consequently, it can be challenging to conduct a double-blind comparative listening test
for headphones. It is difficult to hide the possible influencing variables, such as brand, design or price.
In addition, the manual substitution of different headphones on the subject’s head can be disruptive
and introduces useless fatigue on the subject [
6
]. Moreover, the fitting and tactile sensations are
impossible to remove, making them an important bias factor [7].
In order to avoid these effects, it is appropriate to use a virtual headphone simulation to
perform the listening tests [
8
,
9
]. This method employs one reference headphone to simulate the
different headphones under test. In this way, listeners can evaluate the simulated versions of the
different headphones wearing just the reference headphone, therefore avoiding the manual change of
headphones and removing the visual and tactile biases. Some other advantages are obtained with this
virtual method: listeners can have immediate access to the different headphones, and the procedure
test becomes more flexible, transparent, controlled and repeatable.
Appl. Sci. 2016,6, 117 3 of 18
The reliability of this virtual simulation method has been previously studied, finding good
correlation between standard listening tests using real headphones and the virtual simulation method.
However, in some cases, some discrepancy related to a specific model or sound signal [
8
] has been
found due to the visual and tactile bias present in the standard test [10].
Due to the great advantages of a virtual test over a standard one, this study used a virtual
headphone listening test methodology. This will remove the strong bias that would appear in this
study due to the great difference in appearance and fitting characteristics among the consumer
headphones and high quality ones used in this test.
2.1. Headphone Selection
Different headphones were selected in order to represent a range of commercial and
readily-available headphones. According to this principle and the scope of the study described
in previous sections, seven different headphones were selected plus a high quality reference one. A
Sennheiser HD800 (Sennheiser, Wedemark, Germany) was chosen as the reference headphone (REF).
The reason for this selection is due to its great fidelity, response, low distortion and accurate timbral
reproduction. The other seven headphones were selected to cover a wide range of possible common
uses. The brands and models of the rest of the headphones will be omitted, as they are not necessary
for the result analysis.
The headphones used in the study were classified as:
(a)
REF, Reference, Sennheiser HD800 headphone (open and circumaural)
(b)
HQop, High Quality open headphone (circumaural)
(c)
MQcl, Medium Quality closed headphone (circumaural)
(d)
BDso, Big Diaphragm semi-open headphone (circumaural)
(e)
LCmul, Low Cost multimedia headphone (supra-aural)
(f)
AirL, Airline headphone (supra-concha)
(g)
Woh, Wireless open headphone (circumaural)
(h)
LCmul2, Low Cost multimedia headphone 2 (supra-aural)
The reference headphone was the only one that participants used, saw and had contact with
during the tests. The rest of the headphones were simulated through the reference one. Then, all of the
participants performed the test using the same high quality reference headphone (REF, Sennheiser
HD800). The resulting signals for the rest of the headphones (used in Tests 2, 3 and 4 and described in
their sections) were simulated by means of proper signal processing algorithms and heard through the
reference headphone.
2.2. Frequency Responses Measures
To measure the response of the different headphones, a swept-sine method was employed [
11
]
using a Head and Torso Simulator (HATS) Model B & K Type 4100 (Brüel & Kjær, Nærum, Denmark)
(Figure 1). This technique gave us both the frequency response, as well as the first and second distortion
harmonics needed for the simulation of the different headphones.
Figure 1. Set-up for measuring the headphones with the Head and Torso Simulator (HATS).
Appl. Sci. 2016,6, 117 4 of 18
To avoid differences in the amplitude level of the measures, the selected criterion was to achieve
the same equivalent power between 100 Hz to 10 kHz for all of the headphones (for calibration, we
employed band pass pink noise between 100 Hz to 10 kHz instead of 20 Hz to 20 kHz in order to
minimize the influence of roll-off in low and high frequencies in low quality headphones). This decision
allowed us to measure all of the headphones in the same reproduction conditions and to achieve
the same reproduction level in this band of frequencies. The reproduced pressure level for all of the
headphones was equivalent to 69 Sound Pressure Level dB (dBSPL) of pink noise in the reference
headphones. This level was selected in informal tests as a pleasant listening level. Besides, this level
allowed the measurement of the different headphone models without any saturation distortion in
equivalent conditions.
Each of the headphones, including the reference one, were measured with the mentioned
swept-sine method. The resulting impulse responses (
hi[n]
) were truncated to 50 ms (2205 samples for
a 44,100-Hz sampling frequency) and windowed with a half Hamming window. This length provides
good resolution in low frequencies until 20 Hz. To minimize errors related to headphone positioning
on the ear of the HATS simulator, five resets of the headphones were done and measured. The curves
shown in Figure 2are based on the average of those measures.
The first curve corresponds to the reference headphone (a)-REF, which shows a smooth response
and flat below 3 kHz. The next three, (b)-HQop, (c)-MQcl, (d)-BDso headphones, were chosen as
good mid-quality range with different characteristics: open, closed and semi-open. Their frequency
responses below 6 kHz are quite flat, with the exception of some irregularities in the (c)-MQcl curve
and a peak down at 4.5 kHz that decreases to
−
14 dB. There is another peak up in the curve (d)-BDso
at 6 kHz of 15 dB. The next curves (e to h) represent the frequency responses of the multimedia
(e)-LCmul, airline (f)-AirL, wireless (g)-Woh and another multimedia (h)-LCmul2 headphones, that
were chosen to be an example of mid- and poor quality headphones. Their frequency responses have
important peaks and valleys that affect the sound. Curve (e)-LCmul has a reinforcement in frequencies
around 1.5 kHz and a big dip in 3.5 kHz, and curve (f)-AirL has a strong peak in 140 Hz, as well as other
distortions up to 4.5 kHz. Curve (g)-Woh is flatter in the mid frequencies with a small reinforcement
in 1.5 kHz and a decay around 4.5 kHz. In the case of curve (h)-LCmul2, it is important to note
the rapid decline above 3 kHz and the lack of proper high frequency beyond 5 kHz. All of these
headphones are intended to be a small representation of quality range in commercial headphones.
2.3. Headphones Frequency Response Simulation
The seven headphones under study were simulated to be reproduced with the reference
headphones ((a)-REF-Sennheiser HD800). The simulation of each headphone was done by filtering
with its frequency response, but compensating the effect of the reference headphone using its
inverted frequency response. Equation (1) shows the process for the simulation, where
Hi(ω)
is
the measured response of the headphone to simulate,
HHD800 (ω)
is the measured response of the
reference headphone and
Hi corrected(ω)
is the response of the simulated headphone, which is applied
to the corresponding stimulus.
Hi corrected(ω) = Hi(ω)
HHD800 (ω)(1)
These virtual headphone equalizations include not only the magnitude response, but also the
phase of the headphone measured. Although it is generally accepted that phase does not seem to
affect the perceived accuracy of the simulations [
12
], especially if the stimuli material is a typical music
program, it can be noticed with pink noise stimuli. All of the impulse responses of the headphones
measured, the correction of the reference headphone and its application convolving with the stimulus,
respect and keep the original phases. Moreover, accurate phase processing guaranties that our filtering
will not alter in any way the Interaural Time Difference (ITD) between left and right transducers.
Appl. Sci. 2016,6, 117 5 of 18
Frequency (Hz)
102103104
Relative level (dB)
-50
-40
-30
-20
-10
0
10
REF
(a)
Frequency (Hz)
102103104
Relative level (dB)
-50
-40
-30
-20
-10
0
10
LCmul
(e)
Frequency (Hz)
102103104
Relative level (dB)
-50
-40
-30
-20
-10
0
10
HQop
(b)
Frequency (Hz)
102103104
Relative level (dB)
-50
-40
-30
-20
-10
0
10
AirL
(f)
Frequency (Hz)
102103104
Relative level (dB)
-50
-40
-30
-20
-10
0
10
MQcl
(c)
Frequency (Hz)
102103104
Relative level (dB)
-50
-40
-30
-20
-10
0
10
Woh
(g)
Frequency (Hz)
102103104
Relative level (dB)
-50
-40
-30
-20
-10
0
10
BDso
(d)
Frequency (Hz)
102103104
Relative level (dB)
-50
-40
-30
-20
-10
0
10
LCmul2
(h)
Figure 2.
Frequency response of the headphones used in the study. Solid curve, left channel; dashed
curve, right channel. (
a
) REF, Reference headphone; (
b
–
h
) headphones under study. (
b
) HQop, High
Quality open headphone; (
c
) MQcl, Medium Quality closed headphone; (
d
) BDso, Big Diaphragm
semi-open headphone; (
e
) LCmul, Low Cost multimedia headphone; (
f
) AirL, Airline headphone;
(g) Woh, Wireless open headphone; (h) LCmul2, Low Cost multimedia headphone 2.
Appl. Sci. 2016,6, 117 6 of 18
The filter implementation of Equation (1) was carried out in MATLAB (Matrix Laboratory, R2015a,
MathWorks Inc., Natick, MA, USA, 2015) in the time domain, using Equation (2); where
hi corrected[n]
is the response for the simulation of the virtual headphone,
hi[n]
is the impulse response of the
headphone to simulate and hI
HD800 [n]is the inverted impulse response of the reference headphone.
hi corrected[n] = hi[n]∗hI
HD800 [n](2)
To obtain
hI
HD800 [n]
, we firstly recorded the impulse response of the reference headphone
hHD800
with 2205 sample points (50 ms, fs = 44,100 Hz). Secondly, the Fast Fourier Transform (FFT) of the
response was computed, with zero padding up to a size of 4096, which guaranties a spectral resolution
of 10 Hz. This is low enough to see details of the frequency response. Thirdly, the resulting FFT was
inverted, taking into account a boost limitation of +15 dB. This limitation was included to avoid an
excess of boost at a couple of very narrow notches of the
hHD800
response (see Figure 2a), assuring
that final signals are inside the reproducible dynamic margin and free from artifacts. Lastly, the
inverted and limited response was then used to properly compute the inverse FFT and next Hamming
windowed to obtain the
hI
HD800 [n]
. This process guaranties the avoidance of undesirable effects, such
as circular convolution or others.
Finally, the different headphones were simulated applying the simulation filter
hi corrected[n]
to the
sound materials for each test, obtaining the different stimuli. This was the procedure used for Tests 2
(Section 4) and 4 (Section 6).
2.4. Non-Linear Distortion Simulations
As commented on before, the swept-sine method employed to measure the frequency response of
the headphones provides, apart from the frequency response, distortion harmonics simultaneously.
Figure 3shows the frequency response and the second and third distortion harmonic of the reference
((a)-REF) and the airline ((b)-AirL) headphones. Both of these headphones are a good example of low
(a) and high distortion (b).
Frequency (Hz)
102103104
Magnitude and Distortion (dB)
-60
-50
-40
-30
-20
-10
0
10
REF
(a)
Frequency (Hz)
102103104
Magnitude and Distortion (dB)
-60
-50
-40
-30
-20
-10
0
10
AirL
(b)
Figure 3.
Frequency response with distortion of two headphones (left channel). Solid curve, magnitude;
dashed curve, second order distortion harmonic; dotted curve, third order distortion harmonic. (
a
) REF,
Reference headphone; (b) AirL, Airline headphone.
To simulate the non-linear distortion of each headphone, the method described in [
13
], which uses
Volterra kernels and a series of linear convolutions, was chosen. With this method, the transfer function
of a system is described by means of a Volterra series expansion. The output signal can be represented
as the sum of the linear convolution of the measured impulse responses with the input signal and
the corresponding frequency-shifted version. Applying Fourier transforms to these series results in
a linear equation system. The solution of this system allows the computation of the diagonal Volterra
Appl. Sci. 2016,6, 117 7 of 18
kernels obtaining the impulse response terms for the main response and the first two distortion orders;
Equation (3).
H1=H0
1+H0
3
H2=−2ˆ
H0
2
H3=−4H0
3
(3)
where
H0
1
,
H0
2
,
H0
3
are the measured frequency responses and
H1
,
H2
,
H3
are the Volterra kernels
( ˆ represents the Hilbert transform).
Using these equations, the second and third distortion orders were simulated by convolution,
applying them to Equation (4), where
x(n)
is the input signal and
M
is the number of samples of
the kernel:
y(n) =
M−1
∑
i=0
h1(i)·x(n−i) +
M−1
∑
i=0
h2(i)·x2(n−i) +
M−1
∑
i=0
h3(i)·x3(n−i)(4)
More details of this technique can be found in [
13
]. This procedure was followed for
Test 3 (Section 5).
2.5. Binaural Room Impulse Responses Measurements
In order to generate the spatiality of sound sources, some Binaural Room Impulse Responses
(BRIR) [14] were measured with a HATS B & K Type 4100.
Reverberation is an influential factor for spatial localization [
3
,
15
], and because of this, we
decided to record our own BRIR with natural reverberation instead of using dry responses from a
library. The impulse responses were recorded in a rectangular room with a volume of 132 m
3
and
a reverberation time of about 0.7 s. Nine different azimuth angles were recorded (0
◦
, 30
◦
, 60
◦
, 90
◦
,
135◦, 225◦, 270◦, 300◦, 330◦) in the horizontal plane at 1.5 m of distance.
These measures were used to simulate binaural sound source positions in Test 4 (Section 6).
3. Test 1. Sensitivity Disparity between Left-Right Transducers
3.1. Test Description
The idea of this test is to evaluate how sensitivity disparity between the left and right transducers
affects the perception of the source azimuth. To do that, a subjective perceptual test was carried out
applying some volume level variations to different binaural sounds and checking how this affects the
accuracy of horizontal localization.
In this test, participants had to listen, wearing headphones, to some binaural recordings obtained
with a HATS on specific angles in the horizontal plane. Different variations of the original level between
left and right transducers were applied to these sounds and then presented to the listeners. Participants
should then indicate the direction of arrival, marking the angle in a Graphical User Interface (GUI).
The volume level variations applied were 0 (no modification), 1, 2 or 4 dB more on the left channel
than the right one. Four different angles of direction of arrival were chosen,
−
30
◦
, 0
◦
, 65
◦
and 90
◦
of
azimuth in the horizontal plane. Besides, the influence of different types of sounds was also studied.
These sounds were specifically recorded for this test using a binaural mannequin (B & K
Model 4100) at the specific angles under study. A 44,100-Hz sampling frequency was employed,
obtaining full audio band recordings. The mannequin was in a semianecoic room, and sources
were placed around it at 1 m apart. Four different sounds were recorded: a timbal drum hit, voice,
a whistle and pink noise. The impulsivity of the timbal hit is an interesting characteristic regarding
sound localization, also interesting for its low frequency content. Both voice and whistle are easily
recognizable common sounds, which make them useful for the test. Moreover, the reduced spectral
content of the whistle can be an interesting feature that can affect the test. The voice signal was the
syllables “ba-be-bi-bo-bu”, pronounced by a male voice. This sound has diverse vocalic contents and
Appl. Sci. 2016,6, 117 8 of 18
bilabial consonantal phoneme /b/, which produces impulsive sound. Pink noise was employed to
evaluate a wide spectrum signal. All of these sounds were reproduced by the Sennheiser HD800
reference headphones.
According to the different types of sounds described above, the total number of stimuli presented
to each participant in this test was: 4 angles
×
4 types of sounds
×
4 level variations = 64 stimuli.
These stimuli were randomly presented, and the participant could listen to each of them as many times
as he or she wanted.
During the test, participants also had the possibility of hearing a reference stimulus at any time,
choosing between −90◦,−45◦, 0◦, 45◦and 90◦of azimuth.
To perform the test, a simple Graphical User Interface (GUI) was developed in MATLAB that
brings the user full control of the test. The participant could select the perceived sound source direction
angle in an arc of
−
90
◦
to 90
◦
of azimuth (with a 5
◦
resolution). It was also possible for the subject to
freely control and listen to the reference stimulus.
The test was performed by 20 people, 10 men and 10 women (21 to 45 years, with an average age
of 32). The average runtime of the test was 9 min. Every participant did a training session before taking
the test, so all could listen to all of the stimuli and become familiar with the GUI and the assigned task.
Some preliminary results of this test were previously published by the authors in [16].
3.2. Results
Figure 4a shows the average of the answered angles (for all of the level variation cases) according
to the reproduced angle. The average of the answers has a deviation to the left-hand side. This is
expected since the variations (0, 1, 2, 4 dB) were always more in favor of the left channel than the right.
Angle of sound
90650-30
Answered Angle 95% CI
75
50
25
0
-25
-50
(a)
Level Variation
4dB2dB1dB0dB
Deviation Answered Angle 95% CI
-5
-10
-15
-20
-25
(b)
Figure 4.
(
a
) Average of the answered angles versus reproduced angles (degrees); (
b
) average of the
deviation of the answered angles (degrees) versus level variation (dB).
The tendency of this angle deviation to the left can be seen in Figure 4b, considering the level
variation applied (0, 1, 2, 4 dB).
An Analysis of Variance (ANOVA) indicates that the level variation has a very significant influence
(F= 27.338, df = 3, p< 0.001) over the deviation in the answers.
If we consider just the central angles used in the experiment (0
◦
and 65
◦
), a smaller average
deviation can be seen (Figure 5). This leads us to believe that listeners tended to divert the location of
the sounds perceived on the sides more, which means that the introduced level variations made the
lateral angles disperse more than the central ones.
Appl. Sci. 2016,6, 117 9 of 18
Level Variation
4dB2dB1dB0dB
Deviation Answered Angle 95% CI
5
0
-5
-10
-15
-20
Figure 5.
Average deviation of the answered angles (degrees) versus level variation (dB), considering
only the angles 0◦and 65◦.
On the other hand, the influence of the type of sound (timbal, voice, whistle or pink noise) on
the deviation in responses can be seen in Figure 6a. Voice and pink noise have lower deviation than
timbal and whistle sounds, especially in cases of 0 and 1 dB of deviation. Besides, voice stimuli and
pink noise manifest a more separate and clear deviation at varying levels.
Level Variation
4dB2dB1dB0dB
Deviation Answered Angle 95% CI
10
0
-10
-20
-30
4dB2dB1dB0dB 4dB2dB1dB0dB 4dB2dB1dB0dB
Sound Type
Pink NoiseWhistleVoiceTimbal
(a)
Level Variation
4dB2dB1dB0dB
Deviation Answered Angle 95% CI
10
0
-10
-20
-30
-40
4dB2dB1dB0dB 4dB2dB1dB0dB 4dB2dB1dB0dB
Angle of Sound
90650-30
(b)
Figure 6.
Average deviation of the answered angles (degrees) versus the level variation (dB):
(a) considering the type of sound; (b) considering the angle reproduction of sound.
The influence of the type of sound over the deviation of answers is significant (F= 4.409, df = 3,
p= 0.004) according to an analysis of variance. The sound angle reproduction has a very significant
influence (F= 54.932, df = 3, p< 0.001) over the deviation of the answers. In Figure 6b, the deviation of
the answers for each sound angle reproduction is represented. Angles 0
◦
and 65
◦
present less deviation
to the left. The biggest deviation of the answers corresponds to the angle
−
30
◦
, and it could be due to
the fact that it was the only angle on the left side.
Appl. Sci. 2016,6, 117 10 of 18
4. Test 2. Frequency Response about Quality and Spatial Impressions
4.1. Test Description
In this test, participants listened to some excerpts of sound with headphones and rated their
quality and their sound spatial image. These different headphones were simulated as described in
Section 2.3 by means of the convolution of their frequency responses with the stimuli sounds, and all
of them were reproduced with the reference headphones.
Due to the fact that different frequency responses produce noticeable effects, the perceptual
test was designed according to the recommendation International Telecommunication Union,
recommendation by Radiocommunication sector (ITU-R) 1534-2 [
17
], which describes the MUltiple
Stimuli with Hidden Reference and Anchor (MUSHRA) perceptual test. This kind of test describes
a method to assess intermediate quality audio systems and also all of the requirements needed
to accomplish the test with rigor. Besides, this test sets a zero to 100 continuous scale (zero–bad;
100–excellent
) to evaluate quality and other parameters of sounds and systems, always using a
reference sound. All systems are compared to a reference of maximum quality, and the different
systems are also compared between them.
Two different tasks were evaluated during the test by the participants. The first task was to
indicate the quality of the sound with respect to the reference. The second task was to evaluate the
spatial impression (locations, sensations of depth, immersion, reality of the audio event) [
18
] with
respect to the reference.
Five different excerpts of audio (12 to 14 s) were employed as source material (see Table 1), and
all of them were reproduced simulating the different headphones under study. All of these sound
fragments were chosen by their spatial, stereophonic and timbral attributes.
Table 1. Music program used for listening Tests 2 and 3.
Artist Track CD Description
Bettina Flater Haugebonden Women en Mi female voice and guitar
Paco de Lucía Zambra Gitana Canción Andaluza male voice and guitar
Jerry Glodsmith Night Boarders OST The Mummy high dynamic orchestral
The Chad Fisher Group Basin Street Blues live jazz (binaural)
Smashing Pumpkins audience sound live audience and drums (binaural)
In this test, five headphones simulations were done, corresponding to headphones (b)-HQop,
(c)-MQcl, (d)-BDso, (e)-LCmul and (f)-AirL (described in Section 2.1, with frequency responses in
Figure 2). Each of the five sound excerpts previously mentioned were reproduced by the virtual
headphone simulation described in Section 2.3. A virtual headphone simulation for each sound was
presented randomly in series to the listeners, as well as a hidden reference ((a)-REF) and also two
anchor signals. The first Anchor signal (ANC1) was a 7-kHz low pass filtered version of the sound
(according to the mid-quality anchor of the ITU recommendation 1534-2 [
17
]), and the second Anchor
signal (ANC2) was a monaural version of the sound. This second anchor was determined to set
a reference for the spatial impression question.
To perform the test, a GUI was developed in MATLAB according to the recommendation [
17
],
which allowed participants to freely listen to each of the sounds and to the reference, as many times
as they wanted. The different sound fragments were presented randomly as a series with all of the
different headphone simulations, to compare to the reference sound. Once the participant had scored
all of the simulations of a series, a new sound excerpt was presented to be evaluated. This process was
repeated twice, once for each question of the test (the first about quality and the second about spatial
impression), with a pause in between.
Appl. Sci. 2016,6, 117 11 of 18
The number of stimuli of this test was: (5 headphones simulations + 1 hidden reference + 2 anchor
signals)
×
5 sound excerpts = 40 stimuli, presented in five series of eight stimuli plus the reference.
As commented before, these 40 stimuli were presented twice in a different random order, to answer
the two different questions.
The test was performed by 11 people, seven men and four women (21 to 37 years, with an average
age of 30). As the test had two different questions, they were separated into two parts with a rest pause
in the middle. The average runtime of the test was 22 min for the first part and 16 min for the second.
Every participant did a training session before preforming the actual test, so all of them could listen to
all of the stimuli and become familiar with the GUI and the assigned tasks.
4.2. Results
Figure 7a shows the average of the normalized (zero to 100) quality answers for the hidden
reference, all five headphones simulated and the two anchors. As shown, the reference has been
properly identified in most cases. The three supposedly good quality headphones have high scores;
meanwhile, the two supposedly poor quality ones have the lowest scores. Both anchors remain in the
middle of the scores of these two groups.
Reference, Headphones and Anchors
ANC2ANC1AirLLCmulBDsoMQclHQopREF
Average Answered Quality 95% CI
100
80
60
40
20
0
(a)
Reference, Headphones and Anchors
ANC2ANC1AirLLCmulBDsoMQclHQopREF
Average Answered Spatial 95% CI
100
80
60
40
20
0
(b)
Figure 7.
(
a
) Average answered quality versus reference, headphones and anchors (minutes);
(b) average answered spatial impression versus reference, headphones and anchors.
An analysis of variance confirms that the headphones have a very significant influence (F= 58.33,
df = 7, p< 0.001) over the quality perceived.
Figure 7b shows the average of the normalized (zero to 100) spatial impression answers for the
hidden reference, the five headphones simulated and the two anchors. The results seem to be similar
to the answers about quality, with a high correlation of r
2
= 0.648. Nevertheless, in this case, the
confidence intervals are a bit wider, and the scores have some differences. The three supposedly good
quality headphones have high scores again, but the confidence intervals do not separate them very
much. There is a bigger difference between the two supposedly poor quality headphones, and the low
cost multimedia ((e)-LCmul) ones are in the same range as both anchor signals. It is also noticeable
that the Anchor Signal 2 (ANC2) as a monaural signal does not have a lower score.
In any case, an ANOVA confirms that the headphones have a very significant influence (F= 58.33,
df = 7, p< 0.001) over the perceived spatial impression. No significant influence of the type of sound
has been detected, even though some of them were binaural recordings.
Appl. Sci. 2016,6, 117 12 of 18
5. Test 3. Non-Linear Distortion
5.1. Test Description
The objective of this test is to evaluate how the effect of harmonic distortion in headphones affects
the spatial impression.
Several stimuli with and without the simulation of their harmonic distortion were presented to
the participants that had to score their perception.
The effect of these distortions is very subtle. For that reason, the perceptual test was designed
according to the recommendation ITU-R 1116-2 [
19
], which describes a method to assess small
impairments in audio systems. This recommendation also establishes rigorous requirements of
room, equipment and other arrangements. A continuous scale from one to five (1 – very annoying;
5 – imperceptible) is used to evaluate degradations with respect to a reference signal. The
recommendation proposes an ABC test in which two stimuli, A and B, are presented to be compared
against a known reference. One of these two stimuli, A or B, is always a hidden reference, and the
other a degraded signal.
One single question was presented to the participants: “What degradation of quality and spatial
impression do you hear with respect to the reference?”
The same five audio excerpts previously described in Test 2 were used here (see Table 1), as well
as the same five virtual headphone simulations (b)-HQop, (c)-MQcl, (d)-BDso, (e)-LCmul and (f)-AirL
(described in Section 2.1, with frequency responses in Figure 2). No anchors beyond the proposed
scale were used this time.
Two different versions of the headphones simulations were presented in this test. One without
and the other with the distortion simulated with the method described in Section 2.4. These two
versions of the same stimulus were presented each time to the participants. They have then to
rate the distorted against the not distorted version of the same sound in a double-blind manner (A vs. B).
In each trial, there was always a non-distorted version sound that acted as the known reference (C sound),
which according to the recommendation [19] has to be compared to the A and B sounds.
The number of stimuli of this test was then: 5 headphones simulations
×
2 versions (with and
without distortion)
×
5 sound excerpts = 50 stimuli, presented in twenty five series of two stimuli plus
the reference. All of these pairs were presented randomly to each participant.
To perform the test, a GUI was developed according to the recommendation, which allowed
participants to freely listen to each of the sounds to evaluate and the reference, as many times as
they wanted.
The five headphones under study were simulated (including distortion) to be reproduced with
the reference headphones ((a)-REF, frequency response in Figure 2).
This test was performed by the same 11 people of the previous Test 2; seven men and four
women (21 to 37 years, with an average age of 30). The average runtime of the test was 16 min. Every
participant did a training session before preforming this test, so all of them could listen to all of the
stimuli and become familiar with the GUI and the assigned task.
5.2. Results
According to the recommendation [
19
], the difference between the score of the hidden reference
and the score of the degraded signal is analyzed. Figure 8shows these differences for each of the
headphones simulated.
Appl. Sci. 2016,6, 117 13 of 18
Headphones
AirLLCmulBDsoMQclHQop
REF-Distorted Difference 95% IC
0,2
0,1
0,0
-0,1
-0,2
-0,3
-0,4
Figure 8. Difference between hidden reference and distorted signals versus headphones.
No significance has been found. Then, distortion can be considered as imperceptible.
Therefore, it has no effect in spatial perception, at least with the fixed level used to simulate all
headphones (69 dBSPL).
6. Test 4. Frequency Response about Azimuth Localization
6.1. Test Description
The results obtained in Test 2 are significant, but do not provide information about the accuracy
in the localization of sources. For that reason, a test to evaluate the influence of frequency response on
this accuracy was carried out.
Attempts to describe different spatial attributes have been a constant pursuit in the field of spatial
audio [
18
,
20
,
21
]. The diffuse term employed in Test 2 to ask about spatial characteristics (spatial
impression) was intended to relate in a simple way the perception of quality with the feeling of
spaciousness. A more specific study of spatial attributes is then necessary to better evaluate the
performing of the different headphones. In this direction, the localization accuracy in azimuth is one of
the most studied spatial attributes [
22
–
25
] and therefore a good anchor point to contrast the previous
Test 2 with a localization experiment. Therefore, this test tries to establish a relation of the influence of
the frequency response on the azimuth localization in the horizontal plane.
As commented on in Section 2.5, to simulate the position of the sound sources in the horizontal
plane, recordings of BRIRs in a medium-sized room were done. Nine different azimuth angles, 0
◦
, 30
◦
,
60◦, 90◦, 135◦, 225◦, 270◦, 300◦and 330◦, were used.
Four types of sound were employed: door, voice (female), guitar and pink noise. A closing door
is an impulsive sound with quite low frequency content, which can be useful for sound localization.
The guitar sound was composed by various impulsive sounds in different main frequencies, one for
each chord. Voice is an easily-recognizable common sound, and female was chosen to have some
energy in high frequencies. The words “estímulo sonoro” (sound stimulus in Spanish) were employed.
They present the repeated fricative phoneme /s/ with high frequency content and the phoneme /t/, a
occlusive articulation that generates impulsive sound. Pink noise was employed to evaluate a wide
spectrum signal.
For this test, seven different headphones plus a hidden reference were simulated (Section 2.1).
Besides these, an additional anchor auralization (low pass filtered (LPF) sounds at 7 kHz) for each
angle was employed (ANC1).
Appl. Sci. 2016,6, 117 14 of 18
Therefore, the number of stimuli in this test was: 9 angles
×
4 types of sound
×
(7 headphones
simulation + 1 hidden reference + 1 anchor auralization) = 324 stimuli. These stimuli were presented
in random order in two parts of 162 stimuli, with a rest in between.
To perform the test, a GUI was developed in MATLAB, which allowed participants to freely listen
to the stimuli from a random list as many times as they wanted. Participants should indicate the
perceived angle of the sound source. The GUI consists of a circle of points, which represents the top
view of the listener, with a 5
◦
resolution. Additionally, it included a parallel control to freely listen to
a reference sound (pink noise) in the angles of 0, 45, 90, 135, 180, 225, 270 and 315 degrees.
The test was performed by 16 people, 10 men and 6 women (21 to 36 years, average age of 30).
The average runtime was of 21 and 17 min for each part.
6.2. Results
A Cronbach’s alpha analysis over the answers has been performed giving a value of
α=
0.982,
which shows a high internal consistency.
A one-way ANOVA showed a significant influence between the headphones and the deviation of
the answered angle (deviation = answered angle–real angle) (F= 2.399; df = 8; p= 0.014).
A first exploration of the participants’ answers reveals that several front-back confusions [
26
,
27
]
occur. For this reason, an evaluation of the amount of front-back confusions was performed for each
of the headphones simulated. An ANOVA showed that there is a very significant influence of the
type of headphones on the number of front-back confusions (F= 46.307; df = 8; p< 0.001). In Figure 9,
we can see that headphones (f)-AirL and (h)-LCmul2 produce an average of nearly 50% of front-back
confusions. This can be logical, as both headphones are supposed to be in the low quality range.
However, the (c)-MQcl headphone stands out in the group of high quality ones, as it has 30.2% of
front-back confusions, more confusions than the (e)-LCmul headphone, with a significant difference.
A comparison of the frequency response of the headphones that produce more front-back confusions
((f)-AirL, (h)-LCmul2 and (c)-MQcl) reveals that they share in common strong irregularities in the band
of 100 to 1600 Hz. On the other side, other headphones of medium and low quality ranges that have
less front-back confusions do not present these strong irregularities in that four-octave band. Because
of that, we suspect this can be an affecting factor disturbing the front-back discrimination.
Headphones
ANC1LCmul2WohAirLLCmulBDsoMQclHQopREF
Front-Back confusion (%) 95% CI
60
50
40
30
20
10
Figure 9. Percentage of front-back confusions for the reference, headphones and the anchor.
There is no significant influence of the type of sound crossed with the headphones. The sound
guitar is the only one that produces slightly less front-back confusions for all of the headphones.
Due to the strong front-back confusion, the analysis of the deviation of the perceived sound with
respect to the reproduced sound will produce large angle errors with complicated analysis of the
results. A front-back confusion produces a bigger error for sources in the median plane than lateral
sources, avoiding an analysis of the deviation angle (perceived angle–reproduced angle) with respect
to the source position.
Appl. Sci. 2016,6, 117 15 of 18
To overcome this setback, we propose a modified analysis of the error consisting of a preprocessing
of the listener responses based on reflecting to the correct semi-plane the ones that have front-back
confusion, leaving untouched the ones that do not. This correction eliminates big jumps in the deviation,
focusing the experiment in the performance analysis of the headphones reproducing correctly the
main spatial cues as ITD and the low frequency part of Interaural Level Difference (ILD). The high
frequency part is more related to the pinna effect that is not considered with the reflection applied.
Taking into account the strong front-back confusion, the analysis of the answer deviation from the
reproduction angle of the sound was performed introducing the correction of the front-back confusion.
Therefore, a symmetric image of the responses in the back (90◦to 270◦) is brought to the front.
Figure 10 shows the deviation angle of the answers for the reproduction angle of the sounds, both
of them front-back corrected. We can see that the deviations are quite uniform across the different
headphones, except for the angles 90
◦
and 270
◦
in the cases of (f)-AirL and (h)-LCmul2. Looking at
Figure 2, it is easy to see that the frequency responses of these two headphones present irregularities
and deep level drops between 4 and 7 kHz. It is noticeable that the anchor LPF 7-kHz sounds auralized
in the different angles (ANC1) are not affected by this problem, supporting the suspicion that the
commented band is important for sources located in lateral positions.
330
315
300
270
90
60
45
30
0
330
315
300
270
90
60
45
30
0
330
315
300
270
90
60
45
30
0
330
315
300
270
90
60
45
30
0
ANC1LCmul2WohAirL
330
315
300
270
90
60
45
30
0
Deviation Angle (degrees) 95% CI
60
40
20
0
330
315
300
270
90
60
45
30
0
330
315
300
270
90
60
45
30
0
330
315
300
270
90
60
45
30
0
330
315
300
270
90
60
45
30
0
LCmulBDsoMQclHQopREF
Reference, Headphones and Anchor
Angle of sound Front-Back corrected
Figure 10.
Deviation in degrees of the answers for every reproduced angle of sound. The reference,
headphones under testing and anchor are represented.
7. Conclusions
This study outlines the influence of different quality parameters in headphones in the context of
spatial sound reproduction. Four different perceptual tests have been done to analyze: (1) the effects
of the sensitivity disparity between the transducers; (2) the influence of the frequency response over
the perception of quality and the spatial impression; (3) the effects of non-linear distortion; and (4) the
influence of the frequency response over azimuth localization.
The following main conclusions can be drawn:
1.
The sensitivity disparities between left and right transducers affect the localization of sound
sources, starting from level differences of 1 dB.
2.
The quality and uniformity of the frequency response have an important influence in the
spatial impression.
3. Additionally, the spatial impression has a high correlation with the subjective perceived quality.
4.
The binaural recordings do not obtain significant better results for the parameter spatial impression
compared to two-channel stereo mixes.
Appl. Sci. 2016,6, 117 16 of 18
5.
The distortion introduced by consumer level low quality headphones does not affect the
perception of the spatial sound image.
6.
It has been ratified that much front-back confusion is produced, both for high and low
quality headphones.
7.
We found that irregularities of the frequency response in the band of 100 to 1600 Hz seem to
especially affect the front-back discrimination.
8.
We also found that a poor response in the band of 4 to 7 kHz degrades the accuracy in lateral
position localization.
All of these conclusions have been supported with statistical and ANOVA analysis. Some other
interesting comments and clarifications about these conclusions can be added:
In addition to Conclusion 1, the angles chosen in the disparity test are a determining factor,
whereby the more lateralized the angle, the larger the deviation. An increased number of angular
positions may be of interest in later studies.
In relation to Conclusions 2 and 3, it is worth remarking that the mono anchor signal (ANC2)
has obtained equal or even better results for spatial impression than some headphones ((e)-LCmul,
(f)-AirL) and the stereo LPF anchor (ANC1). This fact seems to be in relation to a deficient high
frequency reproduction and the general listening sensation, as evidenced by the high correlation
statistics obtained with the parameter perceived quality.
In relation to Conclusion 5, other works, such as [
28
], have not found significant perception of the
distortion. However, this earlier study used high quality headphones, while ours does so also with
low quality consumer headphones, and we have also analyzed the influence on spatial reproduction.
Finally, taking into account these three characteristics, perceived quality,spatial impression and
accuracy in azimuth localization, we have concluded that the first two are highly correlated. Surprisingly,
and contrary to how it might seem a priori, there is virtually no correlation between spatial impression
and accuracy in localization, because the strong influence that the subjective perceived quality has
over the spatial image perception. An illustrating example can be seen with the (f)-LCmul headphone.
It would be interesting to deepen this relationship in future work.
Based on the results of this study, some general guidelines for the design of headphones suitable
for spatial sound reproduction can be suggested. A sensitivity difference between left-right transducers
less than 1 dB should be assured in the manufacturing process to avoid azimuth localization errors.
A flat frequency response between 100 to 1600 Hz is desirable to reduce front-back confusion. Finally,
a good frequency response in the band 4 to 7 kHz would guarantee a good accuracy in the localization
of lateral sources.
Acknowledgments:
The Spanish Ministry of Economy and Competitiveness supported this work under the
projects TEC2012-37945-C02-01, TEC2015-68076-R and the grant BES-2013-065034.
Author Contributions:
All authors discussed the contents of the manuscript. Jose Javier Lopez contributed to the
research idea and the framework of this study. Pablo Gutierrez-Parera performed the experimental work and
analyzed the data.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
ANOVA: Analysis of Variance
BRIR: Binaural Room Impulse Responses
dBSPL: Sound Pressure Level dB
FFT: Fast Fourier Transform
GUI: Graphical User Interface
HATS: Head and Torso Simulator
Appl. Sci. 2016,6, 117 17 of 18
ILD: Interaural Level Difference
ITD: Interaural Time Difference
ITU-R: International Telecommunication Union, recommendation by Radiocommunication sector
LPF: Low Pass Filter
MATLAB: Matrix Laboratory, MathWorks software
MUSHRA: MUltiple Stimuli with Hidden Reference and Anchor
OST: Original film Soundtrack
VBAP: Vector Base Amplitude Panning
WFS: Wave Field Synthesis
References
1.
Pulkki, V. Virtual sound source positioning using vector base amplitude panning. J. Audio Eng. Soc.
1997
,45,
456–466.
2. Berkhout, A.; de Vries, D. Acoustic control by wave field synthesis. J. Acoust. Soc. Am. 1993,93, 2764–2778.
3.
Blauert, J. Spatial Hearing: The Psychophysics of Human Sound Localization; MIT Press: Cambridge, MA,
USA, 1997.
4.
Begault, D.R. 3D Sound for Virtual Reality and Multimedia Applications; Academic Press Professional Inc.:
San Diego, CA, USA, 1994.
5.
Bech, S.; Zacharov, N. Perceptual Audio Evaluation—Theory, Method and Application; John Wiley & Sons Ltd.:
Sussex, UK, 2006.
6.
Olive, S.E.; Welti, T. The relationship between perception and measurement of headphone sound quality.
In Proceedings of the 133rd AES Convention, San Francisco, CA, USA, 26–29 October 2012.
7.
Opitz, M. Headphones listening tests. In Proceedings of the 121st AES Convention, San Francisco, CA, USA,
5–8 October 2006.
8.
Hirvonen, T.; Vaalgamaa, M.; Backman, J.; Karjalainen, M. Listening test methodology for headphone
evaluation. In Proceedings of the 114th AES Convention, Amsterdam, The Netherlands, 22–25 March 2003.
9.
Briolle, F.; Voinier, T. Transfer function and subjective quality of headphones: Part 2, subjective quality
evaluations. In Proceedings of the 11th AES International Conference, Portland, OR, USA, 29–31 May 1992.
10.
Olive, S.E.; Welti, T.; McMullin, E. A virtual headphone listening test methodology. In Proceedings of the
51st AES International Conference, Helsinki, Finland, 22–24 August 2013.
11.
Farina, A. Simultaneous measurement of impulse response and distortion with a swept-sine technique.
In Proceedings of the 108th AES Convention, Paris, France, 18–22 February 2000.
12.
Lindau, A.; Brinkmann, F. Perceptual evaluation of headphone compensation in binaural synthesis based on
non-individual recordings. J. Audio Eng. Soc. 2012,60, 54–62.
13.
Farina, A.; Armelloni, E. Emulation of not-linear, time-variant device by the convolution technique.
In Proceedings of the Congresso AES Italy 2005, Como, Italy, 3–5 November 2005.
14.
Pulkki, V.; Karjalainen, M. Communication Acoustics: An Introduction to Speech, Audio and Psychoacoustics;
John Wiley & Sons Ltd: Sussex, UK, 2015.
15. Rumsey, F. Spatial Audio; Focal Press: Oxford, UK, 2001.
16.
Gutierrez-Parera, P.; Lopez, J.J.; Aguilera, E. On the influence of headphones quality in the spatial
immersion produced by binaural recordings. In Proceedings of the 138th AES Convention, Warsaw, Poland,
7–10 May 2015.
17.
Rec. ITU-R BS. 1534-2. Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems;
International Telecommunication Union (ITU): Geneva, Switzerland, 2014.
18.
Rumsey, F. Spatial quality evaluation for reproduced sound: Terminology, meaning and a scene-based
paradigm. J. Audio Eng. Soc. 2002,50, 651–666.
19.
Rec. ITU-R BS. 1116-2. Methods for the Subjective Assessment of Small Impairments in Audio Systems; International
Telecommunication Union (ITU): Geneva, Switzerland, 2014.
20.
Letowski, T. Sound quality assessment: Cardinal concepts. In Proceedings of the 87th AES Convention,
New York, NY, USA, 18–21 October 1989.
21.
Zacharov, N.; Koivuniemi, K. Unravelling the perception of spatial sound reproduction. In Proceedings of
the 19th AES International Conference, Bavaria, Germany, 21–24 June 2001.
Appl. Sci. 2016,6, 117 18 of 18
22.
Shinn-Cunningham, B. Learning reverberation: Considerations for spatial auditory displays. In Proceedings
of the International Conference on Auditory Display (ICAD), Atlanta, GA, USA, 2–5 April 2000.
23.
Minnair, P.; Olesen, S.K.; Christensen, F.; Møller, H. Localization with binaural recordings from artificial and
human heads. J. Audio Eng. Soc. 2001,49, 323–336.
24.
Santala, O.; Pulkki, V. Directional perception of distributed sound sources. J. Acoust. Soc. Am.
2011
,129,
1522–1530.
25.
Mendoça, C.; Campos, G.; Dias, P.; Santos, J.A. Learning auditory space: Generalization and long-term
effects. PLoS ONE 2013,8, e77900.
26.
So, R.H.Y.; Ngan, B.; Horner, A.; Braasch, J.; Blauert, J.; Leung, K.L. Toward orthogonal non-individualised
head-related transfer functions for forward and backward directional sound: Cluster analysis and
an experimental study. Ergonomics 2010,53, 767–781.
27.
Zhang, P.X.; Hartmann, W.M. On the ability of human listeners to distinguish between front and back.
Hear. Res. 2010,260, 30–46.
28.
Temme, S.; Olive, S.E.; Tatarunis, S.; Welti, T.; McMullin, E. The correlation between distortion audibility and
listener preference in headphones. In Proceedings of the 137th AES Convention, Los Angeles, CA, USA,
9–12 October 2014.
c
2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons by Attribution
(CC-BY) license (http://creativecommons.org/licenses/by/4.0/).
Available via license: CC BY
Content may be subject to copyright.