ArticlePDF Available

The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition

Authors:
  • Biamp Systems

Abstract and Figures

For several reasons, the Fourier phase domain is less favored than the magnitude domain in signal processing and modeling of speech. To correctly analyze the phase, several factors must be considered and compensated, including the effect of the step size, windowing function and other processing parameters. Building on a review of these factors, this paper investigates a spectral representation based on the Instantaneous Frequency Deviation, but in which the step size between processing frames is used in calculating phase changes, rather than the traditional single sample interval. Reflecting these longer intervals, the term delta-phasespectrumisusedtodistinguishthisfrominstantaneous derivatives. Experiments show that mel-frequency cepstral coef- ficients features derived from the delta-phase spectrum (termed Mel-Frequency delta-phase features) can produce broadly similar performance to equivalent magnitude domain features for both voice activity detection and speaker recognition tasks. Further, it is shown that the fusion of the magnitude and phase representations yields performance benefits over either in isolation. Index Terms—Instantaneous frequency (IF), phase, speaker recognition, speech analysis, voice activity detection (VAD).
Content may be subject to copyright.
QUT Digital Repository:
http://eprints.qut.edu.au/
Thisistheacceptedversion ofthefollowingjournalarticle:
McCowan,Iain,Dean,DavidB.,McLaren,MitchellL.,Vogt,
RobertJ.,&Sridharan,Sridha(2011)Thedeltaphase
spectrumwithapplicationtovoiceactivitydetectionand
speakerrecognition.IEEETransactionsonAudio,Speech,and
LanguageProcessing.
©Copyright 2011 IEEE
1
The Delta-Phase Spectrum with Application to
Voice Activity Detection and Speaker Recognition
Iain McCowan Member IEEE, David Dean Member IEEE, Mitchell McLaren Student Member IEEE, Robert
Vogt Member IEEE, Sridha Sridharan Senior Member IEEE
Abstract—For several reasons, the Fourier phase domain is
less favoured than the magnitude domain in signal process-
ing and modelling of speech. To correctly analyse the phase,
several factors must be considered and compensated, includ-
ing the effect of the step size, windowing function and other
processing parameters. Building on a review of these factors,
this paper investigates a spectral representation based on the
Instantaneous Frequency Deviation, but in which the step size
between processing frames is used in calculating phase changes,
rather than the traditional single sample interval. Reflecting
these longer intervals, the term Delta-Phase Spectrum is used
to distinguish this from instantaneous derivatives. Experiments
show that mel-frequency cepstral coefficients features derived
from the Delta-Phase Spectrum (termed Mel-Frequency Delta-
Phase features) can produce broadly similar performance to
equivalent magnitude domain features for both voice activity
detection and speaker recognition tasks. Further, it is shown
that the fusion of the magnitude and phase representations yields
performance benefits over either in isolation.
Index Terms—phase, instantaneous frequency, speech analysis,
voice activity detection, speaker recognition.
I. INTRODUCTION
Most speech analysis focuses on features derived from the
signal’s magnitude spectrum, with the phase spectrum dis-
carded. This is due both to mathematical difficulties analysing
phase as a function, as well as psychoacoustic and signal
processing experimental results that have rarely shown the
phase to provide any empirical benefit over magnitude-only
features. While well motivated, however, this still effectively
discards half of the information present in the original signal.
While this discarded information may mostly be redundant
in low noise conditions, when the noise energy becomes
comparable to the signal energy, sources of discriminative
information that may prove complementary to the magnitude
spectrum are desirable.
Many efforts to improve the robustness and discriminative
ability of speech features have focussed on the importance
of encoding temporal information in the feature extraction
process, such as RASTA filtering of spectral trajectories [1],
temporal pattern (TRAPS) classifiers [2], and the modulation
Copyright (c) 2011 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
The authors are with Speech and Audio Research Laboratory, Queensland
University of Technology, 2 George St. Brisbane, QLD, Australia. Phone: +61
7 3138 1608, Fax: +61 7 3138 1516
Mitchell McLaren is also with Centre for Language and Speech Technology,
Radboud University, PO Box 9102, 6500HC Nijmegen, The Netherlands.
Email: iain@ieee.org, ddean@ieee.org, m.mclaren@let.ru.nl,
r.vogt@qut.edu.au, s.sridharan@qut.edu.au
spectrum [3], [4], [5]. Given the Fourier phase domain en-
codes relative timing information between different spectral
components, interest in its use has increased in recent years.
Different approaches have included estimating phase changes
from an interference model [6], using the phase of the signal
autocorrelation at different lags [7], measuring relative phase
difference between frequencies [8], and deriving features based
on the group delay [9] and instantaneous frequency [10], [11].
A recent review of the use of phase information in speech
processing, however, indicates that broadly effective phase-
domain features remain elusive [12].
The main difficulty associated with extracting speech fea-
tures from the phase spectrum is the ambiguity that exists
between angles separated by multiples of 2πradians. While
the principal phase spectrum can be obtained by choosing the
phase angle to lie between ±π, this choice is arbitrary and
results in regular discontinuities from circular wrapping of
values considered over time or frequency. Phase unwrapping
may be performed to restore a continuous phase spectrum
for analysis, but consistent unwrapping is difficult, relying on
different heuristics in practice [12], [13], [14], [15]. This dif-
ficulty in obtaining the phase as a continuous function causes
both analytical problems as well as modeling difficulties due
to inconsistencies in the representation across frames.
This paper commences with a discussion of practical issues
that must be considered in analysing phase domain information
from the short-time Fourier transform (STFT). While works
can be found in the speech signal processing literature that
discuss individual issues to varying degrees, the literature on
these details is sparse. A first contribution of this paper is
therefore to provide a tutorial introduction and brief literature
review of practicalities in dealing with short-time Fourier
phase in discrete-time frame-based processing algorithms. In
particular, compensation must be made for the inter-frame
time step and the effect of the windowing function before the
phase spectrum can be meaningfully analysed. Another issue
concerns the lack of a common temporal origin when making
comparisons of the phase spectrum over different sequences,
such as when developing statistical models of speech. Finally,
there is a need to select processing parameters, such as frame
size and window function, that are appropriate for analysing
phase information, rather than naively applying parameters
that work well for the magnitude domain.
Following this review, this paper investigates the use of
phase changes between analysis frames as a representation that
can be consistently analysed both within and across sequences.
As a temporal difference in phase values, this representation
2
is equivalent to the Instantaneous Frequency Deviation (IFD)
spectrum [11], [16]; however rather than estimating the instan-
taneous derivative using successive samples, the phase delta is
analysed over a larger time delta corresponding to the inter-
frame step size. As there is no intention to measure the formal
derivative (that is, the limit as the time interval approaches
zero or equivalently single sample difference in its digital
approximation), in order to distinguish from the Instantaneous
Frequency Deviation as commonly computed and used, the
term Delta-Phase Spectrum is used to describe the quantity
used in this article.
The desire to analyse phase differences over frame step
intervals is motivated by the success of approaches to model
the speech signal as the birth and death of individual sinusoidal
components, each lasting several short-term frames [17], [18],
[19], [20]. Measuring the simple difference in phase in narrow
frequency bins across step-sized intervals may effectively
capture information about timing and transitions across the
spectrum and between speech units, such as phonemes, syl-
lables and words, potentially leading to useful features for
detecting voice activity in noise, or distinguishing voices. In
order to demonstrate use of the Delta-Phase Spectrum in prac-
tice, therefore, these two applications are investigated. First,
a simple Gaussian Mixture Model (GMM) based Voice Ac-
tivity Detection (VAD) system is evaluated in different noise
conditions in Section V. Mel-frequency cepstral coefficient
features derived from the Delta-Phase Spectrum, termed Mel-
Frequency Delta-Phase (MFDP) features, are compared to and
combined with standard Mel-Frequency Cepstral Coefficient
(MFCC) features derived from the magnitude spectrum [21].
Similarly, the effectiveness of MFDP features is evaluated for
application to speaker recognition in Section VI.
II. A REVIEW OF IMPLEMENTATION PRACTICALITIES FOR
TH E PHA SE DOMAIN
A. Short-Time Fourier Analysis of Speech Signals
While short-time Fourier analysis of speech is a well-known
technique, in established use for over 30 years [22], [23], [24],
a brief review is provided here as the basis for the subsequent
discussion.
The short-time Discrete Fourier Transform (DFT) is defined
as:
Xm(k) =
X
n=−∞
w(nmD)x(n)ekn(1)
where mis the frame index, w(n)is a causal window of length
T(i.e., zero-valued outside the range 0nT1), Dis
the number of samples between successive analysis frames
(the step size, with DT), and ωk=2πk
L, where Lis the
number of analysis frequencies being considered in the DFT
(with LT).
B. Selection of Processing Parameters
The above equation can be interpreted as shifting a short-
time window function w(n)through progressive D-sized de-
lays over the signal x(n), to obtain successive T-length frames
for analysis. Implementation therefore depends on appropriate
choice of the window function, the step size (that is, the frame
rate) and the frame length.
The window function w(n)is necessary to impose a finite
extent on the signal being analysed. Important considerations
include minimising spectral leakage through effective tapering
(enforcing periodicity in the window length) and the ability to
resolve frequency components (effective window bandwidth
and sidelobe level). Different windowing functions are anal-
ysed in [25]. While windows with smooth tapering, such as
the Hamming window, are commonly used in analysing the
magnitude spectrum over short-time segments, some studies
have shown that the rectangular window is more appropriate
when analysing the phase domain [26], [27], [12].
Speech is generally considered to be approximately
piecewise-stationary over a period of approximately 20 mil-
liseconds; however it is noted that some sounds are stationary
over longer or shorter durations, or may be non-stationary.
The choice of analysis frame (that is, window) length Tis a
trade-off between desired frequency resolution and temporal
resolution - a longer frame gives better frequency resolution,
however possibly at the expense of blurring out more rapid
speech events. A window duration of between 16-32 ms is
often used in analysis of the speech magnitude spectrum as
an effective balance between these practical considerations.
Most studies on the relevance of phase domain information
in the speech processing literature have however concluded
that longer analysis frames are required than typically used
in magnitude domain analysis. For instance, perception tests
have repeatedly shown that intelligibility of phase-only stimuli
improves over magnitude-only stimuli as the window duration
extends from 100-1000 ms, while the converse is true for
shorter frames [28], [12], [29]. According to a range of studies
on English speech, phonemes typically vary from 50 to 200
milliseconds, syllables from 50 to 500 milliseconds, and words
from 80 to 850 milliseconds [30], [3]. Further, it has been
shown that inter-word pauses during conversational speech can
vary from 100 to 1000 milliseconds [30]. Work modelling the
speech signal as the birth and death of individual sinusoidal
components, each lasting several short-term frames, suggests
that significant phase variations over time and frequency may
be produced by these underlying speech units [17], [18].
In selecting the step-size D, consideration must first be
given to the bandwidth of the window function and the analysis
frame length. For instance, for a Hamming window of length
T, the step-size should be less than or equal to T/4to avoid
aliasing [23]. This places an upper bound on the step-size,
or equivalently a lower bound on the frame rate (although
it is common practice to implement a step size of T/2). It
is of course necessary that the frame rate be high enough
to capture the temporal dynamics in the signal. Following
the Nyquist sampling principle, the assumption that speech is
quasi-stationary over 20 ms segments motivates new frames
being taken at 10 ms intervals, that is at 100 frames per
second. This frame rate is commonly used in speech analysis
applications. While the preceding paragraph motivated longer
analysis frames for the phase domain, a similar 100 Hz frame
rate (step size) is still motivated in order to effectively sample
these variations in the speech signal.
3
20 40 60 80 100 120
1
0.5
0
0.5
1
(c) Phase Spectrum after Window and FrameShift Compensation
20 40 60 80 100 120
1
0.5
0
0.5
1
(b) Phase Spectrum after Window Compensation
20 40 60 80 100 120
1
0.5
0
0.5
1
(a) Uncompensated Phase Spectrum
Fig. 1. Plot of the Phase Spectrum for successive input audio frames for (a)
uncompensated case, (b) when compensation has been made for the analysis
window, and (c) when both window and frame-shift compensation using
Equation (3) have been applied to the DFT. The solid blue line shows the
current frame, the dashed black line shows the previous frame, and the dot-
dashed red line shows the phase difference between these (calculated using
the phase of the quotient, to avoid wrapping effects). A window and FFT
length of T=L= 128 and a step size of D= 8 were used on a signal at
Fs= 16 kHz.
C. Compensation for the Analysis Window
It is first necessary to understand the effect of the windowing
function on the phase. A description of this effect and a
compensation method can be found in [18, Section 9.3.3],
summarised here for convenience.
Because the window function is commonly symmetric about
its mid-point, and this is aligned with the mid-point of the
current frame in practical implementation of the analysis
procedure, it has a linear Fourier transform phase of ωkT/2.
A simple way to compensate for its effect is to implement
a circular shift of the windowed signal in the time-domain
prior to the Fourier transform. Specifically: take the mth input
frame xm(n)of length T, apply the window function w(n),
zero-pad as necessary to the FFT length L, then circularly
shift the frame by T/2samples such that the latter half of the
frame occupies the range 0n(T/2) and the earlier half
occupies the range L(T/2) nL1. Following this, the
Fourier transform can be taken and the inter-frame time step
compensated as above. An alternative implementation may
instead compensate the phase modulation in the frequency
domain.
The effect of this window compensation on a single analysis
frame is shown in Figure 1(a)-(b).
D. Compensation for Inter-frame Time Step
To implement the short-time DFT in practice, typical speech
signal analysis multiplies the window function by Tsamples
from the signal to obtain an analysis frame. A length LDFT
is then taken to obtain the spectrum for analysis. Subsequent
analysis frames are obtained by shifting the input signal by
Dsamples - that is, discarding the first Dsamples from the
T-length buffer, and appending Dnew samples at the end.
This procedure in effect implements:
˜
Xm(k) =
X
n=−∞
w(n)x(n+mD)ekn,(2)
sometimes referred to as the Running Short-time Fourier
Transform (RSTFT) [31]. That is, it is the signal that is
effectively being shifted (through progressive advancements)
past the fixed window, rather than the window being shifted
over the signal. This distinction is important when considering
the absolute time origin for each analysis frame. In (1), the
temporal origin of each signal frame remains the origin of the
original signal x(n). In (2), however, the absolute position
of the frame within the original sequence is discarded, by
redefining the temporal origin of the frame as mD.
This has no implications for applications that only consider
the magnitude spectrum. Further, when the short-term Fourier
analysis is being done prior to re-synthesis of the signal,
such as when using the overlap-add method to implement
frequency-domain filtering, this difference is eventually com-
pensated for by time-shifting the synthesised frame to its
correct position in the output.
In applications that seek to analyse the phase spectrum
across multiple frames, however, the above distinction has
important consequences. Direct inter-frame comparisons of
phase values calculated in this way are invalid, due to the
changing reference. By accounting for the effect of the step
size, however, it is possible to restore a common reference
point to the phase values for every frame, allowing meaningful
analysis and modelling of the phase information over time.
If we therefore wish to compare the phase spectrum between
frames, it is straightforward to see that (1) is related to (2) by
Xm(k) = ˜
Xm(k)ekmD (3)
Returning to the typical procedure for obtaining the spectrum
of successive analysis frames, applying (3) following the
DFT compensates the phase spectrum to restore a common
reference. The effect of correctly compensating for both the
window function and the frame time step is shown in Fig-
ure 1(c).
III. THE DE LTA-PHA SE SP EC TRU M
Two problems still remain with the phase spectrum follow-
ing the above compensation: the lack of common temporal
reference between different sequences, and possible ambigui-
ties arising from phase wrapping. Absolute values of the phase
spectrum have little meaning without a common reference
point: phase values are by nature relative. While phase values
within a single sequence can be compared once they have been
compensated back to a virtual zero reference time, comparison
of phase values across sequences is problematic due to the
arbitrary start-point for the windowing. When developing
statistical models of the behaviour of the phase spectrum,
therefore, it is necessary to somehow restore some common
reference to the values.
4
A. Review of Spectral Representations based on Instantaneous
Frequency
One means of achieving a consistent phase-domain quantity
for analysis and modelling is to calculate the temporal deriva-
tive, commonly referred to as the Instantaneous Frequency
(IF). A number of works in the literature have investigated
spectrographic representations which plot the magnitude spec-
trum with the bin location on the time and frequency axes
reassigned according to the instantaneous frequency and group
delay [32], [31], [33]. A technical and historical review relat-
ing these different approaches is presented in [32], including
algorithms for practical digital estimation of the IF using the
STFT. One common method for computing the IF without
explicit differentiation is to first calculate the IF Deviation
as the imaginary part of the ratio between two STFT’s, one
calculated using the standard window and one calculated by
replacing the window with its derivative [31], [34], [33]. The
IF can then calculated by compensating this for the centre
frequency of each component. Following [35], this is referred
to as the Auger and Flandrin method in [32].
Recently a spectrographic representation that is instead
based directly on the Instantaneous Frequency Deviation was
proposed in [11]. In that work, the IF was first calculated as
the phase difference between two successive STFT’s calculated
with a single sample increment, following [36] and referred
to as the finite difference approximation for IF in [32].
Adapting [11], [31] to the notation from the preceding section,
let us redefine the short-term Fourier transform in terms of the
starting sample qof the frame m(ie q=mD), rather than
implying this from the frame index m:
˜
X(k, q) =
X
n=−∞
w(n)x(n+q)ekn,(4)
The Instantaneous Frequency can then be calculated as [11],
[32]:
v(k, q) = arg h˜
X(k, q)˜
X(k, q 1)i,(5)
where (·)indicates the complex conjugate. The Instantaneous
Frequency Deviation can be calculated as [16], [31], [11]:
ψ(k, q) = v(k , q)ωk(6)
= arg h˜
X(k, q)˜
X(k, q 1)eki,(7)
Having observed that the IF tracks its harmonic frequency
more accurately as the corresponding spectral magnitude in-
creases (ie, IF deviation is inversely proportional to magni-
tude), the Instantaneous Frequency Deviation Spectrum αwas
then defined as [11]:
α(k, q) = |ψ(k , q)|1(8)
B. Delta-Phase Spectrum
Instead of analysing the instantaneous phase derivatives over
single sample intervals, this paper proposes a related represen-
tation based on the phase difference between successive frames
separated by a step-size time interval. In a similar manner to
the Instantaneous Frequency Deviation above, it can be simply
calculated as the phase of the ratio of successive complex
spectral values:
φm(k) = arg Xm(k)
Xm1(k)(9)
where the use of Xfrom Equation 1 rather than ˜
Xreflects
the fact that the spectrum has been compensated for the
inter-frame time step and analysis window to implement the
Fourier Transform with a fixed time basis (as described in
Section II). Because the phase modulation introduced by the
analysis window will be the same for all frames, and will thus
be cancelled out during the division, it can be seen that the
Delta-Phase Spectrum may simply be implemented as:
φm(k) = arg "˜
Xm(k)ekmD
˜
Xm1(k)ek(m1)D#
= arg h˜
Xm(k)˜
X
m1(k)ekDi(10)
where ˜
Xis the uncompensated short-time Fourier spectrum.
To facilitate direct comparison, the Instantaneous Frequency
Deviation from the preceding section may be restated with the
frame index mexplicit as:
ψm(k) = arg h˜
X(k, mD)˜
X(k, mD 1)eki,(11)
while the Delta-Phase Spectrum implements:
φm(k) = arg h˜
X(k, mD)˜
X(k, mD D)ekDi,(12)
These expressions are equivalent in the limiting case of D=
1(i.e., a single sample step), however the latter is more general
by using the processing parameter Dfor the time change
interval. While the mathematical difference is minor, different
information is being captured: rather than estimating the phase
derivative at the start time of each individual frame, the simple
change in phase between frames is measured. Because the
“instantaneous” derivative is not measured, to avoid confusion
with the Instantaneous Frequency as commonly calculated
and used, the less constrained term Delta Phase is adopted
in this article. By adopting a distinct term, it is intended to
encourage new interpretation and insights by moving away
from the constraint of single sample intervals. The term Delta-
Phase draws a clear analogy with delta coefficents commonly
used in speech feature vectors derived from the magnitude
spectrum [37]. A further computational difference is that two
STFTs per frame must be calculated to obtain a spectrogram
or derive features from the IF Deviation Spectrum [11] in
a standard sliding window procedure, while by reusing the
previous frame’s FFT in calculating the phase change the Delta
Phase Spectrum requires only one.
Rather than calculating phase differences over time, as
above, a similar approach in [8] effectively enforced a common
reference using a particular frequency bin in the Fourier trans-
form. Such a method however requires an arbitrary frequency
bin to be selected (chosen to be π/4in [8]), which may or
may not provide a robust reference depending on the vocal
characteristics of the speaker and the spectral characteristics
of the noise.
5
Finally, it is noted that as well as providing a consistent
basis for comparison over different times and sequences, a
representation based on the change in phase over a given
time allows the issue of phase wrapping to be controlled, as
discussed in Section IV.
C. Mel-Frequency Delta-Phase (MFDP) Features
In order to model the speech signal, it is often necessary
to extract a pertinent set of features from the raw spectral
representation. This section presents one such feature set that
may be derived from the Delta Phase Spectrum. The intention
is to be illustrative rather than optimal in any sense: clearly
other feature representations are possible.
The Mel-Frequency Cepstral Coefficients (MFCC) have
proven to be an effective choice of speech features derived
from the magnitude spectrum [21]. The MFCC features are
formed by first extracting filter bank energies using a bank
of band-pass filters on the absolute magnitude spectrum. The
filter bank design is inspired by the critical band filtering
of the human auditory system [38]. Cepstral coefficients are
then derived from these by taking the logarithm of filter
bank energies and performing a Discrete Cosine Transform
(DCT). The cepstral processing implements a homomorphic
transformation, effectively mapping convolutive effects in the
original time domain into additive effects in the cepstral
domain [39], [40].
This paper proposes extracting Mel-Frequency Delta-Phase
Cepstral Coefficients (MFDP) by performing the same opera-
tions on the absolute delta-phase spectrum |φm(k)|from (9),
rather than the magnitude spectrum. For these features, the
absolute operator is used to measure the amount of change
in the phase within each frequency bin without concern for
the polarity of this change. It can be seen that the logarithm
following filter bank analysis is not strictly motivated for the
same reason in the phase domain as in the magnitude domain,
as the phase angle is effectively already in the log domain.
The logarithm on the filter bank output does however have a
second effect in practice: being akin to the application of a soft
maximum operation, it effectively emphasises the peak values
within each frequency band. In order to avoid smoothing out
peaks from the delta-phase spectrum following the filter bank
analysis, the logarithm is therefore maintained in the proposed
MFDP feature extraction.
An important practical consideration in developing phase
domain features is selection of the parameters for the short-
time Fourier analysis. Following the rationale presented in
Section II-B, in extracting the MFDP features in this paper,
a rectangular window function is used on frames of 256
ms duration at a rate of 100 frames per second (i.e., a 10
ms step size). Note that such a frame length corresponds to
the analysis interval typically used in other works modelling
temporal dynamics of the speech signal [1], [2], [3], [4], [5].
Further motivation for a longer analysis window is the desire
to detect phase changes in individual harmonic components in
the signal, and thus measure FFT bins that are as narrow as
is practical.
IV. EFFE CT O F TIME INT ERVAL O N OBSERVE D PHA SE
CHANGE
As shown in the previous section, the Delta-Phase extends
on the Instantaneous Frequency Deviation by removing the
constraint of being a strict instantaneous phase derivative and
instead capturing coarser phase changes over longer step-sized
intervals. The main impact of increasing the time interval is to
broaden the distribution of phase change that may be observed.
Two conflicting effects of this are to improve the ability to
detect sudden changes in the phase while also introducing
the possibility of phase wrapping. This section considers the
influence of the step size and FFT length parameters in this
context, and presents histograms and spectrograms produced
on a sample speech sequence using different parameters. As
the closest work from the literature, particular comparison is
made between the settings used for the IF Deviation Spec-
trum [11] and those used for the experiments in the current
article.
A. Ability to Detect Sudden Phase Changes
As the phase measurement from the FFT is in some sense
an average measure over the frame duration T(here and in
the following, assume an FFT length of L=Tis used), it
becomes increasingly difficult to detect material event-based
changes in the phase as Ddecreases. Consider the case when
a substantial change in the signal has occurred in the new D
samples from one frame to the next due to some underlying
physical event, such as a new speech unit being produced.
Following a sinusoidal analysis model for speech [17], [19],
[20], this may give rise to the birth or death of a sinusoidal
component, causing a rapid shift in the phase of a particular
frequency component, rather than a slowly varying modulation
in its IF.
The ability of such a rapid phase shift to influence the phase
change as measured using two FFT frames depends on the
ratio of Dto T, as well as the windowing function. For the
IF Deviation Spectrum [11], only one of these Tsamples
changes between the two FFTs used to calculate the phase
derivative for each frame, and then a Dsized step is taken
before measuring this again. For example, with T= 512, this
means that 99.8% of the two frames being compared in each
IFD measurement are the same samples, and so any sudden
phase change that occurs over a small range of samples in
the physical signal will undergo a significant averaging effect,
hampering the ability to detect such phase discontinuities. Any
sudden phase change will be further smoothed according to the
tapering of the window function in the time domain.
The ability to measure sudden phase transitions in the
physical signal using phase differences between two FFT’s
may therefore be expected to improve as the proportion of
new samples between the two FFT’s (that is, the ratio of Dto
T) increases, motivating the use of the more general D-sample
time delta used in calculating the Delta-Phase Spectrum in the
current article.
6
B. Phase Wrapping Considerations
Contrasting with the above motivation for increasing the
interval for calculating phase change is the possibility of intro-
ducing phase wrapping. To understand this, let us commence
by considering the IF Deviation. IF Deviation measures how
the phase-derivative changes relative to the centre frequency
of a given FFT bin. Following the filter-bank analogy of the
FFT, and neglecting for the moment leakage across bins due
to the non-ideal window response, the possible change in the
IF relative to the bin centre frequency (ωkrad/s) is necessarily
limited by the bin width. For a frame of length Tand taking
an FFT length of T, the frequency bandwidth of an ideal
individual FFT bin is ω=2πFs
Trad/s. If the instantaneous
frequency lies outside the range ωk±πFs
T, the component
would instead occur predominantly in a neighbouring FFT bin.
Over a given time interval, say Dsamples, the phase change
that may be observed in a given frequency bin at the end
of the interval is therefore limited to ±πD
Twith respect to the
phase observed at the start of the interval. For the IF Deviation
Spectrum in [11], Fs= 16000,D= 1 and T= 512 were
used. In this case, the limiting phase change is ≈ ±0.002π
radians for the IF to be within ±15.6 Hz of the FFT bin
centre frequency, and thus still fall within that bin. Phase
wrapping therefore will not occur for the IF Deviation, and
this will continue to be the case for the Delta-Phase as long
as the interval D < T . Considering the parameters used in
experiments in the current article, Fs= 16000,D= 160 and
T= 4096, the limiting phase change within a bin is ≈ ±0.04π
radians.
In practice, due to windowing, the above simplified analysis
will not strictly hold: the windowing main lobe width and side-
lobes mean that a given component will have some influence
over a range of frequency bins. Extending the above analysis,
it may be seen that some phase wrapping can theoretically
occur for a given frequency component in distant frequency
bins that are more than B=T/D bins away from the local bin
for that component. The potential influence of such wrapping
will depend therefore on the windowing sidelobe level at this
frequency bin shift, as well as the strength of the the local
frequency component in those bins. For a given frequency
component, as long as D < T phase wrapping will only occur
as noise in distant FFT bins, progressively affecting less bins
and attenuated by the sidelobe level of the windowing as D/T
decreases. IF Deviation represents the limiting case of D= 1,
in which phase wrapping will not occur within the TFFT bins,
although general sidelobe leakage may still introduce noise to
the measured phase change in each bin.
C. Empirical Analysis
From the above analysis it is apparent that the ratio of
the step size to frame length provides a design parameter
controlling the observed distribution of delta phase values,
playing off the ability to detect sudden phase shifts with the
possibility of introducing noise from phase wrapping in distant
FFT bins.
1) Phase Change Histograms: To corroborate this analysis,
Figure 2 plots the histogram of delta phase values obtained
!!!"#$ " "#$ !
%&'()*+,"#""-"
!!!"#$ " "#$ !
%.'()*+,"#"/!-
!!!"#$ " "#$ !
%0'()*+,"#"1-$
!!!"#$ " "#$ !
%2'()*+,"#!-$"
!!!"#$ " "#$ !
%3'()*+,"#-$""
!!!"#$ " "#$ !
%4'()*+,!#""""
Fig. 2. Histogram of Delta-Phase values (normalised by πradians) for a
sample speech sequence (Fs=16 kHz) corrupted by 10dB noise for varying
values of D/T, with L=512 in each case. The speech segment consists of a
male utterance of “The decking is quarter-inch mahogany marine plywood”
from the TIMIT database [41]
from a sample speech sequence for six different values of D/T
with Tfixed at 512 samples. The delta phase is approximately
uniformly distributed over ±πat D/T = 1, and as this ratio
decreases the values become more normally distributed with
decreasing variance. Case (a) shows the distribution for D=1,
as used to calculate the IF Deviation Spectrum in [11]. The
settings used in the experiments in this paper (D/T = 0.039
using a longer frame of T= 4096 samples) corresponds to a
setting between cases (b) and (c). This setting was chosen
to improve the ability to detect sudden phase shifts while
minimising the ability of phase wrapping to significantly affect
the measurements.
2) Spectrographic Comparison with IF Deviation: Fig-
ures 3-5 demonstrate the effect of different processing param-
eters on the Delta-Phase Spectrum. In each case the original
signal, standard magnitude spectrum and the IF Deviation
spectrum [11] are shown for comparison. To facilitate inter-
pretation in terms of phase change, a minor difference is that
the absolute IF Deviation is used directly here, rather than its
reciprocal as proposed in [11].
For a direct comparison with the IF Deviation spectrum
presented in the literature, Figure 3 uses processing parameters
taken from the example in [11]. In this case, a Chebyshev 50dB
window of length T= 512 (32 ms) and a D= 64 step size
(4 ms) is used. These setting are well suited for calculating
the magnitude spectrum, as shown in Figure 3(b). For the IF
Deviation spectrum in Figure 3(c), a single sample interval
is used to calculate the deviation, while for the Delta-Phase
spectrum in Figure 3(d) the step size D= 64 is used. It
is apparent that the magnitude and phase representations are
correlated, with regions of high magnitude often corresponding
to regions where there is little change in phase, and vice-versa.
As might be expected following the histogram analysis in the
preceding section (in effect, D/T = 0.1250 is used here), the
spectrogram for the Delta-Phase shows a distribution of values
with greater variance than the IF Deviation. This appears as a
more noisy spectrographic representation that makes the finer
structures of the speech less evident for the Delta-Phase than
7
Fig. 3. Sample audio sequence in the (a) Time-domain and spectrographic
representations of the (b) Magnitude spectrum, (c) Instantaneous Frequency
Deviation spectrum and (d) the Delta-Phase spectrum, using parameters
following [11] to facilitate comparison (Fs=16000, T=512, D=64, Chebyshev
50 dB window). For (b)-(d) the y-axis shows increasing FFT bin index
(i.e., increasing frequency) and the x-axis shows increasing frame index. The
speech segment consists of a male utterance of “The decking is quarter-inch
mahogany marine plywood” from the TIMIT database [41]
(a) Timedomain Signal
(b) FilterBank Magnitude Spectrum
(c) FilterBank Instantaneous Frequency Deviation Spectrum
(d) FilterBank DeltaPhase Spectrum
Fig. 4. Sample audio sequence in the (a) Time-domain and Mel-scaled
Filter-bank spectrographic representations of the (b) Magnitude spectrum,
(c) Instantaneous Frequency Deviation spectrum and (d) the Delta-Phase
spectrum, using parameters following [11] to facilitate comparison (Fs=16000,
T=512, D=64, Chebyshev 50 dB window). For (b)-(d) the y-axis shows
increasing filter-bank index (i.e., increasing frequency) and the x-axis shows
increasing frame index.
the IF Deviation.
Figure 4 shows the same sequence using the same pro-
cessing parameters as Figure 3, however the output of 24
Mel-scaled filter-banks are shown in place of the raw FFT
bins. This figure serves simply to illustrate that despite the
finer differences between the IF Deviation and Delta-Phase
spectrum in Figure 3, when considering Mel-scaled filter-bank
outputs as used in extracting features, these differences are less
evident.
The processing parameters used in Figures 3-4 are how-
ever not appropriate for the motivations of the Delta-Phase
proposed in the present article, which is to detect significant
event-based shifts in the phase of sinusoidal components.
The short frame length leads to wider FFT bins than those
desired to focus on individual harmonic components, and the
(a) Timedomain Signal
(b) FilterBank Magnitude Spectrum
(c) FilterBank Instantaneous Frequency Deviation Spectrum
(d) FilterBank DeltaPhase Spectrum
Fig. 5. Sample audio sequence in the (a) Time-domain and Mel-scaled
Filter-bank spectrographic representations of the (b) Magnitude spectrum,
(c) Instantaneous Frequency Deviation spectrum and (d) the Delta-Phase
spectrum, using parameters as used in subsequent experiments in the current
article (Fs=16000, T=4096, D=160, Rectangular window). For (b)-(d) the y-
axis shows increasing filter-bank index (i.e., increasing frequency) and the
x-axis shows increasing frame index.
use of a Chebyshev window sacrifices some ability to detect
discontinuous event-based transitions in the signal (albeit,
while offering lower sidelobe levels). Figure 5 shows the
same sequence using the processing parameters used in the
experiments in the current article; that is, a rectangular window
of length T= 4096 (256 ms) and a D= 160 step size (10 ms).
As in Figure 4, the output of 24 Mel-scaled filter-banks are
shown. It is apparent from Figure 5(b) that the longer analysis
window is not an appropriate choice for the magnitude domain.
The IF Deviation in Figure 5(c) also shows little information
with these processing parameters, as might be expected given
that only 1 sample of 4096 has changed in the two FFTs
being used to calculate the deviation (that is, 0.024% of the
frame). The larger frame size examined here means that in
practice averaging affects will hinder the ability to detect any
phase change over a single sample interval, particularly at low
frequencies.
In contrast, Figure 5(d) confirms that by using a longer
frame and increasing the ratio of Dto T, the Delta-Phase
Spectrum is capturing regions of both high and low phase
change in the signal over both time and frequency. These
patterns reveal interesting structure in the underlying signal
that appear complementary to the information traditionally
extracted from the magnitude spectrum, as shown in Fig-
ure 4(b). For the Delta-Phase Spectrum in Figure 5(d), 160
of the samples are changing from frame to frame (that is,
3.9% of the frame). This step size (10ms) also has the benefit
of matching that commonly used in magnitude domain feature
extraction, facilitating fusion of magnitude (MFCC) and phase
(MFDP) domain systems in the following VAD and speaker
recognition experiments.
V. AP PL IC ATIO N TO VOICE ACTIVITY DE TE CT ION (VAD)
In order to validate the proposed Delta-Phase Spectrum and
Mel-Frequency Delta Phase features derived from it, a first
set of experiments was conducted applying the features for
8
TABLE I
DATABAS E NOI SE T YPE S AN D SCE NAR IOS . THESE FORMAL PARTITIONS OF
TH E DATABAS E ARE R EFE RR ED TO I N TH E TEX T US ING I TALI CIS ED
LA BEL S,S UCH A S Street-City
Type Scenario 1 Scenario 2
Street City Suburb
Car Windows Up Windows Down
a simple voice activity, or speech/non-speech, detection task.
Note that the goal of these experiments is simply to validate
the proposed phase representation, rather than to achieve state-
of-the-art VAD performance.
A. Database
In order to evaluate the proposed MFDP speech features
for the purposes of voice activity detection, a database of
240 hours of noisy speech over 9600 individual files was
constructed through a combination of clean speech and real-
world noise recordings. A comprehensive description of the
database is available in [42], with relevant details summarised
here.
In order to construct the voice activity detection database,
two real-world recordings of at least 30 minutes of typical
background noise were made in each of 4 scenarios, covering
two broad noise types, Car and Street, as shown in Table I.
The two recordings for each scenario were captured at similar
times on separate days to ensure adequate temporal difference
in the environments. In addition to the noise recording itself,
6 swept-sine sweeps were recorded in the Car scenarios in
order to allow the reverberant response to be estimated, such
that speech may be inserted as if it were captured in that
scenario.
For each of the two recordings in each scenario, 200 noisy
speech sequences, equally split between lengths of 60 and 120
seconds, were constucted for each of 6 signal-to-noise ratios
(SNRs), being -10 dB, -5 dB, 0 dB, 5 dB, 10 dB and 15 dB.
These noisy speech sequences were constructed by extracting
a random section of the noise recording of the appropriate
length and adding clean speech sequences chosen randomly
from the TIMIT speech database [41] at the desired SNR.
For the Car scenarios, the clean speech sequences were first
transformed to match the estimated reverberant response of
the noise recording. In order to ensure that speech energy is
consistent between files, the SNR mixing was performed by
adding the inserted speech sequences with an active speech
level of -26 dBov (dB overload, following [43]), after first
scaling the background noise to match the desired SNR. As the
database sequences were constructed, the ground-truth timing
for speech events is known precisely for evaluation of VAD
algorithms.
B. System Description
A Gaussian Mixture Model (GMM) based speech detection
system was used to evaluate the MFDP speech features in
comparison to standard MFCC speech features. GMM-based
systems using MFCC features have been shown to provide a
robust baseline solution across a range of speech classification
problems, including speech/non-speech detection [44], [45].
By extracting features from sub-band energies and learning
statistical models over training examples, a GMM-MFCC
system provides a higher performance baseline than more
traditional VAD systems based on thresholding features such
as broadband frame energy.
The MFCC and MFDP features used in the experi-
ments were 13-dimensional cepstral coefficients, including the
zero’th coefficient, and with first-order regression coefficients
appended (i.e., traditional “delta” features such as in [37]),
making a 26-dimension feature vector in each case. The MFCC
features were calculated using a standard 25 ms Hamming
window, with a 10 ms step size (that is, a rate of 100 fps),
while the MFDP features used a 256 ms rectangular window,
also using a 10 ms step size, following the rationale presented
in previous sections. While larger step sizes could be examined
for the MFDP papers, to facilitate comparison and fusion with
MFCCs, the 10 ms step size was maintained in all experiments.
These speech detection experiments were operated under
the assumption that the broad SNR of the target environment
is known, but the specific scenario, or type of noise, is not
known. To this end the six noise levels were divided into three
groups covering two SNRs each, designated as the high (-10
dB, -5 dB), medium (0 dB and 5 dB), and low (10 dB and 15
dB) broad noise levels.
To train speech detection modules under the operating
assumption provided, speech and non-speech GMMs were
trained based on the known ground truth on one set of
scenarios across both noise types, and for each of the three
broad noise levels. The other set of scenarios for each of the
three broad noise levels was then used to calculate speech
scores by taking the difference between the log-likelihoods
given for each feature vector by the speech and non-speech
GMMs. To give an example for sake of clarity, the low noise
Street-City data was tested on models trained using low noise
data from Street-Suburb and Car-Windows Down.
The speech scores obtained in this way were then smoothed
by a 1-second median filter centred on each feature vector to
attenuate short-term variation in favour of the longer term.
The MFCC+MFDP results indicate multi-stream fusion of the
MFCC and MFDP, in which the log-likelihoods of each stream
were combined using addition (equally weighted) prior to the
smoothing median filter.
Speech and non-speech segmentation decisions were made
by comparing the smoothed speech scores to a threshold. This
threshold was estimated by minimising the half total error rate,
calculated as the average of the miss and false-alarm rates, on
a held-out tuning data set. These tuning scores were calculated
similarly to the test scores, but were calculated on the same set
as the GMM parameter estimation, to ensure the final testing
set is unseen to both the GMM training and threshold tuning.
To produce unbiased results for each noise type, the complete
results were generated using 2-fold training and testing, split
according to the scenario numbers indicated in Table I.
C. Results
Results from the voice activity detection experiments are
presented in Tables II and III for the Car and Street noise
9
TABLE II
VAD RES ULTS F OR CAR NOISE CONDITION
SNR Features FAR MR HTER
10 to 15 dB MFCC 2.3% 1.3% 1.8%
MFDP 3.4% 1.3% 2.3%
MFCC+MFDP 2.6% 1.0% 1.8%
0 to 5 dB MFCC 2.6% 2.7% 2.6%
MFDP 4.6% 1.6% 3.1%
MFCC+MFDP 3.5% 1.1% 2.3%
-10 to -5 dB MFCC 3.8% 8.9% 6.4%
MFDP 7% 8.7% 7.8%
MFCC+MFDP 7.4% 2.1% 4.7%
TABLE III
VAD RES ULTS F OR STREET NOISE CONDITION
SNR Features FAR MR HTER
10 to 15 dB MFCC 2.4% 1.7% 2.0%
MFDP 3.4% 1.5% 2.5%
MFCC+MFDP 2.5% 1.3% 1.9%
0 to 5 dB MFCC 3.0% 6.6% 4.8%
MFDP 4.2% 4.1% 4.2%
MFCC+MFDP 3.4% 2.7% 3%
-10 to -5 dB MFCC 6.6% 23% 14.8%
MFDP 5.1% 17.8% 11.5%
MFCC+MFDP 8.6% 8.9% 8.8%
types, respectively. Results are presented in terms of per-
centage False Alarm Rate (FAR), Miss Rate (MR, equivalent
to False Rejection Rate) and Half-Total Error Rate (HTER).
These demonstrate performance at a particular operating point,
selected to optimise HTER on the training data as explained
above. To show performance across a range of operating
points, the Detection Error Trade-off (DET) plot is shown in
Figure 6 [46].
These results show similar performance is achieved using
the MFCC or MFDP features. In Car noise, the MFCC’s
show a marginal improvement over MFDP’s, and this trend is
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Miss probability (in %)
VAD Combined Results
MFCC
MFDP
MFCC+MFDP
Fig. 6. DET plot of GMM-based VAD results over all noise conditions and
levels, using MFCC and MFDP features and their multi-stream fusion.
reversed for the Street noise. Without seeking to over-interpret
these results, it may be that the MFDP’s show benefits in
less stationary noise environments due to the longer analysis
window of 256 ms. Further, the DET plot in Figure 6 shows
that the MFCC features perform better in the high Miss Rate
region (high False Rejection), while MFDP features exhibit
better performance in the high False Alarm region.
The two important points to garner from these results are:
first, that results using only phase information are comparable
to those using magnitude only; second, that in both noise
types, and at all noise levels, the multi-stream fusion of the
magnitude-domain MFCC’s and phase-domain MFDP’s yields
significant performance benefits over either in isolation, as
measured by the HTER. Note that while the FAR increases
marginally over MFCC in the fused system, the MR is
significantly improved in each case, simply reflecting the fact
that the operating point is chosen based on HTER.
VI. AP PL IC ATIO N TO SPEAKER RECOGNITION
The effectiveness of the proposed MFDP features for voice
activity detection was demonstrated in the preceding section,
with the fusion results highlighting the complementary infor-
mation they offer to MFCC features. This section seeks to
further validate the proposed phase representation by investi-
gating whether MFDP features are also able to capture speaker
discriminative information from the phase domain through
their application to the task of speaker recognition. As in the
preceding section, the goal of these experiments is to validate
the proposed phase representation, rather than to demonstrate
state-of-the-art performance.
Speaker recognition is commonly performed using cepstral-
based features derived from the magnitude domain. Partic-
ularly successful in this research domain are MFCC fea-
tures. MFCCs provide state-of-the-art speaker recognition
performance when used in conjunction with GMMs adapted
from a Universal Background Model (UBM) and suitable
session variability compensation techniques [47], [48]. Limited
research has focussed on the application of phase-related
features to speaker recognition due to the belief that the phase
component of speech offers little information relative to the
magnitude domain [49], [50], [51].
A. Experimental Configuration
The comparison of MFDP and MFCC features will be con-
ducted following the well-known NIST Speaker Recognition
Evaluation (SRE) series [52] protocols. Since 1996, NIST
have conducted regular evaluations of speaker recognition
technology by specifying an evaluation protocol and corre-
sponding corpus predominantly consisting of conversational
telephony speech from several hundreds of speakers. The NIST
SRE series has driven state-of-the-art in the area of speaker
recognition research. For these experiments, the 2006 and
2008 NIST SRE data and protocols were used, specifically
the evaluation conditions consisting of 5-minute English-only
telephone conversations. This subset of evaluation conditions
was selected to allow for a clear analysis of the proposed
features without the need to consider additional variability
10
introduced from microphone, interview or cross-channel trials
available in the corpora.
The two feature sets will be examined in the context of a
GMM Supervector SVM system. MFCC variants of this sys-
tem have demonstrated state-of-the-art performance in recent
SRE’s. The GMM Supervector SVM system [53] combines
robust yet straightforward acoustic modelling in the form of
mean-adapted high-order Gaussian mixture models (GMM)
with more recent discriminative machine learning approaches
through Support Vector Machine (SVM) classification.
In this approach, each utterance, in both training and test-
ing, is first used to estimate a mean-adapted GMM through
maximum a posteriori (MAP) adaptation from a universal
background model (UBM). In this work, gender-dependent,
512-component UBM’s are used for this purpose. This form
of MAP adaptation has been a well-established approach in
speaker recognition for over a decade [47]. The component
mean vectors of the adapted GMM are then concatenated
together to form a single large vector, known as a supervector;
the supervector thus provides a convenient, fixed dimension
representation of each utterance for use within an SVM
classifier.
Speaker SVM training and classification was performed
using the GMM mean supervector kernel [53]. This kernel
performs a weighted dot-product between the GMM mean
supervectors. Support vector machines are discriminative clas-
sifiers and thus are trained on both positive and negative
examples of a speaker. In the context of NIST evaluations,
there is typically only a single (positive) training example of
a speaker while a substantial number of impostor (negative)
examples are drawn from previous NIST evaluation corpora.
Zero and Test-norm score normalisation was applied to
all scores to reduce the statistical variation observed in
scores [54]. Both normalisation techniques utilise a large set of
impostor speech segments to calculate a set of normalisation
statistics. Zero-norm is a speaker-centric technique in which
a speaker’s scores are scaled by the mean µiand standard
deviation σi, obtained when scoring the impostor cohort
against the speaker model, such that,
score =score µi
σi
(13)
Similarly, test-norm calculates µiand σiby trialling a given
test segment against a set of speaker models trained using the
impostor cohort. In this work, the impostor speech segments
were extracted from the NIST 2004 dataset.
The MFDP and MFCC features for these experiments were
formed from 12 cepstral coefficients with appended deltas. In
contrast to features used in previous sections, the 0th cepstrum
was removed from the MFDP features to match the existing
MFCC configuration. This was empirically found to provide
marginal improvements to MFDP-based speaker recognition.
The reader is referred to [55] for more details of the config-
uration and implementation of the GMM Supervector SVM
system used in these experiments.
Two well-established techniques for robust speaker ver-
ification were progressively incorporated into the baseline
configuration described above in order to observe whether they
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Miss probability (in %)
MFDP
MFCC
MFCC+MFDP
Fig. 7. DET plot of 1-sided, English-only trials from the NIST 2006 speaker
recognition evaluation using MFDP and MFCC features.
could offer similar benefits to proposed MFDP’s as they do to
the magnitude-based features on which they were developed.
The first technique, feature-warping [56], applies short-time
Gaussianisation to the feature vector stream extracted from an
utterance using a sliding window to counteract the adverse
effects of channel mismatch and additive noise. A window
of 5 seconds is utilised in this work. The second technique,
Nuisance Attribute Projection (NAP) [57], aims to reduce the
adverse effects of inter-session variation in the SVM kernel
space. Inter-session variation, such as differences in channel
and background noise, is well known as a major source
of error in speaker recognition. NAP addresses this issue
by removing the directions of greatest inter-session variation
from the supervector space. Based on empirical results, forty
directions were removed from the MFCC supervector space
and twenty dimensions in the case of the MFDP configuration.
Evaluation of the feature sets was performed using the
English-only trials from the 1-sided training condition of the
NIST 2006 and 2008 speaker recognition evaluation (SRE).
Classification performance was measured in terms of mini-
mum decision cost function (DCF) and equal error rate (EER),
as defined in the NIST SRE protocol [52]. Score-level fusion
was implemented using the FoCal toolkit [58] to optimise
linear log regression. The fusion weights for the NIST 2006
SRE trials were learned using scores from the 2008 SRE,
and similarly, 2008 SRE fusion weights were learned from
the 2006 SRE scores. This approach to fusion ensured that
the fused weights were not optimistically biased for a given
corpus.
B. Results
Figure 7 depicts the DET curves from the English-only
trials on the NIST 2006 SRE involving MFDP and MFCC
features and their score-level fusion without incorporating
feature-warping or NAP. The performance of the proposed
11
MFDP features demonstrates their effectiveness in capturing
speaker discriminative information from the phase domain. It
is also clear from the DET plot that the fusion of the features
is highly complementary.
Table IV details specific operating performance statistics
from trials on the NIST 2006 and 2008 SRE. Several con-
figurations are presented for a thorough analysis of the pro-
posed MFDP features; a Baseline, Feature-Warping, NAP
and Feature-Warping+NAP configuration, the last of which
amounts to a state-of-the-art configuration developed for
MFCC features. The objective of these experiments was to
observe whether techniques developed for magnitude-based
features were also suited to the proposed MFDP feature set.
The baseline results in Table IV indicate that both MFDP
and MFCC features provided broadly comparable performance
on the NIST 2006 SRE (These results correspond to the DET
curve in Figure 7). On the more challenging NIST 2008 SRE,
MFCCs offered a relative gain of 13% and 15% in minimum
DCF and EER, respectively, over the proposed MFDP results.
Score-level fusion of the baseline MFDP and MFCC configu-
rations resulted in a relative improvement of 22% and 11% in
the EER of the 2006 and 2008 corpora, respectively, indicating
that the MFDP features offer considerable complimentary
information to MFCC’s in the baseline configuration.
The introduction of feature-warping to the baseline system
provided a relative improvement of 7-17% in MFDP perfor-
mance statistics across the NIST corpora and a significant
relative gain of 50% in the MFCC-based results. Interestingly,
MFDP features provided little complimentary information to
the feature-warped MFCC’s. The large discrepancy in the
gains offered by feature-warping may be explained by the
relatively large window used during MFDP feature extraction.
In using a 256ms window of analysis for MFDP extraction
compared to 32ms for the MFCC feature stream, a rela-
tively large correlation between sequential features is expected
thereby potentially reducing the effectiveness or necessity of
feature-warping. An alternative explanation can be derived
from the objective of the feature-warping process. Specifically,
MFDP features may be inherently more robust to channel
distortions and additive noise than the MFCC feature set. This
hypothesis is explored through the application of inter-session
variability compensation via NAP.
The application of NAP to the baseline configurations
provided significant improvements in excess of 32% (relative)
to performance statistics across the NIST corpora. Similarly,
the MFCC configuration obtained a 50% relative improvement
over baseline results from the application of NAP. As with the
baseline results, the fusion of the NAP systems provided a fur-
ther gain of 23% and 13% EER over the MFCC configuration
alone in the NIST 2006 and 2008 SRE, respectively.
The final configuration employing feature-warping and NAP
represents a state-of-the-art SVM configuration developed on
MFCC features. MFCC-based results in Table IV indicate
that NAP provided an average relative gain of 34% in min-
imum DCF and 27% in EER across the evaluated corpora
over the use of feature-warping alone. Comparably, MFDP
results obtained an average relative improvement of 26%
and 21% in minimum DCF and EER, respectively, from the
TABLE IV
MINIMUM DCF AND EER OB TAIN ED FRO M 1-SIDED, ENG LI SH-O NLY
NIST 2006 AN D 2008 SPEAKER RECOGNITION EVALUATIONS WHEN
USING MFCC AN D MFDP FEATU RE S ETS F OR SVM TRAINING AND
CL ASS IFI CATIO N.
NIST 2006 NIST 2008
Features Min. DCF EER Min. DCF EER
Baseline
MFDP .0429 11.10% .0573 15.24%
MFCC .0400 10.89% .0498 12.92%
MFCC+MFDP .0346 8.45% .0465 11.45%
Feature-Warping
MFDP .0398 9.37% .0508 12.69%
MFCC .0188 4.55% .0259 6.34%
MFCC+MFDP .0179 4.54% .0258 6.18%
NAP
MFDP .0273 6.46% .0387 9.81%
MFCC .0184 4.17% .0245 6.02%
MFCC+MFDP .0180 3.20% .0232 5.19%
Feature-Warping + NAP
MFDP .0292 6.28% .0398 9.29%
MFCC .0130 2.87% .0190 4.61%
MFCC+MFDP .0125 2.72% .0185 4.37%
application of NAP. Interestingly, the optimised number of
nuisance directions removed via NAP was only twenty in
the case of MFDP features and forty for the MFCC system.
Comparable performance gains from NAP suggest that MFDP
features exhibit less inter-session variation than MFCCs and
allow such variation to be robustly estimated using fewer
directions. Such a trait is highly desired of features for speaker
verification as inter-session variability continues to be a major
cause of classification error. The phase-based MFDP features
provided reasonable classification performance on the SRE
task with the magnitude-based MFCC features offering relative
improvements of more than 50% in the evaluation of both
corpora. Score-level fusion of both configurations provided
the best performance statistics with relative improvements of
up to 5% being obtained over the MFCC configuration in
the 2006 SRE. Similar improvements were observed in the
2008 SRE trials through fusion. This demonstrates that the
MFDP features extract some speaker specific information from
the phase domain that is complementary to magnitude-based
features.
In operating at a comparable level to the MFCC feature
set in the baseline configuration, offering robustness to inter-
session variation and by providing complementary information
to commonly employed MFCC features, the proposed MFDP
feature set shows high potential for further application in
the field of speaker recognition research. Building on these
preliminary experimental results, investigations into SVM ker-
nels tailored to the MFDP feature set and their application
to GMM-based classification are likely to better exploit the
speaker discriminative information found in MFDP features.
VII. CONCLUSION
This paper has revisited the use of the phase domain in
short-time Fourier analysis of the speech signal, highlighting
the factors that must be considered and compensated before
the phase can be meaningfully analysed. The Delta-Phase
Spectrum computed at a frame advance rate of Dsamples was
12
proposed as a simple phase domain representation similar that
allows consistent comparison over multiple frames and across
sequences, while also minimising practical issues associated
with phase wrapping. The Delta-Phase extends the Instanta-
neous Frequency Deviation, removing the constraint of being
a strict instantaneous phase derivative and instead capturing
coarser changes in the phase structure of the signal from one
frame to the next.
Building upon this representation, it was shown that Mel-
Frequency Delta Phase features extracted purely from the
phase domain could be used to achieve broadly similar per-
formance to the common magnitude-domain Mel-Frequency
Cepstral Coefficients for distinguishing speech from noise,
and also for distinguishing between voices of different people.
Further, it was shown that principled fusion of the magnitude
and phase domain information could achieve performance
improvements over either in isolation.
There remains much scope for research building upon this
work, both in optimising phase domain feature representations
and models, and in understanding whether these findings can
be applied to automatic speech recognition, which needs to
capture shorter-term units than the voice activity and speaker
recognition applications considered here.
ACK NOW LE DG EM EN TS
The authors wish to thank the reviewers for their valuable
comments which has enabled us to improve the quality of
the manuscript, as well as Dan Ellis of Columbia University
for his help in clarifying latter revisions of this article. The
research was supported in part by the Australian Research
Council (ARC) Discovery Grant DP0877835.
REFERENCES
[1] H. Hermansky and N. Morgan, “RASTA processing of speech, IEEE
Transactions on Speech and Audio processing, vol. 2, no. 4, pp. 578–
589, 1994.
[2] H. Hermansky and S. Sharma, “TRAPS-Classifiers of temporal pat-
terns,” in Fifth International Conference on Spoken Language Process-
ing. ISCA, 1998.
[3] S. Greenberg and T. Arai, “What are the essential cues for understanding
spoken language,” IEICE Transactions on Information and Systems, vol.
E87-D, no. 5, pp. 1059–1070, May 2004.
[4] N. Kanedera, T. Arai, H. Hermansky, and M. Pavel, “On the relative
importance of various components of the modulation spectrum for
automatic speech recognition,” Speech Communication, vol. 28, no. 1,
pp. 43–55, 1999.
[5] S. Greenberg and T. Arai, “The relation between speech intelligibility
and the complex modulation spectrum,” in Seventh European Conference
on Speech Communication and Technology. ISCA, 2001.
[6] R. Schluter and H. Ney, “Using phase spectrum information for
improved speech recognition performance,” in Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing,
2001, vol. 1.
[7] S. Ikbal, H. Misra, and H. Bourlard, “Phase autocorrelation (PAC)
derived robust speech features, in IEEE International Conference on
Acoustics, Speech, and Signal Processing, Proceedings., 2003, vol. 2.
[8] L. Wang, S. Ohtsuka, and S. Nakagawa, “High improvement of
speaker identification and verification by combining MFCC and phase
information,” in Proceedings of the 2009 IEEE International Conference
on Acoustics, Speech and Signal Processing. IEEE Computer Society,
2009, pp. 4529–4532.
[9] HA Murthy and V. Gadde, “The modified group delay function and its
application to phoneme recognition,” in IEEE International Conference
on Acoustics, Speech, and Signal Processing. Proceedings, 2003, vol. 1.
[10] Y. Wang, J. Hansen, G.K. Allu, and R. Kumaresan, “Average instanta-
neous frequency (AIF) and average log-envelopes (ALE) for ASR with
the AURORA 2 Database,” in Eighth European Conference on Speech
Communication and Technology. ISCA, 2003.
[11] A.P. Stark and K.K. Paliwal, “Speech analysis using instantaneous
frequency deviation, Proceedings Interspeech 2008, pp. 2602–2605,
2008.
[12] L.D. Alsteris and K.K. Paliwal, “Short-time phase spectrum in speech
processing: A review and some experimental results, Digital Signal
Processing, vol. 17, no. 3, pp. 578–616, 2007.
[13] D.C. Ghiglia and M.D. Pritt, Two-dimensional phase unwrapping:
theory, algorithms, and software, Wiley New York, 1998.
[14] J. Tribolet, “A new phase unwrapping algorithm,” IEEE Transactions
on Acoustics, Speech and Signal Processing, vol. 25, no. 2, pp. 170–177,
1977.
[15] G. Nico and J. Fortuny, “Using the matrix pencil method to solve phase
unwrapping,” IEEE Transactions on Signal Processing, vol. 51, no. 3,
pp. 886–888, 2003.
[16] L.R. Rabiner and R.W. Schafer, Digital processing of speech signals,
Prentice-hall, 1978.
[17] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on a
sinusoidal representation,” IEEE Transactions on Acoustics, Speech and
Signal Processing, vol. 34, no. 4, pp. 744–754, 1986.
[18] T. Quatieri, Discrete-Time Speech Signal Processing: Principles and
Practice, Prentice Hall, 2002.
[19] T.N. Sainath, Acoustic landmark detection and segmentation using
the Mcaulay-Quatieri sinusoidal model, Ph.D. thesis, Massachusetts
Institute of Technology, 2005.
[20] T.N. Sainath and T.J. Hazen, “A sinusoidal model approach to acoustic
landmark detection and segmentation for robust segment-based speech
recognition,” in Acoustics, Speech and Signal Processing, 2006. ICASSP
2006 Proceedings. 2006 IEEE International Conference on. IEEE, 2006,
vol. 1.
[21] S. Davis and P. Mermelstein, “Comparison of parametric representations
for monosyllabic word recognition in continuously spoken sentences,”
IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28,
no. 4, pp. 357–366, 1980.
[22] J. Allen, “Short term spectral analysis, synthesis, and modification by
discrete Fourier transform,” IEEE Transactions on Acoustics, Speech
and Signal Processing, vol. 25, no. 3, pp. 235–238, 1977.
[23] JB Allen and LR Rabiner, “A unified approach to short-time Fourier
analysis and synthesis,” Proceedings of the IEEE, vol. 65, no. 11, pp.
1558–1564, 1977.
[24] M. Portnoff, “Time-frequency representation of digital signals and
systems based on short-time Fourier analysis,” IEEE Transactions on
Acoustics, Speech and Signal Processing, vol. 28, no. 1, pp. 55–69,
1980.
[25] FJ Harris, “On the use of windows for harmonic analysis with the
discrete Fourier transform,” Proceedings of the IEEE, vol. 66, no. 1, pp.
51–83, 1978.
[26] N. Reddy and M. Swamy, “Derivative of phase spectrum of truncated
autoregressive signals, IEEE Transactions on Circuits and Systems, vol.
32, no. 6, pp. 616–618, 1985.
[27] L.D. Alsteris and K.K. Paliwal, “Importance of window shape for phase-
only reconstruction of speech,” in Proc. International Conf. Acoustics,
Speech, Signal Processing, 2004, pp. 573–576.
[28] L. Liu, J. He, and G. Palm, “Effects of phase on the perception of
intervocalic stop consonants,” Speech Communication, vol. 22, no. 4,
pp. 403–417, 1997.
[29] MR Schroeder, “Models of hearing,” Proceedings of the IEEE, vol. 63,
no. 9, pp. 1332–1350, 1975.
[30] D. O’Shaughnessy, “Timing patterns in fluent and disfluent spontaneous
speech,” in IEEE International Conference on Acoustics, Speech, and
Signal Processing, 1995, pp. 600–603.
[31] D. Friedman, “Instantaneous-frequency distribution vs. time: An in-
terpretation of the phase structure of speech,” in IEEE International
Conference on Acoustics, Speech, and Signal Processing, 1985, vol. 10.
[32] S.A. Fulop and K. Fitz, “Algorithms for computing the time-corrected
instantaneous frequency (reassigned) spectrogram, with applications,”
The Journal of the Acoustical Society of America, vol. 119, pp. 360–
371, 2006.
[33] T. Abe, T. Kobayashi, and S. Imai, “The IF spectrogram: a new spectral
representation,” in Proceedings of ASVA 97, 2007, pp. 423–430.
[34] F. Charpentier, “Pitch detection using the short-term phase spectrum,”
in Acoustics, Speech, and Signal Processing, IEEE International Con-
ference on., 1986, vol. 11.
13
[35] F. Auger and P. Flandrin, “Improving the readability of time-frequency
and time-scale representations by the reassignment method,” Signal
Processing, IEEE Transactions on, vol. 43, no. 5, pp. 1068–1089, 2002.
[36] S. Kay, A fast and accurate single frequency estimator, Acoustics,
Speech and Signal Processing, IEEE Transactions on, vol. 37, no. 12,
pp. 1987 –1990, dec. 1989.
[37] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason,
V. Valtchev, and P. Woodland, “The HTK book,” Cambridge University,
vol. 1996, 1995.
[38] H. Fletcher, “Auditory patterns,” Reviews of Modern Physics, vol. 12,
no. 1, pp. 47–65, 1940.
[39] AV Oppenheim and RW Schafer, “From frequency to quefrency: A
history of the cepstrum,” IEEE Signal Processing Magazine, vol. 21,
no. 5, pp. 95–106, 2004.
[40] B.P. Bogert, M.J.R. Healy, and J.W. Tukey, “The quefrency alanysis
of time series for echoes: Cepstrum, pseudo-autocovariance, cross-
cepstrum and saphe cracking,” in Proceedings of the Symposium on
Time Series Analysis, 1963, pp. 209–243.
[41] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, and
N.L. Dahlgren, “DARPA TIMIT acoustic-phonetic continuous speech
corpus CD-ROM, NTIS order number PB91-100354, 1993.
[42] David Dean, Sridha Sridharan, Robert Vogt, and Michael Mason, “The
QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection
algorithms,” in Interspeech, Makuhari, Japan, September 2010.
[43] I. Rec, “830, subjective performance assessment of telephone-band and
wideband digital codecs,” 1996.
[44] I. Shafran and R. Rose, “Robust speech detection and segmentation for
real-time ASR applications,” in Proceedings of International Conference
on Acoustics, Speech, and Signal Processing, 2003, vol. 1, pp. 432–435.
[45] C. Wooters and M. Huijbregts, “The ICSI RT07s speaker diarization
system,” Lecture Notes in Computer Science, vol. 4625, pp. 509–519,
2008.
[46] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki,
“The DET curve in assessment of detection task performance,” in
Fifth European Conference on Speech Communication and Technology.
Citeseer, 1997.
[47] D.A. Reynolds, T.F. Quatieri, and R.B. Dunn, “Speaker verification
using adapted Gaussian mixture models,” Digital Signal Processing,
vol. 10, no. 1-3, pp. 19–41, 2000.
[48] R. Vogt and S. Sridharan, “Explicit modelling of session variability for
speaker verification, Computer Speech & Language, vol. 22, no. 1, pp.
17–38, 2008.
[49] H.A. Murthy and V.R.R. Gadde, “The modified group delay function
and its application to phoneme recognition,” in Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing, 2003, vol. 1,
pp. 68–71.
[50] K.S.R. Murty and B. Yegnanarayana, “Combining evidence from
residual phase and MFCC features for speaker recognition,” IEEE Signal
Processing Letters, vol. 13, no. 1, pp. 52–55, 2006.
[51] Tomi Kinnunen and Haizhou Li, “An overview of text-independent
speaker recognition: From features to supervectors, Speech Communi-
cation, vol. 52, no. 1, pp. 12–40, 2010.
[52] National Institute of Standards and Technology, “The NIST year
2006 speaker recognition evaluation plan,” 2006, Available from:
http://www.itl.nist.gov/iad/mig/tests/sre/2006/sre-06 evalplan-v9.pdf.
[53] W.M. Campbell, D.E. Sturim, D.A. Reynolds, and A. Solomonoff,
“SVM based speaker verification using a GMM supervector kernel and
NAP variability compensation, in Proc. IEEE International Conference
on Acoustics, Speech and Signal Processing, May 2006, vol. 1, pp. 97–
100.
[54] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score normalization
for text-independent speaker verification systems, Digital Signal
Processing, vol. 10, no. 1, pp. 42–54, 2000.
[55] M. McLaren, R. Vogt, B. Baker, and S. Sridharan, “A comparison
of session variability compensation techniques for SVM-based speaker
recognition,” in Proc. Interspeech, 2007, pp. 790–793.
[56] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker
verification,” A Speaker Odyssey, The Speaker Recognition Workshop,
vol. 2001, pp. 213–218, 2001.
[57] A Solomonoff, C Quillen, and W.M. Campbell, “Channel compensation
for SVM speaker recognition,” in Odyssey: The Speaker and Language
Recognition Workshop, 2004, pp. 57–62.
[58] N. Brummer, “FoCal: Tools for fusion and calibration of automatic
speaker detection systems,” July 2005.
Iain McCowan (M’97) received the B.E. and
B.InfoTech. from the Queensland University of
Technology (QUT), Brisbane, in 1996. In 2001, he
completed his PhD with the Research Concentration
in Speech, Audio and Video Technology at QUT,
including a period of research at France Telecom
R&D. In 2001 he joined the IDIAP Research Insti-
tute, Switzerland, progressing to the post of Senior
Researcher in 2003. While at IDIAP, he worked
on a number of applied research projects in the
areas of automatic speech recognition, content-based
multimedia retrieval, multimodal event recognition and modeling of human
interactions. From 2005-2008 he was with the CSIRO eHealth Research
Centre, Brisbane as Project Leader in the area of multimedia content analysis.
In 2008 he founded Dev-Audio Pty Ltd to commercialise microphone array
technology. He holds an adjunct appointment as Associate Professor at QUT
in Brisbane.
David Dean holds Bachelor degrees in Engineering
(with Honours) and Information Technology. He
completed his PhD programme in 2008 in the area
of audio-visual speech technology with his disserta-
tion entitled Synchronous HMMs for Audio-Visual
Speech Processing. As a post-doctoral fellow of the
Speech, Audio, Image and Video Technology pro-
gram at the Queensland University of Technology,
his research has focused on both acoustic and audio-
visual speech processing with a recent focus on
speech detection and speaker verification.
Mitchell McLaren received his PhD with the
Speech, Audio, Image and Video Technologies
(SAIVT) at the Queensland University of Tech-
nology (QUT), Brisbane, Australia, in 2010. He
received his BCompSysEng also from QUT in 2006.
Mitchell has been with the Centre for Language and
Speech Technology (CLST) at Radboud University
Nijmegen, The Netherlands, since 2010 where he is
currently in a post-doctoral role. In 2007, he was a
visiting intern within the Laboratoire Informatique
D’Avignon in Avignon, France. His PhD research
concentrated on speaker verification using support vector machine techniques.
Mitchell was awarded the ‘Best Student Paper Award’ at Interspeech 2008 and
the ‘IEEE 2009 Spoken Language Processing Student Grant’ at ICASSP 2009.
Robert Vogt received his PhD degree in electrical
engineering at the Queensland University of Tech-
nology (QUT), Brisbane, Australia, in 2006 and
a BEng/BInfTech degrees also at QUT in 2002.
Robert has been with the Speech, Audio, Image
and Video Technologies (SAIVT) group at QUT
since 2002 where he is currently a research fellow.
His research interests include speaker recognition,
speaker diarisation and spoken term detection. Dur-
ing his time with the SAIVT group, Robert has
participated in the successful commercialisation of
speech research outcomes and helped secure several large grants through
competitive funding schemes. In 2008, he was invited to participate in the
robust speaker recognition stream at the CSLP Summer Workshop, hosted at
Johns Hopkins University, Baltimore, MD.
14
Professor Sridha Sridharan has a BSc (Electrical
Engineering) degree and obtained a MSc (Commu-
nication Engineering) degree from the University
of Manchester Institute of Science and Technology
(UMIST), UK and a PhD degree in the area of Signal
Processing from University of New South Wales,
Australia. He is a Senior Member of the Institute of
Electrical and Electronic Engineers - IEEE (USA).
He is currently with the Queensland University of
Technology (QUT) where he is a full Professor in the
School of Engineering Systems. Professor Sridharan
is the Deputy Director of the Information Security Institute and the Leader
of the Research Program in Speech, Audio, Image and Video Technologies at
QUT. He has published over 300 papers consisting of publications in journals
and in refereed international conferences in the areas of Image and Speech
technologies during the period 1990- 2010. During this period he has also
graduated 24 PhD students as their Principal Supervisor and 15 PhD students
as their Associate Supervisor in the areas of Image and Speech technologies.
Prof Sridharan has also received a number of research grants from various
funding bodies including Commonwealth competitive funding schemes such
the Australian Research Council (ARC) and the National Security Science
and Technology (NSST) unit. Several of his research outcomes have been
commercialised.
... Recently, in addition to speaker recognition [6][7][8], research on acoustics has focused on the phases of acoustic signals in speech recognition [9], audio classification [5], and spoofing detection [10]. These studies defined the phase by using mathematical tools, such as the Fourier or Hilbert transform, and proposed the following phase-based features: phase differences between time frames, arg(X m /X m−1 ) [6], phase differences among harmonics of speech,φ k − kφ 1 [7], time-differential phase, dθ/dt [8], frequency-differential phase, −dϕ/dω [9], phase transformed by complex trigonometric functions, exp (iϕ) [5], and phase combined with log-magnitude, (ln |X|) 2 + φ 2 [10]. ...
... Recently, in addition to speaker recognition [6][7][8], research on acoustics has focused on the phases of acoustic signals in speech recognition [9], audio classification [5], and spoofing detection [10]. These studies defined the phase by using mathematical tools, such as the Fourier or Hilbert transform, and proposed the following phase-based features: phase differences between time frames, arg(X m /X m−1 ) [6], phase differences among harmonics of speech,φ k − kφ 1 [7], time-differential phase, dθ/dt [8], frequency-differential phase, −dϕ/dω [9], phase transformed by complex trigonometric functions, exp (iϕ) [5], and phase combined with log-magnitude, (ln |X|) 2 + φ 2 [10]. ...
... Herein, we compared the obtained results with those obtained in previous studies [6][7][8] to investigate the validity of the obtained results. Previously published studies demonstrated that the performance of speaker recognition using phase-based features was comparable to that using magnitude-based features. ...
Article
Full-text available
Introduction Speaker recognition has been performed by considering individual variations in the power spectrograms of speech, which reflect the resonance phenomena in the speaker's vocal tract filter. In recent years, phase-based features have been used for speaker recognition. However, the phase-based features are not in a raw form of the phase but are crafted by humans, suggesting that the role of the raw phase is less interpretable. This study used phase spectrograms, which are calculated by subtracting the phase in the time-frequency domain of the electroglottograph signal from that of speech. The phase spectrograms represent the non-modified phase characteristics of the vocal tract filter. Methods The phase spectrograms were obtained from five Japanese participants. Phase spectrograms corresponding to vowels, called phase spectra, were then extracted and circular-averaged for each vowel. The speakers were determined based on the degree of similarity of the averaged spectra. Results The accuracy of discriminating speakers using the averaged phase spectra was observed to be high although speakers were discriminated using only phase information without power. In particular, the averaged phase spectra showed different shapes for different speakers, resulting in the similarity between the different speaker spectrum pairs being lower. Therefore, the speakers were distinguished by using phase spectra. Discussion This predominance of phase spectra suggested that the phase characteristics of the vocal tract filter reflect the individuality of speakers.
... The symbols μ and σ represent the mean and variance of the discrete signal x[n], respectively. Here, x∈{x[0],x[1],...,x[n]} denotes the original audio signal's discrete values. ...
Article
Full-text available
Voice Activity Detection (VAD) is a widely used technique for separating vocal regions from audio signals, with applications in voice language coding, noise reduction, and other domains. While various strategies have been proposed to improve VAD performance, such as ACAM, DCU-10, and Tr-VAD, these approaches often suffer from common limitations, including being unsuitable for long audio and being time-consuming. To address these issues, we propose a new method called AAT-VAD, which integrates an adaptive width attention learning mechanism into the classic transformer framework. Our approach involves extracting Mel-scale Frequency Cepstral Coefficients (MFCC) from the Mel scale frequency domain, adding a masking function to each transformer attention head, and inputting the features processed by the transformer encoder layer into the classifier. Experimental results indicate that our method achieves a 12.8% higher F1-score than DCU-10 and a 0.6% higher F1-score than Tr-VAD under different noise interferences. Furthermore, the average detection cost function (DCF) value of our method is only 14.3% of DCU-10 and 92.4% of Tr-VAD, and the test time of AAT-VAD is only 37.4% of that of Tr-VAD for the same noisy vocal mixed audio.
... Early unsupervised methods made use of simple energybased features and temporal parameters such as zero-crossing rate (ZCR) [1,2], before applying a discriminator model to compute the speech/non-speech decision boundary. Spectral features based on autocorrelation [3][4][5], mel-frequency cepstral coefficients (MFCCs) [4], skewness and kurtosis of linear prediction (LP) residual [6], spectral shape [7], harmonic structure [8], voicing [9], cepstral features [10,11], perceptual spectral flux [12], spectral flatness (SF) and short-term energy [13], and speech enhancement and denoising through pitch indicators [14] were proposed to improve the performance and robustness of these systems in the presence of noise. ...
... Coherent phase ensures continuous polarity and thus expression of emotion on continuous pitch cycles. In [19,23] authors have proposed delta-phase spectrum and complex spectral subtraction respectively for robust automatic speech recognition. In this work a notion of complex MFCC is considered which includes the DCT for cepstral decorrelation and the DST to trigger in case of phase break. ...
Article
Full-text available
Speech Emotion Recognition (SER) is one of the front-line research areas. For a machine, inferring SER is difficult because emotions are subjective and annotation is challenging. Nevertheless, researchers feel that SER is possible because speech is quasi-stationery and emotions are declarative finite states. This paper is about emotion classification by using Complex Mel Frequency Cepstral Coefficients (c-MFCC) as the representative trait and a deep sequential model as a classifier. The experimental setup is speaker independent and accommodates marginal variations in the underlying phonemes. Testing for this work has been carried out on RAVDESS and TESS databases. Conceptually, the proposed model is erogenous towards prosody observance. The main contributions of this work are of two-folds. Firstly, introducing conception of c-MFCC and investigating it as a robust cue of emotion and there by leading to significant improvement in accuracy performance. Secondly, establishing correlation between MFCC based accuracy and Russell’s emotional circumplex pattern. As per the Russell’s 2D emotion circumplex model, emotional signals are combinations of several psychological dimensions though perceived as discrete categories. Results of this work are outcome from a deep sequential LSTM model. Proposed c-MFCC are found to be more robust to handle signal framing, informative in terms of spectral roll off, and therefore put forward as an input to the classifier. For RAVDESS database the best accuracy achieved is 78.8% for fourteen classes, which subsequently improved to 91.6% for gender integrated eight classes and 98.5% for affective separated six classes. Though, the RAVDESS dataset has two analogous sentences revealed results are for the complete dataset and without applying any phonetic separation of the samples. Thus, proposed method appears to be semi-commutative on phonemes. Results obtained from this study are presented and discussed in forms of confusion matrices.
... Most of the methods for the computation of IF use the derivative of the phase of the analytic signal directly, and are therefore subject to the phase wrapping problem (Murty and Yegnanarayana, 2008;. IF has been computed using STFT or a filter bank to extract useful information of speech production such as formant contours and formant bandwidths (Costas, 1981;Ramalingam et al., 1994;Kumaresan et al., 1994;McCowan et al., 2011;Stark and Paliwal, 2008;Vijayan et al., 2019;Tsiakoulis et al., 2013). In most of these studies, the random effects of phase wrapping and the effects of discontinuities due to excitation are smoothed out by averaging IFs over time and frequency. ...
Article
Full-text available
The major impulse-like excitation in the speech signal is due to abrupt closure of the vocal folds, which takes place at the glottal closure instant (GCI) or epoch in each cycle. GCIs are used in many areas of speech science and technology, such as in prosody modification, voice source analysis, formant extraction and speech synthesis. It is difficult to observe these discontinuities (corresponding to GCIs) in the speech signal because of the superimposed time-varying response of the vocal tract system. This paper examines the phase part of different frequency components of the speech signal to extract epochs. Three analysis methods to decompose the speech signal into different frequency components are considered. These methods are the short-time Fourier transform (STFT), narrow bandpass filtering (NBPF), and single frequency filtering (SFF). The locations of the discontinuities in the speech signal are obtained from the instantaneous frequency (IF) (i.e., the time derivative of the phase) of each of the frequency components. A method for automatic detection of epochs using the amplitude weighted IF is proposed. Performance of the proposed epoch detection method is compared with four state-of-the-art methods in clean and telephone quality speech. The performance of the proposed method is comparable with the performance of the existing epoch detection methods for clean speech but better for telephone quality speech.
... Early unsupervised methods made use of simple energybased features and temporal parameters such as zero-crossing rate (ZCR) [1,2], before applying a discriminator model to compute the speech/non-speech decision boundary. Spectral features based on autocorrelation [3][4][5], mel-frequency cepstral coefficients (MFCCs) [4], skewness and kurtosis of linear prediction (LP) residual [6], spectral shape [7], harmonic structure [8], voicing [9], cepstral features [10,11], perceptual spectral flux [12], spectral flatness (SF) and short-term energy [13], and speech enhancement and denoising through pitch indicators [14] were proposed to improve the performance and robustness of these systems in the presence of noise. ...
Preprint
Full-text available
Voice activity detection (VAD) is an important pre-processing step for speech technology applications. The task consists of deriving segment boundaries of audio signals which contain voicing information. In recent years, it has been shown that voice source and vocal tract system information can be extracted using zero-frequency filtering (ZFF) without making any explicit model assumptions about the speech signal. This paper investigates the potential of zero-frequency filtering for jointly modeling voice source and vocal tract system information, and proposes two approaches for VAD. The first approach demarcates voiced regions using a composite signal composed of different zero-frequency filtered signals. The second approach feeds the composite signal as input to the rVAD algorithm. These approaches are compared with other supervised and unsupervised VAD methods in the literature, and are evaluated on the Aurora-2 database, across a range of SNRs (20 to -5 dB). Our studies show that the proposed ZFF-based methods perform comparable to state-of-art VAD methods and are more invariant to added degradation and different channel characteristics.
... The feature extraction module converts the audio clip in two feature maps namely, the magnitude Short-term Fourier transform (STFT) with 54 time bins and 128 frequency bins and the frequency-based 128 dimensional reassigned frequency or instantaneous frequencies (IF) [21] with 54 time bins. IFs were proposed as a feature by Longbiao et al. and Iain et al. [22,23] and have shown to improve VADs performance. The magnitude STFT and IFs are calculated using a 25 ms window (800 samples) and 10 ms (320 samples) hop length. ...
Article
Full-text available
Subtitles are a crucial component of Digital Entertainment Content (DEC such as movies and TV shows) localization. With ever increasing catalog (≈ 2M titles) and localization expansion (30+ languages), automated subtitle quality checks becomes paramount. Being a manual creation process, subtitles can have errors such as missing transcriptions, out-of-sync subtitle blocks with the audio and incorrect translations. Such erroneous subtitles result in an unpleasant viewing experience and impact the viewership. Moreover, manual correction is laborious, highly costly and requires expertise of audio and subtitle languages. A typical subtitle correction process consists of (1) linear watch of the movie, (2) identification of time stamps associated with erroneous subtitle blocks, and (3) correcting procedure. Among the three, time taken to watch the entire movie by a human expert is the most time consuming step. This paper discusses the problem of missing transcription, where the subtitle blocks corresponding to some speech segments in the DEC are non-existent. We present a solution to augment human correction process by automatically identifying the timings associated with the non-transcribed dialogues in a language agnostic manner. The correction step can then be performed by either human-in-the-loop mechanism or automatically using neural transcription (speech-to-text in same language) and translation (text-to-text in different languages) engines. Our method uses a language agnostic neural voice activity detector (VAD) and an audio classifier (AC) trained explicitly on DEC corpora for better generalization. The method consists of three steps: first, we use VAD to identify the timings associated with dialogues (predicted speech blocks). Second, we refine those timings using the AC module by removing the timings associated with the leading and trailing non-speech segments identified as speech by VAD. Finally, we compare the predicted dialogue timings to the dialogue timings present in the subtitle file (subtitle speech blocks) and flag the missing transcriptions. We empirically demonstrate that the proposed method (a) reduces incorrect predicted missing subtitle timings by 10%, (b) improves the predicted missing subtitle timings by 2.5%, (c) reduces false positive rate (FPR) of overextending the predicted timings by 77%, and (d) improves the predicted speech block-level precision by a 119% over VAD baseline on a human-annotated dataset of missing subtitle speech blocks.
Article
This paper examines the phase derivatives of speech signals. The instantaneous complex spectra obtained in the single frequency filtering (SFF) analysis of signals is used to derive the phase function. The problem of phase wrapping is avoided by using the proposed modification to SFF analysis to derive a scaled down version of the phase function. We consider the derivatives of the exponent (i.e., logarithm) of the complex SFF spectra, with respect to frequency, time, and both frequency and time. The imaginary part of the exponent is the phase function, and the real part is the log magnitude function. The negative derivative of phase with respect to frequency is the group delay (GD) function, and the derivative of the phase with respect to time is the instantaneous frequency (IF) function. The features of speech production displayed through the GD function are compared with the features displayed through the derivative with respect to frequency of the corresponding log magnitude function. Likewise, the features of production displayed through the IF function are compared with the features displayed through the derivative with respect to time of the corresponding log magnitude function. The speech production characteristics reflected in these representations of phase derivatives are examined for different types of utterances.
Article
A robust and language agnostic Voice Activity Detection (VAD) is crucial for Digital Entertainment Content (DEC). Primary examples of DEC include movies and TV series. Some ways in which VAD systems are used for DEC creation include augmenting subtitle creation, subtitle drift detection and correction, and audio diarisation. Majority of the previous work on VAD focuses on scenarios that: (a) have minimal background noise, and (b) where the audio content is delivered in English language. However, movies and TV shows can: (a) have substantial amounts of non-voice background signal (e.g. musical score and environmental sounds), and (b) are released worldwide in a variety of languages. This makes most of the previous standard VAD approaches not readily applicable for DEC related applications. Furthermore, there does not exist a comprehensive analysis of Deep Neural Network’s (DNN) performance for the task of VAD applied to DEC. In this work, we present a thorough survey on DNN based VADs on DEC data in terms of their accuracy, Area Under Curve (AUC), noise sensitivity, and language agnostic behaviour. For our analysis we use 1100 proprietary DEC videos spanning 450 hours of content in 9 languages and 5+ genres, making our study the largest of its kind ever published. The key findings of our analysis are: (a) even high quality timed-text or subtitle ³ files contain significant levels of label-noise (up to 15%). Despite high label noise, deep networks are robust and are able to retain high AUCs (∼ 0.94). (b) Using larger labelled dataset can substantially increase neural VAD model’s True Positive Rate (TPR) with upto 1.3% and 18% relative improvement over current state-of-the-art methods in [1], [2] respectively. This effect is more pronounced in noisy environments such as music and environmental sounds. This insight is particularly instructive while prioritizing domain specific labelled data acquisition versus exploring model structure and complexity. (c) Currently available sequence based neural models show similar levels of competence in terms of their language agnostic behaviour for VAD at high Signal-to-Noise Ratios (SNRs) and for clean speech, (d) Deep models exhibit varied performance across different SNRs with CLDNN [3] being the most robust, and (e) models with comparatively larger number of parameters (∼ 2 M) are less robust to input noise as opposed to models having smaller number of parameters (∼ 0.5 M).
Article
Full-text available
In this paper, we introduce a new class of noise robust acoustic features derived from a new measure of autocorrelation, and explicitly exploiting the phase variation of the speech signal frame over time. This family of features, referred to as ``Phase AutoCorrelation'' (PAC) features, include PAC spectrum and PAC MFCC, among others. In regular autocorrelation based features, the correlation between two signal segments (signal vectors), separated by a particular time interval $k$, is calculated as a dot product of these two vectors. In our proposed PAC approach, the angle between the two vectors is used as a measure of correlation. Since dot product is usually more affected by noise than the angle, it is expected that PAC-features will be more robust to noise. This is indeed significantly confirmed by the experimental results presented in this paper. The experiments were conducted on the Numbers 95 database, on which ``stationary'' (car) and ``non-stationary'' (factory) Noisex 92 noises were added with varying SNR. In most of the cases, without any specific tuning, PAC-MFCC features perform better.
Article
Full-text available
Classical models of speech recognition assume that a detailed, short-term analysis of the acoustic signal is essential for accurately decoding the speech signal and that this decoding process is rooted in the phonetic segment. This paper presents an alternative view, one in which the time scales required to accurately describe and model spoken language are both shorter and longer than the phonetic segment, and are inherently wedded to the syllable. The syllable reflects a singular property of the acoustic signal - the modulation spectrum - which provides a principled, quantitative framework to describe the process by which the listener proceeds from sound to meaning. The ability to understand spoken language (i.e., intelligibility) vitally depends on the integrity of the modulation spectrum within the core range of the syllable (3-10 Hz) and reflects the variation in syllable emphasis associated with the concept of prosodic prominence ("accent"). A model of spoken language is described in which the prosodic properties of the speech signal are embedded in the temporal dynamics associated with the syllable, a unit serving as the organizational interface among the various tiers of linguistic representation.