ArticlePDF Available

The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition

September 2011
IEEE Transactions on Audio Speech and Language Processing 19(7):2026-2038

September 2011
19(7):2026-2038

DOI:10.1109/TASL.2011.2109379

Source
DBLP

Authors:

Iain Mccowan

Biamp Systems

David Dean

NSW Technical and Further Education Commission (TAFE NSW)

Show all 5 authorsHide

For several reasons, the Fourier phase domain is less favored than the magnitude domain in signal processing and modeling of speech. To correctly analyze the phase, several factors must be considered and compensated, including the effect of the step size, windowing function and other processing parameters. Building on a review of these factors, this paper investigates a spectral representation based on the Instantaneous Frequency Deviation, but in which the step size between processing frames is used in calculating phase changes, rather than the traditional single sample interval. Reflecting these longer intervals, the term delta-phasespectrumisusedtodistinguishthisfrominstantaneous derivatives. Experiments show that mel-frequency cepstral coef- ficients features derived from the delta-phase spectrum (termed Mel-Frequency delta-phase features) can produce broadly similar performance to equivalent magnitude domain features for both voice activity detection and speaker recognition tasks. Further, it is shown that the fusion of the magnitude and phase representations yields performance benefits over either in isolation. Index Terms—Instantaneous frequency (IF), phase, speaker recognition, speech analysis, voice activity detection (VAD).

Plot of the Phase Spectrum for successive input audio frames for (a) uncompensated case, (b) when compensation has been made for the analysis window, and (c) when both window and frame-shift compensation using Equation (3) have been applied to the DFT. The solid blue line shows the current frame, the dashed black line shows the previous frame, and the dotdashed red line shows the phase difference between these (calculated using the phase of the quotient, to avoid wrapping effects). A window and FFT length of T = L = 128 and a step size of D = 8 were used on a signal at Fs = 16 kHz.

…

Sample audio sequence in the (a) Time-domain and spectrographic representations of the (b) Magnitude spectrum, (c) Instantaneous Frequency Deviation spectrum and (d) the Delta-Phase spectrum, using parameters following [11] to facilitate comparison (Fs=16000, T=512, D=64, Chebyshev 50 dB window). For (b)-(d) the y-axis shows increasing FFT bin index (i.e., increasing frequency) and the x-axis shows increasing frame index. The speech segment consists of a male utterance of " The decking is quarter-inch mahogany marine plywood " from the TIMIT database [41]

…

Sample audio sequence in the (a) Time-domain and Mel-scaled Filter-bank spectrographic representations of the (b) Magnitude spectrum, (c) Instantaneous Frequency Deviation spectrum and (d) the Delta-Phase spectrum, using parameters following [11] to facilitate comparison (Fs=16000, T=512, D=64, Chebyshev 50 dB window). For (b)-(d) the y-axis shows increasing filter-bank index (i.e., increasing frequency) and the x-axis shows increasing frame index.

…

Sample audio sequence in the (a) Time-domain and Mel-scaled Filter-bank spectrographic representations of the (b) Magnitude spectrum, (c) Instantaneous Frequency Deviation spectrum and (d) the Delta-Phase spectrum, using parameters as used in subsequent experiments in the current article (Fs=16000, T=4096, D=160, Rectangular window). For (b)-(d) the yaxis shows increasing filter-bank index (i.e., increasing frequency) and the x-axis shows increasing frame index.

…

DET plot of GMM-based VAD results over all noise conditions and levels, using MFCC and MFDP features and their multi-stream fusion.

…

Figures - uploaded by Iain Mccowan

Content may be subject to copyright.

Content uploaded by Iain Mccowan

Content may be subject to copyright.

QUT Digital Repository:

http://eprints.qut.edu.au/

Thisistheacceptedversion ofthefollowingjournalarticle:

McCowan,Iain,Dean,DavidB.,McLaren,MitchellL.,Vogt,

RobertJ.,&Sridharan,Sridha(2011)Thedelta‐phase

spectrumwithapplicationtovoiceactivitydetectionand

speakerrecognition.IEEETransactionsonAudio,Speech,and

LanguageProcessing.

The Delta-Phase Spectrum with Application to

Voice Activity Detection and Speaker Recognition

Iain McCowan Member IEEE, David Dean Member IEEE, Mitchell McLaren Student Member IEEE, Robert

Vogt Member IEEE, Sridha Sridharan Senior Member IEEE

Abstract—For several reasons, the Fourier phase domain is

less favoured than the magnitude domain in signal process-

ing and modelling of speech. To correctly analyse the phase,

several factors must be considered and compensated, includ-

ing the effect of the step size, windowing function and other

processing parameters. Building on a review of these factors,

this paper investigates a spectral representation based on the

Instantaneous Frequency Deviation, but in which the step size

between processing frames is used in calculating phase changes,

rather than the traditional single sample interval. Reﬂecting

these longer intervals, the term Delta-Phase Spectrum is used

to distinguish this from instantaneous derivatives. Experiments

show that mel-frequency cepstral coefﬁcients features derived

from the Delta-Phase Spectrum (termed Mel-Frequency Delta-

Phase features) can produce broadly similar performance to

equivalent magnitude domain features for both voice activity

detection and speaker recognition tasks. Further, it is shown

that the fusion of the magnitude and phase representations yields

performance beneﬁts over either in isolation.

Index Terms—phase, instantaneous frequency, speech analysis,

voice activity detection, speaker recognition.

I. INTRODUCTION

Most speech analysis focuses on features derived from the

signal’s magnitude spectrum, with the phase spectrum dis-

carded. This is due both to mathematical difﬁculties analysing

phase as a function, as well as psychoacoustic and signal

processing experimental results that have rarely shown the

phase to provide any empirical beneﬁt over magnitude-only

features. While well motivated, however, this still effectively

discards half of the information present in the original signal.

While this discarded information may mostly be redundant

in low noise conditions, when the noise energy becomes

comparable to the signal energy, sources of discriminative

information that may prove complementary to the magnitude

spectrum are desirable.

Many efforts to improve the robustness and discriminative

ability of speech features have focussed on the importance

of encoding temporal information in the feature extraction

process, such as RASTA ﬁltering of spectral trajectories [1],

temporal pattern (TRAPS) classiﬁers [2], and the modulation

However, permission to use this material for any other purposes must be

obtained from the IEEE by sending a request to pubs-permissions@ieee.org.

The authors are with Speech and Audio Research Laboratory, Queensland

University of Technology, 2 George St. Brisbane, QLD, Australia. Phone: +61

7 3138 1608, Fax: +61 7 3138 1516

Mitchell McLaren is also with Centre for Language and Speech Technology,

Radboud University, PO Box 9102, 6500HC Nijmegen, The Netherlands.

Email: iain@ieee.org, ddean@ieee.org, m.mclaren@let.ru.nl,

r.vogt@qut.edu.au, s.sridharan@qut.edu.au

spectrum [3], [4], [5]. Given the Fourier phase domain en-

codes relative timing information between different spectral

components, interest in its use has increased in recent years.

Different approaches have included estimating phase changes

from an interference model [6], using the phase of the signal

autocorrelation at different lags [7], measuring relative phase

difference between frequencies [8], and deriving features based

on the group delay [9] and instantaneous frequency [10], [11].

A recent review of the use of phase information in speech

processing, however, indicates that broadly effective phase-

domain features remain elusive [12].

The main difﬁculty associated with extracting speech fea-

tures from the phase spectrum is the ambiguity that exists

between angles separated by multiples of 2πradians. While

the principal phase spectrum can be obtained by choosing the

phase angle to lie between ±π, this choice is arbitrary and

results in regular discontinuities from circular wrapping of

values considered over time or frequency. Phase unwrapping

may be performed to restore a continuous phase spectrum

for analysis, but consistent unwrapping is difﬁcult, relying on

different heuristics in practice [12], [13], [14], [15]. This dif-

ﬁculty in obtaining the phase as a continuous function causes

both analytical problems as well as modeling difﬁculties due

to inconsistencies in the representation across frames.

This paper commences with a discussion of practical issues

that must be considered in analysing phase domain information

from the short-time Fourier transform (STFT). While works

can be found in the speech signal processing literature that

discuss individual issues to varying degrees, the literature on

these details is sparse. A ﬁrst contribution of this paper is

therefore to provide a tutorial introduction and brief literature

review of practicalities in dealing with short-time Fourier

phase in discrete-time frame-based processing algorithms. In

particular, compensation must be made for the inter-frame

time step and the effect of the windowing function before the

phase spectrum can be meaningfully analysed. Another issue

concerns the lack of a common temporal origin when making

comparisons of the phase spectrum over different sequences,

such as when developing statistical models of speech. Finally,

there is a need to select processing parameters, such as frame

size and window function, that are appropriate for analysing

phase information, rather than naively applying parameters

that work well for the magnitude domain.

Following this review, this paper investigates the use of

phase changes between analysis frames as a representation that

can be consistently analysed both within and across sequences.

As a temporal difference in phase values, this representation

is equivalent to the Instantaneous Frequency Deviation (IFD)

spectrum [11], [16]; however rather than estimating the instan-

taneous derivative using successive samples, the phase delta is

analysed over a larger time delta corresponding to the inter-

frame step size. As there is no intention to measure the formal

derivative (that is, the limit as the time interval approaches

zero or equivalently single sample difference in its digital

approximation), in order to distinguish from the Instantaneous

Frequency Deviation as commonly computed and used, the

term Delta-Phase Spectrum is used to describe the quantity

used in this article.

The desire to analyse phase differences over frame step

intervals is motivated by the success of approaches to model

the speech signal as the birth and death of individual sinusoidal

components, each lasting several short-term frames [17], [18],

[19], [20]. Measuring the simple difference in phase in narrow

frequency bins across step-sized intervals may effectively

capture information about timing and transitions across the

spectrum and between speech units, such as phonemes, syl-

lables and words, potentially leading to useful features for

detecting voice activity in noise, or distinguishing voices. In

order to demonstrate use of the Delta-Phase Spectrum in prac-

tice, therefore, these two applications are investigated. First,

a simple Gaussian Mixture Model (GMM) based Voice Ac-

tivity Detection (VAD) system is evaluated in different noise

conditions in Section V. Mel-frequency cepstral coefﬁcient

features derived from the Delta-Phase Spectrum, termed Mel-

Frequency Delta-Phase (MFDP) features, are compared to and

combined with standard Mel-Frequency Cepstral Coefﬁcient

(MFCC) features derived from the magnitude spectrum [21].

Similarly, the effectiveness of MFDP features is evaluated for

application to speaker recognition in Section VI.

II. A REVIEW OF IMPLEMENTATION PRACTICALITIES FOR

TH E PHA SE DOMAIN

A. Short-Time Fourier Analysis of Speech Signals

While short-time Fourier analysis of speech is a well-known

technique, in established use for over 30 years [22], [23], [24],

a brief review is provided here as the basis for the subsequent

discussion.

The short-time Discrete Fourier Transform (DFT) is deﬁned

as:

Xm(k) =

∞

n=−∞

w(n−mD)x(n)e−jωkn(1)

where mis the frame index, w(n)is a causal window of length

T(i.e., zero-valued outside the range 0≤n≤T−1), Dis

the number of samples between successive analysis frames

(the step size, with D≤T), and ωk=2πk

L, where Lis the

number of analysis frequencies being considered in the DFT

(with L≥T).

B. Selection of Processing Parameters

The above equation can be interpreted as shifting a short-

time window function w(n)through progressive D-sized de-

lays over the signal x(n), to obtain successive T-length frames

for analysis. Implementation therefore depends on appropriate

choice of the window function, the step size (that is, the frame

rate) and the frame length.

The window function w(n)is necessary to impose a ﬁnite

extent on the signal being analysed. Important considerations

include minimising spectral leakage through effective tapering

(enforcing periodicity in the window length) and the ability to

resolve frequency components (effective window bandwidth

and sidelobe level). Different windowing functions are anal-

ysed in [25]. While windows with smooth tapering, such as

the Hamming window, are commonly used in analysing the

magnitude spectrum over short-time segments, some studies

have shown that the rectangular window is more appropriate

when analysing the phase domain [26], [27], [12].

Speech is generally considered to be approximately

piecewise-stationary over a period of approximately 20 mil-

liseconds; however it is noted that some sounds are stationary

over longer or shorter durations, or may be non-stationary.

The choice of analysis frame (that is, window) length Tis a

trade-off between desired frequency resolution and temporal

resolution - a longer frame gives better frequency resolution,

however possibly at the expense of blurring out more rapid

speech events. A window duration of between 16-32 ms is

often used in analysis of the speech magnitude spectrum as

an effective balance between these practical considerations.

Most studies on the relevance of phase domain information

in the speech processing literature have however concluded

that longer analysis frames are required than typically used

in magnitude domain analysis. For instance, perception tests

have repeatedly shown that intelligibility of phase-only stimuli

improves over magnitude-only stimuli as the window duration

extends from 100-1000 ms, while the converse is true for

shorter frames [28], [12], [29]. According to a range of studies

on English speech, phonemes typically vary from 50 to 200

milliseconds, syllables from 50 to 500 milliseconds, and words

from 80 to 850 milliseconds [30], [3]. Further, it has been

shown that inter-word pauses during conversational speech can

vary from 100 to 1000 milliseconds [30]. Work modelling the

speech signal as the birth and death of individual sinusoidal

components, each lasting several short-term frames, suggests

that signiﬁcant phase variations over time and frequency may

be produced by these underlying speech units [17], [18].

In selecting the step-size D, consideration must ﬁrst be

given to the bandwidth of the window function and the analysis

frame length. For instance, for a Hamming window of length

T, the step-size should be less than or equal to T/4to avoid

aliasing [23]. This places an upper bound on the step-size,

or equivalently a lower bound on the frame rate (although

it is common practice to implement a step size of T/2). It

is of course necessary that the frame rate be high enough

to capture the temporal dynamics in the signal. Following

the Nyquist sampling principle, the assumption that speech is

quasi-stationary over 20 ms segments motivates new frames

being taken at 10 ms intervals, that is at 100 frames per

second. This frame rate is commonly used in speech analysis

applications. While the preceding paragraph motivated longer

analysis frames for the phase domain, a similar 100 Hz frame

rate (step size) is still motivated in order to effectively sample

these variations in the speech signal.

20 40 60 80 100 120

−1

−0.5

0.5

20 40 60 80 100 120

−1

−0.5

0.5

(b) Phase Spectrum after Window Compensation

20 40 60 80 100 120

−1

−0.5

0.5

(a) Uncompensated Phase Spectrum

Fig. 1. Plot of the Phase Spectrum for successive input audio frames for (a)

uncompensated case, (b) when compensation has been made for the analysis

window, and (c) when both window and frame-shift compensation using

Equation (3) have been applied to the DFT. The solid blue line shows the

current frame, the dashed black line shows the previous frame, and the dot-

dashed red line shows the phase difference between these (calculated using

the phase of the quotient, to avoid wrapping effects). A window and FFT

length of T=L= 128 and a step size of D= 8 were used on a signal at

Fs= 16 kHz.

C. Compensation for the Analysis Window

It is ﬁrst necessary to understand the effect of the windowing

function on the phase. A description of this effect and a

compensation method can be found in [18, Section 9.3.3],

summarised here for convenience.

Because the window function is commonly symmetric about

its mid-point, and this is aligned with the mid-point of the

current frame in practical implementation of the analysis

procedure, it has a linear Fourier transform phase of ωkT/2.

A simple way to compensate for its effect is to implement

a circular shift of the windowed signal in the time-domain

prior to the Fourier transform. Speciﬁcally: take the mth input

frame xm(n)of length T, apply the window function w(n),

zero-pad as necessary to the FFT length L, then circularly

shift the frame by T/2samples such that the latter half of the

frame occupies the range 0≤n≤(T/2) and the earlier half

occupies the range L−(T/2) ≤n≤L−1. Following this, the

Fourier transform can be taken and the inter-frame time step

compensated as above. An alternative implementation may

instead compensate the phase modulation in the frequency

domain.

The effect of this window compensation on a single analysis

frame is shown in Figure 1(a)-(b).

D. Compensation for Inter-frame Time Step

To implement the short-time DFT in practice, typical speech

signal analysis multiplies the window function by Tsamples

from the signal to obtain an analysis frame. A length LDFT

is then taken to obtain the spectrum for analysis. Subsequent

analysis frames are obtained by shifting the input signal by

Dsamples - that is, discarding the ﬁrst Dsamples from the

T-length buffer, and appending Dnew samples at the end.

This procedure in effect implements:

Xm(k) =

∞

n=−∞

w(n)x(n+mD)e−jωkn,(2)

sometimes referred to as the Running Short-time Fourier

Transform (RSTFT) [31]. That is, it is the signal that is

effectively being shifted (through progressive advancements)

past the ﬁxed window, rather than the window being shifted

over the signal. This distinction is important when considering

the absolute time origin for each analysis frame. In (1), the

temporal origin of each signal frame remains the origin of the

original signal x(n). In (2), however, the absolute position

of the frame within the original sequence is discarded, by

redeﬁning the temporal origin of the frame as mD.

This has no implications for applications that only consider

the magnitude spectrum. Further, when the short-term Fourier

analysis is being done prior to re-synthesis of the signal,

such as when using the overlap-add method to implement

frequency-domain ﬁltering, this difference is eventually com-

pensated for by time-shifting the synthesised frame to its

correct position in the output.

In applications that seek to analyse the phase spectrum

across multiple frames, however, the above distinction has

important consequences. Direct inter-frame comparisons of

phase values calculated in this way are invalid, due to the

changing reference. By accounting for the effect of the step

size, however, it is possible to restore a common reference

point to the phase values for every frame, allowing meaningful

analysis and modelling of the phase information over time.

If we therefore wish to compare the phase spectrum between

frames, it is straightforward to see that (1) is related to (2) by

Xm(k) = ˜

Xm(k)e−jωkmD (3)

Returning to the typical procedure for obtaining the spectrum

of successive analysis frames, applying (3) following the

DFT compensates the phase spectrum to restore a common

reference. The effect of correctly compensating for both the

window function and the frame time step is shown in Fig-

ure 1(c).

III. THE DE LTA-PHA SE SP EC TRU M

Two problems still remain with the phase spectrum follow-

ing the above compensation: the lack of common temporal

reference between different sequences, and possible ambigui-

ties arising from phase wrapping. Absolute values of the phase

spectrum have little meaning without a common reference

point: phase values are by nature relative. While phase values

within a single sequence can be compared once they have been

compensated back to a virtual zero reference time, comparison

of phase values across sequences is problematic due to the

arbitrary start-point for the windowing. When developing

statistical models of the behaviour of the phase spectrum,

therefore, it is necessary to somehow restore some common

reference to the values.

A. Review of Spectral Representations based on Instantaneous

Frequency

One means of achieving a consistent phase-domain quantity

for analysis and modelling is to calculate the temporal deriva-

tive, commonly referred to as the Instantaneous Frequency

(IF). A number of works in the literature have investigated

spectrographic representations which plot the magnitude spec-

trum with the bin location on the time and frequency axes

reassigned according to the instantaneous frequency and group

delay [32], [31], [33]. A technical and historical review relat-

ing these different approaches is presented in [32], including

algorithms for practical digital estimation of the IF using the

STFT. One common method for computing the IF without

explicit differentiation is to ﬁrst calculate the IF Deviation

as the imaginary part of the ratio between two STFT’s, one

calculated using the standard window and one calculated by

replacing the window with its derivative [31], [34], [33]. The

IF can then calculated by compensating this for the centre

frequency of each component. Following [35], this is referred

to as the Auger and Flandrin method in [32].

Recently a spectrographic representation that is instead

based directly on the Instantaneous Frequency Deviation was

proposed in [11]. In that work, the IF was ﬁrst calculated as

the phase difference between two successive STFT’s calculated

with a single sample increment, following [36] and referred

to as the ﬁnite difference approximation for IF in [32].

Adapting [11], [31] to the notation from the preceding section,

let us redeﬁne the short-term Fourier transform in terms of the

starting sample qof the frame m(ie q=mD), rather than

implying this from the frame index m:

X(k, q) =

∞

n=−∞

w(n)x(n+q)e−jωkn,(4)

The Instantaneous Frequency can then be calculated as [11],

[32]:

v(k, q) = arg h˜

X(k, q)˜

X∗(k, q −1)i,(5)

where (·)∗indicates the complex conjugate. The Instantaneous

Frequency Deviation can be calculated as [16], [31], [11]:

ψ(k, q) = v(k , q)−ωk(6)

= arg h˜

X(k, q)˜

X∗(k, q −1)e−jωki,(7)

Having observed that the IF tracks its harmonic frequency

more accurately as the corresponding spectral magnitude in-

creases (ie, IF deviation is inversely proportional to magni-

tude), the Instantaneous Frequency Deviation Spectrum αwas

then deﬁned as [11]:

α(k, q) = |ψ(k , q)|−1(8)

B. Delta-Phase Spectrum

Instead of analysing the instantaneous phase derivatives over

single sample intervals, this paper proposes a related represen-

tation based on the phase difference between successive frames

separated by a step-size time interval. In a similar manner to

the Instantaneous Frequency Deviation above, it can be simply

calculated as the phase of the ratio of successive complex

spectral values:

∆φm(k) = arg Xm(k)

Xm−1(k)(9)

where the use of Xfrom Equation 1 rather than ˜

Xreﬂects

the fact that the spectrum has been compensated for the

inter-frame time step and analysis window to implement the

Fourier Transform with a ﬁxed time basis (as described in

Section II). Because the phase modulation introduced by the

analysis window will be the same for all frames, and will thus

be cancelled out during the division, it can be seen that the

Delta-Phase Spectrum may simply be implemented as:

∆φm(k) = arg "˜

Xm(k)e−jωkmD

Xm−1(k)e−jωk(m−1)D#

= arg h˜

Xm(k)˜

X∗

m−1(k)e−jωkDi(10)

where ˜

Xis the uncompensated short-time Fourier spectrum.

To facilitate direct comparison, the Instantaneous Frequency

Deviation from the preceding section may be restated with the

frame index mexplicit as:

ψm(k) = arg h˜

X(k, mD)˜

X∗(k, mD −1)e−jωki,(11)

while the Delta-Phase Spectrum implements:

∆φm(k) = arg h˜

X(k, mD)˜

X∗(k, mD −D)e−jωkDi,(12)

These expressions are equivalent in the limiting case of D=

1(i.e., a single sample step), however the latter is more general

by using the processing parameter Dfor the time change

interval. While the mathematical difference is minor, different

information is being captured: rather than estimating the phase

derivative at the start time of each individual frame, the simple

change in phase between frames is measured. Because the

“instantaneous” derivative is not measured, to avoid confusion

with the Instantaneous Frequency as commonly calculated

and used, the less constrained term Delta Phase is adopted

in this article. By adopting a distinct term, it is intended to

encourage new interpretation and insights by moving away

from the constraint of single sample intervals. The term Delta-

Phase draws a clear analogy with delta coefﬁcents commonly

used in speech feature vectors derived from the magnitude

spectrum [37]. A further computational difference is that two

STFTs per frame must be calculated to obtain a spectrogram

or derive features from the IF Deviation Spectrum [11] in

a standard sliding window procedure, while by reusing the

previous frame’s FFT in calculating the phase change the Delta

Phase Spectrum requires only one.

Rather than calculating phase differences over time, as

above, a similar approach in [8] effectively enforced a common

reference using a particular frequency bin in the Fourier trans-

form. Such a method however requires an arbitrary frequency

bin to be selected (chosen to be π/4in [8]), which may or

may not provide a robust reference depending on the vocal

characteristics of the speaker and the spectral characteristics

of the noise.

Finally, it is noted that as well as providing a consistent

basis for comparison over different times and sequences, a

representation based on the change in phase over a given

time allows the issue of phase wrapping to be controlled, as

discussed in Section IV.

C. Mel-Frequency Delta-Phase (MFDP) Features

In order to model the speech signal, it is often necessary

to extract a pertinent set of features from the raw spectral

representation. This section presents one such feature set that

may be derived from the Delta Phase Spectrum. The intention

is to be illustrative rather than optimal in any sense: clearly

other feature representations are possible.

The Mel-Frequency Cepstral Coefﬁcients (MFCC) have

proven to be an effective choice of speech features derived

from the magnitude spectrum [21]. The MFCC features are

formed by ﬁrst extracting ﬁlter bank energies using a bank

of band-pass ﬁlters on the absolute magnitude spectrum. The

ﬁlter bank design is inspired by the critical band ﬁltering

of the human auditory system [38]. Cepstral coefﬁcients are

then derived from these by taking the logarithm of ﬁlter

bank energies and performing a Discrete Cosine Transform

(DCT). The cepstral processing implements a homomorphic

transformation, effectively mapping convolutive effects in the

original time domain into additive effects in the cepstral

domain [39], [40].

This paper proposes extracting Mel-Frequency Delta-Phase

Cepstral Coefﬁcients (MFDP) by performing the same opera-

tions on the absolute delta-phase spectrum |∆φm(k)|from (9),

rather than the magnitude spectrum. For these features, the

absolute operator is used to measure the amount of change

in the phase within each frequency bin without concern for

the polarity of this change. It can be seen that the logarithm

following ﬁlter bank analysis is not strictly motivated for the

same reason in the phase domain as in the magnitude domain,

as the phase angle is effectively already in the log domain.

The logarithm on the ﬁlter bank output does however have a

second effect in practice: being akin to the application of a soft

maximum operation, it effectively emphasises the peak values

within each frequency band. In order to avoid smoothing out

peaks from the delta-phase spectrum following the ﬁlter bank

analysis, the logarithm is therefore maintained in the proposed

MFDP feature extraction.

An important practical consideration in developing phase

domain features is selection of the parameters for the short-

time Fourier analysis. Following the rationale presented in

Section II-B, in extracting the MFDP features in this paper,

a rectangular window function is used on frames of 256

ms duration at a rate of 100 frames per second (i.e., a 10

ms step size). Note that such a frame length corresponds to

the analysis interval typically used in other works modelling

temporal dynamics of the speech signal [1], [2], [3], [4], [5].

Further motivation for a longer analysis window is the desire

to detect phase changes in individual harmonic components in

the signal, and thus measure FFT bins that are as narrow as

is practical.

IV. EFFE CT O F TIME INT ERVAL O N OBSERVE D PHA SE

CHANGE

As shown in the previous section, the Delta-Phase extends

on the Instantaneous Frequency Deviation by removing the

constraint of being a strict instantaneous phase derivative and

instead capturing coarser phase changes over longer step-sized

intervals. The main impact of increasing the time interval is to

broaden the distribution of phase change that may be observed.

Two conﬂicting effects of this are to improve the ability to

detect sudden changes in the phase while also introducing

the possibility of phase wrapping. This section considers the

inﬂuence of the step size and FFT length parameters in this

context, and presents histograms and spectrograms produced

on a sample speech sequence using different parameters. As

the closest work from the literature, particular comparison is

made between the settings used for the IF Deviation Spec-

trum [11] and those used for the experiments in the current

article.

A. Ability to Detect Sudden Phase Changes

As the phase measurement from the FFT is in some sense

an average measure over the frame duration T(here and in

the following, assume an FFT length of L=Tis used), it

becomes increasingly difﬁcult to detect material event-based

changes in the phase as Ddecreases. Consider the case when

a substantial change in the signal has occurred in the new D

samples from one frame to the next due to some underlying

physical event, such as a new speech unit being produced.

Following a sinusoidal analysis model for speech [17], [19],

[20], this may give rise to the birth or death of a sinusoidal

component, causing a rapid shift in the phase of a particular

frequency component, rather than a slowly varying modulation

in its IF.

The ability of such a rapid phase shift to inﬂuence the phase

change as measured using two FFT frames depends on the

ratio of Dto T, as well as the windowing function. For the

IF Deviation Spectrum [11], only one of these Tsamples

changes between the two FFTs used to calculate the phase

derivative for each frame, and then a Dsized step is taken

before measuring this again. For example, with T= 512, this

means that 99.8% of the two frames being compared in each

IFD measurement are the same samples, and so any sudden

phase change that occurs over a small range of samples in

the physical signal will undergo a signiﬁcant averaging effect,

hampering the ability to detect such phase discontinuities. Any

sudden phase change will be further smoothed according to the

tapering of the window function in the time domain.

The ability to measure sudden phase transitions in the

physical signal using phase differences between two FFT’s

may therefore be expected to improve as the proportion of

new samples between the two FFT’s (that is, the ratio of Dto

T) increases, motivating the use of the more general D-sample

time delta used in calculating the Delta-Phase Spectrum in the

current article.

B. Phase Wrapping Considerations

Contrasting with the above motivation for increasing the

interval for calculating phase change is the possibility of intro-

ducing phase wrapping. To understand this, let us commence

by considering the IF Deviation. IF Deviation measures how

the phase-derivative changes relative to the centre frequency

of a given FFT bin. Following the ﬁlter-bank analogy of the

FFT, and neglecting for the moment leakage across bins due

to the non-ideal window response, the possible change in the

IF relative to the bin centre frequency (ωkrad/s) is necessarily

limited by the bin width. For a frame of length Tand taking

an FFT length of T, the frequency bandwidth of an ideal

individual FFT bin is ∆ω=2πFs

Trad/s. If the instantaneous

frequency lies outside the range ωk±πFs

T, the component

would instead occur predominantly in a neighbouring FFT bin.

Over a given time interval, say Dsamples, the phase change

that may be observed in a given frequency bin at the end

of the interval is therefore limited to ±πD

Twith respect to the

phase observed at the start of the interval. For the IF Deviation

Spectrum in [11], Fs= 16000,D= 1 and T= 512 were

used. In this case, the limiting phase change is ≈ ±0.002π

radians for the IF to be within ±15.6 Hz of the FFT bin

centre frequency, and thus still fall within that bin. Phase

wrapping therefore will not occur for the IF Deviation, and

this will continue to be the case for the Delta-Phase as long

as the interval D < T . Considering the parameters used in

experiments in the current article, Fs= 16000,D= 160 and

T= 4096, the limiting phase change within a bin is ≈ ±0.04π

radians.

In practice, due to windowing, the above simpliﬁed analysis

will not strictly hold: the windowing main lobe width and side-

lobes mean that a given component will have some inﬂuence

over a range of frequency bins. Extending the above analysis,

it may be seen that some phase wrapping can theoretically

occur for a given frequency component in distant frequency

bins that are more than B=T/D bins away from the local bin

for that component. The potential inﬂuence of such wrapping

will depend therefore on the windowing sidelobe level at this

frequency bin shift, as well as the strength of the the local

frequency component in those bins. For a given frequency

component, as long as D < T phase wrapping will only occur

as noise in distant FFT bins, progressively affecting less bins

and attenuated by the sidelobe level of the windowing as D/T

decreases. IF Deviation represents the limiting case of D= 1,

in which phase wrapping will not occur within the TFFT bins,

although general sidelobe leakage may still introduce noise to

the measured phase change in each bin.

C. Empirical Analysis

From the above analysis it is apparent that the ratio of

the step size to frame length provides a design parameter

controlling the observed distribution of delta phase values,

playing off the ability to detect sudden phase shifts with the

possibility of introducing noise from phase wrapping in distant

FFT bins.

1) Phase Change Histograms: To corroborate this analysis,

Figure 2 plots the histogram of delta phase values obtained

!!!"#$ " "#$ !

%&'()*+,"#""-"

!!!"#$ " "#$ !

%.'()*+,"#"/!-

!!!"#$ " "#$ !

%0'()*+,"#"1-$

!!!"#$ " "#$ !

%2'()*+,"#!-$"

!!!"#$ " "#$ !

%3'()*+,"#-$""

!!!"#$ " "#$ !

%4'()*+,!#""""

Fig. 2. Histogram of Delta-Phase values (normalised by πradians) for a

sample speech sequence (Fs=16 kHz) corrupted by 10dB noise for varying

values of D/T, with L=512 in each case. The speech segment consists of a

male utterance of “The decking is quarter-inch mahogany marine plywood”

from the TIMIT database [41]

from a sample speech sequence for six different values of D/T

with Tﬁxed at 512 samples. The delta phase is approximately

uniformly distributed over ±πat D/T = 1, and as this ratio

decreases the values become more normally distributed with

decreasing variance. Case (a) shows the distribution for D=1,

as used to calculate the IF Deviation Spectrum in [11]. The

settings used in the experiments in this paper (D/T = 0.039

using a longer frame of T= 4096 samples) corresponds to a

setting between cases (b) and (c). This setting was chosen

to improve the ability to detect sudden phase shifts while

minimising the ability of phase wrapping to signiﬁcantly affect

the measurements.

2) Spectrographic Comparison with IF Deviation: Fig-

ures 3-5 demonstrate the effect of different processing param-

eters on the Delta-Phase Spectrum. In each case the original

signal, standard magnitude spectrum and the IF Deviation

spectrum [11] are shown for comparison. To facilitate inter-

pretation in terms of phase change, a minor difference is that

the absolute IF Deviation is used directly here, rather than its

reciprocal as proposed in [11].

For a direct comparison with the IF Deviation spectrum

presented in the literature, Figure 3 uses processing parameters

taken from the example in [11]. In this case, a Chebyshev 50dB

window of length T= 512 (32 ms) and a D= 64 step size

(4 ms) is used. These setting are well suited for calculating

the magnitude spectrum, as shown in Figure 3(b). For the IF

Deviation spectrum in Figure 3(c), a single sample interval

is used to calculate the deviation, while for the Delta-Phase

spectrum in Figure 3(d) the step size D= 64 is used. It

is apparent that the magnitude and phase representations are

correlated, with regions of high magnitude often corresponding

to regions where there is little change in phase, and vice-versa.

As might be expected following the histogram analysis in the

preceding section (in effect, D/T = 0.1250 is used here), the

spectrogram for the Delta-Phase shows a distribution of values

with greater variance than the IF Deviation. This appears as a

more noisy spectrographic representation that makes the ﬁner

structures of the speech less evident for the Delta-Phase than

(a) Time−domain Signal

(b) Magnitude Spectrum

(d) Delta−Phase Spectrum

Fig. 3. Sample audio sequence in the (a) Time-domain and spectrographic

representations of the (b) Magnitude spectrum, (c) Instantaneous Frequency

Deviation spectrum and (d) the Delta-Phase spectrum, using parameters

following [11] to facilitate comparison (Fs=16000, T=512, D=64, Chebyshev

50 dB window). For (b)-(d) the y-axis shows increasing FFT bin index

(i.e., increasing frequency) and the x-axis shows increasing frame index. The

speech segment consists of a male utterance of “The decking is quarter-inch

mahogany marine plywood” from the TIMIT database [41]

(a) Time−domain Signal

(b) Filter−Bank Magnitude Spectrum

(d) Filter−Bank Delta−Phase Spectrum

Fig. 4. Sample audio sequence in the (a) Time-domain and Mel-scaled

Filter-bank spectrographic representations of the (b) Magnitude spectrum,

spectrum, using parameters following [11] to facilitate comparison (Fs=16000,

T=512, D=64, Chebyshev 50 dB window). For (b)-(d) the y-axis shows

increasing ﬁlter-bank index (i.e., increasing frequency) and the x-axis shows

increasing frame index.

the IF Deviation.

Figure 4 shows the same sequence using the same pro-

cessing parameters as Figure 3, however the output of 24

Mel-scaled ﬁlter-banks are shown in place of the raw FFT

bins. This ﬁgure serves simply to illustrate that despite the

ﬁner differences between the IF Deviation and Delta-Phase

spectrum in Figure 3, when considering Mel-scaled ﬁlter-bank

outputs as used in extracting features, these differences are less

evident.

The processing parameters used in Figures 3-4 are how-

ever not appropriate for the motivations of the Delta-Phase

proposed in the present article, which is to detect signiﬁcant

event-based shifts in the phase of sinusoidal components.

The short frame length leads to wider FFT bins than those

desired to focus on individual harmonic components, and the

(a) Time−domain Signal

(b) Filter−Bank Magnitude Spectrum

(d) Filter−Bank Delta−Phase Spectrum

Fig. 5. Sample audio sequence in the (a) Time-domain and Mel-scaled

Filter-bank spectrographic representations of the (b) Magnitude spectrum,

spectrum, using parameters as used in subsequent experiments in the current

article (Fs=16000, T=4096, D=160, Rectangular window). For (b)-(d) the y-

axis shows increasing ﬁlter-bank index (i.e., increasing frequency) and the

x-axis shows increasing frame index.

use of a Chebyshev window sacriﬁces some ability to detect

discontinuous event-based transitions in the signal (albeit,

while offering lower sidelobe levels). Figure 5 shows the

same sequence using the processing parameters used in the

experiments in the current article; that is, a rectangular window

of length T= 4096 (256 ms) and a D= 160 step size (10 ms).

As in Figure 4, the output of 24 Mel-scaled ﬁlter-banks are

shown. It is apparent from Figure 5(b) that the longer analysis

window is not an appropriate choice for the magnitude domain.

The IF Deviation in Figure 5(c) also shows little information

with these processing parameters, as might be expected given

that only 1 sample of 4096 has changed in the two FFTs

being used to calculate the deviation (that is, 0.024% of the

frame). The larger frame size examined here means that in

practice averaging affects will hinder the ability to detect any

phase change over a single sample interval, particularly at low

frequencies.

In contrast, Figure 5(d) conﬁrms that by using a longer

frame and increasing the ratio of Dto T, the Delta-Phase

Spectrum is capturing regions of both high and low phase

change in the signal over both time and frequency. These

patterns reveal interesting structure in the underlying signal

that appear complementary to the information traditionally

extracted from the magnitude spectrum, as shown in Fig-

ure 4(b). For the Delta-Phase Spectrum in Figure 5(d), 160

of the samples are changing from frame to frame (that is,

3.9% of the frame). This step size (10ms) also has the beneﬁt

of matching that commonly used in magnitude domain feature

extraction, facilitating fusion of magnitude (MFCC) and phase

(MFDP) domain systems in the following VAD and speaker

recognition experiments.

V. AP PL IC ATIO N TO VOICE ACTIVITY DE TE CT ION (VAD)

In order to validate the proposed Delta-Phase Spectrum and

Mel-Frequency Delta Phase features derived from it, a ﬁrst

set of experiments was conducted applying the features for

TABLE I

DATABAS E NOI SE T YPE S AN D SCE NAR IOS . THESE FORMAL PARTITIONS OF

TH E DATABAS E ARE R EFE RR ED TO I N TH E TEX T US ING I TALI CIS ED

LA BEL S,S UCH A S Street-City

Type Scenario 1 Scenario 2

Street City Suburb

Car Windows Up Windows Down

a simple voice activity, or speech/non-speech, detection task.

Note that the goal of these experiments is simply to validate

the proposed phase representation, rather than to achieve state-

of-the-art VAD performance.

A. Database

In order to evaluate the proposed MFDP speech features

for the purposes of voice activity detection, a database of

240 hours of noisy speech over 9600 individual ﬁles was

constructed through a combination of clean speech and real-

world noise recordings. A comprehensive description of the

database is available in [42], with relevant details summarised

here.

In order to construct the voice activity detection database,

two real-world recordings of at least 30 minutes of typical

background noise were made in each of 4 scenarios, covering

two broad noise types, Car and Street, as shown in Table I.

The two recordings for each scenario were captured at similar

times on separate days to ensure adequate temporal difference

in the environments. In addition to the noise recording itself,

6 swept-sine sweeps were recorded in the Car scenarios in

order to allow the reverberant response to be estimated, such

that speech may be inserted as if it were captured in that

scenario.

For each of the two recordings in each scenario, 200 noisy

speech sequences, equally split between lengths of 60 and 120

seconds, were constucted for each of 6 signal-to-noise ratios

(SNRs), being -10 dB, -5 dB, 0 dB, 5 dB, 10 dB and 15 dB.

These noisy speech sequences were constructed by extracting

a random section of the noise recording of the appropriate

length and adding clean speech sequences chosen randomly

from the TIMIT speech database [41] at the desired SNR.

For the Car scenarios, the clean speech sequences were ﬁrst

transformed to match the estimated reverberant response of

the noise recording. In order to ensure that speech energy is

consistent between ﬁles, the SNR mixing was performed by

adding the inserted speech sequences with an active speech

level of -26 dBov (dB overload, following [43]), after ﬁrst

scaling the background noise to match the desired SNR. As the

database sequences were constructed, the ground-truth timing

for speech events is known precisely for evaluation of VAD

algorithms.

B. System Description

A Gaussian Mixture Model (GMM) based speech detection

system was used to evaluate the MFDP speech features in

comparison to standard MFCC speech features. GMM-based

systems using MFCC features have been shown to provide a

robust baseline solution across a range of speech classiﬁcation

problems, including speech/non-speech detection [44], [45].

By extracting features from sub-band energies and learning

statistical models over training examples, a GMM-MFCC

system provides a higher performance baseline than more

traditional VAD systems based on thresholding features such

as broadband frame energy.

The MFCC and MFDP features used in the experi-

ments were 13-dimensional cepstral coefﬁcients, including the

zero’th coefﬁcient, and with ﬁrst-order regression coefﬁcients

appended (i.e., traditional “delta” features such as in [37]),

making a 26-dimension feature vector in each case. The MFCC

features were calculated using a standard 25 ms Hamming

window, with a 10 ms step size (that is, a rate of 100 fps),

while the MFDP features used a 256 ms rectangular window,

also using a 10 ms step size, following the rationale presented

in previous sections. While larger step sizes could be examined

for the MFDP papers, to facilitate comparison and fusion with

MFCCs, the 10 ms step size was maintained in all experiments.

These speech detection experiments were operated under

the assumption that the broad SNR of the target environment

is known, but the speciﬁc scenario, or type of noise, is not

known. To this end the six noise levels were divided into three

groups covering two SNRs each, designated as the high (-10

dB, -5 dB), medium (0 dB and 5 dB), and low (10 dB and 15

dB) broad noise levels.

To train speech detection modules under the operating

assumption provided, speech and non-speech GMMs were

trained based on the known ground truth on one set of

scenarios across both noise types, and for each of the three

broad noise levels. The other set of scenarios for each of the

three broad noise levels was then used to calculate speech

scores by taking the difference between the log-likelihoods

given for each feature vector by the speech and non-speech

GMMs. To give an example for sake of clarity, the low noise

Street-City data was tested on models trained using low noise

data from Street-Suburb and Car-Windows Down.

The speech scores obtained in this way were then smoothed

by a 1-second median ﬁlter centred on each feature vector to

attenuate short-term variation in favour of the longer term.

The MFCC+MFDP results indicate multi-stream fusion of the

MFCC and MFDP, in which the log-likelihoods of each stream

were combined using addition (equally weighted) prior to the

smoothing median ﬁlter.

Speech and non-speech segmentation decisions were made

by comparing the smoothed speech scores to a threshold. This

threshold was estimated by minimising the half total error rate,

calculated as the average of the miss and false-alarm rates, on

a held-out tuning data set. These tuning scores were calculated

similarly to the test scores, but were calculated on the same set

as the GMM parameter estimation, to ensure the ﬁnal testing

set is unseen to both the GMM training and threshold tuning.

To produce unbiased results for each noise type, the complete

results were generated using 2-fold training and testing, split

according to the scenario numbers indicated in Table I.

C. Results

Results from the voice activity detection experiments are

presented in Tables II and III for the Car and Street noise

TABLE II

VAD RES ULTS F OR CAR NOISE CONDITION

SNR Features FAR MR HTER

10 to 15 dB MFCC 2.3% 1.3% 1.8%

MFDP 3.4% 1.3% 2.3%

MFCC+MFDP 2.6% 1.0% 1.8%

0 to 5 dB MFCC 2.6% 2.7% 2.6%

MFDP 4.6% 1.6% 3.1%

MFCC+MFDP 3.5% 1.1% 2.3%

-10 to -5 dB MFCC 3.8% 8.9% 6.4%

MFDP 7% 8.7% 7.8%

MFCC+MFDP 7.4% 2.1% 4.7%

TABLE III

VAD RES ULTS F OR STREET NOISE CONDITION

SNR Features FAR MR HTER

10 to 15 dB MFCC 2.4% 1.7% 2.0%

MFDP 3.4% 1.5% 2.5%

MFCC+MFDP 2.5% 1.3% 1.9%

0 to 5 dB MFCC 3.0% 6.6% 4.8%

MFDP 4.2% 4.1% 4.2%

MFCC+MFDP 3.4% 2.7% 3%

-10 to -5 dB MFCC 6.6% 23% 14.8%

MFDP 5.1% 17.8% 11.5%

MFCC+MFDP 8.6% 8.9% 8.8%

types, respectively. Results are presented in terms of per-

centage False Alarm Rate (FAR), Miss Rate (MR, equivalent

to False Rejection Rate) and Half-Total Error Rate (HTER).

These demonstrate performance at a particular operating point,

selected to optimise HTER on the training data as explained

above. To show performance across a range of operating

points, the Detection Error Trade-off (DET) plot is shown in

Figure 6 [46].

These results show similar performance is achieved using

the MFCC or MFDP features. In Car noise, the MFCC’s

show a marginal improvement over MFDP’s, and this trend is

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

False Alarm probability (in %)

Miss probability (in %)

VAD − Combined Results

MFCC

MFDP

MFCC+MFDP

Fig. 6. DET plot of GMM-based VAD results over all noise conditions and

levels, using MFCC and MFDP features and their multi-stream fusion.

reversed for the Street noise. Without seeking to over-interpret

these results, it may be that the MFDP’s show beneﬁts in

less stationary noise environments due to the longer analysis

window of 256 ms. Further, the DET plot in Figure 6 shows

that the MFCC features perform better in the high Miss Rate

region (high False Rejection), while MFDP features exhibit

better performance in the high False Alarm region.

The two important points to garner from these results are:

ﬁrst, that results using only phase information are comparable

to those using magnitude only; second, that in both noise

types, and at all noise levels, the multi-stream fusion of the

magnitude-domain MFCC’s and phase-domain MFDP’s yields

signiﬁcant performance beneﬁts over either in isolation, as

measured by the HTER. Note that while the FAR increases

marginally over MFCC in the fused system, the MR is

signiﬁcantly improved in each case, simply reﬂecting the fact

that the operating point is chosen based on HTER.

VI. AP PL IC ATIO N TO SPEAKER RECOGNITION

The effectiveness of the proposed MFDP features for voice

activity detection was demonstrated in the preceding section,

with the fusion results highlighting the complementary infor-

mation they offer to MFCC features. This section seeks to

further validate the proposed phase representation by investi-

gating whether MFDP features are also able to capture speaker

discriminative information from the phase domain through

their application to the task of speaker recognition. As in the

preceding section, the goal of these experiments is to validate

the proposed phase representation, rather than to demonstrate

state-of-the-art performance.

Speaker recognition is commonly performed using cepstral-

based features derived from the magnitude domain. Partic-

ularly successful in this research domain are MFCC fea-

tures. MFCCs provide state-of-the-art speaker recognition

performance when used in conjunction with GMMs adapted

from a Universal Background Model (UBM) and suitable

session variability compensation techniques [47], [48]. Limited

research has focussed on the application of phase-related

features to speaker recognition due to the belief that the phase

component of speech offers little information relative to the

magnitude domain [49], [50], [51].

A. Experimental Conﬁguration

The comparison of MFDP and MFCC features will be con-

ducted following the well-known NIST Speaker Recognition

Evaluation (SRE) series [52] protocols. Since 1996, NIST

have conducted regular evaluations of speaker recognition

technology by specifying an evaluation protocol and corre-

sponding corpus predominantly consisting of conversational

telephony speech from several hundreds of speakers. The NIST

SRE series has driven state-of-the-art in the area of speaker

recognition research. For these experiments, the 2006 and

2008 NIST SRE data and protocols were used, speciﬁcally

the evaluation conditions consisting of 5-minute English-only

telephone conversations. This subset of evaluation conditions

was selected to allow for a clear analysis of the proposed

features without the need to consider additional variability

introduced from microphone, interview or cross-channel trials

available in the corpora.

The two feature sets will be examined in the context of a

GMM Supervector SVM system. MFCC variants of this sys-

tem have demonstrated state-of-the-art performance in recent

SRE’s. The GMM Supervector SVM system [53] combines

robust yet straightforward acoustic modelling in the form of

mean-adapted high-order Gaussian mixture models (GMM)

with more recent discriminative machine learning approaches

through Support Vector Machine (SVM) classiﬁcation.

In this approach, each utterance, in both training and test-

ing, is ﬁrst used to estimate a mean-adapted GMM through

maximum a posteriori (MAP) adaptation from a universal

background model (UBM). In this work, gender-dependent,

512-component UBM’s are used for this purpose. This form

of MAP adaptation has been a well-established approach in

speaker recognition for over a decade [47]. The component

mean vectors of the adapted GMM are then concatenated

together to form a single large vector, known as a supervector;

the supervector thus provides a convenient, ﬁxed dimension

representation of each utterance for use within an SVM

classiﬁer.

Speaker SVM training and classiﬁcation was performed

using the GMM mean supervector kernel [53]. This kernel

performs a weighted dot-product between the GMM mean

supervectors. Support vector machines are discriminative clas-

siﬁers and thus are trained on both positive and negative

examples of a speaker. In the context of NIST evaluations,

there is typically only a single (positive) training example of

a speaker while a substantial number of impostor (negative)

examples are drawn from previous NIST evaluation corpora.

Zero and Test-norm score normalisation was applied to

all scores to reduce the statistical variation observed in

scores [54]. Both normalisation techniques utilise a large set of

impostor speech segments to calculate a set of normalisation

statistics. Zero-norm is a speaker-centric technique in which

a speaker’s scores are scaled by the mean µiand standard

deviation σi, obtained when scoring the impostor cohort

against the speaker model, such that,

score =score −µi

σi

(13)

Similarly, test-norm calculates µiand σiby trialling a given

test segment against a set of speaker models trained using the

impostor cohort. In this work, the impostor speech segments

were extracted from the NIST 2004 dataset.

The MFDP and MFCC features for these experiments were

formed from 12 cepstral coefﬁcients with appended deltas. In

contrast to features used in previous sections, the 0th cepstrum

was removed from the MFDP features to match the existing

MFCC conﬁguration. This was empirically found to provide

marginal improvements to MFDP-based speaker recognition.

The reader is referred to [55] for more details of the conﬁg-

uration and implementation of the GMM Supervector SVM

system used in these experiments.

Two well-established techniques for robust speaker ver-

iﬁcation were progressively incorporated into the baseline

conﬁguration described above in order to observe whether they

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

False Alarm probability (in %)

Miss probability (in %)

MFDP

MFCC

MFCC+MFDP

Fig. 7. DET plot of 1-sided, English-only trials from the NIST 2006 speaker

recognition evaluation using MFDP and MFCC features.

could offer similar beneﬁts to proposed MFDP’s as they do to

the magnitude-based features on which they were developed.

The ﬁrst technique, feature-warping [56], applies short-time

Gaussianisation to the feature vector stream extracted from an

utterance using a sliding window to counteract the adverse

effects of channel mismatch and additive noise. A window

of 5 seconds is utilised in this work. The second technique,

Nuisance Attribute Projection (NAP) [57], aims to reduce the

adverse effects of inter-session variation in the SVM kernel

space. Inter-session variation, such as differences in channel

and background noise, is well known as a major source

of error in speaker recognition. NAP addresses this issue

by removing the directions of greatest inter-session variation

from the supervector space. Based on empirical results, forty

directions were removed from the MFCC supervector space

and twenty dimensions in the case of the MFDP conﬁguration.

Evaluation of the feature sets was performed using the

English-only trials from the 1-sided training condition of the

NIST 2006 and 2008 speaker recognition evaluation (SRE).

Classiﬁcation performance was measured in terms of mini-

mum decision cost function (DCF) and equal error rate (EER),

as deﬁned in the NIST SRE protocol [52]. Score-level fusion

was implemented using the FoCal toolkit [58] to optimise

linear log regression. The fusion weights for the NIST 2006

SRE trials were learned using scores from the 2008 SRE,

and similarly, 2008 SRE fusion weights were learned from

the 2006 SRE scores. This approach to fusion ensured that

the fused weights were not optimistically biased for a given

corpus.

B. Results

Figure 7 depicts the DET curves from the English-only

trials on the NIST 2006 SRE involving MFDP and MFCC

features and their score-level fusion without incorporating

feature-warping or NAP. The performance of the proposed

MFDP features demonstrates their effectiveness in capturing

speaker discriminative information from the phase domain. It

is also clear from the DET plot that the fusion of the features

is highly complementary.

Table IV details speciﬁc operating performance statistics

from trials on the NIST 2006 and 2008 SRE. Several con-

ﬁgurations are presented for a thorough analysis of the pro-

posed MFDP features; a Baseline, Feature-Warping, NAP

and Feature-Warping+NAP conﬁguration, the last of which

amounts to a state-of-the-art conﬁguration developed for

MFCC features. The objective of these experiments was to

observe whether techniques developed for magnitude-based

features were also suited to the proposed MFDP feature set.

The baseline results in Table IV indicate that both MFDP

and MFCC features provided broadly comparable performance

on the NIST 2006 SRE (These results correspond to the DET

curve in Figure 7). On the more challenging NIST 2008 SRE,

MFCCs offered a relative gain of 13% and 15% in minimum

DCF and EER, respectively, over the proposed MFDP results.

Score-level fusion of the baseline MFDP and MFCC conﬁgu-

rations resulted in a relative improvement of 22% and 11% in

the EER of the 2006 and 2008 corpora, respectively, indicating

that the MFDP features offer considerable complimentary

information to MFCC’s in the baseline conﬁguration.

The introduction of feature-warping to the baseline system

provided a relative improvement of 7-17% in MFDP perfor-

mance statistics across the NIST corpora and a signiﬁcant

relative gain of 50% in the MFCC-based results. Interestingly,

MFDP features provided little complimentary information to

the feature-warped MFCC’s. The large discrepancy in the

gains offered by feature-warping may be explained by the

relatively large window used during MFDP feature extraction.

In using a 256ms window of analysis for MFDP extraction

compared to 32ms for the MFCC feature stream, a rela-

tively large correlation between sequential features is expected

thereby potentially reducing the effectiveness or necessity of

feature-warping. An alternative explanation can be derived

from the objective of the feature-warping process. Speciﬁcally,

MFDP features may be inherently more robust to channel

distortions and additive noise than the MFCC feature set. This

hypothesis is explored through the application of inter-session

variability compensation via NAP.

The application of NAP to the baseline conﬁgurations

provided signiﬁcant improvements in excess of 32% (relative)

to performance statistics across the NIST corpora. Similarly,

the MFCC conﬁguration obtained a 50% relative improvement

over baseline results from the application of NAP. As with the

baseline results, the fusion of the NAP systems provided a fur-

ther gain of 23% and 13% EER over the MFCC conﬁguration

alone in the NIST 2006 and 2008 SRE, respectively.

The ﬁnal conﬁguration employing feature-warping and NAP

represents a state-of-the-art SVM conﬁguration developed on

MFCC features. MFCC-based results in Table IV indicate

that NAP provided an average relative gain of 34% in min-

imum DCF and 27% in EER across the evaluated corpora

over the use of feature-warping alone. Comparably, MFDP

results obtained an average relative improvement of 26%

and 21% in minimum DCF and EER, respectively, from the

TABLE IV

MINIMUM DCF AND EER OB TAIN ED FRO M 1-SIDED, ENG LI SH-O NLY

NIST 2006 AN D 2008 SPEAKER RECOGNITION EVALUATIONS WHEN

USING MFCC AN D MFDP FEATU RE S ETS F OR SVM TRAINING AND

CL ASS IFI CATIO N.

NIST 2006 NIST 2008

Features Min. DCF EER Min. DCF EER

Baseline

MFDP .0429 11.10% .0573 15.24%

MFCC .0400 10.89% .0498 12.92%

MFCC+MFDP .0346 8.45% .0465 11.45%

Feature-Warping

MFDP .0398 9.37% .0508 12.69%

MFCC .0188 4.55% .0259 6.34%

MFCC+MFDP .0179 4.54% .0258 6.18%

NAP

MFDP .0273 6.46% .0387 9.81%

MFCC .0184 4.17% .0245 6.02%

MFCC+MFDP .0180 3.20% .0232 5.19%

Feature-Warping + NAP

MFDP .0292 6.28% .0398 9.29%

MFCC .0130 2.87% .0190 4.61%

MFCC+MFDP .0125 2.72% .0185 4.37%

application of NAP. Interestingly, the optimised number of

nuisance directions removed via NAP was only twenty in

the case of MFDP features and forty for the MFCC system.

Comparable performance gains from NAP suggest that MFDP

features exhibit less inter-session variation than MFCCs and

allow such variation to be robustly estimated using fewer

directions. Such a trait is highly desired of features for speaker

veriﬁcation as inter-session variability continues to be a major

cause of classiﬁcation error. The phase-based MFDP features

provided reasonable classiﬁcation performance on the SRE

task with the magnitude-based MFCC features offering relative

improvements of more than 50% in the evaluation of both

corpora. Score-level fusion of both conﬁgurations provided

the best performance statistics with relative improvements of

up to 5% being obtained over the MFCC conﬁguration in

the 2006 SRE. Similar improvements were observed in the

2008 SRE trials through fusion. This demonstrates that the

MFDP features extract some speaker speciﬁc information from

the phase domain that is complementary to magnitude-based

features.

In operating at a comparable level to the MFCC feature

set in the baseline conﬁguration, offering robustness to inter-

session variation and by providing complementary information

to commonly employed MFCC features, the proposed MFDP

feature set shows high potential for further application in

the ﬁeld of speaker recognition research. Building on these

preliminary experimental results, investigations into SVM ker-

nels tailored to the MFDP feature set and their application

to GMM-based classiﬁcation are likely to better exploit the

speaker discriminative information found in MFDP features.

VII. CONCLUSION

This paper has revisited the use of the phase domain in

short-time Fourier analysis of the speech signal, highlighting

the factors that must be considered and compensated before

the phase can be meaningfully analysed. The Delta-Phase

Spectrum computed at a frame advance rate of Dsamples was

proposed as a simple phase domain representation similar that

allows consistent comparison over multiple frames and across

sequences, while also minimising practical issues associated

with phase wrapping. The Delta-Phase extends the Instanta-

neous Frequency Deviation, removing the constraint of being

a strict instantaneous phase derivative and instead capturing

coarser changes in the phase structure of the signal from one

frame to the next.

Building upon this representation, it was shown that Mel-

Frequency Delta Phase features extracted purely from the

phase domain could be used to achieve broadly similar per-

formance to the common magnitude-domain Mel-Frequency

Cepstral Coefﬁcients for distinguishing speech from noise,

and also for distinguishing between voices of different people.

Further, it was shown that principled fusion of the magnitude

and phase domain information could achieve performance

improvements over either in isolation.

There remains much scope for research building upon this

work, both in optimising phase domain feature representations

and models, and in understanding whether these ﬁndings can

be applied to automatic speech recognition, which needs to

capture shorter-term units than the voice activity and speaker

recognition applications considered here.

ACK NOW LE DG EM EN TS

The authors wish to thank the reviewers for their valuable

comments which has enabled us to improve the quality of

the manuscript, as well as Dan Ellis of Columbia University

for his help in clarifying latter revisions of this article. The

research was supported in part by the Australian Research

Council (ARC) Discovery Grant DP0877835.

REFERENCES

[1] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE

Transactions on Speech and Audio processing, vol. 2, no. 4, pp. 578–

589, 1994.

[2] H. Hermansky and S. Sharma, “TRAPS-Classiﬁers of temporal pat-

terns,” in Fifth International Conference on Spoken Language Process-

ing. ISCA, 1998.

[3] S. Greenberg and T. Arai, “What are the essential cues for understanding

spoken language,” IEICE Transactions on Information and Systems, vol.

E87-D, no. 5, pp. 1059–1070, May 2004.

[4] N. Kanedera, T. Arai, H. Hermansky, and M. Pavel, “On the relative

importance of various components of the modulation spectrum for

automatic speech recognition,” Speech Communication, vol. 28, no. 1,

pp. 43–55, 1999.

[5] S. Greenberg and T. Arai, “The relation between speech intelligibility

and the complex modulation spectrum,” in Seventh European Conference

on Speech Communication and Technology. ISCA, 2001.

[6] R. Schluter and H. Ney, “Using phase spectrum information for

improved speech recognition performance,” in Proceedings of the IEEE

International Conference on Acoustics, Speech, and Signal Processing,

2001, vol. 1.

[7] S. Ikbal, H. Misra, and H. Bourlard, “Phase autocorrelation (PAC)

derived robust speech features,” in IEEE International Conference on

Acoustics, Speech, and Signal Processing, Proceedings., 2003, vol. 2.

[8] L. Wang, S. Ohtsuka, and S. Nakagawa, “High improvement of

speaker identiﬁcation and veriﬁcation by combining MFCC and phase

information,” in Proceedings of the 2009 IEEE International Conference

on Acoustics, Speech and Signal Processing. IEEE Computer Society,

2009, pp. 4529–4532.

[9] HA Murthy and V. Gadde, “The modiﬁed group delay function and its

application to phoneme recognition,” in IEEE International Conference

on Acoustics, Speech, and Signal Processing. Proceedings, 2003, vol. 1.

[10] Y. Wang, J. Hansen, G.K. Allu, and R. Kumaresan, “Average instanta-

neous frequency (AIF) and average log-envelopes (ALE) for ASR with

the AURORA 2 Database,” in Eighth European Conference on Speech

Communication and Technology. ISCA, 2003.

[11] A.P. Stark and K.K. Paliwal, “Speech analysis using instantaneous

frequency deviation,” Proceedings Interspeech 2008, pp. 2602–2605,

2008.

[12] L.D. Alsteris and K.K. Paliwal, “Short-time phase spectrum in speech

processing: A review and some experimental results,” Digital Signal

Processing, vol. 17, no. 3, pp. 578–616, 2007.

[13] D.C. Ghiglia and M.D. Pritt, Two-dimensional phase unwrapping:

theory, algorithms, and software, Wiley New York, 1998.

[14] J. Tribolet, “A new phase unwrapping algorithm,” IEEE Transactions

on Acoustics, Speech and Signal Processing, vol. 25, no. 2, pp. 170–177,

1977.

[15] G. Nico and J. Fortuny, “Using the matrix pencil method to solve phase

unwrapping,” IEEE Transactions on Signal Processing, vol. 51, no. 3,

pp. 886–888, 2003.

[16] L.R. Rabiner and R.W. Schafer, Digital processing of speech signals,

Prentice-hall, 1978.

[17] R. McAulay and T. Quatieri, “Speech analysis/synthesis based on a

sinusoidal representation,” IEEE Transactions on Acoustics, Speech and

Signal Processing, vol. 34, no. 4, pp. 744–754, 1986.

[18] T. Quatieri, Discrete-Time Speech Signal Processing: Principles and

Practice, Prentice Hall, 2002.

[19] T.N. Sainath, Acoustic landmark detection and segmentation using

the Mcaulay-Quatieri sinusoidal model, Ph.D. thesis, Massachusetts

Institute of Technology, 2005.

[20] T.N. Sainath and T.J. Hazen, “A sinusoidal model approach to acoustic

landmark detection and segmentation for robust segment-based speech

recognition,” in Acoustics, Speech and Signal Processing, 2006. ICASSP

2006 Proceedings. 2006 IEEE International Conference on. IEEE, 2006,

vol. 1.

[21] S. Davis and P. Mermelstein, “Comparison of parametric representations

for monosyllabic word recognition in continuously spoken sentences,”

IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28,

no. 4, pp. 357–366, 1980.

[22] J. Allen, “Short term spectral analysis, synthesis, and modiﬁcation by

discrete Fourier transform,” IEEE Transactions on Acoustics, Speech

and Signal Processing, vol. 25, no. 3, pp. 235–238, 1977.

[23] JB Allen and LR Rabiner, “A uniﬁed approach to short-time Fourier

analysis and synthesis,” Proceedings of the IEEE, vol. 65, no. 11, pp.

1558–1564, 1977.

[24] M. Portnoff, “Time-frequency representation of digital signals and

systems based on short-time Fourier analysis,” IEEE Transactions on

Acoustics, Speech and Signal Processing, vol. 28, no. 1, pp. 55–69,

1980.

[25] FJ Harris, “On the use of windows for harmonic analysis with the

discrete Fourier transform,” Proceedings of the IEEE, vol. 66, no. 1, pp.

51–83, 1978.

[26] N. Reddy and M. Swamy, “Derivative of phase spectrum of truncated

autoregressive signals,” IEEE Transactions on Circuits and Systems, vol.

32, no. 6, pp. 616–618, 1985.

[27] L.D. Alsteris and K.K. Paliwal, “Importance of window shape for phase-

only reconstruction of speech,” in Proc. International Conf. Acoustics,

Speech, Signal Processing, 2004, pp. 573–576.

[28] L. Liu, J. He, and G. Palm, “Effects of phase on the perception of

intervocalic stop consonants,” Speech Communication, vol. 22, no. 4,

pp. 403–417, 1997.

[29] MR Schroeder, “Models of hearing,” Proceedings of the IEEE, vol. 63,

no. 9, pp. 1332–1350, 1975.

[30] D. O’Shaughnessy, “Timing patterns in ﬂuent and disﬂuent spontaneous

speech,” in IEEE International Conference on Acoustics, Speech, and

Signal Processing, 1995, pp. 600–603.

[31] D. Friedman, “Instantaneous-frequency distribution vs. time: An in-

terpretation of the phase structure of speech,” in IEEE International

Conference on Acoustics, Speech, and Signal Processing, 1985, vol. 10.

[32] S.A. Fulop and K. Fitz, “Algorithms for computing the time-corrected

instantaneous frequency (reassigned) spectrogram, with applications,”

The Journal of the Acoustical Society of America, vol. 119, pp. 360–

371, 2006.

[33] T. Abe, T. Kobayashi, and S. Imai, “The IF spectrogram: a new spectral

representation,” in Proceedings of ASVA 97, 2007, pp. 423–430.

[34] F. Charpentier, “Pitch detection using the short-term phase spectrum,”

in Acoustics, Speech, and Signal Processing, IEEE International Con-

ference on., 1986, vol. 11.

[35] F. Auger and P. Flandrin, “Improving the readability of time-frequency

and time-scale representations by the reassignment method,” Signal

Processing, IEEE Transactions on, vol. 43, no. 5, pp. 1068–1089, 2002.

[36] S. Kay, “A fast and accurate single frequency estimator,” Acoustics,

Speech and Signal Processing, IEEE Transactions on, vol. 37, no. 12,

pp. 1987 –1990, dec. 1989.

[37] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason,

V. Valtchev, and P. Woodland, “The HTK book,” Cambridge University,

vol. 1996, 1995.

[38] H. Fletcher, “Auditory patterns,” Reviews of Modern Physics, vol. 12,

no. 1, pp. 47–65, 1940.

[39] AV Oppenheim and RW Schafer, “From frequency to quefrency: A

history of the cepstrum,” IEEE Signal Processing Magazine, vol. 21,

no. 5, pp. 95–106, 2004.

[40] B.P. Bogert, M.J.R. Healy, and J.W. Tukey, “The quefrency alanysis

of time series for echoes: Cepstrum, pseudo-autocovariance, cross-

cepstrum and saphe cracking,” in Proceedings of the Symposium on

Time Series Analysis, 1963, pp. 209–243.

[41] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, and

N.L. Dahlgren, “DARPA TIMIT acoustic-phonetic continuous speech

corpus CD-ROM,” NTIS order number PB91-100354, 1993.

[42] David Dean, Sridha Sridharan, Robert Vogt, and Michael Mason, “The

QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection

algorithms,” in Interspeech, Makuhari, Japan, September 2010.

[43] I. Rec, “830, subjective performance assessment of telephone-band and

wideband digital codecs,” 1996.

[44] I. Shafran and R. Rose, “Robust speech detection and segmentation for

real-time ASR applications,” in Proceedings of International Conference

on Acoustics, Speech, and Signal Processing, 2003, vol. 1, pp. 432–435.

[45] C. Wooters and M. Huijbregts, “The ICSI RT07s speaker diarization

system,” Lecture Notes in Computer Science, vol. 4625, pp. 509–519,

2008.

[46] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki,

“The DET curve in assessment of detection task performance,” in

Fifth European Conference on Speech Communication and Technology.

Citeseer, 1997.

[47] D.A. Reynolds, T.F. Quatieri, and R.B. Dunn, “Speaker veriﬁcation

using adapted Gaussian mixture models,” Digital Signal Processing,

vol. 10, no. 1-3, pp. 19–41, 2000.

[48] R. Vogt and S. Sridharan, “Explicit modelling of session variability for

speaker veriﬁcation,” Computer Speech & Language, vol. 22, no. 1, pp.

17–38, 2008.

[49] H.A. Murthy and V.R.R. Gadde, “The modiﬁed group delay function

and its application to phoneme recognition,” in Proc. IEEE International

Conference on Acoustics, Speech and Signal Processing, 2003, vol. 1,

pp. 68–71.

[50] K.S.R. Murty and B. Yegnanarayana, “Combining evidence from

residual phase and MFCC features for speaker recognition,” IEEE Signal

Processing Letters, vol. 13, no. 1, pp. 52–55, 2006.

[51] Tomi Kinnunen and Haizhou Li, “An overview of text-independent

speaker recognition: From features to supervectors,” Speech Communi-

cation, vol. 52, no. 1, pp. 12–40, 2010.

[52] National Institute of Standards and Technology, “The NIST year

2006 speaker recognition evaluation plan,” 2006, Available from:

http://www.itl.nist.gov/iad/mig/tests/sre/2006/sre-06 evalplan-v9.pdf.

[53] W.M. Campbell, D.E. Sturim, D.A. Reynolds, and A. Solomonoff,

“SVM based speaker veriﬁcation using a GMM supervector kernel and

NAP variability compensation,” in Proc. IEEE International Conference

on Acoustics, Speech and Signal Processing, May 2006, vol. 1, pp. 97–

100.

[54] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score normalization

for text-independent speaker veriﬁcation systems,” Digital Signal

Processing, vol. 10, no. 1, pp. 42–54, 2000.

[55] M. McLaren, R. Vogt, B. Baker, and S. Sridharan, “A comparison

of session variability compensation techniques for SVM-based speaker

recognition,” in Proc. Interspeech, 2007, pp. 790–793.

[56] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker

veriﬁcation,” A Speaker Odyssey, The Speaker Recognition Workshop,

vol. 2001, pp. 213–218, 2001.

[57] A Solomonoff, C Quillen, and W.M. Campbell, “Channel compensation

for SVM speaker recognition,” in Odyssey: The Speaker and Language

Recognition Workshop, 2004, pp. 57–62.

[58] N. Brummer, “FoCal: Tools for fusion and calibration of automatic

speaker detection systems,” July 2005.

Iain McCowan (M’97) received the B.E. and

B.InfoTech. from the Queensland University of

Technology (QUT), Brisbane, in 1996. In 2001, he

completed his PhD with the Research Concentration

in Speech, Audio and Video Technology at QUT,

including a period of research at France Telecom

R&D. In 2001 he joined the IDIAP Research Insti-

tute, Switzerland, progressing to the post of Senior

Researcher in 2003. While at IDIAP, he worked

on a number of applied research projects in the

areas of automatic speech recognition, content-based

multimedia retrieval, multimodal event recognition and modeling of human

interactions. From 2005-2008 he was with the CSIRO eHealth Research

Centre, Brisbane as Project Leader in the area of multimedia content analysis.

In 2008 he founded Dev-Audio Pty Ltd to commercialise microphone array

technology. He holds an adjunct appointment as Associate Professor at QUT

in Brisbane.

David Dean holds Bachelor degrees in Engineering

(with Honours) and Information Technology. He

completed his PhD programme in 2008 in the area

of audio-visual speech technology with his disserta-

tion entitled Synchronous HMMs for Audio-Visual

Speech Processing. As a post-doctoral fellow of the

Speech, Audio, Image and Video Technology pro-

gram at the Queensland University of Technology,

his research has focused on both acoustic and audio-

visual speech processing with a recent focus on

speech detection and speaker veriﬁcation.

Mitchell McLaren received his PhD with the

Speech, Audio, Image and Video Technologies

(SAIVT) at the Queensland University of Tech-

nology (QUT), Brisbane, Australia, in 2010. He

received his BCompSysEng also from QUT in 2006.

Mitchell has been with the Centre for Language and

Speech Technology (CLST) at Radboud University

Nijmegen, The Netherlands, since 2010 where he is

currently in a post-doctoral role. In 2007, he was a

visiting intern within the Laboratoire Informatique

D’Avignon in Avignon, France. His PhD research

concentrated on speaker veriﬁcation using support vector machine techniques.

Mitchell was awarded the ‘Best Student Paper Award’ at Interspeech 2008 and

the ‘IEEE 2009 Spoken Language Processing Student Grant’ at ICASSP 2009.

Robert Vogt received his PhD degree in electrical

engineering at the Queensland University of Tech-

nology (QUT), Brisbane, Australia, in 2006 and

a BEng/BInfTech degrees also at QUT in 2002.

Robert has been with the Speech, Audio, Image

and Video Technologies (SAIVT) group at QUT

since 2002 where he is currently a research fellow.

His research interests include speaker recognition,

speaker diarisation and spoken term detection. Dur-

ing his time with the SAIVT group, Robert has

participated in the successful commercialisation of

speech research outcomes and helped secure several large grants through

competitive funding schemes. In 2008, he was invited to participate in the

robust speaker recognition stream at the CSLP Summer Workshop, hosted at

Johns Hopkins University, Baltimore, MD.

Professor Sridha Sridharan has a BSc (Electrical

Engineering) degree and obtained a MSc (Commu-

nication Engineering) degree from the University

of Manchester Institute of Science and Technology

(UMIST), UK and a PhD degree in the area of Signal

Processing from University of New South Wales,

Australia. He is a Senior Member of the Institute of

Electrical and Electronic Engineers - IEEE (USA).

He is currently with the Queensland University of

Technology (QUT) where he is a full Professor in the

School of Engineering Systems. Professor Sridharan

is the Deputy Director of the Information Security Institute and the Leader

of the Research Program in Speech, Audio, Image and Video Technologies at

QUT. He has published over 300 papers consisting of publications in journals

and in refereed international conferences in the areas of Image and Speech

technologies during the period 1990- 2010. During this period he has also

graduated 24 PhD students as their Principal Supervisor and 15 PhD students

as their Associate Supervisor in the areas of Image and Speech technologies.

Prof Sridharan has also received a number of research grants from various

funding bodies including Commonwealth competitive funding schemes such

the Australian Research Council (ARC) and the National Security Science

and Technology (NSST) unit. Several of his research outcomes have been

commercialised.

Phase characteristics of vocal tract filter can distinguish speakers

Article

Full-text available

Dec 2023

Introduction Speaker recognition has been performed by considering individual variations in the power spectrograms of speech, which reflect the resonance phenomena in the speaker's vocal tract filter. In recent years, phase-based features have been used for speaker recognition. However, the phase-based features are not in a raw form of the phase but are crafted by humans, suggesting that the role of the raw phase is less interpretable. This study used phase spectrograms, which are calculated by subtracting the phase in the time-frequency domain of the electroglottograph signal from that of speech. The phase spectrograms represent the non-modified phase characteristics of the vocal tract filter. Methods The phase spectrograms were obtained from five Japanese participants. Phase spectrograms corresponding to vowels, called phase spectra, were then extracted and circular-averaged for each vowel. The speakers were determined based on the degree of similarity of the averaged spectra. Results The accuracy of discriminating speakers using the averaged phase spectra was observed to be high although speakers were discriminated using only phase information without power. In particular, the averaged phase spectra showed different shapes for different speakers, resulting in the similarity between the different speaker spectrum pairs being lower. Therefore, the speakers were distinguished by using phase spectra. Discussion This predominance of phase spectra suggested that the phase characteristics of the vocal tract filter reflect the individuality of speakers.

Voice Activity Detection Optimized by Adaptive Attention Span Transformer

Article

Full-text available

Jan 2023

Voice Activity Detection (VAD) is a widely used technique for separating vocal regions from audio signals, with applications in voice language coding, noise reduction, and other domains. While various strategies have been proposed to improve VAD performance, such as ACAM, DCU-10, and Tr-VAD, these approaches often suffer from common limitations, including being unsuitable for long audio and being time-consuming. To address these issues, we propose a new method called AAT-VAD, which integrates an adaptive width attention learning mechanism into the classic transformer framework. Our approach involves extracting Mel-scale Frequency Cepstral Coefficients (MFCC) from the Mel scale frequency domain, adding a masking function to each transformer attention head, and inputting the features processed by the transformer encoder layer into the classifier. Experimental results indicate that our method achieves a 12.8% higher F1-score than DCU-10 and a 0.6% higher F1-score than Tr-VAD under different noise interferences. Furthermore, the average detection cost function (DCF) value of our method is only 14.3% of DCU-10 and 92.4% of Tr-VAD, and the test time of AAT-VAD is only 37.4% of that of Tr-VAD for the same noisy vocal mixed audio.

Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

Conference Paper

Sep 2022

Speech emotion recognition by using complex MFCC and deep sequential model

Article

Full-text available

Sep 2022
MULTIMED TOOLS APPL

Suprava Patnaik

Speech Emotion Recognition (SER) is one of the front-line research areas. For a machine, inferring SER is difficult because emotions are subjective and annotation is challenging. Nevertheless, researchers feel that SER is possible because speech is quasi-stationery and emotions are declarative finite states. This paper is about emotion classification by using Complex Mel Frequency Cepstral Coefficients (c-MFCC) as the representative trait and a deep sequential model as a classifier. The experimental setup is speaker independent and accommodates marginal variations in the underlying phonemes. Testing for this work has been carried out on RAVDESS and TESS databases. Conceptually, the proposed model is erogenous towards prosody observance. The main contributions of this work are of two-folds. Firstly, introducing conception of c-MFCC and investigating it as a robust cue of emotion and there by leading to significant improvement in accuracy performance. Secondly, establishing correlation between MFCC based accuracy and Russell’s emotional circumplex pattern. As per the Russell’s 2D emotion circumplex model, emotional signals are combinations of several psychological dimensions though perceived as discrete categories. Results of this work are outcome from a deep sequential LSTM model. Proposed c-MFCC are found to be more robust to handle signal framing, informative in terms of spectral roll off, and therefore put forward as an input to the classifier. For RAVDESS database the best accuracy achieved is 78.8% for fourteen classes, which subsequently improved to 91.6% for gender integrated eight classes and 98.5% for affective separated six classes. Though, the RAVDESS dataset has two analogous sentences revealed results are for the complete dataset and without applying any phonetic separation of the samples. Thus, proposed method appears to be semi-commutative on phonemes. Results obtained from this study are presented and discussed in forms of confusion matrices.

Analysis of Instantaneous Frequency Components of Speech Signals for Epoch Extraction

Article

Full-text available

Aug 2022
Comput Speech Lang

The major impulse-like excitation in the speech signal is due to abrupt closure of the vocal folds, which takes place at the glottal closure instant (GCI) or epoch in each cycle. GCIs are used in many areas of speech science and technology, such as in prosody modification, voice source analysis, formant extraction and speech synthesis. It is difficult to observe these discontinuities (corresponding to GCIs) in the speech signal because of the superimposed time-varying response of the vocal tract system. This paper examines the phase part of different frequency components of the speech signal to extract epochs. Three analysis methods to decompose the speech signal into different frequency components are considered. These methods are the short-time Fourier transform (STFT), narrow bandpass filtering (NBPF), and single frequency filtering (SFF). The locations of the discontinuities in the speech signal are obtained from the instantaneous frequency (IF) (i.e., the time derivative of the phase) of each of the frequency components. A method for automatic detection of epochs using the amplitude weighted IF is proposed. Performance of the proposed epoch detection method is compared with four state-of-the-art methods in clean and telephone quality speech. The performance of the proposed method is comparable with the performance of the existing epoch detection methods for clean speech but better for telephone quality speech.

Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

Preprint

Full-text available

Jun 2022

Voice activity detection (VAD) is an important pre-processing step for speech technology applications. The task consists of deriving segment boundaries of audio signals which contain voicing information. In recent years, it has been shown that voice source and vocal tract system information can be extracted using zero-frequency filtering (ZFF) without making any explicit model assumptions about the speech signal. This paper investigates the potential of zero-frequency filtering for jointly modeling voice source and vocal tract system information, and proposes two approaches for VAD. The first approach demarcates voiced regions using a composite signal composed of different zero-frequency filtered signals. The second approach feeds the composite signal as input to the rVAD algorithm. These approaches are compared with other supervised and unsupervised VAD methods in the literature, and are evaluated on the Aurora-2 database, across a range of SNRs (20 to -5 dB). Our studies show that the proposed ZFF-based methods perform comparable to state-of-art VAD methods and are more invariant to added degradation and different channel characteristics.

Language agnostic missing subtitle detection

Article

Full-text available

Jun 2022

Subtitles are a crucial component of Digital Entertainment Content (DEC such as movies and TV shows) localization. With ever increasing catalog (≈ 2M titles) and localization expansion (30+ languages), automated subtitle quality checks becomes paramount. Being a manual creation process, subtitles can have errors such as missing transcriptions, out-of-sync subtitle blocks with the audio and incorrect translations. Such erroneous subtitles result in an unpleasant viewing experience and impact the viewership. Moreover, manual correction is laborious, highly costly and requires expertise of audio and subtitle languages. A typical subtitle correction process consists of (1) linear watch of the movie, (2) identification of time stamps associated with erroneous subtitle blocks, and (3) correcting procedure. Among the three, time taken to watch the entire movie by a human expert is the most time consuming step. This paper discusses the problem of missing transcription, where the subtitle blocks corresponding to some speech segments in the DEC are non-existent. We present a solution to augment human correction process by automatically identifying the timings associated with the non-transcribed dialogues in a language agnostic manner. The correction step can then be performed by either human-in-the-loop mechanism or automatically using neural transcription (speech-to-text in same language) and translation (text-to-text in different languages) engines. Our method uses a language agnostic neural voice activity detector (VAD) and an audio classifier (AC) trained explicitly on DEC corpora for better generalization. The method consists of three steps: first, we use VAD to identify the timings associated with dialogues (predicted speech blocks). Second, we refine those timings using the AC module by removing the timings associated with the leading and trailing non-speech segments identified as speech by VAD. Finally, we compare the predicted dialogue timings to the dialogue timings present in the subtitle file (subtitle speech blocks) and flag the missing transcriptions. We empirically demonstrate that the proposed method (a) reduces incorrect predicted missing subtitle timings by 10%, (b) improves the predicted missing subtitle timings by 2.5%, (c) reduces false positive rate (FPR) of overextending the predicted timings by 77%, and (d) improves the predicted speech block-level precision by a 119% over VAD baseline on a human-annotated dataset of missing subtitle speech blocks.

An Investigation of Magnitude-Based and Phase-Based Features for Large-Scale Speaker Identification

Conference Paper

Oct 2022

Analysis of phase derivatives of speech signals

Article

Sep 2022

B. Yegnanarayana

This paper examines the phase derivatives of speech signals. The instantaneous complex spectra obtained in the single frequency filtering (SFF) analysis of signals is used to derive the phase function. The problem of phase wrapping is avoided by using the proposed modification to SFF analysis to derive a scaled down version of the phase function. We consider the derivatives of the exponent (i.e., logarithm) of the complex SFF spectra, with respect to frequency, time, and both frequency and time. The imaginary part of the exponent is the phase function, and the real part is the log magnitude function. The negative derivative of phase with respect to frequency is the group delay (GD) function, and the derivative of the phase with respect to time is the instantaneous frequency (IF) function. The features of speech production displayed through the GD function are compared with the features displayed through the derivative with respect to frequency of the corresponding log magnitude function. Likewise, the features of production displayed through the IF function are compared with the features displayed through the derivative with respect to time of the corresponding log magnitude function. The speech production characteristics reflected in these representations of phase derivatives are examined for different types of utterances.

A Comprehensive Empirical Review Of Modern Voice Activity Detection Approaches For Movies And TV Shows

Article

Apr 2022
NEUROCOMPUTING

A robust and language agnostic Voice Activity Detection (VAD) is crucial for Digital Entertainment Content (DEC). Primary examples of DEC include movies and TV series. Some ways in which VAD systems are used for DEC creation include augmenting subtitle creation, subtitle drift detection and correction, and audio diarisation. Majority of the previous work on VAD focuses on scenarios that: (a) have minimal background noise, and (b) where the audio content is delivered in English language. However, movies and TV shows can: (a) have substantial amounts of non-voice background signal (e.g. musical score and environmental sounds), and (b) are released worldwide in a variety of languages. This makes most of the previous standard VAD approaches not readily applicable for DEC related applications. Furthermore, there does not exist a comprehensive analysis of Deep Neural Network’s (DNN) performance for the task of VAD applied to DEC. In this work, we present a thorough survey on DNN based VADs on DEC data in terms of their accuracy, Area Under Curve (AUC), noise sensitivity, and language agnostic behaviour. For our analysis we use 1100 proprietary DEC videos spanning 450 hours of content in 9 languages and 5+ genres, making our study the largest of its kind ever published. The key findings of our analysis are: (a) even high quality timed-text or subtitle ³ files contain significant levels of label-noise (up to 15%). Despite high label noise, deep networks are robust and are able to retain high AUCs (∼ 0.94). (b) Using larger labelled dataset can substantially increase neural VAD model’s True Positive Rate (TPR) with upto 1.3% and 18% relative improvement over current state-of-the-art methods in [1], [2] respectively. This effect is more pronounced in noisy environments such as music and environmental sounds. This insight is particularly instructive while prioritizing domain specific labelled data acquisition versus exploring model structure and complexity. (c) Currently available sequence based neural models show similar levels of competence in terms of their language agnostic behaviour for VAD at high Signal-to-Noise Ratios (SNRs) and for clean speech, (d) Deep models exhibit varied performance across different SNRs with CLDNN [3] being the most robust, and (e) models with comparatively larger number of parameters (∼ 2 M) are less robust to input noise as opposed to models having smaller number of parameters (∼ 0.5 M).

Phase AutoCorrelation (PAC) derived Robust Speech Features

Article

Full-text available

Jan 2003

In this paper, we introduce a new class of noise robust acoustic features derived from a new measure of autocorrelation, and explicitly exploiting the phase variation of the speech signal frame over time. This family of features, referred to as ``Phase AutoCorrelation'' (PAC) features, include PAC spectrum and PAC MFCC, among others. In regular autocorrelation based features, the correlation between two signal segments (signal vectors), separated by a particular time interval $k$, is calculated as a dot product of these two vectors. In our proposed PAC approach, the angle between the two vectors is used as a measure of correlation. Since dot product is usually more affected by noise than the angle, it is expected that PAC-features will be more robust to noise. This is indeed significantly confirmed by the experimental results presented in this paper. The experiments were conducted on the Numbers 95 database, on which ``stationary'' (car) and ``non-stationary'' (factory) Noisex 92 noises were added with varying SNR. In most of the cases, without any specific tuning, PAC-MFCC features perform better.

The det curve in assessment of detection task performance

Article

Full-text available

Jan 1997

What are the essential cues for understanding spoken language?

Article

Full-text available

May 2004

Classical models of speech recognition assume that a detailed, short-term analysis of the acoustic signal is essential for accurately decoding the speech signal and that this decoding process is rooted in the phonetic segment. This paper presents an alternative view, one in which the time scales required to accurately describe and model spoken language are both shorter and longer than the phonetic segment, and are inherently wedded to the syllable. The syllable reflects a singular property of the acoustic signal - the modulation spectrum - which provides a principled, quantitative framework to describe the process by which the listener proceeds from sound to meaning. The ability to understand spoken language (i.e., intelligibility) vitally depends on the integrity of the modulation spectrum within the core range of the syllable (3-10 Hz) and reflects the variation in syllable emphasis associated with the concept of prosodic prominence ("accent"). A model of spoken language is described in which the prosodic properties of the speech signal are embedded in the temporal dynamics associated with the syllable, a unit serving as the organizational interface among the various tiers of linguistic representation.

TRAPS - classifiers of temporal patterns

Conference Paper