Content uploaded by S.J. Godsill
Author content
All content in this area was uploaded by S.J. Godsill on Oct 06, 2014
Content may be subject to copyright.
Bayesian Statistical Methods for Audio and Music Processing
A. Taylan Cemgil, Simon J. Godsill, Paul Peeling, Nick Whiteley
Signal Processing and Comms. Lab, University of Cambridge
Department of Engineering, Trumpington Street, Cambridge, CB2 1PZ, UK
{atc27,sjg}@eng.cam.ac.uk
August 15, 2008
Abstract
Bayesian statistical methods provide a formalism for arriving at solutions to various problems
faced in audio processing. In r eal environments, acoustical conditions and sound sources are
highly variable, yet audio signals often possess significant statistical structure. There is a great
deal of prior knowledge available about why this statistical structure is present. This includes
knowledge of the physical mechanisms by which sounds are generated, the cognitive processes
by which sounds are perceived and, in the context of music, the abstract mechanisms by which
high-level sound structur e is compiled. Bayesian hierarchical techniques provide a natural means
for unification of these bodies of prior knowledge, allowing the formulation of highly-structured
models for observed audio data and latent processes at various levels of abstraction. They also
permit the inclusion of desirable modelling components such as change-point structures and
model-order specifications.
The resulting models exhibit complex statistical structure and in practice, highly adaptive
and powerf ul computational techniqu es are needed to perform inference. In this chapter, we
review some of the statistical models and associated inference methods developed recently for
audio and music processing. Our treatment will be biased towards musical signals, yet the mod-
elling strategies and inference techniques are generic and can be applied in a broader context
to nonstationary time series analysis. In the chapter we will review application areas for audio
processing, describe models appropriate for these scenarios and discuss the computational prob-
lems posed by inference in these models. We will describe models in both the time domain and
transform domains, the latter typically offering greater computational tractability and modelling
flexibility at the expense of some accuracy in the models. I nference in the mo dels is perform ed
using Monte Carlo methods as well as variational approaches originating in statistical physics.
We hope to show that this field, which is still in its infancy compared to topics such as computer
vision and speech recognition, has great p otential for advancement in coming years, with the
advent of powerful Bayesian inference methodologies and accompanying computational power
increases.
1 Introduc t ion
In applications that need to deal with acoustical and computational modelling of sound, a
fundamental ob s tacle is superposition, i.e. concurrent sound events (polyphonic music, speech
or environmental sound) are mixed and altered d ue to reverberation present in the acoustic
environment. In speech pro cessing, this problem is referred to as the cocktail party problem.
1
In hearing aids , undesired structured environmental s ources, such as wind or machine noises,
contaminate the target sound and need to be filtered out; here the objective is denoising or
perceptual enhancement. A similar situation hap pens in polyphonic music, where several in-
struments play s imultaneously and one goal is to separate or identify the individual voices. In all
of these domains, due to superposition, information about individual sources cannot be directly
extracted, and significant focus is given in the literature to source separation, deconvolution
and perceptual organisation of sound (Wang and Brown 2006).
Acoustic processing is a rather broad field and the research is driven by both scientific and
technological motivations – two related but distinct goals. For technological needs, the primary
motivation is to develop p ractical engineering solutions to enhance recognition, denoising, source
separation or information retrieval. The ultimate goal here is to construct computer systems
that display aspects of human level perf ormance in automated sound understand ing. In the
second, the goal is scientific understanding of cognitive processes behind the human auditory
system and the physical sound generation process of musical instruments or voices.
Our starting point in this article is that in both contexts, scientific or technological, Bayesian
statistical methods provide a f ormalism to make progress. This is achieved via models which
quantify prior knowledge about physical properties and semantics of sound and powerf ul com-
putational techniques. The key equation, then, is Bayes’ theorem and in the context of aud io
processing it can be stated as
p(Structure|Audio Data) ∝ p(Audio Data|Structure)p(Structure)
Thus inference is drawn from the posterior distribution over hidden structure given observed
audio data. The strength of this simple and abstract view of audio processing is that it admits
a variety of tasks such as tracking, restoration, transcription, separation, identification or resyn-
thesis can be formulated as Bayesian inference problems. The approach also inherits the ben efi t
common to all applications of Bayesian statistical methods that the problem formulation and
computational solution strategy are well separated. This differs significantly from heuristic and
ad-hoc approaches to audio processing which have been popular historically and which involve
the design of custom-built algorithms for solving specific tasks where problem formulation and
computational solution are mixed, taking account of practical and pragmatic considerations.
1.1 Introduction to Musical Audio
The following discussion gives a basic introduction to some of the properties of musical audio
signals. The discussion follows closely that of (Godsill 2004). Musical audio is highly structured,
both in the time domain and in the frequency domain. In the time domain, tempo and beat
specify the range of likely note transition times. In the frequency domain, two levels of struc-
ture can be considered. First, each note is composed of a fundamental frequency (related to the
‘pitch’ of the note), and partials whose relative amplitudes determine the timbre of the note.
This frequency domain description can be regarded as an empirical approximation to the true
process, wh ich is in reality a complex non-linear time-domain system (McIntyre, Schumacher,
and Woodhouse 1983; Fletcher and Rossing 1998). The frequencies of the partials are approx-
imately integer multiples of the fundamental frequency, although this clearly doesn’t apply for
instruments such as bells and tuned percussion. Second, several notes played at the same time
form chords, or polyphony. The fundamental frequencies of each note comprising a chord are
typically related by simple multiplicative rules. For example, a C major chord may be composed
of the frequencies 523 Hz, 659Hz ≈ 5/4×523 Hz and 785 Hz ≈ 3/2×523 Hz. Figure 2 shows a
time-frequency spectrogram analysis for a simple monophonic (single note) flute recording (this
may be auditioned at www-sigproc.eng.cam.ac.uk/~sjg/haba, where other extracts used in
2
2 4 6 8 10 12 14 16
−1
−0.5
0
0.5
1
x 10
4
t/sec
Amplitude
Figure 1: Time-domain waveform for a solo flute extract
this paper may also be listened to
1
), corresponding to the waveform displayed as Figure 1. In
this both the temporal segmentation and the frequency domain structure are clearly visible on
the plot. Focusing on a sin gle localised time frame, at around 2s in the same extract, we
can clearly see the fundamental frequency component, labelled ω
0
, and the partial stucture, at
frequencies 2ω
0
, 3ω
0
, ...of a single musical note in figure 1.1. It is clear from spectra such as
figure 1.1 that it will b e possible to estimate the pitch (we will refer to pitch interchangeably
with ω
0
, although it should be noted that perceived pitch is a more complex function of th e
fundamental and all partials) and partial information (amplitudes of partials, number of par-
tials, etc.) from single-note data that is well segmented in time (so that there is not signifi cant
overlap between more than one separate musical note within any single segment). There are
many ways to achieve this, based on sample autocorrelation functions, spectral peak locations,
etc. Of course, r eal musical extracts don’t usually arrive in conveniently segmented single note
form, and much more complex structures need to be considered.
1.2 Applications
There are many tasks of interest for musical analysis in which computers can be of assistance,
including (but not limited to):
1. Music-to-score transcription. This involves the analysis of raw audio signals to produce
a musical ‘score’ representation. This is one of the most challenging and comprehensive
tasks facing us in computational music analysis, and one that is certainly ill-defined, since
there are many possible written scores corresponding to on e performance. An expert hu-
man listener could transcribe a relatively complex piece of musical audio, bu t the score
produced would be dissimilar in many respects to that of the composer. However, it would
be reasonable to hope that the transcriber could generate a score having similar pitches
1
Web page not yet generated - sorry!
3
t/sec
f/Hz
0 200 400 600 800 1000 1200
0
0.5
1
1.5
2
x 10
4
0
5
10
15
20
25
Figure 2: Time-frequency spectrogram representa tion for the flute recording
0 500 1000 1500 2000 2500 3000 3500 4000 4500
10
2
10
3
10
4
10
5
10
6
Frequency
Amplitude
’Partials’ or ’Harmonics’
ω
2ω
3ω
Figure 3: Short-time Fourier analysis of a single frame of data from the flute extract
4
and durations to those of the composer. The sub -task of generating a pitch-and-duration
map of the music is the main aim of many so-called ‘transcription’ systems. Others have
considered the task of score generation from this point on and software is available com-
mercially for this highly subjective part of th e process - we will not consider it further
here. Applications that require the transcription task include analysis of ethnomusicologi-
cal recordings, transcription of jazz and other improvised forms for analysis or publication
of performance versions, and transcriptions of rare or historical pieces w hich are no longer
available in the form of a printed score. Apart from applications which directly require the
full transcription, there are many applications, for example those below, which are fully
or partially solved as a result of a solution to the transcription problem.
2. Instrument classification is an important component of musical analysis systems, i.e. the
task of recognising which instruments are playin g at any given time in a piece
3. A related concept is timbre determination - extraction of the tonal character of a pitched
musical note (in coarse terms, is it harsh, sweet, bright, etc.
4. Signal separation - here we attempt to separate out individual instruments or notes from
a polyphonic (many-note) mixture. This finds application in many areas from sound
remastering in the recording studio through to Karaoke (extraction of a principal vocal
line from a source, leaving just the accompaniment). Source separation fin ds much w ider
application of course in non-mus ical audio, especially in source separation for hearing aids,
see below.
5. Audio restoration and enhancement. I n this application the quality of an audio source is
enhanced, for example by reduction of background noise. This task comes as a by-product
of many m odel-based analysis tasks, such as source separation above, since a noise-reduced
version of the input signal will often be available as one of the possible inferences from the
Bayesian posterior distribution.
The fundamental tasks above will fin d use in many varied acoustical applications. For example,
with vast amount of audio data available digitally in on-line repositories, it is not reasonable
to pred ict th at almost all audio material w ill be available digitally in the near future. This has
rendered automated processing of audio for sorting and choice of musical content an important
and central information processing task, affecting literally millions of end users. For flexible
interaction, it is essential that systems are able to extract structure and organize information
from the audio signal directly. Our view is that the associated fundamental computational
problems requ ir e both a fresh look at existing signal processing techniques and d evelopment of
novel statistical methodology.
In addition, Computer based music composition and sound synthesis date b ack to the first
days of digital computation. However, despite recent technological advances in synthesis, com-
pression, processing and distribu tion of digital audio, it has yet been not possible to construct
machines that can simulate the effectiveness of human listening.
Statistical methodolgies are now migrating into human computer interaction, computer
games and electronic entertainment computing. Here, one ambitious research goal focuses on
computational techniques to equip computers with musical listening and interaction capabilities.
This is essential in constru ction of intelligent music systems an d virtual musical instruments
that can listen, imitate and autonomously interact with humans. For flexible interaction, it is
essential that music systems are aware of the actual content of the music, are able to extract
structure and organise information directly from acoustic input. For generating convincing
performances, they need to be able to analyse and mimic master musicians.
Another vitally important application area for millions of people is hearing aids, which di-
rectly benefits from efficient and robust methods for recognition and source separation (Hamacher,
Chalupper, Eggers, Fischer, Kornagel, Puder, and Rass 2005). It is estimated that there are
5
almost nine million hearing imp aired people in the UK alone; a number which is believed to be
increasing with rapidly aging population. Progress in this field is likely to improve the quality
of life for a sizeable segment of society. Recently, modern hearing aids have evolved into pow-
erful computational devices and w ith advances in wireless communications, it is becoming now
feasible to delegate computation to external portable computing devices. This provides unprece-
dented possibilities along with interesting computational challenges for online adaptation, since
in the next generation of hearing aids, it will be feasible to run sophisticated statistical signal
processing and machine learning algorithms. Finally, computational audio processing finds ap-
plication in the areas of monitoring, rescue and surveillance, comp uter aided music education,
musicology, music perception and cognition r esearch.
piano
time
amplitude
viola
piccolo
french horn cymbals
congas
frame index
frequency index
Figure 4: Some acoustical instruments, examples o f typical time series and correspo nding spectro-
grams (time varying magnitude sp ectra – modulus of short time Fourier transform) computed with
FFT. (Audio data and images fro m RWCP Instrument samples database).
piano + piccolo + cymbals
Figure 5: Superposition. The time series and the magnitude spectrogram of the resulting signal
when some of the instruments play concurrent ly.
6
2 Fundamental Au dio Processing Tasks
From the above discussion of the challenges facing audio processing, some fundamental tasks
can be identified for treatment by Bayesian techniques. Firstly, we can hope to address the
superposition task in a model-based fashion, by posing models that capture the behaviour of
superimposed signals. These are similar in flavour to th e latent factors analysed in some sta-
tistical modelling problems. A generic model for observed data Y , under a linear superposition
assumption, will th en be:
Y =
I
X
i=1
s
i
(1)
where the s
i
represent each of the I individual audio sources present. We pose this very basic
model here as a single-channel observation model, although it is straightforward to extend
the model to the multi-channel case, in which case it will be usual to include also channel-
specific mixing coefficients. The sources and data will typically be audio time series, but can
also represent expansion coefficients of the audio in some other domain such as the Fourier or
wavelet domain, as will be made clear in context later. We may make the model a little more
sophisticated by making the data a stochastic function of the sources, and in this case we will
specify some non-degenerate likelihood function p(Y |
P
I
i=1
s
i
).
We typically assume that the individual s ources s
i
, one or more of which may be background
noise terms, are independent a priori. They are parameterised by θ
i
, which represent informa-
tion about the sound generation process for that particular source, including perhaps its pitch
and other characteristics (number of partials, etc.), encoded through a conditional distrib ution
and prior distribution for each source:
p(s
i
, θ
i
) = p(s
i
|θ
i
)p(θ
i
)
Dependence between the θ
i
, for example to model the harmonic relationships of notes within a
chord, can of course be included as desired w hen considering the joint d istribution of sources
and parameters. To this model we can add unknown hyperparameters Λ with prior p(Λ) in the
usual way, and incorporate model uncertainty through an additional prior distribution on the
number of components I. The specification of suitable source models p(s
i
|θ
i
) and p(θ
i
), as well
as the form of likelihood function p(Y |
P
I
i=1
s
i
), will form a substantial part of the remainder
of the paper.
Several fundamental inference tasks can then be identified from this generic model, including
the source separation and polyph on ic music transcription tasks identified above.
2.1 Source Separation
In source separation, the task is to infer the source signals s
i
themselves, given the observed
signal Y . Collecting the sources together as S = {s
i
}
I
i=1
and the parameters as Θ = {θ
i
}
I
i=1
,
the Bayesian formulation of the pr oblem can be stated, under a fixed number of sources I, as
(see for example (Mohammad-Djafari 1997; Knuth 1998; Rowe 2003; F´evotte and Godsill 2006;
Cemgil, Godsill, and Fevotte 2007))
p(S|Y ) =
1
P (Y )
Z
p(Y |S, Λ)p(S|Θ, Λ)p(Λ)p(Θ)dΛdΘ (2)
where, under our deterministic model above in Eq. 1 the likelihood function p(Y |S, Λ) will be
degenerate. The marginal likelihood P (Y ) plays a key role when model ord er uncertainty is to
be incorporated into the problem, for example when the number of sources N is unknown and
needs to be estimated (Miskin and Mackay 2001).
7
Additional considerations which may additionally be included in the above framework in-
clude convolutive (filtered) and non-stationary mixing of the sources - both scenarios are of
practical interest and still pose significant computational challenges. Once the posterior distri-
bution is computed by evaluating the integral, point estimates of the sources can be obtained
using su itable estimation criteria, such as marginal MAP or posterior mean estimation, al-
though in the latter case one has to be especially careful w ith the interpretation of expectations
in models where likelihoods and priors are invariant to source permutations.
2.2 Polyphonic Music Transcription
Music transcription refers to extraction of a human readable and interpretable description from
a recording of a music performance, see Fig. 6. In cases where more than a single musical
note plays at a given time instant, we term this task polyphonic music transcription. Interest
in this problem is largely motivated by a desire to implement a program to infer automatically
a musical notation, such as the traditional western music notation, listing the pitch values
of notes, corresponding timestamps and other expressive information in a given performance.
These quantities will be encoded in the above model through the parameters θ
i
of each note
present at a given time. Simple models will encode only the pitch of the note in θ
i
, while
more complex models can include expressive information, instrument-specific characteristics
and timbre, etc.
Apart from being an interesting modelling and computational prob lem in its own right,
automated extraction of a score-like description is potentially very useful in a broad spectrum
of applications such as interactive music performance systems, music information retrieval and
musicological analysis of musical performances, not to m ention as an aid to the source separation
task identified above. However, in its most unconstrained form, i.e., when operating on an
arbitrary acoustical input, music transcription remains a very challenging problem, owing to
the wide variation in acoustical conditions and characteristics of musical instruments. In spite
of these difficulties, a practical engineering solution is possible by careful incorporation of prior
knowledge from cognitive science, musicology, musical acoustics, and by use of computational
techniques from statistics and digital signal pr ocessing.
t/sec
f/Hz
0 1 2 3 4 5 6 7 8
0
1000
2000
3000
4000
5000
0
10
20
Figure 6: Polyphonic Music Transcription. The task is to generate a human readable score as shown
below, given the acoustic input. The computational problem here is t o infer pitch, number of notes,
rhythm, tempo, meter, time signature. The inference can be achieved online (filtering) or offline
(smoothing), depending upon requirements.
2.3 Hierarchical Models for Musical Audio
In a statistical sense, music transcription is an inference problem where, given a signal, we
want to find a score that is consistent with the encoded music. In this context, a score can
be contemplated as a collection of “musical objects” (e.g., note events) that are rendered by a
performer to generate the observed signal. The term “musical object” comes directly from an
analogy to visual scene analysis where a scene is “explained” by a list of objects along with a
8
Score
Expression
Piano-Roll
Signal
Figure 7: A hierarchical generative model for music transcription. In this model, an unknown score
is rendered by a performer into a piano-roll. The performer introduces expressive timing deviations
and tempo fluctuations. The piano-roll is rendered into audio by a synthesis model. The piano roll
can be viewed as a symbolic r epresentation, analogous to a sequence of MIDI events. Given the
observations, transcription can be viewed as Bayesian inference of the score. Somewhat simplified,
the techniques described in this article can be viewed as inference techniques as applied to subgraphs
of this graphical model.
description of their intrinsic properties such as shape, color or relative position. We view music
transcription from the same perspective, where we want to “explain” individual samp les of a
music signal in terms of a collection of musical objects where each object has a set of intrinsic
properties such as pitch, tempo, loudness, duration or score position. It is in this respect that
a score is a high level description of music.
Musical signals have a very rich temporal structure, and it is natural to think of them as being
organized in a hierarchical way. At the highest level of this organization, which we may call as the
cognitive (symbolic) level, we have a score of the piece, as, for instance, intended by a composer
2
.
The performers add th eir interpretation to music and render th e score into a collection of
“control signals”. Further down on the physical level, the control signals trigger various musical
instruments that synthesize the actual sound signal. We illustrate these generative processes
using a hierarchical graphical model (See Figure 7), where the arcs represent generative links.
This architecture is of course anything but new, and in fact underlies any music generating
computer program s uch as a sequencer. The main difference of our model from a conventional
sequencer is that th e links are probabilistic, ins tead of deterministic. We use the sequencer
analogy in describing a realistic generative pro cess for a large class of music signals.
In describing music, we are usually interested in a s ymbolic repr esentation and not so much
in the “d etails” of the actual waveform. To abstract away from the signal details, we define
an intermediate layer, that represents the control signals. T his layer, that we call a “piano-
roll”, forms the interface between a symbolic process and the actual signal process. Roughly,
the symbolic process describes how a piece is composed and performed. Conditioned on the
piano-roll, the signal process describes how the actual waveform is synthesized. Conceptually,
the transcription task is then to “invert” this generative model and recover back the original
score. As an interm ed iate and less sophisticated task, we may try an d invert back only as far
as the piano-roll.
2
In reality the music may be improvised and there may be actually not a wr itten sco re. In this case we replace
the generative mo del with the intentions of the performer, which can still be expressed in our framework as a ‘vir tua l’
musica l score
9
3 Signal Models for Audio
We begin the discussion by describing some basic note and chord models for musical audio, based
in the time or frequency domain. As already discussed, a basic property of most n on-percussive
musical sounds is a set of oscillations at frequencies related to the fundamental frequency ω
0
.
Consider for the moment a short-time frame of musical audio data, denoted y(τ), in which
note transitions d o not occur. This would correspond, for example, to the analysis of a single
musical chord. Throughout, we assume that the continuous time audio waveform y(τ) has been
discretised with a sampling frequency ω
s
rad.s
−1
, so that discrete time observations are obtained
as y
t
= y(2πt/ω
s
), t = 0, 1, 2, . . . , N − 1. We assume that y(τ) is bandlimited to ω
s
/2 rad.s
−1
,
or equivalently that it has been prefiltered with an ideal low-pass filter having cut-off fr equ en cy
ω
s
/2 rad.s
−1
. We will not consider for the moment the time evolution of one chord to the next,
or of note changes in a melody. This critical issue is treated in later sections.
The following model for, say, the ith note out of a chord comprising I notes in total can be
written as
s
i,t
=
M
i
X
m=1
α
m,i
cos (mω
0,i
t) + β
m,i
sin (mω
0,i
t) (3)
for t ∈ {0, . . . , N − 1}. Here, M
i
> 0 is the number of partials pr esent in note i,
q
α
2
m,i
+ β
2
m,i
gives th e amplitude of these partials and tan
−1
(β
m,i
/α
m,i
) gives the phase of that partial. Note
that ω
0,i
∈ (0, π) is here scaled for convenience - its actual frequency is
ω
0,i
2π
ω
s
. The unknown
parameters for each note are thus ω
0,i
, the fundamental frequency, M
i
, the number of partials
and α
m,i
, β
m,i
, which determine the amplitude and phase of each partial.
The extension to the multiple note case is then straightforwardly obtained by linear super-
position of a number of notes:
y
t
=
I
X
i=1
s
i,t
+ v
t
where v
t
is a random background noise component (compare this with the deterministic mixture
in Eq. 1). In this model v
t
will also have to model any residual trans ient noise from the musical
instruments th emselves. We now have in addition an unknown parameter I, the number of
notes present, plus any unknown statistics of the background noise process.
Such a model is a reasonable approximation for many steady musical sounds, and has quite
a lot of analytical tractability, especially if a Gaussian form is assumed for v
t
and for the pr iors
on amplitudes α and β. Nevertheless, th e posterior distribution is highly non-Gaussian and
multimodal, and sophisticated computational tools are required to infer accurately from this
model. This was precisely the topic of the work in (Walmsley, Go dsill, and Rayner 1998) and
(Walmsley, Godsill, and Rayner 1999), wh ere a reversible jump sampler was developed for such
a model, under the above-mentioned Gaussian prior assumptions.
The basic f orm above is however over-idealised in a number of ways: principally from the
assumption of constant amplitudes α and β over time, and in the fixed integer relationships
between partials, i.e. partial m in note i lies exactly at frequency mω
0,i
. The mo dification of
the basic m odel to remove these assumptions was the topic of our later work (Davy and Godsill
2002; Godsill and Davy 2002; Davy, Godsill, and Idier 2006; Godsill and Davy 2005), still within
a reversible jump Monte Carlo framework.
3
In p articular, it is fairly straightforward to modify
3
Editors: would you like me to write a summary of reversible jump in an appendix?
10
0 500 1000 1500 2000 2500 3000 3500 4000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time index, t
ψ
1,t
ψ
2,t
ψ
3,t
ψ
1,t
ψ
9,t
....
Figure 8: Basis functions ψ
i,t
, I = 9, 50% overlapped hanning windows.
the model so that the partial amplitudes α and β vary with time,
s
i,t
=
M
i
X
m=1
α
m,i,t
cos (mω
0,i
t) + β
m,i,t
sin (mω
0,i
t) (4)
and we typically exp and α
m,i,t
and β
m,i,t
on a finite set of smooth basis functions ψ
i,t
with
expansion coefficients a
i
and b
i
:
α
m,i,t
=
J
X
j=1
a
i
ψ
i,t
, β
m,i,t
=
J
X
j=1
b
i
ψ
i,t
In our work we have adopted 50%-overlapped Hamming windows for the basis functions, see
Fig. 8, with support either chosen a priori by the u s er or treated as a Bayesian random variable
(Godsill and Davy 2005).
Alternative m ore general representations allow a fully stochastic variation of α
m,i,t
in the
state-space formulation, see section ??.
Fur th er idealisations in these models include the assumption of constant fun damental fre-
quencies with time and the Gaussian prior and noise assumptions, but in principle all can be
addressed in a principled Bayesian fashion.
3.1 A prior distribution for musical notes
Under the above basic time-domain model we need to assign prior distributions over the un-
known parameters for a single note in the mix, currently {ω
0,i
, M
i
, α
i
, β
i
}, where α
i
, β
i
are
the vectors of parameters α
m,i
, β
m,i
, m = 1, 2, ..., M
i
. Under an assumed note system such
as an equally-tempered Western note system, we can augment this with a note number index
n
i
. A suitable scheme is the MIDI note numbering system
4
which labels middle C (or ‘C4’)
as note number 60, and all other notes as integers relative to this - the A below this would
4
See for example www.harmony-central.com/MIDI/doc/table2
11
−8 −6 −4 −2 0 2 4 6 8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
log( ω
0,i
), in semitones relative to A440Hz
Prior probability density
Figure 9: Prior for fundamental frequency p(ω
0,i
)
be 57, for example, and the A above middle C (usually at 440Hz in modern Western tuning
systems) would be note number 69. Other non-Western systems could also be encoded within
variants of such a scheme. The fundamental frequency would then be expected to lie ‘close’
to the expected frequency for a particular note number, allowing for performance and tuning
deviations from the ideal. Thus a prior for the observed f undamental frequency ω
0,i
can be
constructed fairly straightforwardly. We adopt here a truncated log-normal distribution for the
note’s fundamental frequency:
p(log(ω
0,i
)|n
i
) ∝
(
N(µ(n
i
), σ
2
ω
), log(ω
0,i
) ∈ [(µ(n
i
− 1) + µ(n
i
))/2, (µ(n
i
) + µ(n
i
+ 1))/2)
0, otherwise
where µ(n) computes the expected log-frequency of note number n, i.e., when we are dealing
with A440 music in the equally tempered western system,
µ(n) = (n − 69)/12 log(2) + log(440/ω
s
) (5)
where once again ω
s
rad.s
−1
is the sampling frequency of the data. Assuming p(n) is uniform for
now, the resulting prior p(ω
0,i
) is plotted in Fig. 9, capturing the expected clustering of note
frequencies at semitone spacings relative to A440.
The prior model for a note is completed with two components. Firstly a prior for the
number of partials, p(M
i
|ω
0,i
), is specified as uniform over the range {M
min
, ..., M
max
}, with
limits truncated to prevent partials at frequencies greater than ω
s
/2, the Nyquist rate. Secondly
a prior for the amplitude parameters α
i
, β
i
must be specified. Th is turns out to be quite crucial
to the modelling performance and here we initially proposed a Gaussian form. It is expected
however that partials at high frequencies will have lower energy than those at high frequencies,
generally following a low-pass filter shape in the frequency domain. Coefficents α
m,i
and β
m,i
are
then assigned independent Gaussian prior distributions such that their amplitudes are assumed
12
10
0
10
1
10
−10
10
−9
10
−8
10
−7
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
10
0
Partial number, m
increasing ν
Figure 10: Family of k
m
curves (log-log plot), T = 5, ν = 1, ..., 10.
to decay with increasing frequen cy of the partial number m. The general form of this is
p(α
m,i
, β
m,i
) = N(β
m,i
|0, g
2
i
k
m
)N(α
m,i
|0, g
2
i
k
m
)
Here g
i
is a scaling factor common to all partials in a n ote and k
m
is a frequency-dependent
scaling factor to allow for the expected decay with increasing frequency for partial amplitudes.
Following (Godsill and Davy 2005) the amplitudes are assumed to decay as follows:
k
m
= 1/(1 + (T m)
ν
)
where ν is a decay constant and T determines the cut-off frequency. Such a model is based
on empirical observations of the partial amplitudes in many real instrument recordings, and
essentially ju s t en codes a low pass filter with unknown cut-off fr equ en cy and decay rate. See for
example the family of curves with T = 5, ν = 1, 2, ..., 10, Fig. 10. It is worth pointing out that
this model does n ot impose very stringent constraints on the precise amplitude of the partials:
the Gaussian distribution will allow for significant departures from the k
m
= 1/(1 + (T m)
ν
)
rule, as dictated by the data, but it does impose a generally low-p ass shape to the harmonics
across frequency. It is possible to keep th ese parameters as unknowns in the MCMC scheme (see
(Godsill and Davy 2005)), although in the examples presented here we fix these to appr op riately
chosen values for the sake of computational simplicity. g
i
, which can be regard ed as th e overall
‘volume’ parameter for a note, is treated as an add itional random variable, assigned an inverted
Gamma distribution for its prior. The Gaussian prior structure outlined here for the α and β
parameters is readily extended to the time-varying amplitude case of E q. (4), in which case
similar Gaussian priors are applied directly to the expansion coefficients a and b, s ee (Davy,
Godsill, and Idier 2006).
In the simplest case, a polyphonic model is then built by taking an independent prior over
13
0 2000 4000 6000 8000 10000 12000 14000 16000
−0.5
0
0.5
Waveform − slow release
0 2000 4000 6000 8000 10000 12000 14000 16000
−0.5
0
0.5
Waveform − fast release
Figure 11: Waveforms for release transient on pipe organ. Top: slow release; bottom: fast release.
the individual notes and the number of notes present:
p(Θ) = p(I)
I
Y
i=1
p(θ
i
)
where
θ
i
= {n
i
, ω
0,i
, M
i
, α
i
, β
i
, g
i
}
This model can be explored using MCMC methods, in particular the reversible jump MCMC
method (Green 1995), and results from this and related models can be found in (Godsill and
Davy 2005; Davy, Go dsill, and Idier 2006). In later sections, however, we discuss simp le modi-
fications to th e generative model in the frequen cy domain which render the computations much
more feasible for large polyphonic mixtures of sounds.
The models of this section provide a quite accurate time-domain description of many musical
sounds. The inclusion of additional effects such as inh armonicity and time-varying partial
amplitudes (Godsill and Davy 2005; Davy, Go dsill, and Idier 2006) makes for additional realism.
3.2 Example: musical transient analysis with the harmonic model
A useful case in point is the analysis of musical transients, i.e. the start or end of a musical
note, when we can expect rapid variation in partial amplitudes w ith time. Here we take as an
example a pipe organ transient, analysed under different playing conditions: one involving a
rapid release at the end of the note, and the other involving a slow release, see Fig. 11. There is
some visible (and audible) difference between the two waveforms, and we seek to analyse what
is being changed in the structure of the note by the release mode. Such questions are of interest
to acousticians and instrument builders, for example.
We analyse these datasets using the prior distribution of the previous section and the model
of Eq. (4). A fixed length hanning window of duration 0.093s was used for the basis functions.
The r esulting MCMC outpu t can be used in m any ways. For example, examination of the
expansion coefficients a
i
and b
i
allows an analysis of how the partials vary with time under each
playing condition. In both cases the reversible jump MCMC identifies 9 significant partials in
the data. I n Figs. 12 and 13 we plot the fi rst five (m = 1, ..., 5) partial energies a
2
m,i
+ b
2
m,i
as a
function of time.
14
0 10 20 30 40 50 60 70 80 90
0
0.02
0.04
m=1
Pipe organ − slow release
0 10 20 30 40 50 60 70 80 90
0
0.02
0.04
m=2
0 10 20 30 40 50 60 70 80 90
0
0.02
0.04
m=3
0 10 20 30 40 50 60 70 80 90
0
0.02
0.04
m=4
0 10 20 30 40 50 60 70 80 90
0
0.02
0.04
Expansion Coefficient i
m=5
Figure 12: Magnitudes of partials with time: slow release.
0 10 20 30 40 50 60 70 80 90
0
0.02
0.04
m=1
Pipe organ − fast release
0 10 20 30 40 50 60 70 80 90
0
0.02
0.04
m=2
0 10 20 30 40 50 60 70 80 90
0
0.02
0.04
m=3
0 10 20 30 40 50 60 70 80 90
0
0.02
0.04
m=4
0 10 20 30 40 50 60 70 80 90
0
0.02
0.04
Expansion coefficient i
m=5
Figure 13: Magnitudes of partials with time: f ast release.
15
Examining the behaviour from the MCMC output we can see that the th ird partial is sub-
stantially elevated during the slow release mode, between coefficients i = 30 to 40. Also, in the
slow release mode, the fundamental frequency (m = 1) decays at a much later stage relative to,
say, the fifth partial, which itself decays more slowly in that mode. One can also use the model
output to perform signal modification; for example time stretching or pitch shifting of the tran-
sient are readily achieved by reconstructing the signal using the MCMC-estimated parameters
but modifying the hanning window basis function length (for time-stretching) or reconstructing
with modified fundamental frequency ω
0
, see www-sigproc.eng.cam.ac.uk/~sjg/haba. The
details of our reversible jump MCMC scheme are quite complex, involving a combination of
specially designed independence Metropolis-Hastings proposals and random walk-style propos-
als for the note frequen cy variables. In the frequency-domain models described in section 5 we
use essentially the same MCMC scheme, with simpler likelihood functions - some more details
of the proposals used are given there.
4 Dynamical St ate Space Model s
The models introduced in the previous section are generalised linear models, where the expan-
sion coefficients a
i
and b
i
are assumed to be a-priori independent across frames. While these
models are quite useful, they are not very realistic models of the underlying physics, as audio
is essentially the result of u nfolding dynamical processes. It is possible to introduce r an dom
walk dynamics on the expansion coefficients. In contrast, we will describe the evolution of the
expansion coefficients in state space form, v ia state space models. Such representations can be
derived from well known sinusoidal s ignal representations such describ ed in the appendix A and
are one step towards physical models.
The audio signal can be written as a sum of exponentially decaying or windowed s inusoids
(See Eq. (3) or examples in appendix sections A.2 and A.3). However, richer processes with more
complex behaviour can be described by state s pace models. This involves defining stochastic
signal representations and higher level, unobs erved stochastic elements, such as change point
processes and model-order indicators, which are combined to form hierarchical Bayesian models.
The r eader will notice that these models exhibit a variety of structures that arise from the
interaction between the low-level signal models and the high-level latent processes. We also
describe associated inferen ce tasks and efficient methods for their solution which take advantage
of the model structures.
4.1 Conditionally Linear Dynamical Systems with regime switch-
ing
We start this section with a general framework to highlight and unify the basic modelling
ideas. Ou r goal is to construct a mo del that can mimic the qualitative behaviour of acoustical
systems (such as musical instruments shown in figure 4) yet still has some analytical structure
which allows efficient inference. Our starting point is the conditionally-linear state space model.
This is motivated by the fact that the harmonic structure of audio signals, arising the physical
processes by which they are generated, can conveniently be f ormulated in state-space form.
Here, we construct a generic model as a cascade of two systems with an excitation (e.g., vi-
brating str ing) feeding into a resonator (e.g., the body of an acoustic instrum ent). Respectively,
16
Figure 14: A damped oscillator in state space f orm.
s
e
and s
f
are the states of the excitation system and the resonating system:
s
e
k
∼ N
s
e
k
; A
k
s
e
k−1
, Q
k
s
f
k
= A
ft
s
f
k−1
+ B
ft
s
e
k
¯y
k
∼ N
¯y
k
; Cs
f
k
, R
(6)
where C is an observation matrix and R is the observation noise. This mod el is, of course, a
particular parametrisation of the general linear dynamical system. The main idea, in the general
sense, is to define a nonstationary process with p(
¯
A,
¯
Q) over the sequence of transition matrices
¯
A ≡ {A
k
}
k≥0
and the transition noise covariances
¯
Q ≡ {Q
k
}
k≥0
. The posterior estimates
of these latent parameters, when integrated over latent states s
k
will describe the signal in a
compact way. There are clearly many possibilities in defining prior distributions over
¯
A and
¯
Q.
In the sequel, we will define several realistic models in this framework.
We define a sequence of discrete switch variables r = r
0:K−1
and draw the state matrices
conditionally. Here this indicators r are abstract, but in practice will correspond to onsets,
offsets or note labels, depending upon the context of the task at hand.
p(r) = p(r
0:K−1
) = p(r
0
)
K−1
Y
k=1
p(r
k
|r
k−1
)
A
k
∼ p(A
k
|r
k
) Q
k
∼ p(Q
k
|r
k
)
This includes the special cases such as when A
k
= A(r), i.e. a deterministic function of r (with
the choice p(A
k
|r) = δ(A
k
− A(r))), similar with Q
k
.
In principle, one could work with in any state space coordinate system by appropriately
choosing the state m atrices. However, we prefer to work in a representation to maintain the
interpretability of the parameters. This representation is closely related to the sinus oidal models
and described in detail in the appendixA.
4.1.1 Dynamic Harmonic model
To highlight our specific construction, we start this section with an example. We consider a
second order oscillator systems, driving a second order resonator, following Eq.(6). We specify
the models by a specific choice of transition matrices and transition noise covariance
A
k
=
˜
Z(γ
k
, ω
k
) A
ft
=
˜
Z(γ
ft
, ω
ft
) Q
k
= q
k
I
where
˜
Z(γ, ω) ≡ e
−Lγ
ν
cos(Lω
ν
) −sin(Lω
ν
)
sin(Lω
ν
) cos(Lω
ν
)
⊤
We have a L ×L observation noise covariance matrix R, and an L ×2K observation matrix
C, where each column is a damped sinusoidal (see Eq.(32) in the appendix). For simplicity,
we consider the case when frame length is L = 1, where we generate the signal sample by
17
sample. See Fig.14. Note that in this representation the state vector s is simply th e expansion
co efficients (r eal amplitudes) a and b, introduced in the previous section.
The transition matrix A
k
generates a damped sinusoidal which is fed into the system with
transition matrix A
ft
via the 2 × 2 input matrix B
ft
, here taken as B
ft
= I. Th e driving n oise
of the excitation has time dependent variance Q
k
= q
k
I. The observation matrix in this case
is C =
1 0
. The discrete variables r
k
in this model encode onsets and offsets. We define
a Markov chain r
k
for k = 0, 1, . . . , where r
k
∈ {on = 1, off = 0}, with the state transition
distribution p arametrised as p(r
k
= on|r
k−1
= on) = π
on
and p(r
k
= off|r
k−1
= off) = π
off
.
Conditioned on r
k
and r
k−1
, we let
q
k
=
q
onset
if r
k
= on and r
k−1
= off
q
on
if r
k
= on and r
k−1
= on
0 otherwise
γ
k
=
γ
on
if r
k
= on
γ
off
if r
k
= off
The hyper parameters of this model are the prior state probabilities π
on
and π
off
transition
variances q
onset
, q
on
, the excitation model transition damping constants γ
on
, γ
off
and fr equ en cy
ω
ex
, the resonator parameters γ
ft
and ω
ft
and the observation noise variance R. With each
new onset event (r
k−1,k
= (off, on))), the excitation state is reinitialised from a Gaussian with
variance q
onset
and driven w ith noise until the next onset time.
Even this rather simple model displays quite rich behaviour, as shown by typical realisations
in Fig. 4.1.1. Given the observed signals, posterior inference for latent variables gives structural
information about the signal. For example, E(ω
k
|¯y
1:k
) will give an on line estimate of the instan-
taneous fr equ en cy of the excitation and the posterior transition variance estimate E(r
k
|¯y
1:K
)
will give an indication of onsets and offsets.
4.2 Dynamic Harmonic model and Changepoint models
As previously discussed, acoustic systems, and in particular p itched musical instruments tend
to create oscillations with modes at fr equ en cies that are roughly related by ratios of integers
(Fletcher and Rossing 1998). Th e fundamental frequency, corresponding the largest common
divisor of mode frequencies, is strongly correlated with the perceived pitch in music. For tran-
scription, we need to estimate the fundamental frequency as well as the onsets and offsets to
mark the beginning and the end of each note. This problem can be formalised using the har-
monic models introduced in section 3, coupled to a change-point structure related to the mixture
model of the previous section. We now combine a number of oscillators that are harmonically
related by a fundamental frequency. We define the block diagonal state evolution matrix
A
k
= diag(Z
0,k
, . . . , Z
ν,k
, . . . Z
W −1,k
)
with possible choices
Z
ν,k
=
˜
Z(γ
k
, ω
k
)
ν
=
˜
Z(νγ
k
, νω
k
)
ν
(7)
Here, the power ν adjust both the d amping and the frequency. T his ensu res that all oscillators
are tun ed to a multiple of a base fundamental frequency.
Both γ and ω can assum e positive real values and the exact posterior has a complicated
form due to the nonlinear relationship with observed data. One simplification, in contrast with
models introd uced earlier in section 3, is choosing the fundamental frequ en cies from a finite set
taking values on a prespecified grid su ch as the tempered scale or finer gradatation according
to the desired frequency resolution,
ω
k
= ω(m
k
) m
k
∈ 1 . . . M
18
k
(a) Only periodic excitation
e
k
r
k
y
k
γ
on
= 5 · 10
−4
γ
off
= 0.03 q
on
= 0
k
(b) Periodic + noise excitation
e
k
r
k
y
k
γ
on
= 0.01 γ
off
= 0.1 q
on
= 1
k
(c) Only noise excitation
e
k
r
k
y
k
γ
on
= 1 γ
off
= 1 q
on
= 1
k
(d) Only impulsive excitation
e
k
r
k
y
k
γ
on
= 1 γ
off
= 1 q
on
= 0
Figure 15: Time series y
k
generated by a cascade of two phasors in Eq.6 ( middle), conditioned on the
indicator sequence r
0:K−1
, (top – black = on, white = off ). The excitation signals are shown at the
bottom e
k
= ( 1 0) s
k
. The other hyperparameters are fixed at γ
ft
= 0.005, ω
ft
= π/15, ω
ex
= π/16,
q
onset
= 1
19
Here, m is a discrete index variable and ω(m) is a function to the associated fundamental
frequency. For example, when m corresponds to the pitch label; ω(m) corresponds to the
fundamental frequency, as in (5) for example. We define the following pair of discrete latent
variables
d
k
= (r
k
, m
k
)
and obtain a discrete chain that can visit one of the |d| = 2M states at each time slice k. The
prior can be taken Markovian as p(d
k
|d
k−1
). The most likely onsets, offsets and the fund a-
mental frequency can be inferred via calculating the marginal maximum a-posteriori (MMAP)
trajectory
d
∗
0:K−1
= arg max
d
0:K−1
Z
p(y
0:K−1
|s
0:K−1
)p(s
0:K−1
|d
0:K−1
)p(d
0:K−1
)ds
0:K−1
Alternatively, when online estimates are required, as is the case in real-time interaction, the
filtering d en s ity can be computed recursively
p(s
k
, d
k
|y
0:k
) ∝
X
d
k−1
Z
p(y
k
|s
k
)p(s
k
, d
k
|s
k−1
, d
k−1
)p(s
k−1
, d
k−1
|y
0:k−1
)ds
k−1
p(d
k
|y
0:k
) =
Z
p(s
k
, d
k
|y
0:k
)ds
k
For general switching state space models, exact inference of the above quantities is not tractable.
Whilst in principle the filtering distribution can be represented exactly as a Gaus sian mixture
and propagated in closed form, we have to still resort to approximations since th e number
of mixture components needed for exact representation of p(s
k
, d
k
|y
0:k
) increases exponentially
with increasing k. However, there is an interesting special case, when condtioned on a p articular
configuration of d, there is a “forgetting” property, i.e., if
p(s
k
|d
k−1:k
=
¯
d, s
k−1
) = p(s
k
|d
k−1:k
=
¯
d)
In this case, the exact MMAP trajectory or the filtering density (Fearnhead 2003; Cemgil, Kap-
pen, and Barber 2004) can be computed in polynomial time. This can be shown by considering
all trajectories d
(j)
0:k
for j = 1 . . . |d|
k+1
of the discrete states d upto time k. One can show
that trajectories (j
′
) which are dominated by (j) in terms of conditional marginal likelihood
z(j
′
) ≤ z(j) ≡ p(y
0:k
, r
(j)
0:k
) can be discarded without destroying optimality. This greedy pruning
strategy is optimal and leaves only a number of trajectories that is increasing polynomially with
time (Cemgil, Kappen, and Barber 2004).
4.3 Polyphony and factorial models
The models described so far are useful for modelling complex sound s ou rces. Yet, extensions are
required for source separation or polyph on ic transcription. This is typically done via factorial
models, i.e., by constructing models over the product spaces of the individual sources, as was
done for the static case in section 3, eq. (3).
One possible construction assumes that the aud io is a superposition of several sources,
indexed via i = 1 . . . I, w ith each source i modelled using the same model class, yet with different
hyperparameter instantiations. Here, we first consider a factorial switching state space model
for music transcription. Here, each latent process ν = 1 . . . W corresponds to a “piano key”.
Indicators d
0:W −1,0:K−1
encode a latent p iano roll. Given this model, polyphonic transcription
can be obtained as
d
∗
0:W −1,0:K−1
= argmax
d
Z
p(y|s)p(s|d)p(d)ds
20
d
0
··· d
k
··· d
K−1
Q
ν
0
···
Q
ν
k
···
Q
ν
K−1
A
ν
0
···
A
ν
k
···
A
ν
K−1
s
ν
0
···
s
ν
k
···
s
ν
K−1
ν = 0 . . . W − 1
¯y
0
¯y
k
¯y
K−1
500 1000 1500 2000 2500 3000 3500
Figure 16: (L eft) Switching state space model with a block diagonal state transition matrix where
each block is denoted by ν. The dynamic harmonic model corresponds to the case when the blocks
of the transition matrix are chosen as A
k
=
˜
Z(γ
k
, ω
k
). (Right) Pitch detection and onset selection
of a signal recorded from a bass guitar . The MMAP state trajectory is show as r
k
= on = black and
the vertical a xis denotes the pitch label index m
k
.
d
i,0
··· d
i,k
··· d
i,K−1
Q
ν
i,0
···
Q
ν
i,k
···
Q
ν
i,K−1
A
ν
i,0
···
A
ν
i,k
···
A
ν
i,K−1
s
ν
i,0
···
s
ν
i,k
···
s
ν
i,K−1
ν = 0 . . . W − 1
i = 1 . . . I
¯y
0
¯y
k
¯y
K−1
i
k
y
k
ν
Figure 17: A Factorial Switching state space model and typical realisations. The task of polyphonic
transcription is to find the maximum marginal a-poteriori estimate of the latent switches d
∗
given
observations
21
A factorial switching state space model for polyphonic transcription is introduced in (Cemgil,
Kappen, and Barber 2004) and s ome further strategies have been investigated in (Cemgil 2007).
The model and a typical r ealisation is shown in Fig. 17. However, efficient inference in factorial
switching state space models is still an open problem. In the presence of a large number of
concurrent sources I, even a single time slice is intractable as the discrete variables in the
product space have a cardinality that scales exponentially as O(|d|
I
). E ven conditioned on d,
the latent continuous state dimension |s| can be very large for realistic models. and analytical
integration over s via Kalman filtering techniques is computationally heavy. The main reason
for this bottleneck in inference is the coupling between s tates s. When these are integrated
out, all time slices of the discrete indicators become coupled and the exact inference p roblem
is reduced to an intractable combinatorial optimisation problem. In the sequel, we will discuss
models where the couplings are ignored.
5 Frequency domain models
The previous two sections described various time d omain models for musical audio, including
sinusoidal models and state-space models. The models are quite accurate for many examples of
audio, although th ey show some non-robust properties in th e case of signals which are far from
steady-state oscillation and for in s tr uments which do not closely obey the laws d escribed above.
Perhaps more critically, for large polyphonic mixes of many notes, each having potentially many
partials, the computations can become very expensive, in p articular the calculation of marginal
likelihood terms in the presence of many Gaussian components α
i
and β
i
. Computing the
marginal likelihood is costly as this requires computation of Kalman filtering equations for a
large state space (that scales with the number of tracked harmonics) and for very long time
series (as typical audio signals are sampled at 44.1 KHz). Hence, either efficient approximations
need to be developed or simplified models need to be constructed.
In this section we at least partially bypass the computational issues by working with approx-
imate models in the frequency domain. These allow for direct likelihood calculations without
resorting to expensive matrix inversions and determinant calculations. Later in the chapter these
models w ill be elaborated further to give sophisitcated Bayesian non-negative matrix factorisa-
tion algorithms which are capable of learning the s tr ucture of the audio events in a semi-blind
fashion. Here initially, though, we work with simple model-based structures in the frequency
domain that are analogous to the time domain priors of the section 3. There are several routes to
a frequency domain representation, including multi-resolution transforms, wavelets, etc., though
here we use a simple windowed discrete Fourier transform as examplar. We now propose two
versions of a frequency domain likelihood model, both of which bypass the main computational
burden of the high-dimensional time-domain Gaussian models.
5.0.1 Gaussian frequency-domain model
The first model proposed is once again a Gaussian model. In the frequen cy domain we will
have typically complex-valued expansion coefficients of the data on a one-dimensional lattice of
frequency values ν ∈ N, i.e. a set of spectrum values y
ν
. The assumption is that the contribution
of each mus ical source term to the expansion coefficients is as independent zero-mean (complex)
Gaussians, with variance determined by the parameters of the musical note:
s
i,ν
∼ N
C
(0, λ
ν
(θ
i
))
where θ
i
= {n
i
, ω
0,i
, M
i
, g
i
} has the same interpretation as for the earlier time-domain model,
but now we can n eglect the α and β coefficients since the random behaviour is now directly
22
0.2 0.4 0.6 0.8 1 1.2 1.4
0
10
20
30
40
50
Normalised frequency
Figure 18: Template function λ
ν
(θ
i
) with M
i
= 8, ω
0,i
= 0.71, Gaussian pulse shape.
modelled by s
i,ν
. This is a very natural formulation for generation of polyphonic models since
we can add a number of sources together to make a single complex Gaussian data mod el:
y
ν
∼ N
C
(0, S
v,ν
+
I
X
i=1
λ
ν
(θ
i
))
Here, S
v,ν
> 0 models a Gaussian background noise component in a manner analogous to
the time-domain formulation’s v
t
and it then remains to design the positive-valued ‘template’
functions λ. Once again, Fig. 1.1 gives s ome guidance as to the general characteristics required.
We then mod el the template using a sum of positive valued pulse waveforms φ
ν
, shifted to be
centred at the expected partial position, and whose amplitude decays with increasing partial
number:
λ
ν
(θ
i
) =
M
i
X
m=1
g
2
i
k
m
φ
ν−mω
0,i
(8)
where k
m
, g
i
and M
i
have exactly the same interpretation as in the time-domain model. An
example template construction is shown in Fig. 18, in which a Gaussian pulse shape has been
utilised.
5.0.2 Point process frequency-domain mo del
The Gaussian frequency domain model requires a knowledge of the conditional distribution for
the whole range of spectrum values. However, the salient features in terms of pitch estimation
appear to be the peaks of the spectrum see Fig. 1.1. Hence a more parsimonious likelihood
model might work only with the peaks detected from the Fourier magnitude spectrum. Thus we
propose as an alternative to the Gaussian spectral model, a point process model for the peaks in
the spectrum. Specifically, if the peaks in the spectrum of an individual note are assumed to be
drawn from a one-dimensional inhomogeneous Poisson point process having intensity function
λ
ν
(θ
i
) (considered as a function of continuous frequency ν), then the combined set of peaks
from many notes may be combin ed , under an independence assumption, to give a Poisson point
process whose intensity function is the sum of the individual intensities (Grimmett and Stirzaker
2001). Suppose we detect a set of peaks in the magnitude spectrum {p
j
}
J
j=1
, ν
min
< p
j
< ν
max
.
23
0 500 1000 1500 2000 2500 3000 3500 4000 4500
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
time (samples)
y
t
Figure 19: Audio waveform - single chord data.
Then the likelihood may be readily computed using:
p({p
j
}
J
j=1
, J|Θ) = Po(J|Z(Θ))
J
Y
j=1
S
v,p
j
+
P
I
i=1
λ
p
j
(θ
i
)
Z(Θ)
where Z(Θ) =
R
ν
max
ν
min
S
v,ν
+
P
I
i=1
λ
ν
(θ
i
)
dν is the normalising constant for the overall intensity
function. Here once again we include a background intensity function S
v,ν
which models ‘false
detections ’, i.e. detected peaks that belong to no existing musical note. The form of the
template functions λ can be very similar to that in the Gaussian frequency model, Eq. 8. A
modified form of this likelihood function was successfully applied for chord detection p roblems
in (Peeling, Li, and Godsill 2007).
5.1 Example: inference in the frequency domain models
The frequency domain models provide a su bstantially faster likeliho od calculation than the
earlier time-domain models, allowing for rapid inference in the presence of significantly larger
chords and tone complexes. Here we present example resu lts for a tone complex containing
many different notes, played on a pipe organ. Analysis is performed on a very short s egment
of 4096 data points, sampled at a rate of ω
s
= 2π × 44, 100 rad.s
−1
- hen ce just under 0.1s of
data, see Fig. 19. From the score of the music we know that there are four notes simultaneously
playing: C5, F♯5, B5, and D6, or MIDI note numb ers 72, 78, 83 and 86. However, the mix
is complicated by the addition of pipes one octave below and one or more octaves above the
principal pitch, and hence we have at least 12 notes present in the complex, MIDI notes 60,
66, 71, 72, 74, 78, 83, 84, 86, 90, 95, and 98. Since the upper octaves share all of th eir partials
with notes from one or more octaves below, it is not clear whether the models will be able to
distinguish all of the sound s as separate notes. We run the frequency-domain m odels using the
prior framework of S ection 3.1 and a reversible jump MCMC scheme of the same form as that
used in the previous transient analysis example. Firstly, u s ing the Gaussian frequency domain
model of section 5.0.1, the MCMC burn -in for the note number vector n = [n
1
, n
2
, ..., n
I
] is
shown in Fig. 20. This is a variable dimension vector under the reversible ju mp MCMC and we
can see notes entering or leaving the vector as iterations pro ceed. We can also see large moves
of an octave (± 12 notes) or a fifth (+7 or -5 notes), corresponding to specialised Metropolis-
24
0 500 1000 1500 2000 2500 3000
50
60
70
80
90
100
110
MCMC Iteration No.
MIDI Note number (Middle C=60)
Figure 2 0: Evolution of the note number vector with iteration number - single chord data. Gaussian
frequency domain model.
Hastings moves which center their proposals on the octave or fifth as well as the locality of the
current note. As is typical of these models, the MCMC becomes slow moving once converged
to a good mode of the distribution and further large moves only occur occasionally. There is a
go od case here for using ad aptive or population MCMC schemes to improve the properties of
the MCMC. Nevertheless, convergence is much faster than for the earlier proposed time domain
models, particularly in term s of the model order s ampling, which was here initialised at I = 1,
i.e. one single note present at the start of the chain. Specialised independence proposals have
also been devised, based on simple pitch estimation methods applied to the raw data. These
are largely responsible for the initiation of new notes in the MCMC chain. In this instance
the MCMC has identified correctly 7 out of the (at least) 12 possible pitches present in the
music: 60, 66, 71, 72, 74, 78, 86. The remaining 5 u nidentified pitches share all of their partials
with lower pitch es estimated by the algorithm, and hence it is reasonable that they remain
unestimated. Examination of the discrete Fourier magnitude spectrum (Fig. 21) shows that
the higher pitches (with the possible exception of n
7
= 83, whose harmonics are modelled by
n
3
= 71) are generally buried at very low amplitude in the spectrum and can easily be absorbed
into the model for pitches one or more octaves lower in pitch.
We can compare these results with those obtained using th e Poisson model of section 5.0.2.
The MCMC was run under identical conditions to the Gaussian model and we plot the equivalent
note index output in Fig. 22. Here we see that fewer notes are estimated, since the basic point
process model takes no account of the amplitudes of the peaks in the sp ectrum, and hence is
happy to assign all harmonics to the lowest possible fundamental pitch. The four predominant
pitches estimated are the four lowest fund amentals: 60, 66, 71 and 74. The sampler is, however,
generally more mobile and we see a better and more rapid exploration of the posterior.
25
10
−1
10
0
0
5
10
15
20
25
30
Frequency (log scale)
Amplitude
n
1
=60
n
2
=66
n
3
=71
n
4
=72
n
12
=98
n
11
=95
n
5
=74
n
10
=90
n
9
=86
n
8
=84
n
6
=78
n
7
=83
Figure 21: Discrete Fourier magnitude spectrum for 12-note chord. True note positions marked with
red pentagram.
0 1000 2000 3000 4000 5000 6000 7000 8000
40
50
60
70
80
90
100
110
MCMC iteration number
MIDI note number (Middle C=60)
Figure 2 2: Evolution of the note number vector with iteration number - single chord data. Poisson
frequency domain model.
26
5.2 Further prior structures for Transform domain representa-
tions
In audio processing, the energy content of a signal is typically time-varying hence it is natural to
model audio with a process with a time varying power spectral density on a time frequency plane
(Reyes-Gomez, Jojic, and Ellis 2005; Wolfe, Godsill, and Ng 2004; F´evotte, Daudet, Godsill,
and Torr´esani 2006), and several prior structures are pr oposed in the literature for modelling
the expansion coefficients. The central idea is choosing a latent variance model varying over
time and frequency bins
s
ν,k
|Q
ν,k
∼ N(s
ν,k
; 0, Q
ν,k
)
Q
ν,k
= q
ν,k
I
In (Wolfe, Godsill, and Ng 2004), the follow ing structure is proposed under the name Gabor
Regression
q
ν,k
|r
ν,k
∼ [r
ν,k
= on] IG(q
ν,k
; a, b/a) + [r
ν,k
= off] δ(q
ν,k
)
Moreover, the joint distribution over the latent indicators r = r
0:W −1,0:K−1
is taken as a pairwise
Markov Random field where u denotes a double index u = (ν, k)
p(r) ∝
Y
(u,u
′
)∈E
φ(r
u
, r
u
′
)
5.3 Gamma chains and fields
An alternative model is introduced in (Cemgil and Dikmen 2007; Cemgil, Peeling, Dikmen, and
Godsill 2007), where a Markov Random field is directly placed on the variance terms as
p(q) =
Z
dλp(q, λ)
using a so-called gamma field.
To understand the construction of a Gamma field, it is instructive to look first at a ch ain,
where we have an alternating sequence of Gamma and inverse Gamma random variables
q
u
|λ
u
∼ IG(q
u
; a
q
, a
q
λ) λ
u+1
|q
u
∼ G(λ
u+1
; a
λ
, q
u
/a
λ
)
Note that this construction leads to conditionally conjugate Markov blankets that are given as
p(q
u
|λ
u
, λ
u+1
) ∝ IG(q
u
; a
q
+ a
λ
, a
q
λ
u
+ a
λ
λ
u+1
)
p(λ
u
|q
u−1
, q
u
) ∝ G(λ
u
; a
λ
+ a
q
, a
λ
q
−1
u−1
+ a
q
q
−1
u
)
Moreover it can be shown that any pair of variables q
i
and q
j
are positively correlated, and q
i
and λ
k
are negatively correlated. Note that this is a particular stochastic volatility model useful
for characterisation of non-stationary behaviour observed in time series (Shepard 2005).
We can represent a chain by a graphical model where the edge set is E = {(u, u)}∪{(u, u+1)}.
Considering the Markov structure of the ch ain, we define a gamma field p(q, λ) as a bipartite
undirected graphical model consisting of the vertex set V = V
λ
∪ V
q
, where partitions V
λ
and
V
q
denotes the collection of variables λ and q that are conditionally distributed G and IG
respectively. We define an edge s et E where an edge (u, u
′
) ∈ E such that λ
u
∈ V
λ
and q
u
′
∈ V
q
,
if the joint distribution admits the following factorisation
p(λ, q) ∝
Y
u∈V
λ
λ
(
P
u
′
a
u,u
′
−1)
u
Y
u
′
∈V
q
q
−(
P
u
a
u,u
′
+1)
u
Y
(u,u
′
)∈E
exp(−a
u,u
′
λ
u
q
u
′
)
Here, the shape parameters play the role of coupling strengths; when a
u,u
′
is large, adjacent
nodes are correlated. Given, this construction, various signal models can be developed figure 23.
27
Figure 23: Possible model topologies for Gamma fields. White and gray nodes corresponds to V
q
and
V
λ
nodes respectively. The horizontal and vertical axis corresponds to frequency ν and frame index
k. Each model describes how the prior variances are coupled as a function of time-frequency index.
Fo r example, the first model from the left corresponds to a source model with “spectral continuity”,
energy content of a given frequency band changes only slowly. The second model is useful for
modelling impulsive sources where energy is concentrated in time but spread across frequencies.
5.4 Models b ased on Latent Variance/Intensity factorisation
The various Markov random field priors of the previous introdu ced couplings between the latent
variances q
ν,k
. Yet, another alternative is to decompose the latent variances as a product. Here,
we define the following hierarchical model (see Fig. 25)
s
ν,k
∼ N(s
ν,k
; 0, q
ν,k
) q
ν,k
= t
ν
v
k
(9)
t
ν
∼ IG(t
ν
; a
t
ν
, a
t
ν
b
t
ν
)) v
k
∼ IG(v
k
; a
v
k
, a
v
k
b
v
k
))
Such models are also particularly useful for modelling acoustic instruments. Here, the t
ν
vari-
ables can be interpreted as average expected energy template as a function of f requency bin.
At each time, this template is modulated by v
ν
, to adj ust the overall volume. An example is
given in Figure 24 to represent a piano s ound. Here, the template gives the harmonic structure
of the pitch and the excitation characterises the time varying en ergy.
10 20 30 40 50 60
20
40
60
80
100
120
τ/ Frame
ν/ Frequency
—MDCT— coefficients
Template t
ν
Excitation v
k
Standard deviation
√
q
ν,k
Template t
ν
Excitation v
k
Intensity q
ν,k
Figure 24: (Left) The spectrogra m of a piano |s
ν,k
|
2
. (Middle) Estimated templates and excitations
using the conditionally Gaussian model defined in 9, where q
ν,k
is the latent variance (Right) Esti-
mated templates and excitations using the conditionally Poisson model defined in the next section
15
A simple factorial model, that uses the gamma chain prior models introd uced in section 5.3
is constru cted as f ollows:
x
ν,k
=
X
i
s
ν,i,k
s
ν,i,k
∼ N(s
ν,i,k
; 0, q
ν,i,k
) Q = {q
ν,i,k
} ∼ p(Q|Θ
t
) (10)
The computational advantage of this class of models is the conditional independence of the
latent sources given the latent variance variables. Given the latent variances and data, the
28
v
i,0
···
v
i,k
···
v
i,K−1
t
ν,i
s
ν
i,0
···
s
ν
i,k
···
s
ν
i,K−1
ν = 0 . . . W − 1
¯y
0
¯y
k
¯y
K−1
v
i,0
···
v
i,k
···
v
i,K−1
t
ν,i
s
ν
i,0
···
s
ν
i,k
···
s
ν
i,K−1
i = 1 . . . I
x
ν,0
x
ν,k
x
ν,K−1
ν = 0 . . . W − 1
¯y
0
¯y
k
¯y
K−1
Figure 25: (Left) Latent variance/intensity models in product for m (Eq.9). Hyperparameters are
not shown. (Right) Factorial version of the same model, used for polyphonic estimation as used in
section 5.5.3.
posterior of the sources is a product of Gaussian distributions. I n particular, the individual
marginals are given in closed form as
p(s
ν,i,k
|X, Q) = N(s
ν,i,k
; κ
ν,i,k
x
ν,k
, q
ν,i,k
(1 − κ
ν,i,k
))
κ
ν,i,k
= q
ν,i,k
/
X
i
′
q
ν,i
′
,k
This means that if the latent variances can be estimated, source separation can be easily ac-
complished. The choice of pr ior structures on the latent variances p(Q|·) is key here.
Below we illustrate this approach in single channel source separation for transient/harmonic
decomposition. Here, we assume that there are two sources i = 1, 2. The prior variances of the
first source i = 1 are tied across time frames using a gamma chain and aims to model a source
with harmonic continuity. The prior has the form
Q
ν
p(q
ν,i=1,1:K
). This model simply assumes
that for a given source the amount of energy in a frequency band stays roughly constant. The
second source i = 2 is tied across frequency bands and has the form
Q
k
p(q
1:W,i=2,k
); this
model tries to capture impulsive/percusive structure (for example compare the piano and conga
examples in Fig.4). The model aims to separate the sources based on harmonic continuity and
impulsive structure.
We illustrate this approach to separate a piano sound into its constituent components and
drum separation. We assume that J = 2 components are generated independently by two
Gamma chain models with vertical and horizontal topology. In figure 26-(b), we observe that
the model is able to separate transients and harmonic components. The sound files of these
results can be down loaded and listened at the following url: http://www-sigproc.eng.cam.
ac.uk/
~
sjg/haba, which is perhaps the best way assess the sound quality.
The variance/intensity factorisation models described in Eq. 9 have also straightforward
factorial extensions
x
ν,k
=
X
i
s
ν,i,k
s
ν,i,k
∼ N(s
ν,i,k
; 0, q
ν,i,k
) q
ν,i,k
= t
ν,i
v
i,k
(11)
T = {t
ν,i
} ∼ p(T |Θ
t
) V = {v
i,k
} ∼ p(V |Θ
v
) (12)
29
Time (τ)
Frequency Bin (ν)
X
org
S
hor
S
ver
Figure 26: Single channel Source Separation example, left to right, log-MDCT coefficients o f the
original signal and reconstruction with horizontal and vertical IGMRF models.
If we integrate out the latent sources, the marginal is given as
x
ν,k
∼ N(x
ν,k
; 0,
X
i
t
ν,i
v
i,k
)
Note that, as
P
i
t
ν,i
v
i,k
= [T V ]
ν,k
, the variance “field” Q is given compactly as th e matrix
product Q = T V . This resembles closely a matrix factorisation and is used extensively in audio
modelling. In the next section, we discuss mo dels of this type.
5.5 Non-negative Matrix factorisation models
Until now, we have described conditionally Gaussian models. Recently, a popular branch of
source separation and analysis of musical audio literature has focused on non-negativity of the
magnitude spectrogram X = {x
ν,τ
} with x
ν,τ
≡ ks
ν,k
k
1/2
2
, where s
ν,k
are expansion coefficients
obtained from a time frequency expansion. The basic idea is representing a spectrogram by
enforcing a factorisation as X ≈ T V where both T and V are matrices with positive entries
(Smaragdis and Brown 2003; Abdallah and Plumbley 2006; Virtanen 2006; Kameoka 2007;
Bertin, Badeau, and Richard 2007; Vincent, Bertin, and Badeau 2008). In music signal analysis,
T can be interpreted as a codebook of templates, corresponding to spectral shapes of individual
notes and V is the matrix of activations, somewhat analogous to a musical score. Often, the
following objective is minimised:
(T, V )
∗
= min
T,V
D(X||T V ) (13)
where D is the information (Kullback-Leibler) divergence, given by
D(X||Λ) =
X
ν,τ
x
ν,τ
log
x
ν,τ
λ
ν,τ
−x
ν,τ
+ λ
ν,τ
(14)
Using Jens en ’s inequality (Cover and Thomas 1991) and concavity of log x, it can be shown ,
that D(·) is nonnegative and D(X||Λ) = 0 if and only if X = Λ. The objective in (13) could
be minimised by any suitable optimisation algorithm. (Lee and Seung 2000) have proposed an
efficient variational bound minimisation algorithm that has attractive convergence properties,
that has been since successfully applied to various applications in signal analysis and source
separation. Although not widely acknowledged, it can be shown that the minimisation algorithm
is in fact an EM algorithm with data augmentation (Cemgil 2008). More precisely, it can be
shown that minimising D w.r.t., T and V is equivalent finding the ML solution of the following
30
hierarchical model
x
ν,k
=
X
i
s
ν,i,k
s
ν,i,k
∼ PO(s
ν,i,k
; 0, λ
ν,i,k
) λ
ν,i,k
= t
ν,i
v
i,k
(15)
t
ν,i
∼ G(t
ν,i
; a
t
ν,i
, b
t
ν,i
/a
t
ν,i
) v
i,k
∼ G(v
i,k
; a
v
i,k
, b
v
i,k
/a
v
i,k
) (16)
The computational advantage of this models is the conditional independence of the latent
sources given the variance variables. In p articular, we have
p(s
ν,i,k
|X, T, V ) = BI(s
ν,i,k
; x
ν,k
, κ
ν,i,k
)
κ
ν,i,k
= λ
ν,i,k
/
X
i
′
λ
ν,i
′
,k
This means that if the latent variances can be estimated somehow, source separation can be
easily accomplished as E(s)
BI(s;x,κ)
= κx.
5.5.1 Variational Bayes
It is also possible to estimate the marginal likelihood p(X) by integrating out all the templates
and excitations. This can be done via Gibbs samp ling or using a variational approach. Th e
variational app roach is very similar to the EM algorithm, with an additional approximation
step. We sketch here the Variational Bayes (VB) (Ghahramani and Beal 2000; Bishop 2006)
method to bound the marginal loglikelihood as
L
X
(Θ) ≡ log p(X|Θ) ≥
X
S
Z
d(T, V )q log
p(X, S, T, V |Θ)
q
(17)
= E(log p(X, S, V, T |Θ))
q
+ H[q] ≡ B
V B
[q] (18)
where, q = q(S, T, V ) is an instrumental distribu tion and H[q] is its entropy. The bound is
tight for the exact posterior q(S, T, V ) = p(S, T, V |X, Θ), but as this distribution is complex we
assume a factorised form for the instrumental distribution by ignoring some of the couplings
present in the exact posterior
q(S, T, V ) = q(S)q(T )q(V ) =
Y
ν,τ
q(s
ν,1:I,τ
)
!
Y
ν,i
q(t
ν,i
)
Y
i,τ
q(v
i,τ
)
≡
Y
α∈C
q
α
where α ∈ C = {{S}, {T }, {V }} d en otes set of disjoint clusters. Hence, we are no longer
guaranteed to attain the exact marginal likelihood L
X
(Θ). Yet, the bound property is preserved
and the strategy of VB is to optimise the bound. Although the best q distribution respecting the
factorisation is not available in closed form, it tu rns out that a local optimum can be attained
by the following fixed point iteration:
q
(n+1)
α
∝ exp
E(log p(X, S, T, V |Θ))
q
(n)
¬α
(19)
where q
¬α
= q/q
α
. This iteration monotonically improves the individual factors of the q dis-
tribution, i.e. B[q
(n)
] ≤ B [q
(n+1)
] for n = 1, 2, . . . given an initialisation q
(0)
. The order is n ot
important for convergence, one could visit blocks in arbitrary order. However, in general, the
attained fixed point d epends upon the order of the updates as well as the starting point q
(0)
(·).
This approach is computationally rather attractive and is very easy to imp lement (Cemgil 2008).
31
5.5.2 Variational update equations and sufficient statistics
The expectations of E(log p(X, S, T, V |Θ)) are functions of the su fficient statistics of q. The
fixed point iteration for the latent sources S (where m
ν,τ
= 1), an d excitations V leads to the
following
q(s
ν,1:I,τ
) = M(s
ν,1:I,τ
; x
ν,τ
, p
ν,1:I,τ
) q(v
i,τ
) = G
v
i,τ
; α
v
i,τ
, β
v
i,τ
(20)
p
ν,i,τ
= exp(E(log t
ν,i
) + E(log v
i,τ
))/
X
i
exp(E(log t
ν,i
) + E(log v
i,τ
)) (21)
α
v
i,τ
= a
v
i,τ
+
X
ν
m
ν,τ
E(s
ν,i,τ
) β
v
i,τ
=
a
v
i,τ
b
v
i,τ
+
X
ν
m
ν,τ
E(t
ν,i
)
!
−1
(22)
The variational parameters of q(t
ν,i
) = G
t
ν,i
; α
t
ν,i
, β
t
ν,i
are found similarly. The hyperparam-
eters can be optimised by maximising the variational Bound. While this does not gu arantee
to increase the true marginal likelihood, it leads in this application to quite practical and fast
algorithms.
5.5.3 Example: Polyphonic pitch estimation
In this section, we illustrate Bayesian NMF for polyphonic pitch detection. The approach
consists of two stages:
1. E stimation of hyperparameters given a corpus of piano notes
2. E stimation of templates and excitations given new polyphonic data and fixed hyperpa-
rameters
In the first stage, we estimate the hyperparameters a
t
ν,i
= a
t
i
and b
t
ν,i
(see Eq. 16), via
maximisation of the variational bou nd given in Eq. 18. Here, the observations are matrices
X
i
matrix is a spectrogram computed given each note i = 1 . . . I. In figure 27, we show th e
estimated scale parameters b
t
ν,i
as a function of frequency band ν and note index i. The harm onic
structure of each note is clearly visible.
To test the approach, we synthesize a music piece (here, a short segment from the beginning
of “F¨ur Elise” by Beethoven), given a midi piano roll and recordings of isolated notes from a
piano by simply appropriately shifting each time series and adding. The piano roll and the
the spectrogram of the synthesized audio are shown in Figure 28. The pitch detection task is
infering the excitations given the hyperparameters and the spectrogram.
The results are shown in Figure 29. The top figure shows the excitations estimated give th e
prior Eq. 16. The notes are visible here but there are some artifacts. The middle figure shows
results from a model where excitations are tied across time using a Gamma chain introduced
in section 5.3. This prior is highly effective here and we are able to get a more clearer picture.
The bottom figur e displays results obtained from a real recording of “F¨ur Elise”, performed on
electric guitar. Interestingly, w hilst we are still using the hyperparameters estimated from a
piano, the infered excitations show significant overlap with the original score.
6 Probabilistic Models for Tempo, Rhythm, Meter
An important feature of music information retrieval and interactive performance systems is the
capacity to infer attributes related to the temporal structure of audio signals. Detecting the
pulse or foot-tapping rate of musical audio signals has been a focus of research in musical signal
processing for several years (Cemgil 2004; Klapuri and Davy 2006). However, little attention
has been paid to the task of extracting information about the more abstract musical concepts
32
i/Key index
ν/Frequency index
Estimated Scale Parameters of the template prior
10 20 30 40 50 60 70 80
50
100
150
200
250
300
350
400
450
500
Figure 27: Estimated template hyperparameters b
t
ν,i
.
Note Index
Frame
Piano Roll
τ/Frame
ν/Frequency Bin
Log |MDCT| coefficients
50 100 150 200 250
50
100
150
200
250
300
350
400
450
500
Figure 28: The ground truth piano roll and the spectrum of the polyphonic data
33
τ/Frame index
pitch
Excitations (weak coupling)
20 40 60 80 100 120
5
10
15
20
25
30
35
40
pitch
τ/Frame index
20 40 60 80 100 120
5
10
15
20
25
30
35
40
50 100 150 200 250
5
10
15
20
25
30
35
40
Figure 29: Polyphonic Pitch detection. Estimated expected excitations (Top) Uncoupled excitations
(Middle) Tied excitations using a Gamma chain, ground tr uth shown in white (Bottom) Excitations
estimated from a guitar using the hyperparameters estimated from a piano - ground truth shown in
black.
34
of tempo, rhythm and meter. An hierarchical Bayesian approach is ideally suited to this task.
In order to quantify the musical concepts of interest, we interpret tempo as being equivalent
to the performed rate of quarter notes and rhythmic pattern as indicating the regions within a
musical bar in which note onsets are likely to occur. In this section we summarise a modelling
framework which permits inference of these quantities from observed MIDI ons et events or raw
audio samples (Whiteley, Cemgil, and Godsill 2006; Whiteley, Cemgil, and Godsill 2007).
6.1 A Bar-pointer Model
Central to the method is a dynamical model of a bar-pointer, an hypothetical, hidden, dynamical
object which maps the positive real line to one period of a latent rhythmical pattern, i.e. one
musical bar. Different rhythms are represented by rhythmic pattern functions which control
the hyper-parameters of a conjugate gamma-Poisson observation model. Th e trajectory of the
bar-pointer modulates these functions onto the time line. Conditional upon this trajectory and
hyper-parameter values, ob s erved MIDI onset events are modelled as being a non-homogeneous
Poisson Process or r aw audio samples are modelled as being a Gaussian process. The inf eren ce
task is then to estimate the state of the bar-pointer and values of rhythmic pattern indicator
variables. Both filtering and smoothing are of interest in d ifferent applications.
For a discrete time index k and ∆ a positive constant, at time t
k
= k∆ d en ote by φ
k
the position of the bar-pointer, which takes values in [0, 1). Denote by
˙
φ
k
its velocity which
takes values in [
˙
φ
min
,
˙
φ
max
], where
˙
φ
min
> 0 and
˙
φ
max
represents the mximum tempo to be
considered. The motion of the bar-pointer is modelled as being a piece-wise constant velocity
process:
φ
k+1
= (φ
k
+ ∆
˙
φ
k
)mod 1,
p(
˙
φ
k+1
|
˙
φ
k
) ∝ N(
˙
φ
k
, σ
2
φ
) ×I
[
˙
φ
min
≤
˙
φ
k+1
≤
˙
φ
max
]
,
where I
[x]
takes the value 1 wh en x is true and zero otherwise. The velocity of the bar pointer
is defined to be proportional to tempo.
At each time index k, a rhythmic pattern indicator, R
k
, takes one value in a finite set, for
example S = {0, 1}. The elements of the set S correspond to different rhythmic patterns, which
are described in further detail below, with examples given in figure 30. For now we deal with
the simple case in which there are only two such patterns, and switching between values of R
k
is modelled as occurring if a bar line is crossed, i.e.:
if φ
k
< φ
k−1
,
P r(R
k
= s|r
k−1
, φ
k
, φ
k−1
) =
p
r
, s 6= r
k−1
,
1 − p
r
, s = r
k−1
,
otherwise, R
k
= r
k−1
, where p
r
is the pr ob ability of a change in rhythmic pattern. In summary,
x
k
≡ [φ
k
˙
φ
k
r
k
]
T
specifies the state of the system at time index k. We remark that the dynamical
model can be extended to include the musical notion of meter, s ee (Whiteley, Cemgil, and Godsill
2006) for details.
6.2 Poisson Observation Model
MIDI onset events are treated as being Poisson distributed with an intensity parameter which is
conditioned on the position of the bar-pointer and the rhythm indicator variable. Defining the
Poisson intensity in this fashion allows quantification of the postulate that for a given rhythm,
there are regions in one bar in which onsets occur with high probability.
Each rhythmic pattern function, µ
r
: [0, 1) → R
+
, maps the position of the bar pointer φ
k
to the mean of a gamma prior distribution on an intensity parameter λ
k
. The value of µ
r
(φ
k
)
35
combined with a constant variance Q
λ
, determines the shape and rate parameters of the gamma
distribution:
a
r
(φ
k
) = µ
r
(φ
k
)
2
/Q
λ
,
b
r
(φ
k
) = µ
r
(φ
k
)/Q
λ
,
For brevity, denote a
k
≡ a
r
(φ
k
), and b
k
≡ b
r
(φ
k
). Then conditional on φ
k
and r
k
, the prior
density over λ
k
is:
p(λ
k
|φ
k
, r
k
) =
(
λ
a
k
−1
k
b
a
k
k
exp(−b
k
λ)
Γ(a
k
)
, λ
k
≥ 0
0, λ
k
< 0
This combination of prior distrib utions provides robustness against variation in the data.
Let y
k
denote the number of onset events observed in the kth n on -overlapping frame of length
∆, centred at time t
k
. The number y
k
is modelled as being Poisson distributed as follows:
p(y
k
|λ
k
) =
(λ
k
∆)
y
k
exp(−λ
k
∆)
y
k
!
.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
50
100
µ
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
50
100
µ
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
50
100
µ
φ
Figure 30: Examples of rhythmic pattern functions, each corresponding to a different value of r
k
. Top -
a bar of duplets in 4/4 meter, middle - a bar of triplets in 4/4 meter, bottom - 2 against 3 polyrhythm.
The widths of the peaks model arpeggiation of chords and expressive performance. Construction in terms
of splines permits flat regions b etween peaks, corresponding to an onset event ‘noise floor’.
Inference for the intensity λ
k
is not required so it is integrated out. This may be d one
analytically, yielding:
p(y
k
|φ
k
, r
k
) =
Z
∞
0
p(y
k
|λ
k
)p(λ
k
|φ
k
, r
k
)dλ
k
=
b
a
k
k
Γ(a
k
+ y
k
)
y
k
!Γ(a
k
)(b
k
+ ∆)
a
k
+y
k
.
6.3 Gaussian Observation M odel
For every time index k, denote by z
k
a vector of ν samples constituting the kth non-overlapping
frame of a raw audio s ignal. The time interval ∆ is then given by ∆ = ν/f
s
, where f
s
is the
36
sampling rate. The samples are modelled as independent with a zero mean Gaussian distribu-
tion:
p(z
k
|σ
2
k
) =
1
(2πσ
2
k
)
ν/2
exp
−
z
T
k
z
k
2σ
2
k
.
An inverse-gamma distribution is placed on th e variance σ
2
k
. The shape and scale parameters
of this distribution, denoted by c
k
and d
k
respectively, are determined by the location of the
bar pointer, φ
k
and the rhythmic pattern indicator variable r
k
, again via a rhythmic pattern
function, µ
k
.
p(σ
2
k
|φ
k
, r
k
) =
d
c
k
k
exp(−d
k
/σ
2
z
)
Γ(c
k
)
σ
−2(c
k
+1)
k
,
c
k
= µ
2
k
/Q
s
+ 2,
d
k
= µ
k
µ
2
k
Q
s
+ 1
,
where Q
s
is the variance of the inverse-gamma distribution and is chosen to be constant.
The variance of the Gaussian distribution, σ
2
k
, may be integrated out analytically to yield:
p(z
k
|φ
k
, r
k
) =
d
c
k
k
Γ(c
k
+ ν/2)
(2π)
ν/2
Γ(c
k
)
z
T
k
z
k
2
+ d
k
−(c
k
+ν/2)
.
6.4 Results
Of practical interest are the posterior filtering and smoothing distributions, e.g for some date
record of length K frames, p(φ
1:K
,
˙
φ
1:K
, r
1:K
|y
1:K
) and its marginals.
The performance of this model is demonstrated by a smoothing task for an excerpt of a
MIDI performance of ‘Michelle’ by the Beatles, using the Poisson observation model (results for
the Gaussian observation model are reported in (Whiteley, Cemgil, and Godsill 2006)). This
demonstrates the joint tempo-tracking and rhythm recognition capability of the method. The
performance, by a professional pianist, was recorded using a Yamaha Disklavier C3 Pro Grand
Piano. The top two rhythmic pattern s in figure 30 were employed. For purposes of exposition,
the state-space of the bar pointer was uniformally discretised to M = 1000 position and N = 20
velocity points. The discretised position and velocity of bar pointer are denoted by m
k
and n
k
respectively, with the discretised dynamical mo del being that with probability 0.99, n
k
= n
k−1
and, within the allowed range, n
k
is in cremented or decremented with equi-pr ob ability. Uniform
initial prior distributions were set on m
k
, n
k
and r
k
. T he time frame length was set to ∆ = 0.02s,
corresponding to the range of tempi: 12 − 240 quarter notes per minute. The probability of a
change in rhythmic pattern was set to p
r
= 0.1. Th e variance of the gamma distribution was
set Q
λ
= 10.
This section of ‘Michelle’ presents a s ignifi cant challenge for tempo tracking because of the
triplets, each of which by definition has a duration of 3/2 quarter notes. A performance of this
excerpt could be wrongly interpreted as having a local change in tempo in the second bar, when
in fact the rate of quarter n otes remains constant; the bar of triplets is just a change in r hythm.
For the discretised model, exact inferen ce is possible (Sequential Monte Carlo methods for
the non-discretised model are investigated in (Whiteley, Cemgil, and Godsill 2007)). In figure
31, the strong diagonal stripes in the image of the posterior smoothing distributions for m
k
correspond to the maximum a-posteriori (MAP) trajectory of the bar pointer. The change to
a trip let rhythm in the second bar and the subsequent reversion to duplet rhythm are correctly
identified. The MAP tempo is given by the darkest stripe in the image for the velocity log-
smoothing distribution - it is roughly constant throughout.
37
0 50 100 150 200 250 300 350 400 450
0
1
2
y
k
Observed Data
m
k
log p(m
k
|y
1:K
)
50 100 150 200 250 300 350 400 450
800
600
400
200
−10
−5
0
Quarter notes per min.
log p(n
k
|y
1:K
)
50 100 150 200 250 300 350 400 450
180
120
60
−10
−5
0
p(r
k
|y
1:K
)
Frame Index, k
50 100 150 200 250 300 350 400 450
Triplets
Duplets
0.2
0.4
0.6
0.8
Figure 31: Results for jo int tempo tracking and rhythmic pattern recognition on a MIDI performance
of ‘Michelle’ by the Beatles. The top figure is the score which the pianist was given to play. Each
image consists of smoothing marginal distributions for each frame index.
7 Discuss i on and Conclusi ons
In this review, we sketched Bayesian methods for analysis of audio signals. The Bayesian models
exhibit complex statistical structure and in practice, highly adaptive and powerful computa-
tional techn iques are needed to perform inference.
In this chapter, we have reviewed some of these statistical models, including generalised linear
models, dynamic systems and other hierarchical models, and described how various problems
in audio and music processing can be cast into Bayesian posterior in ference. We have also
illustrated inference methods based on Monte Carlo simulation or other deterministic techniques
(such as mean field, variational Bayes) originating in statistical physics to tackle computational
problems posed by inference in these models. We describes models in both the time domain and
transform domains, the latter typically offering greater computational tractability and modelling
flexibility at the expense of some accuracy in the mo dels.
The Bayesian approach has two key advantages over more traditional engineering solutions:
it provides both a su ite for model construction and a framework for algorithm development.
Apart from the pedagogical advantages (such as highlighting algorithmic similarities, conver-
gence characteristics and computational requirements), the framework facilitates development
of sophisticated models and to automation of code generation procedures. We believe that the
field of computer h earing, which is still in its infancy compared to topics su ch as computer
vision and speech recognition, has great p otential for advancement in coming years, with the
advent of powerful Bayesian inference methodologies and accompanying computational power
increases.
38
A Review of sinusoidal models
This appendix introduces several low-level, deterministic representations of audio signals and
highlights the links between various representations such as damped sinusoids, state space and
Gabor rep resentations. In the main text, these representations are us ed to formulate statistical
models for observed data, given the values of latent parameters. These deterministic representa-
tions are highly s tructured and the statistical models into which they are assimilated th erefore
exhibit rich temporal and spectral properties. Furthermore, the specific parameterisations of
these signal models allows their statistical counter-parts to be coupled to higher-level model
components, such as change-point processes and model-order ind icators.
A.1 Static Models
Sound signals are emitted by vibrating objects that can be modelled as a cascade of second
order systems. Hence, it is convenient to represent them as a sum of sinusoids and a transient
non-periodic component (See e.g. (McAulay and Quatieri 1986; Serra and Smith 1991; Rodet
1998; Irizarry 2002; Parra and Jain 2001; Davy and Godsill 2003)). We will start our exposition
using a deterministic and static mo del, which we later generalise in several directions. For a
time series y
n
of length N with n = 0 . . . N − 1, the (deterministic) discrete time sinus oidal
model can be written as
y
n
=
W −1
X
ν=0
α
ν
e
−γ
ν
n
cos(ω
ν
n + φ
ν
) (23)
Here, ν = 0 . . . W − 1 denotes the sinusoidal index and W is the total number of sinusoidal
components. The parameters of this model are all real numbers, amplitudes α
ν
, log-damping
co efficients γ
ν
> 0, the frequencies ω
ν
and the phases φ
ν
. Using the fact that cos(θ) = (e
jθ
+
e
−jθ
)/2 we can write
y
n
=
W −1
X
ν=0
(c
ν
z
n
ν
+ c
∗
ν
z
∗
ν
n
) (24)
where c
ν
is the complex amplitude c
ν
= (α
ν
/2)e
jφ
ν
and z
ν
is the complex pole z
ν
= e
−γ
ν
+jω
ν
.
It is obvious that we can write the model in matrix notation as:
¯y ≡
y
0
y
1
.
.
.
y
n
.
.
.
y
N−1
=
1 1 . . . 1 1
z
0
z
∗
0
. . . z
W −1
z
∗
W −1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
z
n
0
z
∗
0
n
. . . z
n
W −1
z
∗n
W −1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
z
N−1
0
z
∗
0
N−1
. . . z
N−1
W −1
z
∗N−1
W −1
c
0
c
∗
0
.
.
.
c
W −1
c
∗
W −1
≡ F
0:N−1
c
F
0:N−1
= F
0:N−1
(γ, ω) is N × 2W and each column is a damped complex exponential. The
n + 1’th row will be denoted as F
n
. It is possible to estimate all p arameters using subspace
techniques (Laroche 1993; Badeau, Boyer, and David 2002).
Many popular models appear as special cases of the above m odel, and can be obtained
by tying various parameters. The harmonic model, with fundamental frequen cy ω and some
log-damping coefficient γ is defined as (23) by the f ollowing
ω
ν
= νω γ
ν
= νγ (25)
39
This model is p articularly usefu l for pitched musical instruments that tend to create oscillations
with modes that are roughly related by integer ratios. It is possible to let the dampin g coeffi-
cients free; or tie them as above to model the fact that higher frequencies get damped faster. A
related model is the inharmonic model that assumes
ω
ν
= B(ν, ω) (26)
where B is a function that “stretchs” the harmonic frequencies. Stretching of harmonics is a
phenomena that is observed especially in piano strings.
It is important to note that the inverse Fourier transform is obtained as a special case of the
harmonic model by ignoring the damping γ
ν
= γ = 0 and choosing the frequencies on a uniform
grid with ω
ν
= 2πν/N for ν = 0 . . . W − 1 with W = N/2. We call all these models and their
variations static models.
A.2 State space Representations and Dyn amic Models
An alternative, dynamic representation exploits the rotational invariance property:
y
n
y
n+1
.
.
.
y
n+L−1
=
1 1 . . . 1 1
z
0
z
∗
0
. . . z
W −1
z
∗
W −1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
z
L−1
0
z
∗
0
L−1
. . . z
L−1
W −1
z
∗L−1
W −1
z
0
z
∗
0
.
.
.
z
W −1
z
∗
W −1
n
c
0
c
∗
0
.
.
.
c
W −1
c
∗
W −1
y
n:n+L−1
= F
0:L−1
diag(Z
0
, Z
1
, . . . , Z
W −1
)
n
c (27)
with Z
ν
= Z
ν
(γ
ν
, ω
ν
) ≡ diag(z
ν
, z
∗
ν
), c ≡
c
0
c
∗
0
. . . c
W −1
c
∗
W −1
⊤
. By approp riately
rotating the complex amplitudes, the basis matrix stays shift invariant in time. This representa-
tion allows us to write a model for arbitrary frame lengths and “interpolate” between dynamic
and static interpretations. For example, given a suitable frame length L, we can reorganise the
time series in N/L nonoverlapping frames (assuming L divides N). Then we can write using
(27)
¯y
k
≡ (y
kL
, . . . , y
(k+1)L−1
)
⊤
= F
0:L−1
diag(Z
L
0
, Z
L
1
, . . . , Z
L
W −1
)
k
c
Here, k denotes a frame index such that k = 0 . . . N/L − 1. One can easily envision models
where subsequent frames are possibly of different length L
k
such that
P
k
L
k
= N. Of course,
mathematically speaking, we haven’t really gained anything and in the noiseless case, all for-
mulations for any frame length L are equivalent. However, the dynamical system perspective
opens up interesting possibilities to extend the basic model in a way that is not apparent in th e
static formulation. In particular, we can write the following state space parametrisation:
s
d
0
=
c
0
c
∗
0
. . . c
W −1
c
∗
W −1
⊤
= c
s
d
k+1
= A
d
s
d
k
¯y
k
= F
0:L−1
s
d
k
where A
d
≡ diag(Z
L
1
, Z
L
2
, . . . , Z
L
W −1
) is a 2W × 2W d iagonal matrix with poles appearing in
conjugate pairs and F
0:L−1
is L × 2W , as defined in (27). This rep resentation is known as
the diagonal realisation. It is well known that the representation of ¯y
k
is not unique; for any
invertible transform matrix T , we could define s
k
= T s
d
k
T s
d
k+1
= T A
d
T
−1
T s
d
k
⇐⇒ s
k+1
= As
k
(28)
¯y
k
= F
0:L−1
T
−1
T s
d
k
⇐⇒ ¯y
k
= C
0:L−1
s
k
(29)
40
with new state evolution and observation matrices A ≡ T A
d
T
−1
and C
0:L−1
≡ F
0:L−1
T
−1
. In
particular, to avoid complex arithmetic, one can apply the following block diagonal transform
T = diag(T, T, . . . , T ) (30)
T ≡
1
√
2
1 1
j −j
T T
H
= T
H
T = I
This renders the n + 1’th row of C
0:L−1
C
n
= F
n
T
H
(31)
=
√
2
e
−nγ
0
cos(nω
0
) e
−nγ
0
sin(nω
0
) . . . e
−nγ
W −1
cos(nω
W −1
) e
−nγ
W −1
sin(nω
W −1
)
(32)
Fur th er we obtain
A = diag(A
0
, . . . , A
ν
, . . . , A
W −1
) (33)
A
ν
= A
ν
(γ
ν
, ω
ν
) = T Z
L
ν
T
H
= e
−Lγ
ν
cos(Lω
ν
) −sin(Lω
ν
)
sin(Lω
ν
) cos(Lω
ν
)
⊤
s
0
=
s
⊤
0,0
. . . s
⊤
0,ν
. . . s
⊤
0,W −1
⊤
(34)
s
0,ν
= T
c
ν
c
∗
ν
⊤
=
√
2
Re {c
ν
} −Im {c
ν
}
⊤
The harmonic and inharmonic models of the previous section can be written in state sp ace
form by constraining the poles to have related frequencies. For example, the harmonic model
with damping coefficient γ and fundamental frequency ω can be written as
A
ν
= A
ν
(γ
ν
, ω
ν
) = T Z(γ, ω)
Lν
T
H
= e
−Lγν
cos(Lων) −s in(Lων)
sin(Lων) cos(Lων)
⊤
A.3 Nonstationarity and Time-Frequenc y Representations
The sinusoidal model, as we have introd uced it in the previous section, is rather r estrictive
for modelling long sequences and realisations from nonstationary pro cesses, as the poles of th e
model are taken to be fixed. One possibility is to assume that the poles are time varying; where
one can take
A
ν,k
= A
ν
(γ
ν,k
, ω
ν,k
) C
k
= C(γ
0:W −1,k
, ω
0:W −1,k
)
However, the associated estimation problem is highly nonlinear. The inverse problem is also
ill posed as one can generate any signal if one can arbitrarily m odulate the pole frequencies
and damping coefficients. Hence, an appropr iate regulariser, such as a random walk in pole
frequencies, needs to be assumed. In certain cases, such as when the harmonic model is used, the
pole frequencies can be discretised and deterministic grid based inference techniques (Tabrikian,
Dubnov, and Dikalov 2004; Cemgil, Kappen, and Barber 2004) or more powerful sequ ential
Monte Carlo techniques can be employed (Dubois and Davy 2007).
An arguably simpler approach is to use a b asis that is localised both in time and frequency.
Such rep resentations are known as Gabor representations. In the Gabor representation (Mallat
1999; Wolfe, Godsill, and Ng 2004), a real valued time series is represented on a W × K time-
frequency lattice by
y
n
=
K−1
X
τ=0
W −1
X
ν=0
c
ν,τ
h
n,τ
z
n
ν
+ c
∗
ν,τ
h
n,τ
z
∗
ν
n
41
where ν = 0 . . . W −1 is the frequency band in dex and τ = 0 . . . K −1 is the time frame in dex,
where the poles are z
ν
= z
ν
(ω) = e
jων
with ω = 2π/W . T he coefficients h
n,τ
are fixed and are
determined from a prototype window function h(n) as
h
n,τ
= h(n − τ
N
K
)
The real valued and non-negative window function is typically chosen as a symmetric bell-shaped
curve with compact support
5
to give a suitable time-frequency localisation. The ratio N/K
denotes the effective number of samples that we shift the window with each frame. The ratio
KW/N is the oversampling rate and gives an indication of the red undancy of the representation.
It is informative to wr ite the model in matrix form
¯y = Gc
where the expansion coefficients are given as
c =
c
0,0
c
∗
0,0
. . . c
1,τ
. . . c
ν,1
. . . c
∗
W −1,K−1
⊤
and G is given as a N × 2W K matrix
g
0,0
(0) g
∗
0,0
(0) . . . g
0,τ
(0) . . . g
ν,0
(0) . . . g
∗
W −1,K−1
(0)
.
.
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
g
0,0
(n) g
∗
0,0
(n) . . . g
0,τ
(n) . . . g
ν,0
(n) . . . g
∗
W −1,K−1
(n)
.
.
. . . . . . .
.
.
.
.
.
.
.
.
. . . .
.
.
.
g
0,0
(N −1) g
∗
0,0
(N −1) . . . g
0,τ
(N −1) . . . g
ν,0
(N −1) . . . g
∗
W −1,K−1
(N − 1)
(35)
Here, each column is a basis vector with a modulated and translated version of the window
function g
ν,τ
(n) = h
n,τ
z
n
ν
. To avoid complex arithm etic, the block diagonal tran s formation in
(30) can be employed as ¯y = GT
−1
T c =
˜
Gs to obtain real matrices and expansion coefficients,
as in (32) and (34). For certain choices of the time-frequency lattice parameters w ith KW = N ,
the matrix G can be rendered square and orthonormal G
H
G = I. One such choice is taking the
window h rectangular with a support of L = N/K samp les such that subsequent windows do not
overlap in time and using W = L frequency bins. This is equivalent to modelling the sequence
as independent time frames. Many other techniques, such as lapped orthogonal transforms or
modified discrete cosine transforms, to obtain orthagonality with other type of window and
basis functions exist and it is beyond our purpose to review this rather broad literature.
The Gabor representation is in a way a generalisation of the sinusoidal model in Eq. (24),
that represents the signal by exponentially decaying window function e
−γn
, using a single sinu-
soid per frequ en cy ban d. In contrast, the Gabor representation chooses at each time frame and
each frequency band a new expansion coefficient, hence do es n ot en force phase continuity and
allows for smooth amplitude variations. One potential shortcomming of the Gabor representa-
tion is that it is not very powerful in representing abrupt changes in the signal characteristics,
such as chaingepoints or frequency modulations.
References
Abdallah, S. A. and M. D. Plumbley (2006, January). Unsupervised analysis of polyph onic
music using sparse coding. IEEE Transactions on Neural Networks 17 (1), 179–196.
5
However, the original paper by Gabor assumed Gaussian functions with infinite support
42
Badeau, R., R. Boyer, and B. David (2002, September). Eds parametric modelling and track-
ing of audio signals. In DAFx-02, Hamburg, Germany.
Bertin, N., R. Badeau, and G. Richard (2007). Blind signal decompositions for automatic
transcription of polyphonic music: NMF and K-SVD on the benchmark. In Proc. of
Internation conference on audio, speech and signal processing (ICASSP), Honolulu.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Spr inger.
Cemgil, A. T. (2004). Bayesian Music Transcription. Ph. D. thesis, Radboud University of
Nijmegen.
Cemgil, A. T. (2007). Strategies for s equ ential inference in factorial switching state space
models. In Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP
07), Honolulu, Hawaii, pp. 513–516.
Cemgil, A. T. (2008, July). Bayesian inference in non-negative matrix factorisation models.
Technical Report CUED/F-INFENG/TR.609, University of Cambridge.
Cemgil, A. T. and O. Dikmen (2007, September). Conjugate gamma Markov random fields
for modelling nonstationary sources. In ICA 2007, 7th International Conference on Inde-
pendent Component Analysis and Signal Separation.
Cemgil, A. T., S. J. Godsill, and C. Fevotte (2007). Variational and Stochastic Inference for
Bayesian Source Separation. Digital Signal Processing 17.
Cemgil, A. T., H. J. Kappen, and D. Barber (2004). A generative model for music transcrip-
tion. Accepted to IEEE Transactions on Speech and Audio Processing.
Cemgil, A. T., P. Peeling, O. Dikmen, and S. J. Godsill (2007, October). Prior structures for
time-frequency energy distributions. In Accepted to Proc. of IEEE Workshop on Applica-
tions of Signal Processing to Audio and Acoustics.
Cover, T. M. and J. A. Thomas (1991). Elements of Information Theory. New York: John
Wiley & Sons, Inc.
Davy, M., S. Godsill, and J. Idier (2006, April). Bayesian analysis of polyphonic western tonal
music. Journal of the Acoustical Society of America 119 (4).
Davy, M. and S. J. Godsill (2002). Detection of abrupt sp ectral changes using su pport vec-
tor machin es. an application to audio signal segmentation. In Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing.
Davy, M. and S. J. Godsill (2003). Bayesian harmonic models for musical signal analysis. In
Bayesian Statistics 7.
Dubois, C. and M. Davy (2007, May). Joint detection and tracking of time-varying har-
monic components: a flexible bayesian approach. IEEE transactions on Speech, Audio
and Language Processing 15 (4), 1283–1295.
Fearnhead, P. (2003). Exact and efficient bayesian inference for multiple changepoint pr ob-
lems. Technical report, Dept. of Math. and Stat., Lancaster University.
F´evotte, C., L . Daudet, S. J. Godsill, and B. Torr´esani (2006, May). Sparse regression with
structured priors: Application to audio denoising. In Proc. ICASSP, Toulouse, France.
F´evotte, C. and S. Godsill (2006). A Bayesian approach for blind separation of sparse sources.
IEEE Trans. on Speech and Audio Processing.
Fletch er, N. H. and T. Rossing (1998). The Physics of Musical Instruments. Springer.
Ghahramani, Z. and M. Beal (2000). Propagation algorithms for variational Bayesian learn-
ing. In Neural Information Processing Systems 13.
Godsill, S. (2004). Computational modeling of musical signals. Chance Magazine (American
Statistical Association 17 (4).
43
Godsill, S. and M. Davy (2005, October). Bayesian computational models for inharmonicity
in musical instruments. In Proc. of IEEE Workshop on Applications of Signal Processing
to Audio and Acoustics, New Paltz, NY.
Godsill, S. J. and M. Davy (2002). Bayesian harmonic models for musical pitch estimation
and analysis. In Proc. IEEE International Conference on Acoustics, Speech and Signal
Processing.
Green, P. J. (1995). Reversib le jump Markov-chain Monte Carlo computation and Bayesian
model determination. Biometrika 82 (4), 711–732.
Grimmett, G. and D. Stirzaker (2001). Probability and Random Processes (Third Edition
ed.). Oxford University Press.
Hamacher, V., J. Chalupper, J. Eggers, E. Fischer, U. Kornagel, H. Puder, and U. Rass
(2005). Signal processing in high-end hearing aids: state of the art, challenges, and future
trends. EURASIP J. Appl. Signal Process. 2005(1), 2915–2929.
Irizarry, R. A. (2002). Weighted estimation of harmonic components in a musical sound signal.
Journal of Time Series Analysis 23.
Kameoka, H. (2007). Statistical Approach to Multipitch Analysis. Ph. D. thesis, University of
Tokyo.
Klapuri, A. and M. Davy (Eds.) (2006). Signal Processing Methods for Music Transcription.
New York: Springer.
Knuth, K. H. (1998, Jul.). Bayesian source separation and localization. In SPIE’98: Bayesian
Inference for Inverse Problems, San diego, pp. 147–158.
Laroche, J. (1993). Use of the matrix pencil method for the spectrum analysis of musical
signals. Journal of Acoustical Society of America 94, 1958–1965.
Lee, D. D. and H. S. Seung (2000). Algorithms for non-negative matrix factorization. In
NIPS, pp. 556–562.
Mallat, S. (1999). A Wavelet Tour of Signal Processing. Academic Press.
McAulay, R. J. and T. F. Quatieri (1986). Speech analysis/synthesis based on a sinus oidal
representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 34 (4),
744–754.
McIntyre, M. E., R. T. Schumach er, and J. Woodhouse (1983). On the oscillations of musical
instruments. J. Acoustical Society of America 74, 1325–1345.
Miskin, J. and D. Mackay (2001). Ensemble learning for blind sour ce separation. In S. J.
Roberts and R. M. Everson (Eds.), Independent Component Analysis, pp. 209–233. Cam-
bridge University Press.
Mohammad-Djafari, A. (1997, J ul.). A Bayesian estimation method for detection, localisation
and estimation of superposed sources in remote sensing. In SPIE’97, San Diego.
Parra, L. and U. Jain (2001). Approximate Kalman filtering for the harmonic plus noise
model. In Proc. of IEEE WASPAA, New Paltz.
Peeling, P. H., C. Li, and S. J. Godsill (2007, April). Poisson point process modeling for
polyphonic music transcription. Journal of the Acoustical Society of America Express
Letters 121(4), EL168–EL175. Reused with permission from Paul Peeling, The Journal
of the Acoustical Society of America, 121, EL168 (2007). Copyright 2007, Acoustical
Society of America.
Reyes-Gomez, M., N. Jojic, and D. Ellis (2005). Deformable spectrograms. In AI and Statistics
Conference, Barbados.
Rodet, X. (1998). Mus ical sound signals analysis/synthesis: Sinusoidal + residual and ele-
mentary waveform models. Applied Signal Processing.
44
Rowe, D. B. (2003). Multivariate Bayesian Statistics: Models for Source Separation and
Signal Unmixing. Chapan & Hall/CRC.
Serra, X. and J. O. Smith (1991). Spectral modeling synthesis: A sound analysis/synthesis
system based on deterministic plus stochastic decomposition. Computer Music Jour-
nal 14 (4), 12–24.
Shepard, N. (Ed.) (2005). Stochastic Volatility, Selected Readings. Oxford University Press.
Smaragdis, P. and J. Brown (2003). Non-negative matrix factorization for polyphonic mu-
sic transcription. In WASPAA, IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics.
Tabrikian, J., S. Dubnov, and Y. Dikalov (2004, Jan). Maximum a-posteriori probability pitch
tracking in noisy environments using harm on ic model. IEEE Transactions on Speech and
Audio Processing 12(1), 76–87.
Vincent, E., N. Bertin, and R. Badeau (2008). Harmonic and inharmonic nonnegative matrix
factorization for polyphonic pitch transcription. In Proc. IEEE Int. Conf. on Acoustics,
Speech and Signal Processing (ICASSP), Las Vegas. IEEE.
Virtanen, T . (2006, November). Sound Source Separation in Monaural Music Signals. Ph. D.
thesis, Tampere Un iversity of Technology.
Walmsley, P., S. J. Godsill, and P. J. W. Rayner (1998, September). Multidimensional opti-
misation of harmonic signals. In Proc. European Conference on Signal Processing.
Walmsley, P. J ., S. J. Godsill, and P. J. W. Rayner (1999, October). Polyphonic pitch tracking
using joint Bayesian estimation of multiple frame parameters. In Proc. IEEE Workshop
on Audio and Acoustics, Mohonk, NY State, Mohonk, NY State.
Wang, D. and G. J. Brown (Eds.) (2006). Computational Auditory Scene Analysis: Principles,
Algorithms, Applications. Wiley.
Whiteley, N., A. T. Cemgil, and S. J. Godsill (2006). Bayesian modelling of temporal struc-
ture in musical audio. In Proceedings of International Conference on Music Information
Retrieval, Victoria, Canada.
Whiteley, N., A. T. Cemgil, and S. J . Godsill (2007, April). Sequential Inference of Rhythmic
Structure in Mus ical Audio. In Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal
Processing (ICASSP 07), pp. 1321–1324. IEEE.
Wolfe, P. J., S. J . Godsill, and W. Ng (2004). Bayesian variable selection and regularisation
for time-frequency surface estimation. Journal of the Royal Statistical Society.
45