ArticlePDF Available

Automated extraction of odontocete whistle contours


Abstract and Figures

Many odontocetes produce frequency modulated tonal calls known as whistles. The ability to automatically determine time × frequency tracks corresponding to these vocalizations has numerous applications including species description, identification, and density estimation. This work develops and compares two algorithms on a common corpus of nearly one hour of data collected in the Southern California Bight and at Palmyra Atoll. The corpus contains over 3000 whistles from bottlenose dolphins, long- and short-beaked common dolphins, spinner dolphins, and melon-headed whales that have been annotated by a human, and released to the Moby Sound archive. Both algorithms use a common signal processing front end to determine time × frequency peaks from a spectrogram. In the first method, a particle filter performs Bayesian filtering, estimating the contour from the noisy spectral peaks. The second method uses an adaptive polynomial prediction to connect peaks into a graph, merging graphs when they cross. Whistle contours are extracted from graphs using information from both sides of crossings. The particle filter was able to retrieve 71.5% (recall) of the human annotated tonals with 60.8% of the detections being valid (precision). The graph algorithm's recall rate was 80.0% with a precision of 76.9%.
Content may be subject to copyright.
Automated extraction of odontocete whistle contours
Marie A. Roch
San Diego State University, Department of Computer Science, 5500 Campanile Drive, San Diego,
California 92182-7720
T. Scott Brandes
Signal Innovations Group, Incorporated, 4721 Emperor Boulevard, Suite 330, Research Triangle Park,
North Carolina 27703
Bhavesh Patel
San Diego State University, Department of Computer Science, 5500 Campanile Drive, San Diego,
California 92182-7720
Yvonne Barkley
Southwest Fisheries Science Center, National Oceanic and Atmospheric Administration, 3333 North Torrey
Pines Court, La Jolla, California 92037
Simone Baumann-Pickering
Scripps Institution of Oceanography, University of California, San Diego, 9500 Gilman Drive, La Jolla,
California 92093-0205
Melissa S. Soldevilla
Duke University Marine Laboratory, 135 Duke Marine Lab Road, Beaufort, North Carolina 28516
(Received 7 March 2011; revised 11 July 2011; accepted 25 July 2011)
Many odontocetes produce frequency modulated tonal calls known as whistles. The ability to auto-
matically determine time frequency tracks corresponding to these vocalizations has numerous
applications including species description, identification, and density estimation. This work devel-
ops and compares two algorithms on a common corpus of nearly one hour of data collected in the
Southern California Bight and at Palmyra Atoll. The corpus contains over 3000 whistles from
bottlenose dolphins, long- and short-beaked common dolphins, spinner dolphins, and melon-headed
whales that have been annotated by a human, and released to the Moby Sound archive. Both
algorithms use a common signal processing front end to determine time frequency peaks from a
spectrogram. In the first method, a particle filter performs Bayesian filtering, estimating the contour
from the noisy spectral peaks. The second method uses an adaptive polynomial prediction to con-
nect peaks into a graph, merging graphs when they cross. Whistle contours are extracted from
graphs using information from both sides of crossings. The particle filter was able to retrieve 71.5%
(recall) of the human annotated tonals with 60.8% of the detections being valid (precision). The
graph algorithm’s recall rate was 80.0% with a precision of 76.9%.
C2011 Acoustical Society of America. [DOI: 10.1121/1.3624821]
PACS number(s): 43.80.Cs, 43.60.Uv [WA] Pages: 2212–2223
The identification and description of individual marine
mammal tonal calls is a task that has numerous applications.
Contour description is useful for describing species’ vocal
repertoires (i.e., Wang et al., 1995) or can be a preliminary
step in a signal processing chain to determine information
about which species generated a set of whistles (Oswald
et al., 2007). Population densities can be estimated from call
detections when the call production rate is known (Marques
et al., 2009), and finally identifying a call recorded at multi-
ple hydrophones can be used to solve the call correspon-
dence task in localization applications.
Over the last two decades, several groups have worked
on methods to automate the description of whistles. Whistles
are tonal calls produced by many species of odontocetes.
Some of these are semi-automated, such as the approaches
used by Buck and Tyack (1993) and Lammers et al. (2003),
where a user is required to provide information to assist the
algorithm such as starting and ending points of the whistles.
Most fully automated algorithms, including the techniques
described in this work, focus on extracting contour ridges
from time frequency representations of a signal. The time
frequency representation typically consists of a spectro-
gram computed from sequences of Fourier transforms of
overlapping windowed audio data.
Examples of this type of algorithm include the work of
Datta and Sturtivant (2002) which uses edge-detection tech-
niques from image processing to find and connect areas of
the spectrogram with sharp transitions in intensity. Other
Author to whom correspondence should be addressed. Electronic mail:
2212 J. Acoust. Soc. Am. 130 (4), October 2011 0001-4966/2011/130(4)/2212/12/$30.00 V
C2011 Acoustical Society of America
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
strategies include identifying peaks in the spectra and con-
necting them in a coherent manner. For algorithms that iden-
tify whistles from time frequency peaks, the challenge lies
in determining when peaks are in close enough proximity to
be part of the same whistle contour and being able to disam-
biguate individual whistles when they cross one another.
Halkias and Ellis (2006) detected short segments of whistles
and then decided whether or not to connect nearby segments
based upon likelihood models trained on the mean frequency
curvature and energy. Pulsed killer whale calls appear to
have a harmonic structure when analyzed with long windows
relative to the interpulse interval [see Watkins (1967) for a
discussion of this relationship]. Brown and Miller (2007) as
well as Shapiro and Wang (2009) have exploited this struc-
ture to determine call tracks.
An alternative approach is to consider whistle discovery
within the context of Bayesian filtering, with the time-varying
whistle frequency serving as a hidden state that is estimated
from the noisy sound field. Mallawaarachchi et al. (2008)
used Kalman filters, a closed-form solution of the Bayesian
filtering problem, to adaptively predict the next time
frequency peak along a whistle path, thus solving the dis-
ambiguation problem by retaining state information about
each whistle. Other approaches that use predictions based on
partial detections include the whistle detectors in Ishmael
(Mellinger, 2001) and PamGuard (Gillespie et al., 2008).
Some groups have proposed techniques that are not based
on sequences of short-time Fourier transforms. Adam (2008)
extracted calls of killer whales using the Hilbert-Huang trans-
form. Ioana et al.(2010)proposed a method to extract tonals
when the signal’s phase track can be approximated by a poly-
nomial. The strongest signal is estimated by the product high-
order ambiguity function (Barbarossa et al., 1998), subtracted,
and the process is iterated to find the next strongest signal.
In this work, we consider two methods for the automatic
annotation of whistles. The first uses a particle filter, which
overcomes several shortcomings of Kalman filters which
constrain the probability distributions to Gaussians and only
permits linear state update equations. The second method
uses the formalism of a graph to connect spectral peaks and
permits delayed decisions about which paths are associated
with which whistle (if any) until more information than sim-
ply the next detected peak is available.
Kalman filters become limited in real-world scenarios
and have difficulty when distributions are non-Gaussian,
such as when distributions become multimodal at contour
intersections or when background noise increases suddenly.
As a robust alternative to Kalman filters in more complex
environments, particle filters provide a sequential Monte
Carlo solution for Bayesian filtering that works in non-
linear and non-Gaussian settings (Doucet et al., 2001, pp.
3–14; Arulampalam et al., 2002). In previous work, parti-
cle filters have been used in formant tracking for human
speech (Shi and Chang 2003); however, extensions are
needed for detecting odontocete whistles since their
approach requires that formants remain uninterrupted by
other sounds or formants. White and Hadley (2008)
showed that particle filters have the potential for use in
cetacean whistle extraction by showing that a simple parti-
cle filter can extract a single short whistle. In the work
presented here we extend their approach by accommodat-
ing a more sophisticated particle filter specifically designed
for odontocete whistle extraction in a complex acoustic
environment with numerous overlapping whistles from
multiple individuals.
As an alternative approach, we also consider the con-
struction of graph representations of whistle networks.
Graphs are commonly used to represent the interconnections
between nodes and have applications in search, path-finding,
etc. (Nilsson, 1980, Chap. 2). Spectral peaks in the time
frequency space are examined and either appended to exist-
ing graphs or form new graphs. Graphs may contain multiple
whistles, and a disambiguation step analyzes each graph to
extract the multiple whistles that may lie within.
This work uses a common signal processing front-end to
compare the results of particle-filter and graph based meth-
ods for detecting whistle contours. Nearly one hour of
recordings has been hand-annotated for five different species
of odontocetes, and metrics have been defined to determine
the efficacy of the algorithms not only for retrieval and false
detections, but also to characterize the quality of the detec-
tions. Both techniques are capable of real-time whistle
extraction on current-generation workstations such as the
II X4 940 (Advanced Micro Devices, Sunnydale,
CA), or Xeon
X3360 (Intel, Santa Clara, CA) with multi-
ple gigabytes of RAM.
A. Data collection
Data sampled at 192 kHz with 16 or 24 bit quantization
were collected for five species of odontocetes. Calls from
short-beaked and long-beaked common dolphins (Delphinis
delphis and D. capensis, respectively), as well as bottlenose
dolphins (Tursiops truncatus) were collected in the Southern
California Bight between 2004 and 2006. Additional bottle-
nose dolphin recordings along with recordings from melon-
headed whales (Peponocephala electra) and spinner dolphins
(Stenella longirostris longirostris) were collected during
2006 and 2007 at Palmyra Atoll. Two types of hydrophones
were used, the ITC 1042 (Intl. Transducer Corp., Santa Bar-
bara, CA), and the HS150 (Sonar Research and Development
Ltd., Beverly, UK), both of which have flat frequency
responses (63 dB) between 1–100 kHz. Hydrophones were
dipped or towed from small boats, the stationary platform
R/P FLIP (Fisher and Spiess 1963), and the R/V David Starr
Jordan. Hydrophone depths were typically 10 to 30 m.
Trained visual observers confirmed the identity of each
species. Recordings were made only in the presence of sin-
gle-species groups when no other groups were sighted. Limi-
tations of the data collection include differences in the
ability to sight other species due to observation platform
height as well as similarities between long-beaked and short-
beaked common dolphins (Heyning and Perrin, 1994) that
make identification of these species more difficult. The
sightings, recording durations, and specific files of the 56 m
39 s subset of the data used in this study are reported in
Tables Iand II.
J. Acoust. Soc. Am., Vol. 130, No. 4, October 2011 Roch et al.: Odontocete whistle contour extraction 2213
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
B. Signal processing front-end
Spectrograms are formed from the log magnitude spectra
of Hamming-windowed data frames computed every D
In this work we use a frame length of 8 ms and a frame
advance of D
¼2 ms, resulting in a per frequency bin band-
width of D
¼125 Hz. Only frequency bins between 5 and 50
kHz are processed as most calls of the species of interest lie
within this range. Spectrograms are smoothed by using a me-
dian filter over a 3 3 time-frequency grid followed by a per
frequency bin spectral means subtraction over a 3 s window.
Spectral peaks are identified for each frame by noting
all frequency bins whose normalized magnitude exceeds 10
dB rel. and have no higher-energy neighbors within 250 Hz
), which is roughly where the first side band of a Ham-
ming window occurs for the 8 ms window used by the algo-
rithm. Part of the motivation for suppressing close peaks is
to prevent detections of peaks from echoes with very short
delay which would have a similar trajectory such as those
that might occur when an animal is near the surface. Regions
of broad band energy, such as those produced by impulsive
sounds such as snapping shrimp or echolocation clicks, are
detected by checking to see if the percentage of frequency
bins identified as peaks has increased dramatically from the
previous frame. When this increases by more than 1%, we
consider it unlikely that the new peaks are attributable to the
start of new whistles, and the frame is not processed. Subse-
quent frames use the number of peaks from the last accepted
frame when determining the percentage increase in peaks,
and the algorithm initializes this value to 5% of the fre-
quency bins at the start of processing. The thresholds for this
ad hoc method will be shown to produce good results for the
species in this study, and would need to be adjusted for any
species that started to chorus in large numbers within Dts.
The specified analysis and growth rate parameters result in
the admission of up to eighteen calls in the first analysis
frame and up to three new calls within any 2 ms period.
C. Whistle extraction
We compare two competing methods of determining
whistle (tonal) contour patterns from the detected peaks. The
first method employs particle filters to model the trajectory
of hypothesized peaks and incrementally builds candidate
tonal contours. The second method assembles peaks that
meet criteria into a graph representation. No attempt is made
TABLE I. Summary of recordings. Abbreviations: CalCOFI—California Cooperative Oceanic Fisheries Investigations oceanographic survey, SCI—San
Clemente Island small boat survey, SOCAL—SOuthern CALifornia Instrumentation cruises on the R/V Sproul, FLIP—R/P FLIP moored recordings, and
Palmyra—Palmyra Atoll small boat recordings.
Species Duration Expedition Duration Expedition Duration Expedition Total duration
Bottlenose dolphin 4 m 13 s SCI 6 m 39 s Palmyra 10 m 52 s
Long-beaked common dolphin 5 m 0 s CalCOFI 3 m 54 s SOCAL 5 m 0 s FLIP 13 m 54 s
Melon-headed whale 6 m 57 s Palmyra 1 m 5 s Palmyra 3 m 18 s Palmyra 11 m 20 s
Short-beaked common dolphin 2 m 30 s SCI 6 m 11 s SCI 4 m 47 s SCI 13 m 28 s
Spinner dolphin 2 m 23 s Palmyra 2 m 5 s Palmyra 2 m 37 s Palmyra 7 m 5 s
grand total 56 m 39 s
TABLE II. Audio files corresponding to the summary data of Table I. Files are publicly available in the Moby
Sound archive as part of the 2011 Detection, Classification, and Localization of Marine Mammals Using Pas-
sive Acoustic Monitoring conference dataset.
Species Sighting File(s)
Bottlenose dolphin 1 Qx-Tt-SCI0608-N1-060814-121518.wav
2 palmyra092007FS192-070924-205305.wav and
Long-beaked common dolphin 1 Qx-Dc-CC0411-TAT11-CH2-041114-154040-s.wav
2 Qx-Dc-SC03-TAT09-060516-171606.wav
3 QX-Dc-FLIP0610-VLA-061015-165000.wav
Melon-headed whale 1 palmyra092007FS192-070925-023000.wav
2 palmyra092007FS192-071004-032342.wav
3 palmyra102006-061020-204327_4.wav
Short-beaked common dolphin 1 Qx-Dd-SCI0608-N1-060815-100318.wav
2 Qx-Dd-SCI0608-Ziph-060817-100219.wav
3 Qx-Dd-SCI0608-Ziph-060817-125009.wav
Spinner dolphin 1 palmyra092007FS192-070927-224737.wav
2 palmyra092007FS192-071011-232000.wav
3 palmyra102006-061103-213127_4.wav
2214 J. Acoust. Soc. Am., Vol. 130, No. 4, October 2011 Roch et al.: Odontocete whistle contour extraction
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
to disambiguate crossing tonals until after a graph has been
completed. This allows information from both sides of a
crossing to be considered when disambiguating multiple
whistles that cross.
For both methods, false detections are reduced by dis-
carding detections of less than 150 ms duration that are fre-
quently due to noise. While some whistles may be shorter
than 150 ms, results reported by Oswald et al. (2003) for
nine species of odontocetes in the eastern tropical Pacific
(four of which are covered in this study) had mean durations
from 0.3 to 1.4 s, with the species producing the shortest du-
ration whistles having a standard deviation of 0.3.
1. Particle filter
If we have a sequence of detected spectral peaks s1:tfrom
time index 1 to tthat can be used to model a sequence of con-
tour estimates c0:tas a general Markovian process, Bayes’ the-
orem describes the posterior distribution (that of the estimated
contour given the spectral peaks) at any time tas
pðc0:tjs1:tÞ¼ pðs1:tjc0:tÞpðc0:tÞ
where the initial contour estimate, c
is set to the first spec-
tral peak encountered that is not associated with another
whistle. This joint distribution of the posterior can be written
as a recursive pair of prediction and updating equations
using the Chapman-Kolmogorov equation (Papoulis, 1991,
p. 193) and Bayes’ theorem
Prediction :pðctjs1:t1Þ¼ ðpðctjct1Þpct1js1:t1
Updating :pðctjs1:tÞ¼ pðstjctÞpðctjs1:t1Þ
This recursion describes a Bayesian filtering process, where
the posterior that is estimated in one time step is used as the
prior distribution (our belief about the previous frequency of
the whistle) in the subsequent time step. If all of these distri-
butions are Gaussian and the state updates are linear, then
Kalman filtering provides a closed-form solution to this
recursion. When this constraint does not hold, as in many
systems of interest, sequential Monte Carlo methods such as
particle filtering can be used to find estimates of this
The particle filter estimates the posterior update with a
weighted collection of Npoint samples or particles ci
where the weights wi
0are each initialized as 1/N. Here, the
continuous posterior is approximated as a discrete distribu-
tion using the Dirac dfunction d() over each of the ipar-
ticles, and the particle weights wi
tare normalized. Since the
shape and peak of the posterior are unknown, we generate
point samples by using a distribution we define. This distri-
bution is referred to as an importance density, qðctjs1:tÞ,
and the particle weights are set proportionally,
tapðctjs1:tÞ=qðctjs1:tÞ. This can be written recursively using
Bayes’ theorem as
By setting the importance density as the product of particle
weight in the previous time step and the state update prior,
t1Þ, the particle weight becomes
In this way, the particle weights are resampled at each time
step and normalized to sum one in a process referred to as
sampling importance resampling (Gordon et al., 1993). In
the work presented here, the likelihood function pðstjci
takes the form of a normal distribution.
To improve performance, systematic resampling (Kita-
gawa, 1996) is implemented with particle replacement at each
time step. During each recursion, particles with a low weight
are extinguished and replacement particles are regenerated
near particles with a large weight. Particles far from the peak
of the posterior are removed and more particles are added
near the peak so that particles have a better chance of being
distributed within the informative parts of the posterior. This
is done within a continuous resampling space and works
much like a regularized particle filter (Musso et al., 2001).
In the predictive step, the particle locations are updated
according to the motion model for the whistle update. When
the whistle estimate is at least seven samples long, it is used
to approximate the first and second order time derivatives of
the whistle frequency at time t1. These rates of change are
used with a standard second order equation of motion (6) to
estimate the new location for each particle at time t.Asa
way to further increase the chances that the particles will be
well distributed throughout the measureable space spanned
by the posterior and to avoid being gradually herded off
course by perpetuating state estimate errors, a small random
noise is included in the prediction step model. This adjust-
ment is accounted for as a Gaussian random walk and
applied within the motion model. The random adjustment
is drawn from a Gaussian distribution such that 95% of the
draws are within about 40 Hz of the prediction (zero mean,
variance of 375 Hz, or a few frequency bins) and added into
the particle motion model to update particle positions ci
fðtt0Þ2þ: (6)
Here, the estimated first and second derivatives of the whis-
tle contour at an initial time step are _
f, respectively.
Prediction is typically for a single time step ðt0t1Þ;
however, larger time steps can occur when trying to reac-
quire whistle contours that have been lost due to brief signal
masking such as echolocation clicks.
J. Acoust. Soc. Am., Vol. 130, No. 4, October 2011 Roch et al.: Odontocete whistle contour extraction 2215
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
The likelihood function pðstjci
tÞdescribes how likely the
updated estimates are with the observations, and can take the
form of a Gaussian distribution without restricting the ability
of the particle filter to estimate non-Gaussian posteriors. In
cases where the particles span multiple sound intensity
peaks, treating the likelihood as a sum of Gaussians using
each observation instead of taking the “best” data point
increases the power of the particle filter to navigate a whistle
contour when the observations present a more complicated
scene. Once the weights are determined by calculating the
likelihood for each particle update, the center of mass of the
particles represents the best estimate for the peak of the pos-
terior, and is used as the estimated frequency of the whistle
contour at time t. To improve performance, the whistle con-
tour is modeled as a three dimensional feature, including not
only frequency, but also the first and second order deriva-
tives of the contour, c¼f;_
T. The added shift in fre-
quency described in Eq. (6) is applied directly to the first
component of c, and the first and second derivatives of the
contour are estimated using the last five samples of the con-
tour estimate. The likelihood function is chosen as a multi-
variate normal distribution with the mean defined by the
location of each particle, a zero covariance, and the diagonal
variance R¼D
[3, 1.5, 1]. This can be written as
t;PÞ, where st¼½s1;t;s2;t;s3;tTis generated
from a spectral peak at time t. The first component s1;tis the
frequency of the spectral peak, and the second s2;tand third
s3;tcomponents represent the rates of change _
f, respec-
tively, as determined by treating s1;tas the subsequent con-
tour update. Early in contour finding, before enough contour
updates have been found to approximate _
f, a lower
dimensional likelihood function is used that is scaled to the
number of features available. In this way, the best matches
of both the particles and the spectral peaks are found based
on the available information.
A set of particles is used to estimate a single whistle
contour, as shown in Fig. 1. When there are multiple whistle
contours present, each contour is estimated with its own set
of particles in each time step. New whistles are initiated
each time a sound level peak occurs that cannot be associ-
ated with an existing set of particles describing current whis-
tle contours. This can happen due to spectral separation, on
the order of 500 Hz difference between a sound level peak
and a current whistle contour using the given likelihood
function, or it can occur due to having more numerous sound
level peaks than current whistle contours. Groups of particles
are not considered to be a whistle contour until a minimum
number of time updates occur. The threshold is based on a
user specified minimum whistle duration.
2. Graph detection of tonals
The graph search algorithm maintains two sets of graphs
to organize candidate detections. The first set is the fragment
set and in general contains small fragments of whistles that
are identified. As these fragments grow, they are migrated to
the active set which consists of longer sets of timefre-
quency peaks without any attempt to disambiguate tonal
crossings. Each graph has a set of endpoints that may be
extended as new time frequency peaks are discovered. Af-
ter a period of inactivity where no new elements are added
to an active graph, it is removed from the active set and a
disambiguation algorithm extracts individual whistles.
The two major operations for each frame of the spectro-
gram consist of graph extension and graph pruning. Exten-
sion consists of examining the peaks and determining if they
are appropriate for extending an existing graph in the frag-
ment or active sets. The pruning step identifies graphs whose
endpoints are too far away from time tto be extended, and
identifies the whistles contained therein. Throughout this
section, tdenotes start times of spectrogram frames.
a. Graph extension. Criteria for graph extension are
based on an adaptive polynomial fit of a recent portion (25
ms) of the path to be extended. When the multiple paths are
possible due to recent whistle crossing, each possible path is
fit. The fit uses an ordinary least squares criterion (Press,
1992, Chap. 15.4). The goodness of the fit is measured by an
adjusted R
coefficient (Dillon and Goldstein, 1984, Chap.
6.3.2), which penalizes the fitness measure by a function of
the number of parameters and data points:
where tvaries over the Nregression samples, lsis the mean
of the regression sample frequencies, and ^
pfð  Þ is a predic-
tion polynomial of order degreeð^
pfÞ. The fit is initially tried
with a first order polynomial. A heuristic that accounts for
FIG. 1. (Color online) Particle filter performance in whistle discovery as
shown with a spectrogram. Approximate boundary of an odontocete whistle
is marked by the solid lines. Detected peaks of whistle are shown as squares.
Particles in each time step are shown as ’s, and the center of mass of the
particles is depicted as circles. These are the effective measures of the whis-
tle contour in each time step. Tracking continues even through time frames
without nearby whistle peaks allowing whistle contour detection to resume
once peaks are detected again. Locations where the whistle contour detec-
tion is resumed are denoted with darker circles.
2216 J. Acoust. Soc. Am., Vol. 130, No. 4, October 2011 Roch et al.: Odontocete whistle contour extraction
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
the sensitivity of polynomial prediction to quantization noise
along with a check for goodness of fit and quantity of estima-
tion data is used to determine whether or not a higher order
polynomial should be applied. Letting r^
pfdenote the stand-
ard deviation of the squared residuals, poor fits are re-esti-
mated with the next higher order polynomial when the
pf>2Df;and N>3 degreeð^
is satisfied.
A new peak, st, extends one or more paths in an existing
graph(s) if it is within 50 ms of the path’s endpoint and
1000 (Fig. 2). Connections are first tried in the
active set to favor well established paths. If no match is
found, the fragment set is searched. Should an appropriate
path from a fragment set graph be found, the duration of the
newly extended path is examined and the graph is moved to
the active set if the longest possible path exceeds 50 ms.
When no viable extensions of existing paths are feasible, a
new graph consisting of the detected peak is added to the
fragment set.
There are two special cases that merit discussion. It is
possible for a peak to be added to more than one graph. An
example of this occurs when tonal contours cross. In this
case, the graphs are merged using the union-find algorithm
(Cormen et al., 1990, Chap. 21) which permits efficient
merging of sets with near constant-time performance. By
merging the graphs, we delay the decision about which path
should be taken on the other side of the crossing until the
graph has been completed, allowing information from both
sides of the crossing to be used. The second case arises when
two tonals are in close proximity to one another and share a
similar slope. Such spectral peaks are typically within the
tolerance range of the predicted path, and the ability to con-
nect multiple peaks to the same graph endpoint can result in
a lattice structure where two roughly parallel segments are
bridged many times. To prevent this, the same peak is not
permitted to be joined to two endpoints that are part of the
same graph.
b. Graph pruning. After each graph extension, graphs
are pruned. When a graph has no end points that are within
50 ms of the current frame, it is no longer possible to extend
the graph. Consequently, the graph is removed from the
active or fragment set. When the time difference between
the first and last nodes of the graph are less than 150 ms, the
graph is discarded. An example of graphs produced by ana-
lyzing common dolphin whistles can be seen in the third
panel of Fig. 3.
Graphs that are retained are subjected to a disambigua-
tion step. Conceptually, graph paths are reduced to a set of
nodes that are either start/termination points for a candidate
whistle or intersection points. Each intersection is resolved
into one or more contour segments by examining each possi-
ble pairing between arcs leading into and out of an intersec-
tion node. Nodes with longer paths associated with them are
more likely to be important and are processed first. Ordering
is established by multiplying the length of the longest input
and output paths associated with each node.
FIG. 2. (Color online) Graph extension. Dashed curves depict an active
graph. A peak is depicted by an asterisk and ordinary least squares regres-
sion curves are fit along the closest 25 ms of paths near the peak as indicated
by the change in shade and dash pattern. Peaks that are within 1 kHz of the
path predicted by the polynomial fits will be added to the graph.
FIG. 3. Whistle detection algorithm performance amid the interference of
odontocete echolocation clicks. The uppermost panel shows a spectrogram
of 5 s of long-beaked common dolphin call data (analysis bandwidth 125
Hz) with relative dBs of signal to noise ratio encoded by gray levels. The
second panel shows the whistles detected by the particle filter algorithm.
The last two panels show the whistle graphs and extracted whistles as
detected by the graph search algorithm.
J. Acoust. Soc. Am., Vol. 130, No. 4, October 2011 Roch et al.: Odontocete whistle contour extraction 2217
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
Input and output path pairs are assigned scores based on
a heuristic derived from the adaptive polynomial fit used in
the graph extension step. Forward and backward average
squared prediction errors from up to 300 ms of the incoming
and outgoing paths are summed to determine the feasibility
of each pairing:
penaltyðini;k;outj;kÞ¼ 1
stin ^
stout ^
where t
is the time in s associated with the junction node,
and out
represent the ith input and jth output edges,
respectively, from intersection node k, pathp;k¼fall nodes
along path p0:3 s from intersection node kg. The predic-
tion polynomials ^
fand ^
fare estimated from .3 s of
data or to the nearest intersection node along the input and
output paths. An example can be seen in the crossing whis-
tles of Fig. 4. A total of four penalties will be computed, the
first two of which are between one of the incoming edges
and the two outgoing ones (highlighted). The first of these
penalties is formed by estimating predictor polynomials for
edges DA
!and AB
fand pAB
f. The average squared error
of the predictions of pDA
fonto the closest 0.3 s of AB
vice-versa are summed. This is repeated for the other three
possible combinations, and a greedy algorithm connects the
paths with the lowest penalties. When no more pairs can be
processed, the next intersection node is examined. The proc-
essing of intersection nodes is ordered by the lengths of the
longest possible pair of paths to favor longer whistles. As
will be shown empirically, spurious detections tend to have
higher false positive rates, and the rationale is to build on
what are likely to be better detections first.
Due to the frequency quantization in discrete Fourier
transforms, an optimization was added to address whistles
with similar slopes. When this occurs, the whistles’ paths
may fall in the same time frequency bin for multiple
frames. This results in two intersection nodes that may have
a single output and input path between them that corresponds
to both whistles. When two intersection nodes share a single
path with multiple inputs on one side and outputs on another,
we permit the path that bridges the two nodes to belong to
multiple whistles (Fig. 5).
The disambiguation algorithm results in a new set of
graphs where each candidate tonal has no crossings, but may
have one or more extraneous edges. These are removed by a
final pass that filters out short edges of less than 5 ms, which
are not part of an interior path.
D. Ground truth and metrics
An important component of an automated detection sys-
tem is the ability to measure its performance. This must be
done by comparing the system output with a set of known
detections, referred to as ground truth information. Although
spectrograms of whistles can be subjective, humans typically
perform well on visual separation tasks and a trained analyst
(author Y.B.) used custom software that permitted the user
to interactively specify tonal contours. The analyst placed
points along a whistle through which cubic B-spline curves
were fit. B-splines consist of multiple piecewise Bezier
curves that are constrained to have smooth transitions
between certain points through which the B-spline must pass
(Dierckx, 1993, Chap. 1). It is important to recognize that
while efforts were made to carefully record accurate ground
truth information (including inspection of randomly sampled
FIG. 4. Graph disambiguation. When deciding whether the incoming arc
!should be joined with the outgoing arc AB
!or AC
!, polynomials ^
estimated for all three arcs. The sum of the squared prediction errors of pairs
of incoming and outgoing edges [Eq. (9)] is used to determine which pairs
should be joined. In this example, DA
!is joined to AC
FIG. 5. (Color online) Common subpaths. The graph for these common dol-
phin whistles shows a dashed segment that is shared between two whistles.
The intersection nodes are characterized by having multiple inputs on one
side and multiple outputs on the other, joined by a single segment. When
this occurs, the disambiguation algorithm permits the segment to be used in
more than one whistle, permitting both whistles in the figure above to be
2218 J. Acoust. Soc. Am., Vol. 130, No. 4, October 2011 Roch et al.: Odontocete whistle contour extraction
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
segments for quality control), some decisions are subjective
and some amount of error is nearly inevitable.
In general, the analyst worked on short segments of 3–5 s
of recording and would adjust the spectrogram contrast and
brightness to most favorably display the tonal contours.
Complete tonals as well as fragments were noted, regardless
of their length or signal-to-noise ratio (SNR). Stepped whis-
tles were recorded as single tonals, while harmonics were
recorded separately. When echoes could be clearly distin-
guished they were not recorded.
The analyst-specified ground truth information was
compared to the detected whistles using a series of metrics
and selection criteria. The metrics are designed to measure
the correctness and quality of detections. The selection crite-
ria are used to determine which tonals were expected to be
detected, and are based on SNR and length metrics. As the
SNR of tonal calls can vary depending upon the part of the
call, tonals are only expected to be detected when a certain
percentage of the contour exceeds a specified SNR. A second
criterion rejects tonals that are less than a minimum duration.
We set the selection criteria to be appropriate for the types
of signals that could possibly be detected based on the
thresholds used in our algorithms: whistles of 150 ms or lon-
ger with a third of the whistle having a SNR 10 dB.
For each tonal in the ground truth tonal list, we examine
the set of detected tonals that overlap the start and end time
of the detected tonal. This is done regardless of whether or
not the tonal meets the selection criteria. All ground truth
tonals are processed so that it can be determined whether or
not a detection matches some ground truth tonal that failed
the selection criteria. In such cases, the matched tonal will
not be included in the metrics that describe the quality and
quantity of matches, but neither will it be considered to be a
false positive (bad match).
As the cubic spline interpolations may have minor devi-
ations from the actual tonal path, the recorded frequencies
are quantized to the nearest 125 Hz (based on an 8 ms analy-
sis window) and a search is conducted within 6500 Hz (64
bins) for the frequency bin with maximal energy. For each
overlapping point between a detected tonal and a specific
current ground truth tonal, the absolute frequency difference
between the detection and ground truth peak is computed. If
the mean difference is 350 Hz (a few frequency bins
away), the detected tonal is rejected as a false positive. Oth-
erwise, it is marked as a valid detection.
Measurements of system performance describe the sys-
tem’s ability to retrieve tonals as well as the quality of the
retrieved matches. The primary system metrics are recall and
precision. Recall measures the percentage of the expected
detections that were retrieved,
recall ¼X
jj 100;(10)
where groundcis the set of ground truth tonals subject to the
aforementioned selection criteria, and matchðt1;t2Þis an in-
dicator function that returns one if tonal t2has one or more
valid detections in t1, and zero otherwise. Precision is a met-
ric that measures the percentage of detections that are
precision ¼X
and the false positive rate is simply 100-precision.
Several other metrics are defined to assess the quality of
matches. Coverage is an indication of the average percentage
of a ground truth tonal that is matched and is truncated at
100% to prevent artificial inflation of the coverage statistic
should a detection be slightly longer than a ground truth
tonal. As multiple detections may cover a single ground truth
tonal, fragmentation is a measure of the average number of
detections per ground truth tonal. Deviation is a measure of
the average frequency deviation between the path of ground
truth tonal and its corresponding detection(s). Metrics are
summarized in Fig. 6.
Over three thousand ground truth whistles met the selec-
tion criteria that tonals must be at least 150 ms in duration
and that at least a third of the tonal had to have a SNR
of 10 dB. The number of whistles meeting these criteria
and the metrics associated with their detections are summar-
ized by sighting and species in Table III. The particle filter
was able to retrieve 71.5% (recall) of the 3372 ground truth
tonals with a precision of 60.8%. The graph algorithm
showed a recall rate of 80.0% with a precision of 76.9%.
The average deviation from the ground truth frequency was
low for both algorithms (particle filter 161 Hz, r¼51, graph
search 70 Hz, r¼76). Both algorithms performed reason-
ably well on the coverage (particle filter 79.7% r¼23.2,
graph search 86.0% r¼20.5) and fragmentation (1.2 detec-
tions per tonal for both algorithms) metrics. Sample
FIG. 6. (Color online) Metrics used to characterize detections. The Venn
diagram on the left shows the overlap between the detected tonals and
ground truth data. Recall computes the percentage of correct detections
relative to the ground truth while precision is the percentage of detec-
tions that were correct. The exaggerated caricatures of a call and associ-
ated detections on the right illustrate the quality metrics. Average
deviation is the mean frequency deviation between the tonal call and
detection(s). As systems may detect a call in multiple pieces, or frag-
ments, the number of fragments per call is recorded. Coverage is an in-
dication of the percentage of the tonal that was detected and in this case
would be ðt1t0Þþðt3t2Þ½=t4t0
100. Call and detection data
are caricatures with exaggerated frequency deviation.
J. Acoust. Soc. Am., Vol. 130, No. 4, October 2011 Roch et al.: Odontocete whistle contour extraction 2219
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
detections for various levels of acoustic clutter can be seen
in Figs. 3and 7.
Both algorithms demonstrate the ability to extract whis-
tles from very complex auditory scenes with many animals
vocalizing simultaneously. The precision associated with
both algorithms deserves further analysis, as the values indi-
cate that both algorithms produce a fair number of false posi-
tives. The majority of these false positives are quite short, as
seen in the cumulative distribution function for false positives
with respect to length (Fig. 8). They occur most often in
regions with strong noise and in areas where the noise floor
rises suddenly. Examples of phenomena that can give rise to
this include increases in wind velocity, rainfall, and anthropo-
genic sources. The large number of false positive in the sec-
ond melon-headed whale recording is directly attributable to
TABLE III. Performance comparison of graph and particle filter algorithms for the detection of odontocete whistle contours. Summary statistics are computed
across all ground truth tonals meeting SNR and duration selection criteria (see text) and are not averages of sighting statistics. When given, 6rindicates stand-
ard deviation.
Particle filter Graph search
Species Sighting Tonals Precision Recall
6r% Fragments Precision Recall
6r% Fragments
Bottlenose dolphin 1 89 69.9 79.8 170 653 84.1 621.2 1.3 67.6 84.3 44 659 83.1 621.6 1.3
2 265 95.9 82.6 141 651 76.4 622.8 1.2 95.5 82.6 128 651 77.0 622.3 1.3
all 354 87.2 81.9 148 653 78.3622.7 1.2 86.4 83.1 106 665 78.5 622.2 1.3
common dolphin
1 300 11.4 26.3 173 664 57.5 632.0 1.5 18.0 20.3 148 671 71.0 625.0 1.2
2 10 84.6 90.0 138 627 70.3 624.7 1.2 100.0 80.0 94 615 78.1624.7 1.5
3 247 92.5 86.6 148 652 84.3 621.2 1.3 93.6 86.6 44 664 88.1618.3 1.2
all 557 29.9 54.2 154 656 76.9627.2 1.3 49.4 50.8 68 678 84.1621.2 1.2
1 90 78.5 67.8 140 652 74.8 622.0 1.0 81.2 71.1 40 651 79.0 623.3 1.1
2 78 21.8 69.2 166 646 78.4 618.7 1.1 17.6 64.1 100 635 80.8617.4 1.1
3 170 86.5 74.1 151 651 78.6 623.6 1.2 88.2 72.9 108 654 81.0 620.0 1.2
all 338 52.7 71.3 151 650 77.6622.2 1.1 48.5 70.4 88 658 80.4620.4 1.1
common dolphin
1 92 73.5 78.3 155 652 79.6 620.0 1.2 66.9 83.7 137 671 73.6623.5 1.1
2 1112 66.8 64.4 166 664 81.9623.5 1.1 96.7 90.5 18 651 95.0615.0 1.1
3 233 76.3 86.3 146 642 83.2 620.5 1.2 79.2 89.7 46 663 85.8620.5 1.3
all 1437 69.1 68.8 161 660 82.0 622.7 1.1 90.7 89.9 30 661 92.2617.6 1.1
Spinner dolphin 1 357 85.4 88.2 177 650 76.0 622.7 1.2 88.8 89.1 130 659 77.6 621.8 1.4
2 146 87.9 81.5 162 645 76.6 622.0 1.1 86.4 82.9 127 656 77.2 618.5 1.2
3 183 86.2 84.2 175 653 86.6 619.2 1.3 83.3 82.0 141 659 83.4 621.1 1.5
all 686 86.1 85.7 173 650 78.9 622.1 1.2 86.8 85.9 132 658 79.0 621.1 1.4
Overall 3372 60.8 71.5 161 656 79.7 623.2 1.2 76.9 80.0 70 676 86.0 620.5 1.2
FIG. 7. Sample detections of acoustic scenes with differing degrees of clutter.
2220 J. Acoust. Soc. Am., Vol. 130, No. 4, October 2011 Roch et al.: Odontocete whistle contour extraction
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
broadband hydrophone tow noise between 5–25 kHz that
occurred when the tow vessel executed tight turns. A third
contributor to erroneous detections is echo sounder pings that
produce chains of peaks that the algorithms organize into
tonals (Fig. 9). This is the major cause for the poor precision
observed in the first long-beaked common dolphin sighting.
Just as transitions into higher noise regions can cause false
positives, transitions into lower noise regions can result in
misses due to low signal to noise estimates in the peak detec-
tion algorithm. Both types of errors suggest that improvements
to the noise estimation and removal portion of the common
signal processing chain could be a productive area for future
performance gains. Finally, missed detections also occur in
regions of very high impulsive noise density such as occur in
strong burst pulsed calls which are series of echolocation
clicks produced with a very short interclick interval.
There are a number of situations where a single whistle
will commonly result in multiple detections. As the system
does not track harmonics or associate echoes with the first
arrival, these are seen as separate events. Similarly, stepped
whistles are tracked as separate entities when the step size is
large. Some of these events have the potential to be associ-
ated during post-processing analysis; however, this will
require non-trivial effort due to phenomena such as incom-
plete detections and propagation loss at higher frequencies.
A final type of duplicate detection occurs in the particle filter
detector only. Occasionally, when spectral peaks are in close
proximity, one of the peaks will be used to form a new hy-
pothesis instead of updating an existing hypothesis. Subse-
quent peaks alternate between the two hypotheses,
leapfrogging one another and forming two tonal paths
instead of one. Improving the rules in governing the update
of whistle paths might alleviate this problem, particularly
since whistle paths that missed an update have first priority
when new peaks are presented.
For the whistles that are correctly detected, performance is
overall quite good. Detected tonals follow the human analyst’s
ground truth track closely, and typically cover 80–85% of the
whistle as recorded by the analyst. The majority of times, whis-
tles are detected as single contours, although the fragmentation
rate of 1.2 indicates that this is not always the case.
Data from other common and bottlenose dolphin sight-
ings collected using the same equipment and methods were
used in the development of the algorithms, and there was no
significant tuning of algorithm parameters for the data
reported in these experiments. As developed, these algo-
rithms are quite effective for determining presence/absence
of animals and should be able to provide reasonable esti-
mates of contour statistics for longer calls. When visual
observations are available, the extracted contours are suita-
ble for development of species recognition algorithms as
well as the exploration of associations between behavioral
state and whistle content. The minimum length threshold of
150 ms along with the propensity for false detections in
FIG. 8. (Color online) Cumulative density function for incorrect detections
whose duration is less than or equal the duration indicated on the False Posi-
tive Duration axis. Both algorithms require that a hypothesized tonal have a
duration 150 ms to be reported as a detection. The vast majority of false
positive detections for both algorithms have short duration.
FIG. 9. Example of false positive detections caused by echosounders in both algorithms.
J. Acoust. Soc. Am., Vol. 130, No. 4, October 2011 Roch et al.: Odontocete whistle contour extraction 2221
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
shorter whistles could impact behavioral studies, and future
work should investigate additional noise reduction techni-
ques to more reliably extract shorter whistles.
Both the particle filter and the graph search algorithms
show the ability to extract whistles from complex auditory
scenes from five different species containing multiple over-
lapping simultaneous whistles. This is demonstrated on a
diverse five species dataset consisting of nearly one hour of
recorded data with 3372 ground truthed (analyst detected)
calls meeting retrieval criteria of having a relative SNR 10
dB for at least one third of the call and a duration 150 ms.
The algorithms are capable of retrieving tonal contours
at a speed of several times real-time on modern computer
architectures. The graph search algorithm outperformed the
particle filter, retrieving 80.0% of the whistles versus 71.5%
by the particle filter. A higher percentage of the detections
from the graph search algorithm (76.9%) matched ground
truth calls than those produced by the particle filter (60.8%),
and in both cases the false positives were dominated by short
duration detections. Correct matches were typically within
one to two frequency bins of the ground truth tonal. Approxi-
mately 80% or more of each tonal was detected (79.7% parti-
cle filter, 86.0% graph search) and whistles were on average
split into 1.2 detections indicating that most tonals were not
split. The most challenging environments for either algorithm
include those with echo sounders, heavy burst pulse call ac-
tivity, and regions of noise state transition, all of which are
areas for further development of the spectral peak detector.
Direct comparisons with other algorithms are difficult
due to differences in data sets, and we avoid making any
claims about our algorithms versus others for this reason. In
an effort to encourage such comparisons, the audio data from
these experiments have been made available to the bioacous-
tics community in the Moby Sound archive (Heimlich et al.,
2011) as part of the Fifth International Detection, Classifica-
tion, and Localization Workshop dataset. The ground truth
information will be released to the Moby Sound archive after
the workshop (August 22–25, 2011 in Portland, OR).
We would like to thank the anonymous reviewers for
their helpful comments on an earlier version of this manu-
script. Numerous people contributed to the collection of the
data used in this work. We would like to thank our col-
leagues at Cascadia Research Collective, the Scripps Whale
Acoustics Lab, and The National University of Singapore’s
Marine Mammal Research Laboratory who provided visual
confirmations on our sightings, especially John Calamboki-
dis, Dominique Camacho, Greg Campbell, Stephen Claus-
sen, Annie Douglas, Erin Falcone, Greg Falxa, Andrea
Havron, Allan Ligon, Megan McKenna, Yeo Kian Peen, Jen
Quan, Nadia Rubio, Greg Schorr, Charles Speed, and Mi-
chael Smith, also the crews of Cal-COFI, the R/V Sproul,
the R/P Flip, and the R/V Zenobia. We also thank Greg
Campbell and Liz Henderson for their help with sighting
data and array configurations, and Chris Garsha, Brent
Hurley, and Sean Wiggins for hardware support. Data collec-
tion was conducted with assistance from John Hildebrand
and was supported by the U.S. Navy Environmental Readi-
ness Division, Frank Stone and Ernie Young, and analysis
and algorithm development was supported by the Office of
Naval Research, Mike Weise and Jim Eckman.
Adam, O. (2008). “Segmentation of killer whale vocalizations using the Hil-
bert-Huang transform,” EURASIP J. Adv. Signal Process. doi: 10.1155/
Arulampalam, M. S., Maskell, S., Gordon, N., and Clapp, T. (2002). “A tu-
torial on particle filters for online nonlinear/non-Gaussian Bayesian
tracking,” IEEE Trans. Signal Process. 50(2), 174–188.
Barbarossa, S., Scaglione, A., and Giannakis, G. B. (1998). “Product high-
order ambiguity function for multicomponent polynomial-phase signal
modeling,” IEEE Trans. Signal Process. 46(3), 691–708.
Brown, J. C., and Miller, P. J. O. (2007). “Automatic classification of killer
whale vocalizations using dynamic time warping,” J. Acoust. Soc. Am.
122(2), 1201–1207.
Buck, J. R., and Tyack, P. L. (1993). “A quantitative measure of similarity for
Tursiops truncatus signature whistles,” J. Acoust. Soc. Am. 94(5), 2497–2506.
Cormen, T. H., Leiserson, C. E., and Rivest, R. L. (1990). Introduction to
Algorithms (MIT Press, Cambridge, MA), p. 1028.
Datta, S., and Sturtivant, C. (2002). “Dolphin whistle classification for deter-
mining group identities,” Signal Processing 82(2), 127–327.
Dierckx, P. (1993). Curve and Surface Fitting with Splines (Oxford Science
Publications, Oxford), p. 285.
Dillon, W. R., and Goldstein, M. (1984). Multivariate Analysis, Methods
and Applications (Wiley, New York), p. 587.
Doucet, A., de Freitas, N., and Gordon, N. (2001). “An introduction to se-
quential Monte Carlo Methods,” in Sequential Monte Carlo Methods in
Practice, edited by A. Doucet, N. De Freitas, and N. Gordon (Springer,
New York), p. 581.
Fisher, F. H., and Spiess, F. N. (1963). “FLIP-FLoating Instrument
Platform,” J. Acoust. Soc. Am. 35(10), 1633–1644.
Gillespie, D., Gordon, J., McHugh, R., McLaren, D., Mellinger, D. K., Red-
mond, P., Thode, A., Trinder, P., and Deng, X.-Y. (2008). “PAMGUARD:
Semiautomated, open source software for real-time acoustic detection and
localisation of cetaceans,” Proc. Inst. Acoustics.
Gordon, N. J., Salmond, D. J., and Smith, A. F. M. (1993). Novel-approach
to nonlinear non-Gaussian Bayesian state estimation. IEE Proc. F 140(2),
Halkias, X. C., and Ellis, D. P. W. (2006). “Call detection and extraction
using Bayesian inference,” Appl. Acoust. 67(11-12), 1164–1174.
Heimlich, S., Klinck, H., and Mellinger, D. K. (2011). The Moby Sound
Database for Research in the Automatic Recognition of Marine Mammal
Calls, (Last viewed on April 1, 2011).
Heyning, J. E., and Perrin, W. F. (1994). “Evidence for two species of com-
mon dolphins (genus Delphinus) from the eastern North Pacific,” Contr.
Sci (Los Angeles) 442, 1–35.
Ioana, C., Gervaise, C., Ste´ phan, Y., and Mars, J. I. (2010). “Analysis of
underwater mammal vocalizations using time-frequency-phase tracker,”
Appl. Acoust. 71(11), 1070–1080.
Kitagawa, G. (1996). “Monte Carlo filter and smoother for non-gaussian
nonlinear state space models,” J. Comput. Graph. Stat. 5(1), 1–25.
Lammers, M. O., Au, W. W. L., and Herzing, D. L. (2003). “The broadband
social acoustic signaling behavior of spinner and spotted dolphins,” J.
Acoust. Soc. Am. 114(3), 1629–1639.
Mallawaarachchi, A., Ong, S. H., Chitre, M., and Taylor, E. (2008). “Spec-
trogram denoising and automated extraction of the fundamental frequency
variation of dolphin whistles,” J. Acoust. Soc. Am. 124(2), 1159–1170.
Marques, T. A., Thomas, L., Ward, J., DiMarzio, N., and Tyack P. L.
(2009). “Estimating cetacean population density using fixed passive acous-
tic sensors: An example with Blainville’s beaked whales,” J. Acoust. Soc.
Am. 125(4), 1982–1994.
Mellinger, D. K. (2001). Ishmael 1.0 User’s Guide. NOAA PMEL, Seattle,
OAR-PMEL-120, p. 30.
Musso, C., Oudjane, C., and LeGland, F. (2001). “Improving regularized par-
ticle filters,” in Sequential Monte Carlo Methods in Practice, edited by A.
Doucet, N. De Freitas, and N. Gordon (Springer, New York), pp. 247–721.
Nilsson, N. J. (1980). Principles of Artificial Intelligence (Tioga, Palo Alto,
CA), p. 476.
2222 J. Acoust. Soc. Am., Vol. 130, No. 4, October 2011 Roch et al.: Odontocete whistle contour extraction
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
Oswald, J. N., Barlow, J., and Norris, T. F. (2003). “Acoustic identification
of nine delphinid species in the eastern tropical Pacific ocean,” Mar. Mam-
mal Sci. 19(1), 20–37.
Oswald, J. N., Rankin, S., Barlow, J., and Lammers, M. O. (2007). “A tool
for real-time acoustic species identification of delphinid whistles,” J.
Acoust. Soc. Am. 122(1), 587–595.
Papoulis, A. (1991). Probability, Random Vriables, and Stochastic Proc-
esses (McGraw-Hill, New York), p. 666.
Press, W. H. (1992). Numerical Recipes in C: the Art of Scientific Comput-
ing (Cambridge University Press, Cambridge), p. 994.
Shapiro, A. D., and Wang, C. (2009). “A versatile pitch tracking algorithm:
From human speech to killer whale vocalizations,” J. Acoust. Soc. Am.
126(1), 451–459.
Shi, Y., and Chang, E. (2003). “Spectrogram-based formant tracking via
particle filters,” Intl. Conf. Acoust., Speech, Signal Proc. (ICASSP), Hong
Kong, China, pp. I-168–I-171.
Wang, D., Wursig, B., and Evans, W. (1995). “Comparisons of whistles
among seven odontocete species,” in Sensory Systems of Aquatic Mam-
mals, edited by R. A. Kastelein, J. A. Thomas, P. E. Nachtigall, (De Spil,
Woerden, NL), pp. 299–323.
Watkins, W. A. (1967). “The harmonic interval: Fact or artifact in spectral
analysis of pulse trains,” in Symposium on Marine Bio-Acoustics, edited
by W. N. Tavolga (Pergamon Press, New York), pp. 15–43.
White, P. R., and Hadley, M. L. (2008). “Introduction to particle filters for
tracking applications in the passive acoustic monitoring of cetaceans,”
Can. Acoust. 36(1), 146–152.
J. Acoust. Soc. Am., Vol. 130, No. 4, October 2011 Roch et al.: Odontocete whistle contour extraction 2223
Redistribution subject to ASA license or copyright; see Download to IP: On: Mon, 14 Mar 2016 07:04:15
... Thus, to detect and identify whistles, a vocalisation tracking method is necessary. Selecting vocalisations in an audio track is a complex task, and several techniques have already been developed to achieve it: statistical modelling of whistles [69][70][71][72][73][74], tracking algorithms based on hand-picked parameters [75][76][77], image processing approaches [78][79][80][81][82][83] or deep learning models associated with clustering methods [84,85]. For our dataset, we chose to adapt a tracking algorithm developed during the DECAV project [86]. ...
... This would involve an accurate identification of each whistle. However, this task is complex, and detection techniques developed so far have not been able to solve this problem without errors [69][70][71][72][73][74][75][76][77][78][79][80][81][82]84,85]. If whistles are first identified, then a classification of whistles using unsupervised classification techniques such as UMAP [67] could be applied, also using a metric to determine the optimal number of clusters [95]. ...
Full-text available
By-catch is the most direct threat to marine mammals globally. Acoustic repellent devices (pingers) have been developed to reduce dolphin by-catch. However, mixed results regarding their efficiency have been reported. Here, we present a new bio-inspired acoustic beacon, emitting returning echoes from the echolocation clicks of a common dolphin ‘Delphinus delphis’ from a fishing net, to inform dolphins of its presence. Using surface visual observations and the automatic detection of echolocation clicks, buzzes, burst-pulses and whistles, we assessed wild dolphins’ behavioural responses during sequential experiments (i.e., before, during and after the beacon’s emission), with or without setting a net. When the device was activated, the mean number of echolocation clicks and whistling time of dolphins significantly increased by a factor of 2.46 and 3.38, respectively (p < 0.01). Visual surface observations showed attentive behaviours of dolphins, which kept a distance of several metres away from the emission source before calmly leaving. No differences were observed among sequences for buzzes/burst-pulses. Our results highlight that this prototype led common dolphins to echolocate more and communicate differently, and it would favour net detection. Complementary tests of the device during the fishing activities of professional fishermen should further contribute to assessment of its efficiency.
... Thus, whistle detection becomes an image processing task. A threshold-based edge detection (Gillespie, 2004), a particle filter performing Bayesian filtering on detected spectral peaks (Roch et al., 2011), and Bin-wise detection using Statistical Contrast filtering (Dadouchi et al., 2013) were utilized to detect and extract the whistle profile from the spectrogram image. However, the underwater acoustic signals used in these studies were measured in a quieter environment absence of snapping shrimp sound. ...
... The underwater acoustic signal used in earlier studies was acquired at the habitats with a relatively quiet ambient environment (Dadouchi et al., 2013;Gillespie, 2004;Roch et al., 2011;Siddagangaiah et al., 2020). The detection methods developed for these habitats could be inefficient when unexpected acoustic sources such as shrimp snapping or anthropogenic sounds were presented. ...
Passive acoustic monitoring (PAM) is commonly utilized to monitor cetacean species’ distribution, abundance, and behavior. The demand for automated methods to detect and extract cetacean vocalizations from acoustic data has increased in the last few decades. Automatic whistle extraction of Indo-Pacific Bottlenose Dolphin (IPBD) and other whistle-producing delphinids habitating in the coastal areas represents a challenging problem due to the high ambient noise, including ship noise snapping shrimp at the same habitat. The acoustic signal containing snapping shrimp sound was usually excluded during the development of the detection method. A robust tool of bioacoustics for snapping shrimp sound is still lacking. This study trained a convolutional neural network (CNN) designed for semantic segmentation, initially developed for autonomous driving, to extract the whistle contour from spectrogram at a pixel level. A total of 1600 datasets was annotated for training and testing. As a result, the semantic segmentation classified the whistle with an overall mean Precision of 0.96, Accuracy of 0.89, and F-score of 0.86. In particular, the semantic segmentation extracted the whistle even if associated with a rich snapping shrimp sound, which the conventional method is incapable of. The advancement of metrics presented in this paper will enable long-term assessment of the IPBD population and individual or group tracking.
... The DCLDE 2011 Oregon dataset contains calls from shortbeaked and long-beaked common dolphins (Delphinus delphis and D. capensis), bottlenose dolphins (Tursiop truncatus) and spinner dolphins (Stenella longirostris), which were used to develop the training set with ground-truthed delphinid signals. Recorded in the Southern California Bight, the dataset encompasses both echolocations click and tonal calls (Roch et al., 2011b). ...
Full-text available
The effective analysis of Passive Acoustic Monitoring (PAM) data has the potential to determine spatial and temporal variations in ecosystem health and species presence if automated detection and classification algorithms are capable of discrimination between marine species and the presence of anthropogenic and environmental noise. Extracting more than a single sound source or call type will enrich our understanding of the interaction between biological, anthropogenic and geophonic soundscape components in the marine environment. Advances in extracting ecologically valuable cues from the marine environment, embedded within the soundscape, are limited by the time required for manual analyses and the accuracy of existing algorithms when applied to large PAM datasets. In this work, a deep learning model is trained for multi-class marine sound source detection using cloud computing to explore its utility for extracting sound sources for use in marine mammal conservation and ecosystem monitoring. A training set is developed comprising existing datasets amalgamated across geographic, temporal and spatial scales, collected across a range of acoustic platforms. Transfer learning is used to fine-tune an open-source state-of-the-art 'small-scale' convolutional neural network (CNN) to detect odontocete tonal and broadband call types and vessel noise (from 0 to 48 kHz). The developed CNN architecture uses a custom image input to exploit the differences in temporal and frequency characteristics between each sound source. Each sound source is identified with high accuracy across various test conditions, including variable signal-to-noise-ratio. We evaluate the effect of ambient noise on detector performance, outlining the importance of understanding the variability of the regional soundscape for which it will be deployed. Our work provides a computationally low-cost, efficient framework for mining big marine acoustic data, for information on temporal scales relevant to the management of marine protected areas and the conservation of vulnerable species.
... 24 The contour-based features were extracted from the time-frequency ridge (TFR) of each vocalization. The TFR extraction is often used in bioacoustics studies, [25][26][27][28] since it allows the extraction of several characteristics related to the TF representation of the sounds. Indeed, the extraction of TF domain features from USVs presents a substantial challenge, not only due to the significant amount of noise present in audio signals, but also due to the typical irregularities and discontinuities in the USVs' TF spectrum. ...
Full-text available
This paper addresses the development of a system for classifying mouse ultrasonic vocalizations (USVs) present in audio recordings. The automatic labeling process for USVs is usually divided into two main steps: USV segmentation followed by the matching classification. Three main contributions can be highlighted: (i) a new segmentation algorithm, (ii) a new set of features, and (iii) the discrimination of a higher number of classes when compared to similar studies. The developed segmentation algorithm is based on spectral entropy analysis. This novel segmentation approach can detect USVs with 94% and 74% recall and precision, respectively. When compared to other methods/software, our segmentation algorithm achieves a higher recall. Regarding the classification phase, besides the traditional features from time, frequency, and time-frequency domains, a new set of contour-based features were extracted and used as inputs of shallow machine learning classification models. The contour-based features were obtained from the time-frequency ridge representation of USVs. The classification methods can differentiate among ten different syllable types with 81.1% accuracy and 80.5% weighted F1-score. The algorithms were developed and evaluated based on a large dataset, acquired on diverse social interaction conditions between the animals, to stimulate a varied vocal repertoire.
... A process of call contour extraction and processing was performed to assist in visual validation of subjective call type categories. For calls that were determined to be of sufficient quality, the fundamental frequency and harmonics were manually traced on the computer using the custom software Silbido (Roch et al., 2011). Calls were plotted for visual inspection in 5 s time windows with a 10 dB signal-to-noise ratio (SNR) threshold. ...
Full-text available
Killer whales (Orcinus orca) produce a variety of acoustic signal types used for communication: clicks, whistles, and pulsed calls. Discrete pulsed calls are highly stereotyped, repetitive, and unique to individual pods found around the world. Discriminating amongst pod specific calls can help determine population structure in killer whales and is used to track pod movements around oceans. Killer whale presence in the Canadian Arctic has increased substantially, but we have limited understanding of their ecology, movements, and stock identity. Two autonomous passive acoustic monitoring (PAM) hydrophones were deployed in the waters of Eclipse Sound and Milne Inlet, in northern Baffin Island, Nunavut, Canada, in August and September 2017. Eleven killer whale pulsed call types, three multiphonic and eight monophonic, are proposed and described using manual whistle contour extraction and feature normalization. Automated detection of echolocation clicks between 20 and 48 kHz demonstrated little to no overlap between killer whale calls and echolocation presumed to be narwhal, which suggests that narwhal remain audibly inconspicuous when killer whales are present. Describing the acoustic repertoire of killer whales seasonally present in the Canadian Arctic will aid in understanding their acoustic behaviour, seasonal movements, and ecological impacts. The calls described here provide a basis for future acoustic comparisons across the North Atlantic and aid in characterizing killer whale demographics and ecology, particularly for pods making seasonal incursions into Arctic waters.
... Over the past decade the development of machine learning tools, and their applications to ecological data, has resulted in a proliferation of automated methods for analyzing large marine acoustic datasets. Both unsupervised and supervised learning frameworks, most notably clustering and deep learning algorithms, have become standard tools in the analysis of marine acoustic data [14][15][16][17][18][19][20][21][22][23][24]. These approaches require initial time investment to develop the models, and for some applications this investment may be substantial. ...
Full-text available
A combination of machine learning and expert analyst review was used to detect odontocete echolocation clicks, identify dominant click types, and classify clicks in 32 years of acoustic data collected at 11 autonomous monitoring sites in the western North Atlantic between 2016 and 2019. Previously-described click types for eight known odontocete species or genera were identified in this data set: Blainville’s beaked whales ( Mesoplodon densirostris ), Cuvier’s beaked whales ( Ziphius cavirostris ), Gervais’ beaked whales ( Mesoplodon europaeus ), Sowerby’s beaked whales ( Mesoplodon bidens ), and True’s beaked whales ( Mesoplodon mirus ), Kogia spp ., Risso’s dolphin ( Grampus griseus ), and sperm whales ( Physeter macrocephalus ). Six novel delphinid echolocation click types were identified and named according to their median peak frequencies. Consideration of the spatiotemporal distribution of these unidentified click types, and comparison to historical sighting data, enabled assignment of the probable species identity to three of the six types, and group identity to a fourth type. UD36, UD26, and UD28 were attributed to Risso’s dolphin ( G . griseus ), short-finned pilot whale ( G . macrorhynchus ), and short-beaked common dolphin ( D . delphis ), respectively, based on similar regional distributions and seasonal presence patterns. UD19 was attributed to one or more species in the subfamily Globicephalinae based on spectral content and signal timing. UD47 and UD38 represent distinct types for which no clear spatiotemporal match was apparent. This approach leveraged the power of big acoustic and big visual data to add to the catalog of known species-specific acoustic signals and yield new inferences about odontocete spatiotemporal distribution patterns. The tools and call types described here can be used for efficient analysis of other existing and future passive acoustic data sets from this region.
Full-text available
Classification of the acoustic repertoires of animals into sound types is a useful tool for taxonomic studies, behavioral studies, and for documenting the occurrence of animals. Classification of acoustic repertoires enables the identification of species, age, gender, and individual identity, correlations between sound types and behavior, the identification of changes in vocal behavior over time or in response to anthropogenic noise, comparisons between the repertoires of populations living in different geographic regions and environments, and the development of software tools for automated signal processing. Techniques for classification have evolved over time as technical capabilities have expanded. Initially, researchers applied qualitative methods, such as listening and visually discerning sounds in spectrograms. Advances in computer technology and the development of software for the automatic detection and classification of sounds have allowed bioacousticians to quickly find sounds in recordings, thus significantly reducing analysis time and enabling the analysis of larger datasets. In this chapter, we present software algorithms for automated signal detection (based on energy, Teager–Kaiser energy, spectral entropy, matched filtering, and spectrogram cross-correlation) as well as for signal classification (e.g., parametric clustering, principal component analysis, discriminant function analysis, classification trees, artificial neural networks, random forests, Gaussian mixture models, support vector machines, dynamic time-warping, and hidden Markov models). Methods for evaluating the performance of automated tools are presented (i.e., receiver operating characteristics and precision-recall) and challenges with classifying animal sounds are discussed.
Detecting whistle events is essential when studying the population density and behavior of cetaceans. After eight months of passive acoustic monitoring in Xiamen, we obtained long calls from two Tursiops aduncus individuals. In this paper, we propose an algorithm with an unbiased gammatone multi-channel Savitzky–Golay for smoothing dynamic continuous background noise and interference from long click trains. The algorithm uses the method of least squares to perform a local polynomial regression on the time–frequency representation of multi-frequency resolution call measurements, which can effectively retain the whistle profiles while filtering out noise and interference. We prove that it is better at separating out whistles and has lower computational complexity than other smoothing methods. In order to further extract whistle features in enhanced spectrograms, we also propose a set of multi-scale and multi-directional moving filter banks for various whistle durations and contour shapes. The final binary adaptive decisions at frame level for whistle events are obtained from the histograms of multi-scale and multi-directional spectrograms. Finally, we explore the entire data set and find that the proposed scheme achieves the highest frame-level F 1 -scores when detecting T. aduncus whistles than the baseline schemes, with an improvement of more than 6%.
Automatic algorithms for the detection and classification of sound are essential to the analysis of acoustic datasets with long duration. Metrics are needed to assess the performance characteristics of these algorithms. Four metrics for performance evaluation are discussed here: receiver-operating-characteristic (ROC) curves, detection-error-trade-off (DET) curves, precision-recall (PR) curves, and cost curves. These metrics were applied to the generalized power law detector for blue whale D calls [Helble, Ierley, D'Spain, Roch, and Hildebrand (2012). J. Acoust. Soc. Am. 131(4), 2682-2699] and the click-clustering neural-net algorithm for Cuvier's beaked whale echolocation click detection [Frasier, Roch, Soldevilla, Wiggins, Garrison, and Hildebrand (2017). PLoS Comp. Biol. 13(12), e1005823] using data prepared for the 2015 Detection, Classification, Localization and Density Estimation Workshop. Detection class imbalance, particularly the situation of rare occurrence, is common for long-term passive acoustic monitoring datasets and is a factor in the performance of ROC and DET curves with regard to the impact of false positive detections. PR curves overcome this shortcoming when calculated for individual detections and do not rely on the reporting of true negatives. Cost curves provide additional insight on the effective operating range for the detector based on the a priori probability of occurrence. Use of more than a single metric is helpful in understanding the performance of a detection algorithm.
The application of particle filters to two tracking problems in passive acoustic monitoring are discussed. Specifically we describe algorithms for extracting the contours of delphinid whistles and the localization of vocalizing animals in three dimensions using a distributed sensor array. The work is focused on highlighting the potential of particle filters in the analysis of bioacoustic signals. The discussion is based on one particular form of particle filter: the sequential importance resampling filter.
In its current stage of development PAMGUARD provides a powerful, flexible and easy to use program for real time acoustic detection and localisation of cetacean vocalisations that combines the functionality of several previous software products and, in many cases, extends them. Thus PAMGUARD is well positioned to provide the standard tool for PAM during mitigation operations and towed hydrophone surveys. The emphasis of development so far has been mainly on cetacean detection but the software is sufficiently flexible to be used for many other acoustic detection and localisation tasks. Perhaps of most fundamental importance is the programming environment that PAMGUARD offers to developers of new algorithms. The PAMGUARD API largely insulates algorithm developers from data handling tasks, making PAMGUARD an efficient development platform. It is this that promises to ensure PAMGUARD's future as a viable and evolving product as programmers choose it as an efficient environment in which to develop new PAM functionality. To date, PAMGUARD has primarily been developed to handle acoustic data. Many mitigation and survey applications combine both visual and acoustic data. The PAMGUARD API has been designed in such a way that it can be easily extended to handle visual data in the future. Clearly having both visual and acoustic data together within the same piece of software should greatly assist in the smooth running of both mitigation and survey applications. Results from field trials indicate that PAMGUARD can provide useful real time information on the locations of whales in the vicinity of a moving vessel. However, not all whales vocalise all of the time, so PAM cannot be considered as a 100% effective method for detecting cetaceans.
A stable platform from which to perform experiments at sea has been a long‐sought goal of sea‐going scientists. In order to perform fine‐scale experiments on fluctuations of acoustic signals in the ocean the manned buoy FLIP has been built to provide the needed stable platform for such work. The 355‐ft‐long, 600‐ton craft is towed to station in the horizontal position, and operates in the vertical position at a draft of 300 ft with its electronics laboratory about 30 ft above the water line. A description is given of its development, stability, and operating characteristics, including the unique flipping operation.
A new algorithm for the prediction, filtering, and smoothing of non-Gaussian nonlinear state space models is shown. The algorithm is based on a Monte Carlo method in which successive prediction, filtering (and subsequently smoothing), conditional probability density functions are approximated by many of their realizations. The particular contribution of this algorithm is that it can be applied to a broad class of nonlinear non-Gaussian higher dimensional state space models on the provision that the dimensions of the system noise and the observation noise are relatively low. Several numerical examples are shown.
Marine mammal vocalizations have always presented an intriguing topic for researchers not only because they provide an insight on their interaction, but also because they are a way for scientists to extract information on their location, number and various other parameters needed for their monitoring and tracking. In the past years field researchers have used submersible microphones to record underwater sounds in the hopes of being able to understand and label marine life. One of the emerging problems for both on site and off site researchers is the ability to detect and extract marine mammal vocalizations automatically and in real time given the copious amounts of existing recordings. In this paper, we focus on signal types that have a well-defined single frequency maxima and offer a method based on Sine wave modeling and Bayesian inference that will automatically detect and extract such possible vocalizations belonging to marine mammals while minimizing human interference. The procedure presented in this paper is based on global characteristics of these calls thus rendering it a species independent call detector/extractor.
Many real-world data analysis tasks involve estimating unknown quantities from some given observations. In most of these applications, prior knowledge about the phenomenon being modelled is available. This knowledge allows us to formulate Bayesian models, that is prior distributions for the unknown quantities and likelihood functions relating these quantities to the observations. Within this setting, all inference on the unknown quantities is based on the posterior distribution obtained from Bayes’ theorem. Often, the observations arrive sequentially in time and one is interested in performing inference on-line. It is therefore necessary to update the posterior distribution as data become available. Examples include tracking an aircraft using radar measurements, estimating a digital communications signal using noisy measurements, or estimating the volatility of financial instruments using stock market data. Computational simplicity in the form of not having to store all the data might also be an additional motivating factor for sequential methods.