Conference PaperPDF Available

Bird Species Identification from Audio Data

Authors:

Abstract and Figures

Birds are an important indicator species for environmental changes, and identifying bird species can provide valuable insights into changes in their populations and habitats. This research focuses on identifying bird species using audio recordings from the BirdCLEF dataset on Kaggle. Using these audio recordings, we extracted 26 acoustic features like MFCC, spectral centroid, spectral bandwidth, etc. These features were then used to train supervised machine learning models like Decision trees, Random forests, Naive Bayes classifier, Support Vector Classifier (SVC), k-nearest neighbor (K-NN), and Stochastic Gradient Descent (SGD). In the end, we trained the models using the combined features from the original audio and cleaned audio files along with feature selection by performing recursive feature elimination (RFE).
Content may be subject to copyright.
Bird Species Identification from Audio Data
Ching Seh Wu
Department of Computer Science
San Jose State University
San Jose, CA USA
ching-seh.wu@sjsu.edu
Sasanka Kosuru
Department of Computer Science
San Jose State University
San Jose, CA USA
sasanka.kosuru@sjsu.edu
Samaikya Tippareddy
Department of Computer Science
San Jose State University
San Jose, CA USA
samaikya.tippareddy@sjsu.edu
Abstract—Birds are an important indicator species for en-
vironmental changes, and identifying bird species can provide
valuable insights into changes in their populations and habitats.
This research focuses on identifying bird species using audio
recordings from the BirdCLEF dataset on Kaggle. Using these
audio recordings, we extracted 26 acoustic features like MFCC,
spectral centroid, spectral bandwidth, etc. These features were
then used to train supervised machine learning models like
Decision trees, Random forests, Naive Bayes classifier, Support
Vector Classifier (SVC), k-nearest neighbor (K-NN), and Stochas-
tic Gradient Descent (SGD). In the end, we trained the models
using the combined features from the original audio and cleaned
audio files along with feature selection by performing recursive
feature elimination (RFE).
Index Terms—Audio classification, Machine learning, Super-
vised classification, Audio signal processing, Bird Song.
I. INTRODUCTION
Human activity and climate change have contributed to the
significant and often negative transformation of many natural
habitats, leading to environmental turmoil. This transformation
has resulted in a series of challenges for different species,
which require careful monitoring to assess the impacts and
take necessary actions. In this context, birds are considered
crucial indicators due to their sensitivity to environmental
changes and the ease with which their populations can be
monitored [1]. Changes in bird populations provide important
insights into the condition of the environment and the potential
problems that can arise. As a result, environmental scientists
often use birds as essential devices to monitor changes in
natural habitats and predict future problems. Identifying bird
species can be further used for providing valuable insights into
their migratory patterns and other behaviors. As a result, it is
possible to anticipate the detrimental effects of environmental
changes and develop mitigation strategies to minimize their
impact on ecosystems.
The advantage of using bird audio instead of images for
classification is that it is easier and less expensive to capture
high-quality audio. In contrast, capturing quality images or
video can be challenging and expensive, requiring advanced
equipment and technical expertise. However, manually iden-
tifying birds based on their audio signals can be a daunting
task, requiring significant resources and time. The traditional
methods of bird classification have often relied on manual
observation and identification, which can be time-consuming
and labor-intensive. With the rapid advancements in machine
learning algorithms, however, the classification of bird audio
has become more accurate and efficient [2]. These algorithms
can analyze large datasets and perform complex computations,
which can help identify and track bird populations. The use
of machine learning algorithms in bird classification allows
for more automated and objective identification of species,
without the need for manual intervention.
In this paper, we utilized a large dataset [3] of bird
sounds obtained from the BirdCLEF competition on Kaggle,
and employed various Python libraries, including Librosa
[4], Scipy [5], and TorchAudio [6], to preprocess the data.
Our preprocessing steps involved cleaning the data to obtain
mono-channel audio files by reducing noise using a Python
Noisereduce [7] library which uses a spectral gating technique.
We also applied a high-pass filter to remove any unwanted
background noise and obtain clean audio data. We then per-
formed feature extraction to obtain relevant features that could
be used for training our models. Some of the features we ex-
tracted include Mel frequency spectral coefficients (MFCCs),
amplitude envelope, energy, spectral centroid, spectral flux,
and zero-crossing rate. These features were chosen based on
their potential to contribute to accurate classification.
To ensure that our models were robust and could handle
a variety of data, we employed various supervised machine
learning algorithms, including decision tree, random forest,
Naive Bayes classifier, Support Vector Classifier (SVC), k-
nearest neighbor (K-NN), and Stochastic Gradient Descent
(SGD), for audio classification. Finally, we used recursive
feature elimination (RFE) to select the most relevant features
for our models and improve the classification accuracy. By
selecting only the most important features, we were able to
improve the accuracy of our models and reduce the risk of
overfitting. Overall, our study aimed to develop an effective
method for bird sound classification using machine learning
techniques, and we believe that our approach could be useful
for researchers interested in analyzing bird sounds.
II. RE LATE D WOR KS
[8] employs a noise separation and classification filter
to obtain the required data from the sound. Mel-frequency
cepstral coefficient (MFCC), the most common feature in
speech recognition systems, is obtained using algorithms. Dif-
ferent algorithms namely, Na¨
ıve Bayes, J4.8, and Multilayer
perceptron were then used to classify the species. In these
approaches, a model is trained to predict the component
sources from synthetic mixtures or aggregations created by
adding up ground-truth sources. As the performance of the
model depends on the degree of match between the training
data and real-world audio, relying on this type of synthetic or
aggregated training data is problematic, especially since the
accurate simulation of the acoustic conditions and classifying
source distribution is very challenging.
To combat the above shortcomings, [9] focuses on develop-
ing a completely unsupervised method called mixture invariant
training (MixIT). In MixIT, existing mixtures are combined
to construct training examples, and the model separates these
examples into a variable number of latent sources, such that
original mixtures can be approximated by remixing these
separated sources. However, this approach was not particular
to bird species classification. [10] on the other hand, focuses on
using the MixIT model for birdsong data. Precision improve-
ments, along with a downstream multi-species bird classifier,
were depicted across three independent datasets. Taking the
maximum model activations across the separated channels and
original audio yielded the best classifier performance for these
datasets.
[11] is a working note of Piczak et. al. from BirdCLEF
2016. The focus of this paper was on evaluating single-
label classifiers suitable for recognizing the main (foreground)
species present in the recording. The audio files were con-
verted to mono-channel format during preprocessing, from
which mel-scaled power spectrograms were generated using
the Librosa library. An ensemble model with three different
network architectures was proposed in this work. The Keras
Deep Learning library was used to build the three networks.
Each of these networks converts the input into spectrogram
segments and predicts which species will be dominant. Av-
eraging the decisions made across all segments of the input
file yields the final prediction. This submission had a mean
average precision of 41.2 percent for background species and
52.9 percent for foreground species. It did not, however, handle
noise reduction during preprocessing, which was addressed in
[12].
[12] proposed a solution that uses a visual representation
of the audio as an input to the CNN. The audio files are first
converted to WAV format, which is then split into chunks
and normalized. Only relevant information was extracted from
these chunks by discarding any chuck which was not loud
enough (below the threshold). Finally, spectrograms are cre-
ated by converting STFT output to image using a color map.
In this paper, CNN is trained in two stages. The first was
done with a colored spectrogram and the second with a black-
and-white spectrogram. The method described in this paper
employs a pre-trained MobileNet network designed for mobile
devices. As a result, it has a small architecture and is quick to
evaluate. However, when CNN was trained with ten classes,
the accuracy dropped by more than 40 percent. A more reliable
pre-trained convolutional neural network, such as ResNet, may
aid in achieving higher accuracies.
The authors in [13], used the sequence of syllables in bird
sounds and compressed the variable length sequence to a fixed-
dimensional feature vector. Syllable pairs were used instead of
single syllables to understand the temporal structure of the bird
sounds and represented using Gaussian syllable prototypes.
They used nearest neighbor classifiers on 3 different histogram
representations i.e. 3 Gaussian syllable prototypes - 10, 30,
and 50 Gaussians with accuracies of 76, 79, and 80 percent
respectively. This paper handles the sparseness of histograms
and facilitates the comparison of the histograms. However, the
problem with using the syllable approach is, it is challenging to
get robust segmentation for a low Signal-to-noise ratio which
is addressed in [14].
In [14], a probabilistic approach is proposed using a statis-
tical manifold. First, meaningful features are extracted from
the audio by considering the audio signals as time series of
samples and converted into a spectrogram and the Fourier
transform is applied to distinguish between different sounds
in a frame based on frequency. The unique approach in this
paper from [13] is instead of bird song syllables they used
histograms of frequencies, as syllables are heavily dependent
on accurate segmentation and not suitable for audio with a low
Signal-to-noise ratio. Also, rather than averaging all the frame-
level features into a single length vector, they aggregated
frame-level features by representing feature distribution in the
histogram as they observed that multi-modality is common
in bird sounds and by averaging, we lose significant infor-
mation. As a result, a multidimensional feature vector of d-
dimensions is generated, and for classification, a feature vector
of frequencies for each histogram bin is given as input. The
authors also used a ‘codebook’ approach to generate these
histogram features for high-dimension vectors like MFCCs.
In this paper, nearest-neighbor classifiers were used with
statistical divergence measures, namely L1, Kellinger, and KL.
The proposed model provides a simpler approach than using
bird syllables and the model provides competitive or better
accuracy results(85-90 percent) with the benchmark SVMs.
III. DATASET DESCRIPTION
The dataset [3] used in this research is downloaded from
Xeno-canto, an open-source website that hosts and shares a
vast collection of wildlife sounds from around the world. The
original dataset consisted of 153 different bird species, ranging
from A to M. However, we only selected certain species to
ensure our data set is of higher quality and suitable for our
research purposes.
Firstly, we only considered audio files with a rating greater
than three which indicates that the audio quality is not too
worse. Secondly, we only selected audio files with a number
of samples greater than a hundred for accurate classification.
Finally, we included audio files that were under twenty seconds
in duration as we observed that, in longer audio files, a
vast majority of the content was extraneous, with only small
fragments containing the relevant bird sounds. After applying
these filters, we were left with a total of 30 bird species that
met all the criteria. These selected species were considered for
further analysis and classification.
IV. METHODOLOGY
Fig. 1: Model training workflow
A. Data Preparation
The audio files in the dataset consisted of both mono and
stereo channel. The number of channels used to capture and
playback audio signals is the primary distinction between
stereo and mono. While stereo signals are recorded and played
back on two audio channels (the left and right channels), mono
signals are recorded and played back on a single audio channel
[15]. Thus, in order to make audio analysis easier, we use
mono, so that we would only have one channel to process.
Also, all the files were in mp3 format. Librosa library was used
to convert mp3 into WAV files. This is because, unlike MP3
file format which is lossy, WAV files are lossless, meaning
that WAV audio is a high-quality uncompressed file. These
are best suited to our scenario, given that audio files have a
lot of noise like traffic, human sounds, or other low-frequency
sounds. Librosa library also converts stereo into mono-channel
audio files. The dataset has been split into the train, and test
sets in a ratio of 80:20 with each of these sets having an equal
ratio of target classes.
B. Noise reduction
We used a python Noise Reduce library [7] that uses
spectral gating to reduce the background noise. In spectral
gating, a spectrogram of an audio signal is computed, noise
threshold is estimated to compute a mask, which is later
used for noise gating. After applying this, we also applied
another high-pass filter [5] to the data for further reduction in
noise. The assumption here to apply a high-pass filter in this
context is based on the fact that bird sounds are predominantly
high-frequency in nature. By employing this filter, we can
effectively isolate and extract these sounds from the low-
frequency ambient noise that may be present, such as traffic
and human sounds.
Fig. 2 shows different audio signals plotted in blue, orange,
green, and red. Fig. 2a shows the original signal that contains
noise and is represented in blue. After applying a high-pass
filter to the original audio, we obtained the audio signal in
orange as shown in Fig. 2b. As the noise was not reduced
significantly using only the high-pass filter, we tried using the
Noise Reduce Python library instead. Fig. 2c shows that the
noise reduced significantly, and the audio signal is represented
in green. On top of the Noise Reduce library, we again used a
high-pass filter to further eliminate noise, and the final audio
signal is represented in red in Fig. 2d.
C. Feature Extraction
For numeric data, a total of twenty-six acoustic features
were extracted. These features are as follows: Short Term
Fourier Transform (STFT), Mel Frequency Cepstral Coeffi-
cients (MFCC), Root Mean Square (RMS), Spectral centroid,
Spectral bandwidth, Spectral roll-off, and Zero crossing rate.
D. Normalization
The extracted features we wanted to use for modeling had
different scales. Uneven feature scales can cause issues when
training machine learning models as dominant features with
much larger values may lead to biased models that perform
poorly on other features. So, we used the StandardScaler
normalization to bring the features to a similar scale and
improve the accuracy and convergence of the models.
E. Classification models
In order to perform classification on the extracted dataset,
we wrote a pipeline method that runs all the following super-
vised models used on the training dataset of 30 classes: K-
nearest neighbor (KNN), Stochastic Gradient Descent (SGD),
Support Vector Classifier (SVC), Decision Trees, Random
Forest, Gaussian Naive Bayes Classifier.
F. Feature Selection
During the analysis of the feature extraction, we were
interested in finding which set of features would result in
the highest level of classification accuracy. This is impor-
tant because feature selection can help improve classification
performance by selecting the most relevant features, and also
reduce classification time by eliminating unnecessary features.
(a) Original audio
(b) Cleaned with high pass filter
(c) Cleaned with noisereduce library(spectral gating)
(d) Cleaned with noisereduce library(spectral gating) and high pass
filter
Fig. 2: Comparison of various noise reduction techniques
We used recursive feature elimination (RFE), a feature selec-
tion algorithm that reduces model complexity by eliminating
lesser significant features. RFE works by iteratively removing
the weakest feature(s) until the desired number of features is
reached.
To determine the optimum number of features needed, we
performed cross-validation along with RFE to score different
subsets of features and select the subset with the highest score.
This process helped us identify the feature set that would result
in the highest classification accuracy. After performing RFE,
we were able to reduce the number of features from 52 to 41
for 3 classes and 37 for 5 classes. By reducing the number of
features, we were able to simplify the model and improve its
performance.
V. RE SU LTS, A NA LYSI S,AND COMPARISONS
There were three experiments conducted in this research.
The first experiment was to use basic machine-learning models
to classify the birds. In this, we have considered three different
sets of bird classes: 3, 5, and 30 classes. The 30 classes were
selected based on specific constraints mentioned in Section
III, and from these 30 classes, we have randomly selected
3 and 5 classes. The results of this experiment are given in
Table I. We have observed that the accuracy and F1 scores
have decreased with the increase in the number of bird classes.
This is because different audio species data had different audio
quality samples and a limited number of audio files, resulting
in an imbalance of data. By nature, in machine learning
algorithms, the accuracy of predictions decreases as the degree
of imbalance increases. The best results in this experiment
were obtained for Stochastic Gradient descent (SGD) when
trained with 3 bird species.
(a) Accuracy scores comparison
(b) F1 scores comparison
Fig. 3: Comparison of Accuracy and F1 scores of models
trained without feature selection
In the second experiment, we compared our Stochastic
Gradient Decent model which was trained using the features
of both original audio and noise-reduced audio, with the model
that did not use this combination. We have observed that
the model trained with combined data had better results than
the model trained using just the original features or just the
noise-reduced features. Table II gives detailed metrics for this
experiment. The F1 scores when we considered combined
features were 0.87 while the models trained with noise audio
had 0.79 and noise-reduced models had 0.72 F1 scores.
TABLE I: F1, Accuracy for 3, 5, and 30 classes without feature
selection
Metric SGD SVM KNN Decision Tree
3 Class
Macro F1 0.87 0.86 0.65 0.64
Accuracy 0.88 0.86 0.71 0.66
5 Class
Macro F1 0.70 0.73 0.58 0.49
Accuracy 0.70 0.74 0.61 0.51
30 Class
Macro F1 0.21 0.18 0.13 0.11
Accuracy 0.30 0.36 0.28 0.17
The third experiment was conducted by training a Stochastic
Gradient Descent model with feature selection on 3 and 5 bird
classes. We have observed a significant increase in accuracy
and F1 scores after feature selection using the recursive feature
elimination technique. Using RFE, a total of 41 features for
3 classes and 37 features for 5 classes were selected. Table
III shows the accuracies and F1 scores obtained for stochastic
gradient descent using 3 and 5 bird species for training. All the
models were evaluated using a cross-validation method with
5 folds.
TABLE II: Comparison of Stochastic Gradient Decent mod-
eled with features with, without noise, and both
Metrics
Features
with & without noise with noise without noise
Macro F1 0.87 0.79 0.72
Accuracy 0.88 0.79 0.76
TABLE III: F1, and Accuracy time for 3 and 5 classes using
RFE on Stochastic Gradient Decent
3 Class 5 Class
Features selected 41 37
Macro F1 0.89 0.88
Accuracy 0.90 0.89
VI. CONCLUSION AND FUTURE WO RK
In conclusion, accuracies decreased as the number of classes
increased. This is because different species had varying audio
data and limited samples. Our research only focused on using
the numerical features extracted such as spectral centroid, zero
crossing rate, and MFCCs. The models were suitable to handle
a balanced dataset, where each class had a similar number of
audio files.
In future work, the Mel-spectrogram features of the original
and cleaned audio (using highpass filter and noise reduce
library) can be plotted as images and used for image classifi-
cation using Convolution Neural Networks (CNNs) or other
deep learning models. Furthermore, for classification using
numeric data, among the four important features of bird
audio i.e., notes, syllables, phrases, and songs, individual or
a combination of these features can be experimented with for
classification, to see if we obtain better results. For handling
the issue of data imbalance for different species, we need
to explore better techniques, rather than undersampling or
oversampling, for improving accuracy when using a higher
number of classes.
REFERENCES
[1] S. Mekonen, “Birds as biodiversity and environmental indicator.”
[Online]. Available: https://core.ac.uk/reader/234657570
[2] D. Stowell, M. Wood, Y. Stylianou, and H. Glotin, “Bird detection in
audio: a survey and a challenge, 08 2016.
[3] Vopani, “Xeno-canto bird recordings extended (a-m),” Sep 2020.
[Online]. Available: https://www.kaggle.com/datasets/rohanrao/xeno-
canto-bird-recordings-extended-a-m
[4] “Librosa: Audio and music signal analysis in python - scipy.
[5] P. Virtanen and et al., “Scipy 1.0: Fundamental algorithms for scientific
computing in python,” Nature Methods, vol. 17, no. 3, p. 261–272, 2020.
[6] Y.-Y. Yang, M. Hira, Z. Ni, A. Astafurov, C. Chen, C. Puhrsch,
D. Pollack, D. Genzel, D. Greenberg, E. Z. Yang, J. Lian, J. Hwang,
J. Chen, P. Goldsborough, S. Narenthiran, S. Watanabe, S. Chintala,
and V. Quenneville-B ´
elair, “Torchaudio: Building blocks for audio
and speech processing,” in ICASSP 2022 - 2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022,
pp. 6982–6986.
[7] T. Sainburg, “Timsainb/noisereduce: V1.0, Jun 2019. [Online].
Available: https://zenodo.org/record/3243139
[8] A. E. Mehyadin, A. M. Abdulazeez, D. A. Hasan, and J. N. Saeed,
“Birds sound classification based on machine learning algorithms,” Asian
Journal of Research in Computer Science, p. 1–11, 2021.
[9] S. Wisdom, E. Tzinis, H. Erdogan, R. J. Weiss, K. Wilson,
and J. R. Hershey, “Unsupervised sound separation using
mixture invariant training, Oct 2020. [Online]. Available:
https://doi.org/10.48550/arXiv.2006.12701
[10] T. Denton, S. Wisdom, and J. R. Hershey, “Improving bird classification
with unsupervised sound separation,” Oct 2021. [Online]. Available:
https://doi.org/10.48550/arXiv.2110.03209
[11] K. J. Piczak, “Recognizing bird species in audio recordings using
deep convolutional neural networks - ceur-ws.org,” 2016. [Online].
Available: http://ceur-ws.org/Vol-1609/16090534.pdf
[12] A. Incze, H.-B. Jancso, Z. Szilagyi, A. Farkas, and C. Sulyok, “Bird
sound recognition using a convolutional neural network, 2018 IEEE
16th International Symposium on Intelligent Systems and Informatics
(SISY), 2018.
[13] P. Somervuo and A. Harma, “Bird song recognition based on syllable
pair histograms,” 2004 IEEE International Conference on Acoustics,
Speech, and Signal Processing.
[14] F. Briggs, R. Raich, and X. Z. Fern, “Audio classification of bird
species: A statistical manifold approach,” 2009 Ninth IEEE International
Conference on Data Mining, 2009.
[15] Arthur, “Is stereo or mono audio better? (applications for both), Jan
2022. [Online]. Available: https://mynewmicrophone.com/is-stereo-or-
mono-audio-better-applications-for-both/
... This method was selected for its effectiveness in reducing noise in time-domain inputs, such as speech, bioacoustics, and physiological signals. Its utility has been demonstrated across various fields, including depression detection from audio signals [48], bird species identification from acoustic data [49], and machine anomaly detection [50]. The method's ability to enhance signal clarity in noisy environments makes it a versatile tool for a broad range of audio processing applications. ...
Article
Full-text available
Chronic respiratory diseases affect people worldwide, but conventional diagnostic methods may not be accessible in remote locations far from population centers. Sounds from the human respiratory system have displayed potential in autonomously detecting these diseases using artificial intelligence (AI). This article outlines the development of an audio-based edge computing system that automatically detects chronic respiratory diseases (CRDs). The system utilizes machine learning (ML) techniques to analyze audio recordings of respiratory sounds (cough and breath) and classify the presence or absence of these diseases, using features such as Mel frequency cepstral coefficients (MFCC) and chromatic attributes (chromagram) to capture the relevant acoustic features of breath sounds. The system was trained and tested using a dataset of respiratory sounds collected from 86 individuals. Among them, 53 had chronic respiratory conditions, including asthma and chronic obstructive pulmonary disease (COPD), while the remaining 33 were healthy. The system’s final evaluation was conducted with a group of 13 patients and 22 healthy individuals. Our approach demonstrated high sensitivity and specificity in the classification of sounds on edge devices, including smartphone and Raspberry Pi. Our best results for CRDs reached a sensitivity of 90.0%, a specificity of 93.55%, and a balanced accuracy of 91.75% for accurately identifying individuals with both healthy and diseased. These results showcase the potential of edge computing and machine learning systems in respiratory disease detection. We believe this system can contribute to developing efficient and cost-effective screening tools.
Article
Full-text available
The bird classifier is a system that is equipped with an area machine learning technology and uses a machine learning method to store and classify bird calls. Bird species can be known by recording only the sound of the bird, which will make it easier for the system to manage. The system also provides species classification resources to allow automated species detection from observations that can teach a machine how to recognize whether or classify the species. Non-undesirable noises are filtered out of and sorted into data sets, where each sound is run via a noise suppression filter and a separate classification procedure so that the most useful data set can be easily processed. Mel-frequency cepstral coefficient (MFCC) is used and tested through different algorithms, namely Naïve Bayes, J4.8 and Multilayer perceptron (MLP), to classify bird species. J4.8 has the highest accuracy (78.40%) and is the best. Accuracy and elapsed time are (39.4 seconds).
Article
Full-text available
SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments. This Perspective describes the development and capabilities of SciPy 1.0, an open source scientific computing library for the Python programming language.
Conference Paper
Full-text available
Convolutional neural networks (CNNs) are powerful toolkits of machine learning which have proven efficient in the field of image processing and sound recognition. In this paper, a CNN system classifying bird sounds is presented and tested through different configurations and hyperparameters. The MobileNet pre-trained CNN model is fine\nobreakdash-tuned using a dataset acquired from the Xeno-canto bird song sharing portal, which provides a large collection of labeled and categorized recordings. Spectrograms generated from the downloaded data represent the input of the neural network. The attached experiments compare various configurations including the number of classes (bird species) and the color scheme of the spectrograms. Results suggest that choosing a color map in line with the images the network has been pre\nobreakdash-trained with provides a measurable advantage. The presented system is viable only for a low number of classes.
Conference Paper
Full-text available
Bird song can be divided into a sequence of syllabic elements. We investigate the possibility of bird species recognition based on the syllable pair histogram of the song. This representation compresses the variable-length syllable sequence into a fixed-dimensional feature vector. The histogram is computed by means of Gaussian syllable prototypes which are automatically found given the song data and the dissimilarity measure of syllables. Our representation captures the use of the syllable alphabet and also some temporal structure of the song. We demonstrate the method in bird species recognition with song patterns obtained from fifty individuals belonging to four common passerine bird species.
Preprint
In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, the model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. The reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of the acoustic conditions and distribution of sources. The acoustic properties can be challenging to accurately simulate, and the distribution of sound types may be hard to replicate. In this paper, we propose a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures. In MixIT, training examples are constructed by mixing together existing mixtures, and the model separates them into a variable number of latent sources, such that the separated sources can be remixed to approximate the original mixtures. We show that MixIT can achieve competitive performance compared to supervised methods on speech separation. Using MixIT in a semi-supervised learning setting enables unsupervised domain adaptation and learning from large amounts of real-world data without ground-truth source waveforms. In particular, we significantly improve reverberant speech separation performance by incorporating reverberant mixtures, train a speech enhancement system from noisy mixtures, and improve universal sound separation by incorporating a large amount of in-the-wild data.
Conference Paper
Many biological monitoring projects rely on acoustic detection of birds. Despite increasingly large datasets, this detection is often manual or semi-automatic, requiring manual tuning/postprocessing. We review the state of the art in automatic bird sound detection, and identify a widespread need for tuning-free and species-agnostic approaches. We introduce new datasets and an IEEE research challenge to address this need, to make possible the development of fully automatic algorithms for bird sound detection.
Conference Paper
Our goal is to automatically identify which species of bird is present in an audio recording using supervised learning. Devising effective algorithms for bird species classification is a preliminary step toward extracting useful ecological data from recordings collected in the field. We propose a probabilistic model for audio features within a short interval of time, then derive its Bayes risk-minimizing classifier, and show that it is closely approximated by a nearest-neighbor classifier using Kullback-Leibler diver- gence to compare histograms of features. We note that fea- ture histograms can be viewed as points on a statistical manifold, and KL divergence approximates geodesic dis- tances defined by the Fisher information metric on such manifolds. Motivated by this fact, we propose the use of another approximation to the Fisher information met- ric, namely the Hellinger metric. The proposed classifiers achieve over 90% accuracy on a data set containing six species of bird, and outperform support vector machines.