Content uploaded by Samaikya Tippareddy
Author content
All content in this area was uploaded by Samaikya Tippareddy on Oct 16, 2024
Content may be subject to copyright.
Bird Species Identification from Audio Data
Ching Seh Wu
Department of Computer Science
San Jose State University
San Jose, CA USA
ching-seh.wu@sjsu.edu
Sasanka Kosuru
Department of Computer Science
San Jose State University
San Jose, CA USA
sasanka.kosuru@sjsu.edu
Samaikya Tippareddy
Department of Computer Science
San Jose State University
San Jose, CA USA
samaikya.tippareddy@sjsu.edu
Abstract—Birds are an important indicator species for en-
vironmental changes, and identifying bird species can provide
valuable insights into changes in their populations and habitats.
This research focuses on identifying bird species using audio
recordings from the BirdCLEF dataset on Kaggle. Using these
audio recordings, we extracted 26 acoustic features like MFCC,
spectral centroid, spectral bandwidth, etc. These features were
then used to train supervised machine learning models like
Decision trees, Random forests, Naive Bayes classifier, Support
Vector Classifier (SVC), k-nearest neighbor (K-NN), and Stochas-
tic Gradient Descent (SGD). In the end, we trained the models
using the combined features from the original audio and cleaned
audio files along with feature selection by performing recursive
feature elimination (RFE).
Index Terms—Audio classification, Machine learning, Super-
vised classification, Audio signal processing, Bird Song.
I. INTRODUCTION
Human activity and climate change have contributed to the
significant and often negative transformation of many natural
habitats, leading to environmental turmoil. This transformation
has resulted in a series of challenges for different species,
which require careful monitoring to assess the impacts and
take necessary actions. In this context, birds are considered
crucial indicators due to their sensitivity to environmental
changes and the ease with which their populations can be
monitored [1]. Changes in bird populations provide important
insights into the condition of the environment and the potential
problems that can arise. As a result, environmental scientists
often use birds as essential devices to monitor changes in
natural habitats and predict future problems. Identifying bird
species can be further used for providing valuable insights into
their migratory patterns and other behaviors. As a result, it is
possible to anticipate the detrimental effects of environmental
changes and develop mitigation strategies to minimize their
impact on ecosystems.
The advantage of using bird audio instead of images for
classification is that it is easier and less expensive to capture
high-quality audio. In contrast, capturing quality images or
video can be challenging and expensive, requiring advanced
equipment and technical expertise. However, manually iden-
tifying birds based on their audio signals can be a daunting
task, requiring significant resources and time. The traditional
methods of bird classification have often relied on manual
observation and identification, which can be time-consuming
and labor-intensive. With the rapid advancements in machine
learning algorithms, however, the classification of bird audio
has become more accurate and efficient [2]. These algorithms
can analyze large datasets and perform complex computations,
which can help identify and track bird populations. The use
of machine learning algorithms in bird classification allows
for more automated and objective identification of species,
without the need for manual intervention.
In this paper, we utilized a large dataset [3] of bird
sounds obtained from the BirdCLEF competition on Kaggle,
and employed various Python libraries, including Librosa
[4], Scipy [5], and TorchAudio [6], to preprocess the data.
Our preprocessing steps involved cleaning the data to obtain
mono-channel audio files by reducing noise using a Python
Noisereduce [7] library which uses a spectral gating technique.
We also applied a high-pass filter to remove any unwanted
background noise and obtain clean audio data. We then per-
formed feature extraction to obtain relevant features that could
be used for training our models. Some of the features we ex-
tracted include Mel frequency spectral coefficients (MFCCs),
amplitude envelope, energy, spectral centroid, spectral flux,
and zero-crossing rate. These features were chosen based on
their potential to contribute to accurate classification.
To ensure that our models were robust and could handle
a variety of data, we employed various supervised machine
learning algorithms, including decision tree, random forest,
Naive Bayes classifier, Support Vector Classifier (SVC), k-
nearest neighbor (K-NN), and Stochastic Gradient Descent
(SGD), for audio classification. Finally, we used recursive
feature elimination (RFE) to select the most relevant features
for our models and improve the classification accuracy. By
selecting only the most important features, we were able to
improve the accuracy of our models and reduce the risk of
overfitting. Overall, our study aimed to develop an effective
method for bird sound classification using machine learning
techniques, and we believe that our approach could be useful
for researchers interested in analyzing bird sounds.
II. RE LATE D WOR KS
[8] employs a noise separation and classification filter
to obtain the required data from the sound. Mel-frequency
cepstral coefficient (MFCC), the most common feature in
speech recognition systems, is obtained using algorithms. Dif-
ferent algorithms namely, Na¨
ıve Bayes, J4.8, and Multilayer
perceptron were then used to classify the species. In these
approaches, a model is trained to predict the component
sources from synthetic mixtures or aggregations created by
adding up ground-truth sources. As the performance of the
model depends on the degree of match between the training
data and real-world audio, relying on this type of synthetic or
aggregated training data is problematic, especially since the
accurate simulation of the acoustic conditions and classifying
source distribution is very challenging.
To combat the above shortcomings, [9] focuses on develop-
ing a completely unsupervised method called mixture invariant
training (MixIT). In MixIT, existing mixtures are combined
to construct training examples, and the model separates these
examples into a variable number of latent sources, such that
original mixtures can be approximated by remixing these
separated sources. However, this approach was not particular
to bird species classification. [10] on the other hand, focuses on
using the MixIT model for birdsong data. Precision improve-
ments, along with a downstream multi-species bird classifier,
were depicted across three independent datasets. Taking the
maximum model activations across the separated channels and
original audio yielded the best classifier performance for these
datasets.
[11] is a working note of Piczak et. al. from BirdCLEF
2016. The focus of this paper was on evaluating single-
label classifiers suitable for recognizing the main (foreground)
species present in the recording. The audio files were con-
verted to mono-channel format during preprocessing, from
which mel-scaled power spectrograms were generated using
the Librosa library. An ensemble model with three different
network architectures was proposed in this work. The Keras
Deep Learning library was used to build the three networks.
Each of these networks converts the input into spectrogram
segments and predicts which species will be dominant. Av-
eraging the decisions made across all segments of the input
file yields the final prediction. This submission had a mean
average precision of 41.2 percent for background species and
52.9 percent for foreground species. It did not, however, handle
noise reduction during preprocessing, which was addressed in
[12].
[12] proposed a solution that uses a visual representation
of the audio as an input to the CNN. The audio files are first
converted to WAV format, which is then split into chunks
and normalized. Only relevant information was extracted from
these chunks by discarding any chuck which was not loud
enough (below the threshold). Finally, spectrograms are cre-
ated by converting STFT output to image using a color map.
In this paper, CNN is trained in two stages. The first was
done with a colored spectrogram and the second with a black-
and-white spectrogram. The method described in this paper
employs a pre-trained MobileNet network designed for mobile
devices. As a result, it has a small architecture and is quick to
evaluate. However, when CNN was trained with ten classes,
the accuracy dropped by more than 40 percent. A more reliable
pre-trained convolutional neural network, such as ResNet, may
aid in achieving higher accuracies.
The authors in [13], used the sequence of syllables in bird
sounds and compressed the variable length sequence to a fixed-
dimensional feature vector. Syllable pairs were used instead of
single syllables to understand the temporal structure of the bird
sounds and represented using Gaussian syllable prototypes.
They used nearest neighbor classifiers on 3 different histogram
representations i.e. 3 Gaussian syllable prototypes - 10, 30,
and 50 Gaussians with accuracies of 76, 79, and 80 percent
respectively. This paper handles the sparseness of histograms
and facilitates the comparison of the histograms. However, the
problem with using the syllable approach is, it is challenging to
get robust segmentation for a low Signal-to-noise ratio which
is addressed in [14].
In [14], a probabilistic approach is proposed using a statis-
tical manifold. First, meaningful features are extracted from
the audio by considering the audio signals as time series of
samples and converted into a spectrogram and the Fourier
transform is applied to distinguish between different sounds
in a frame based on frequency. The unique approach in this
paper from [13] is instead of bird song syllables they used
histograms of frequencies, as syllables are heavily dependent
on accurate segmentation and not suitable for audio with a low
Signal-to-noise ratio. Also, rather than averaging all the frame-
level features into a single length vector, they aggregated
frame-level features by representing feature distribution in the
histogram as they observed that multi-modality is common
in bird sounds and by averaging, we lose significant infor-
mation. As a result, a multidimensional feature vector of d-
dimensions is generated, and for classification, a feature vector
of frequencies for each histogram bin is given as input. The
authors also used a ‘codebook’ approach to generate these
histogram features for high-dimension vectors like MFCCs.
In this paper, nearest-neighbor classifiers were used with
statistical divergence measures, namely L1, Kellinger, and KL.
The proposed model provides a simpler approach than using
bird syllables and the model provides competitive or better
accuracy results(85-90 percent) with the benchmark SVMs.
III. DATASET DESCRIPTION
The dataset [3] used in this research is downloaded from
Xeno-canto, an open-source website that hosts and shares a
vast collection of wildlife sounds from around the world. The
original dataset consisted of 153 different bird species, ranging
from A to M. However, we only selected certain species to
ensure our data set is of higher quality and suitable for our
research purposes.
Firstly, we only considered audio files with a rating greater
than three which indicates that the audio quality is not too
worse. Secondly, we only selected audio files with a number
of samples greater than a hundred for accurate classification.
Finally, we included audio files that were under twenty seconds
in duration as we observed that, in longer audio files, a
vast majority of the content was extraneous, with only small
fragments containing the relevant bird sounds. After applying
these filters, we were left with a total of 30 bird species that
met all the criteria. These selected species were considered for
further analysis and classification.
IV. METHODOLOGY
Fig. 1: Model training workflow
A. Data Preparation
The audio files in the dataset consisted of both mono and
stereo channel. The number of channels used to capture and
playback audio signals is the primary distinction between
stereo and mono. While stereo signals are recorded and played
back on two audio channels (the left and right channels), mono
signals are recorded and played back on a single audio channel
[15]. Thus, in order to make audio analysis easier, we use
mono, so that we would only have one channel to process.
Also, all the files were in mp3 format. Librosa library was used
to convert mp3 into WAV files. This is because, unlike MP3
file format which is lossy, WAV files are lossless, meaning
that WAV audio is a high-quality uncompressed file. These
are best suited to our scenario, given that audio files have a
lot of noise like traffic, human sounds, or other low-frequency
sounds. Librosa library also converts stereo into mono-channel
audio files. The dataset has been split into the train, and test
sets in a ratio of 80:20 with each of these sets having an equal
ratio of target classes.
B. Noise reduction
We used a python Noise Reduce library [7] that uses
spectral gating to reduce the background noise. In spectral
gating, a spectrogram of an audio signal is computed, noise
threshold is estimated to compute a mask, which is later
used for noise gating. After applying this, we also applied
another high-pass filter [5] to the data for further reduction in
noise. The assumption here to apply a high-pass filter in this
context is based on the fact that bird sounds are predominantly
high-frequency in nature. By employing this filter, we can
effectively isolate and extract these sounds from the low-
frequency ambient noise that may be present, such as traffic
and human sounds.
Fig. 2 shows different audio signals plotted in blue, orange,
green, and red. Fig. 2a shows the original signal that contains
noise and is represented in blue. After applying a high-pass
filter to the original audio, we obtained the audio signal in
orange as shown in Fig. 2b. As the noise was not reduced
significantly using only the high-pass filter, we tried using the
Noise Reduce Python library instead. Fig. 2c shows that the
noise reduced significantly, and the audio signal is represented
in green. On top of the Noise Reduce library, we again used a
high-pass filter to further eliminate noise, and the final audio
signal is represented in red in Fig. 2d.
C. Feature Extraction
For numeric data, a total of twenty-six acoustic features
were extracted. These features are as follows: Short Term
Fourier Transform (STFT), Mel Frequency Cepstral Coeffi-
cients (MFCC), Root Mean Square (RMS), Spectral centroid,
Spectral bandwidth, Spectral roll-off, and Zero crossing rate.
D. Normalization
The extracted features we wanted to use for modeling had
different scales. Uneven feature scales can cause issues when
training machine learning models as dominant features with
much larger values may lead to biased models that perform
poorly on other features. So, we used the StandardScaler
normalization to bring the features to a similar scale and
improve the accuracy and convergence of the models.
E. Classification models
In order to perform classification on the extracted dataset,
we wrote a pipeline method that runs all the following super-
vised models used on the training dataset of 30 classes: K-
nearest neighbor (KNN), Stochastic Gradient Descent (SGD),
Support Vector Classifier (SVC), Decision Trees, Random
Forest, Gaussian Naive Bayes Classifier.
F. Feature Selection
During the analysis of the feature extraction, we were
interested in finding which set of features would result in
the highest level of classification accuracy. This is impor-
tant because feature selection can help improve classification
performance by selecting the most relevant features, and also
reduce classification time by eliminating unnecessary features.
(a) Original audio
(b) Cleaned with high pass filter
(c) Cleaned with noisereduce library(spectral gating)
(d) Cleaned with noisereduce library(spectral gating) and high pass
filter
Fig. 2: Comparison of various noise reduction techniques
We used recursive feature elimination (RFE), a feature selec-
tion algorithm that reduces model complexity by eliminating
lesser significant features. RFE works by iteratively removing
the weakest feature(s) until the desired number of features is
reached.
To determine the optimum number of features needed, we
performed cross-validation along with RFE to score different
subsets of features and select the subset with the highest score.
This process helped us identify the feature set that would result
in the highest classification accuracy. After performing RFE,
we were able to reduce the number of features from 52 to 41
for 3 classes and 37 for 5 classes. By reducing the number of
features, we were able to simplify the model and improve its
performance.
V. RE SU LTS, A NA LYSI S,AND COMPARISONS
There were three experiments conducted in this research.
The first experiment was to use basic machine-learning models
to classify the birds. In this, we have considered three different
sets of bird classes: 3, 5, and 30 classes. The 30 classes were
selected based on specific constraints mentioned in Section
III, and from these 30 classes, we have randomly selected
3 and 5 classes. The results of this experiment are given in
Table I. We have observed that the accuracy and F1 scores
have decreased with the increase in the number of bird classes.
This is because different audio species data had different audio
quality samples and a limited number of audio files, resulting
in an imbalance of data. By nature, in machine learning
algorithms, the accuracy of predictions decreases as the degree
of imbalance increases. The best results in this experiment
were obtained for Stochastic Gradient descent (SGD) when
trained with 3 bird species.
(a) Accuracy scores comparison
(b) F1 scores comparison
Fig. 3: Comparison of Accuracy and F1 scores of models
trained without feature selection
In the second experiment, we compared our Stochastic
Gradient Decent model which was trained using the features
of both original audio and noise-reduced audio, with the model
that did not use this combination. We have observed that
the model trained with combined data had better results than
the model trained using just the original features or just the
noise-reduced features. Table II gives detailed metrics for this
experiment. The F1 scores when we considered combined
features were 0.87 while the models trained with noise audio
had 0.79 and noise-reduced models had 0.72 F1 scores.
TABLE I: F1, Accuracy for 3, 5, and 30 classes without feature
selection
Metric SGD SVM KNN Decision Tree
3 Class
Macro F1 0.87 0.86 0.65 0.64
Accuracy 0.88 0.86 0.71 0.66
5 Class
Macro F1 0.70 0.73 0.58 0.49
Accuracy 0.70 0.74 0.61 0.51
30 Class
Macro F1 0.21 0.18 0.13 0.11
Accuracy 0.30 0.36 0.28 0.17
The third experiment was conducted by training a Stochastic
Gradient Descent model with feature selection on 3 and 5 bird
classes. We have observed a significant increase in accuracy
and F1 scores after feature selection using the recursive feature
elimination technique. Using RFE, a total of 41 features for
3 classes and 37 features for 5 classes were selected. Table
III shows the accuracies and F1 scores obtained for stochastic
gradient descent using 3 and 5 bird species for training. All the
models were evaluated using a cross-validation method with
5 folds.
TABLE II: Comparison of Stochastic Gradient Decent mod-
eled with features with, without noise, and both
Metrics
Features
with & without noise with noise without noise
Macro F1 0.87 0.79 0.72
Accuracy 0.88 0.79 0.76
TABLE III: F1, and Accuracy time for 3 and 5 classes using
RFE on Stochastic Gradient Decent
3 Class 5 Class
Features selected 41 37
Macro F1 0.89 0.88
Accuracy 0.90 0.89
VI. CONCLUSION AND FUTURE WO RK
In conclusion, accuracies decreased as the number of classes
increased. This is because different species had varying audio
data and limited samples. Our research only focused on using
the numerical features extracted such as spectral centroid, zero
crossing rate, and MFCCs. The models were suitable to handle
a balanced dataset, where each class had a similar number of
audio files.
In future work, the Mel-spectrogram features of the original
and cleaned audio (using highpass filter and noise reduce
library) can be plotted as images and used for image classifi-
cation using Convolution Neural Networks (CNNs) or other
deep learning models. Furthermore, for classification using
numeric data, among the four important features of bird
audio i.e., notes, syllables, phrases, and songs, individual or
a combination of these features can be experimented with for
classification, to see if we obtain better results. For handling
the issue of data imbalance for different species, we need
to explore better techniques, rather than undersampling or
oversampling, for improving accuracy when using a higher
number of classes.
REFERENCES
[1] S. Mekonen, “Birds as biodiversity and environmental indicator.”
[Online]. Available: https://core.ac.uk/reader/234657570
[2] D. Stowell, M. Wood, Y. Stylianou, and H. Glotin, “Bird detection in
audio: a survey and a challenge,” 08 2016.
[3] Vopani, “Xeno-canto bird recordings extended (a-m),” Sep 2020.
[Online]. Available: https://www.kaggle.com/datasets/rohanrao/xeno-
canto-bird-recordings-extended-a-m
[4] “Librosa: Audio and music signal analysis in python - scipy.”
[5] P. Virtanen and et al., “Scipy 1.0: Fundamental algorithms for scientific
computing in python,” Nature Methods, vol. 17, no. 3, p. 261–272, 2020.
[6] Y.-Y. Yang, M. Hira, Z. Ni, A. Astafurov, C. Chen, C. Puhrsch,
D. Pollack, D. Genzel, D. Greenberg, E. Z. Yang, J. Lian, J. Hwang,
J. Chen, P. Goldsborough, S. Narenthiran, S. Watanabe, S. Chintala,
and V. Quenneville-B ´
elair, “Torchaudio: Building blocks for audio
and speech processing,” in ICASSP 2022 - 2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022,
pp. 6982–6986.
[7] T. Sainburg, “Timsainb/noisereduce: V1.0,” Jun 2019. [Online].
Available: https://zenodo.org/record/3243139
[8] A. E. Mehyadin, A. M. Abdulazeez, D. A. Hasan, and J. N. Saeed,
“Birds sound classification based on machine learning algorithms,” Asian
Journal of Research in Computer Science, p. 1–11, 2021.
[9] S. Wisdom, E. Tzinis, H. Erdogan, R. J. Weiss, K. Wilson,
and J. R. Hershey, “Unsupervised sound separation using
mixture invariant training,” Oct 2020. [Online]. Available:
https://doi.org/10.48550/arXiv.2006.12701
[10] T. Denton, S. Wisdom, and J. R. Hershey, “Improving bird classification
with unsupervised sound separation,” Oct 2021. [Online]. Available:
https://doi.org/10.48550/arXiv.2110.03209
[11] K. J. Piczak, “Recognizing bird species in audio recordings using
deep convolutional neural networks - ceur-ws.org,” 2016. [Online].
Available: http://ceur-ws.org/Vol-1609/16090534.pdf
[12] A. Incze, H.-B. Jancso, Z. Szilagyi, A. Farkas, and C. Sulyok, “Bird
sound recognition using a convolutional neural network,” 2018 IEEE
16th International Symposium on Intelligent Systems and Informatics
(SISY), 2018.
[13] P. Somervuo and A. Harma, “Bird song recognition based on syllable
pair histograms,” 2004 IEEE International Conference on Acoustics,
Speech, and Signal Processing.
[14] F. Briggs, R. Raich, and X. Z. Fern, “Audio classification of bird
species: A statistical manifold approach,” 2009 Ninth IEEE International
Conference on Data Mining, 2009.
[15] Arthur, “Is stereo or mono audio better? (applications for both),” Jan
2022. [Online]. Available: https://mynewmicrophone.com/is-stereo-or-
mono-audio-better-applications-for-both/