Access to this full-text is provided by MDPI.
Content available from Sustainability
This content is subject to copyright.
Citation: Zhou, X.; Wang, N.; Hu, K.;
Wang, L.; Yu, C.; Guan, Z.; Hu, R.; Li,
Q.; Ye, L. Recognition of Western
Black-Crested Gibbon Call Signatures
Based on SA_DenseNet-LSTM-
Attention Network. Sustainability
2024,16, 7536. https://doi.org/
10.3390/su16177536
Academic Editor: Roberto Benocci
Received: 31 May 2024
Revised: 19 August 2024
Accepted: 27 August 2024
Published: 30 August 2024
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
sustainability
Article
Recognition of Western Black-Crested Gibbon Call Signatures
Based on SA_DenseNet-LSTM-Attention Network
Xiaotao Zhou 1,†, Ning Wang 1,†, Kunrong Hu 1, * , Leiguang Wang 2,3, Chunjiang Yu 1, Zhenhua Guan 1,4,
Ruiqi Hu 1, Qiumei Li 5and Longjia Ye 1
1School of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming 650224, China;
zxt@swfu.edu.cn (X.Z.); wn19990808@163.com (N.W.); ycj@swfu.edu.cn (C.Y.);
zhenhuaguan@hotmail.com (Z.G.); hrq3254055328@163.com (R.H.); 15187421206@163.com (L.Y.)
2Institute of Big Data and Artificial Intelligence, Southwest Forestry University, Kunming 650024, China;
leiguangwang@swfu.edu.cn
3Key Laboratory of National Forestry and Grassland Administration on Forestry and Ecological Big Data,
Southwest Forestry University, Kunming 650024, China
4Yunnan Academy of Biodiversity, Southwest Forestry University, Kunming 650024, China
5College of Humanities and Law, Southwest Forestry University, Kunming 650024, China;
15288016214@163.com
*Correspondence: hukunrong@swfu.edu.cn
†These authors contributed equally to this work.
Abstract: As part of the ecosystem, the western black-crested gibbon (Nomascus concolor) is impor-
tant for ecological sustainability. Calls are an important means of communication for gibbons, so
accurately recognizing and categorizing gibbon calls is important for their population monitoring
and conservation. Since a large amount of sound data will be generated in the process of acoustic
monitoring, it will take a lot of time to recognize the gibbon calls manually, so this paper proposes a
western black-crested gibbon call recognition network based on SA_DenseNet-LSTM-Attention. First,
to address the lack of datasets, this paper explores 10 different data extension methods to process all
the datasets, and then converts all the sound data into Mel spectrograms for model input. After the
test, it is concluded that WaveGAN audio data augmentation method obtains the highest accuracy
in improving the classification accuracy of all models in the paper. Then, the method of fusion of
DenseNet-extracted features and LSTM-extracted temporal features using PCA principal component
analysis is proposed to address the problem of the low accuracy of call recognition, and finally, the
SA_DenseNet-LSTM-Attention western black-crested gibbon call recognition network proposed in
this paper is used for recognition training. In order to verify the effectiveness of the feature fusion
method proposed in this paper, we classified 13 different types of sounds and compared several
different networks, and finally, the accuracy of the VGG16 model improved by 2.0%, the accuracy of
the Xception model improved by 1.8%, the accuracy of the MobileNet model improved by 2.5%, and
the accuracy of the DenseNet network model improved by 2.3%. Compared to other classical chirp
recognition networks, our proposed network obtained the highest accuracy of 98.2%, and the conver-
gence of our model is better than all the compared models. Our experiments have demonstrated that
the deep learning-based call recognition method can provide better technical support for monitoring
western black-crested gibbon populations.
Keywords: western black crested gibbon call recognition; data augmentation; bioacoustics; convolutional
neural network; attentional mechanism
1. Introduction
Because they disperse seeds to keep forests healthy, western black-crested gibbons are
essential to their habitats. Gaining a grasp of these animals’ calls and social structures can
help safeguard the ecological balance by improving our comprehension of their effects on
Sustainability 2024,16, 7536. https://doi.org/10.3390/su16177536 https://www.mdpi.com/journal/sustainability
Sustainability 2024,16, 7536 2 of 24
the ecosystem. Because gibbons are highly sensitive to environmental changes, alterations
in the forest and climate change are reflected in their calls. Scientists can identify potential
threats to the ecosystem early on and implement necessary conservation measures by
keeping an ear out for gibbon cries. Finally, by identifying the calls of the western black-
crested gibbon, researchers have been able to gain a better understanding of both the
species and the environments in which they live. This knowledge has also helped to
develop tools and vital information for the development of more successful conservation
strategies. These studies support the preservation of the ecosystem’s overall health and
equilibrium, in addition to aiding in the conservation of the apes themselves.
The continued monitoring and conservation of the western black-crested gibbon group
will be an important element of ecologically sustainable development. The western black-
crested gibbon has the largest group size, about 270 groups, distributed within China [
1
].
Gibbon calls are distinctive and travel long distances [
2
]; people often use gibbon calls to
figure out where and how many wild gibbon colonies there are, so that the right steps can
be taken to protect them [
3
]. Western black-crested gibbons are already being watched
every day through sentinel statistical surveys, where staff listen to, record, and look at
their calls in some higher-elevation areas [
4
,
5
]. Naturally, this manual monitoring strategy
requires a lot of labor and takes a lot of time, so it is unable to consistently meet the
demands of online and timely monitoring [
6
]. Since passive acoustic monitoring (PAM)
came along, the western black-crested gibbon has also been used to test these systems [
7
].
Add to this the fact that social animals like the western black-crested gibbon use a variety
of sounds in their communities to convey messages and maintain social bonds. Thus,
recognizing their calls can help scientists understand their social structure, communication
methods, and social dynamics. This is crucial for the in-depth study and conservation of
these species. By analyzing the calls of these gibbons, researchers can identify differences
between individuals and groups. This helps researchers track the numbers and distribution
of specific groups for more precise ecological monitoring and conservation efforts. However,
there are several challenges in recognizing the calls of western black-crested gibbons,
as follows:
1.
Although passive acoustics can meet the requirements of western black-crested gibbon
group monitoring, it would be labor-intensive to manually identify gibbon sounds
from a large number of acoustic recordings;
2.
Deep learning models can be used to effectively recognize the calls of the western
black-crested gibbon. However, the western black-crested gibbon is located in remote
areas, so it is difficult to obtain a large dataset of calls;
3.
The task of training a deep learning network to distinguish between various call
durations is challenging;
4.
How to effectively extract features from input sound data and perform feature fusion.
Deep learning has recently begun to take center stage in passive acoustic monitoring
as a result of its ongoing development. It has been shown that this manual processing
of late acoustic data can be alleviated by using convolutional neural networks (CNNs)
(LeCun et al. [
8
,
9
]) and that specific species in acoustic recordings can be identified by
trained classifiers [
10
–
17
]. However, for these particular species, a good classifier needs a
lot of reliable data. This is due to the fact that an inadequate dataset for a deep learning
model will result in a reduction in the model’s ability to categorize data [
18
]. Additionally,
gathering these data takes a lot of time, especially for remote and rare species.
The best solution that comes to mind in this paper to deal with the challenge of the
lack of a dataset is data augmentation [
19
]. In order to create new, additional training data,
data augmentation is a technique that involves applying one or more deformations to a
set of annotated training samples or basalizing, time-shifting, and adding background
natural noise to the input speech data [
16
,
20
,
21
]. Due to the lack of a dataset, the aug-
mented data can successfully prevent the data overfitting issue during the classification
training process. The effectiveness of using data augmentation to enhance the model’s
performance has been demonstrated in a number of examples. Salamon et al. [
22
] uses
Sustainability 2024,16, 7536 3 of 24
four data augmentation methods, Time Stretching (TS), Pitch Shift (PS), Dynamic Range
Compression (DRC), and Background Noise (BG), to improve the accuracy of ambient
sound classification; Davis et al. [
23
] used four methods proposed by Salamon et al. [
22
] and
five Linear Predictive Resonance Frequency Coefficients (LPCC) data expansion methods
to effectively improve the accuracy of ambient sound classification. Therefore, reasonable
data expansion can effectively improve the accuracy and generalization performance of the
model. However, most sounds are altered in acoustic properties during the data augmen-
tation process by standard techniques like pitch scaling and temporal stretching; hence,
current techniques fail to sufficiently account for the unique characteristics of each sound.
As a result, only when knowledge of the target sound is present can the augmentation
procedure be chosen in a reasonable manner. The unique vocalizations of many animals
may be lost or changed if the augmentation process is carried out blindly.
Currently, the most used solution in dealing with the problem of missing data is the
Generative Adversarial Network [
24
]. The network is frequently used for improving speech
and image data, because it can produce realistic virtual data without changing the char-
acteristics of the original sound data [
25
,
26
]. Even though GAN networks produce better
results for data enhancement, they still have some flaws: the first is that the effectiveness of
the generated speech data will be extremely low in the absence of training data; the second
is that it ignores the frequency and time features present in the sound, because traditional
GAN data enhancement methods are used to generate speech data from waveforms and
spectrograms; the third is that most of the sound data comes from the outdoors, so the
background noise contained therein can cause traditional GAN networks to fail to learn
effective animal sound features. To address the limitations of traditional GAN networks,
in this paper, we choose the state-of-the-art WaveGAN network [
27
], which can effectively
compensate for these shortcomings.
Still, there is room for improvement in the field of bioacoustic classification with re-
spect to the current approaches. Convolutional neural networks usually require uniformly
long sound inputs, yet sound signals might differ greatly in length. Sound clips must
be trimmed or processed separately before being transferred to the network, because the
time-frequency properties in each sound clip vary in length. It is likely that this approach
will introduce some noise or lose some crucial audio data. Recurrent neural networks,
on the other hand, are made to process sequential input. They do this by using iterative
units, which continuously change their internal state in order to retrieve pertinent infor-
mation from lengthier sound sequences. Convolutional neural nets (CNNs) are highly
effective in classification tasks, giving spatial structure computation priority. They are
therefore especially suitable for processing spatial data, such as photographs. However,
recurrent neural networks (RNNs) are better suited for time-series analysis because they
can extract temporal properties from sequential data. The processing of audio depends
heavily on frequency information. Nevertheless, RNNs frequently struggle to simulate
local frequency features during audio processing, a crucial aspect of audio classification
applications. As a unique type of RNN, long short-term memory (LSTM) [
28
] has demon-
strated strong performance in long sequence applications. The LSTM network contains a
significant quantity of learnable parameters, primarily utilized to mitigate the challenges of
gradient vanishing and explosion that commonly arise when training lengthy sequences.
Due to these benefits, LSTM networks are frequently combined with CNN networks for
speech recognition
[29–31]
. Although most CNN-LSTM networks can obtain better results
in reclassification, most CNN-LSTM-based models suffer from highly contextual fully con-
nected features as inputs to the LSTMs, which lack fine-detailed representations about the
important features and always produce unsatisfactory recognition results [
32
]. Additionally,
the CNN-LSTM combination is not used as much, which underutilizes the deeper features
extracted from the various layers. More seriously, the CNN-LSTM-based network model
has a relatively weak attentional ability, which may lead to the interference of noise during
the training process and may even lead to the degradation of the model’s classification
performance [
33
]. Although different attention mechanisms have been studied to try to
Sustainability 2024,16, 7536 4 of 24
solve this problem, there is still room for improvement regarding how to rationalize the
design of attention modules corresponding to different levels of CNN features.
The remainder of the paper is organized as follows: in Section 3, we describe the data
sources and data augmentation. Section 4describes the methodology, network model,
and attention module used in this investigation. Section 5focuses on the comparative
results of classification accuracy on the validation set for all the models in this paper.
In Section 6, a related discussion on the identification of western black-crested gibbon
calls is given. Finally, Section 7details the conclusions on western black-crested gibbon
call recognition.
2. Related Work
Numerous techniques have been put forth in the field of chirp recognition, including
network architectures like CNN, RNN, and LSTM. These techniques can be broadly divided
into two categories: deep learning-based techniques and conventional feature engineering-
based techniques. Conventional techniques for chirp recognition often depend on auditory
feature extraction, like Mel spectrogram and MFCC. These features are fed into a machine
learning classifier for classification. For example, Zhou et al. [
34
] used extracted Mel
spectrograms to recognize gibbon calls with a maximum accuracy of 99.8%; Rafael et al. [
35
]
used spectrograms to recognize 915 bird species with 71% accuracy; and Roop et al. [
36
]
used spectrograms to recognize bird calls with 96.1% accuracy. However, these methods
have limited performance in dealing with complex audio signals and have difficulty in
capturing the temporal characteristics of gibbon calls.
In 1997, Hochreiter et al. proposed the long short-term memory (LSTM) network [
28
],
and they are presently extensively implemented in speech processing. The problem of stan-
dard recurrent neural networks (RNNs) not being able to capture long-term dependencies
can be solved by LSTM networks. Additionally, gating methods including input, output,
and oblivion gates are used by LSTM networks to enhance network performance and better
manage information flow. Because it can maintain and update state information at each
time step, LSTM is especially well-suited for jobs involving multi-time-step predictions.
This is why it is so popular in the voice recognition industry. Geng et al. [
37
] has designed
a speech recognition network based on LSTM network, which is mainly used to recognize
the speech produced in English teaching; Ahmed et al. [
38
] proposes a speech emotion
recognition model based on the combination of CNN network and LSTM network; due to
the inclusion of the LSTM network; the model effectively improves the overall network
performance. Abdelhamid et al. [39] presents a network architecture that integrates CNN
and LSTM networks, thereby significantly enhancing the model’s efficacy in emotional
speech recognition.
With the advancement of deep learning, RNN networks are increasingly employed
in the speech recognition field [
40
,
41
]. Furthermore, the incorporation of the temporal
attention mechanism significantly enhances the model’s accuracy. An example of this
situation can be seen in some previous studies [
42
–
44
], where temporal attention mech-
anisms have been integrated into a variety of network models to enhance the precision
of speech emotion recognition. The model can learn how to adapt to changes at different
time steps during the training process, because the temporal attention mechanism enables
the model to dynamically assign weights to information at different time steps and to
concentrate more on task-relevant time steps when processing time-series data. It enhances
the network’s capacity for generalization by enabling it to adaptively simulate sequences
of varying lengths.
However, as the number of network layers continues to grow, the problem of vanishing
gradients occurs. In order to solve this problem, many people have conducted a lot
of research, for example, He et al. [
45
] proposed a deep residual structure to solve the
problem of gradient vanishing in deep networks; Gustav et al. [
46
] proposed a self-similarity
based neural network macro-architecture design strategy to solve the problem of gradient
Sustainability 2024,16, 7536 5 of 24
vanishing; and Rupesh et al. [
47
] proposed a highway network that allows information to
flow unimpeded on a multilayer information highway.
Although existing studies have made some progress in gibbon call recognition, there
are still some shortcomings. Our proposed SA_DenseNet-LSTM-Attention model achieves
innovation by (1) adopting the DenseNet network as the backbone network to enhance
the feature extraction capability; (2) incorporating a self-attention mechanism after each
DenseNet convolutional layer to dynamically adjust the feature weights; (3) combining the
LSTM and temporal attention mechanism to capture the time-dependence of audio signals
and effectively extract sequence features with different chirp lengths; (4) fusing features
using PCA principal component analysis to further improve the model performance. The
experimental results show that our method significantly outperforms existing methods in
terms of accuracy, precision, recall and F1 score.
3. Materials
3.1. Data Sources
We chose two parts of data from Zhou et al. [
48
] and the dataset used in the paper
published by Zhou et al. [
49
], but these experimental data were obtained from Zhong
Enzhu et al. [
7
] and Zhou et al. [
48
], from a call monitoring system in the Chuxiong
area of the Ailao Mountains National Nature Reserve in Chuxiong City, Yunnan Province,
China (23
◦
36
′
–24
◦
44
′
N ,100
◦
54
′
–101
◦
30
′
E) (Figure 1f). The data are all acquired by the
self-developed pickup array (Patent No.:ZL 2018 2 2264510.6) (Figure 1c). AAC is utilized
to store the final audio formats, which have a sampling rate of 32 KHz and a duration of
approximately 30 min per file. Then, the acquired sensor data are transmitted through the
5.8 GHz wireless network of the main link, and in order to ensure the effective transmission
of the transmitted data, we use the LoRa network as an auxiliary link (the auxiliary link is
responsible for locating the faults of the main link and, at the same time, taking part in the
monitoring data transmission task) (Figure 1d).
Dataset types included wind, rain, bird calls (Actinodura strigula,Fulvetta cinereiceps,
Psilopogon virens,Pomatorhinus ruficollis,Parus monticolus, and Pterorhinus sannio),
gibbon calls, and cicada calls. However, In this article, we delineate the gibbon calls
in more detail. Based on Fan et al.’s [
50
] method of classifying western black-collared
gibbon calls, this paper also classifies the call of the western black-crested gibbon into
four different types: the simple repetitive syllable calls of the male gibbon, the weakly
modulated syllable calls of the male gibbon, the strongly modulated syllable calls of the
male gibbon, and the agonistic calls of the female gibbon, with the specific structure of the
call as shown in Figure 2.
Figure 3shows the differences in different call types in terms of frequency and am-
plitude. From the figure, it can be concluded that the “aa notes” call type has the highest
average frequency, which is around 45.5 Hz, the “great call” call type has lower average
frequency, which is around 44.5 Hz, the ”modulated” call type has the lowest average
frequency, which is close to 42.5 Hz, and the “weakly modulated" call type has the lowest
average frequency, which is close to 42.5 Hz. The “modulated" call type has the lowest
average frequency, close to 42.5 Hz, and the “weakly modulated” call type has an average
frequency close to 44 Hz. The study found that the “aa notes” call type had the highest
average frequency of about 45.5 Hz, while the “modulated” call type had the lowest av-
erage frequency of about 42.5 Hz, which may be related to specific behavioral patterns or
environmental adaptations. The “aa notes” call type had the highest amplitude of nearly
−72 dB
, suggesting that this type of call is louder and may be advantageous in specific
social interactions. In contrast, the “modulated” call type had the lowest amplitude at
about −78 dB, indicating a more subdued acoustic signature.
Sustainability 2024,16, 7536 6 of 24
Figure 1. Overall network monitoring routes and network data transmission maps. (a) Environmental
factor acquisition sensors; (b) video capture camera; (c) audio capture pickup arrays; (d) wireless
network data transmission, primary link: 5.8 GHz wireless network transmission, secondary link:
LoRa wireless network transmission; (e) the local authority, where the data are transmitted back to
the authority’s servers for storage via the wireless network; (f) the location of the passive acoustic
monitoring system in the Ailao Mountains, Chuxiong City, Yunnan Province.
Figure 2. Cont.
Sustainability 2024,16, 7536 7 of 24
Figure 2. Spectrograms of four different calls of the western black-crested gibbon. (a) simple repetitive
syllable calls of the male gibbon; (b) agonistic calls of the female gibbon; (c) weakly modulated syllable
calls of the male gibbon; (d) strongly modulated syllable calls of the male gibbon.
Figure 3. Comparison of frequency and amplitude differences in calls of different gibbons.
3.2. Data Augmentation
Data augmentation is a method for growing the dataset to prevent overfitting during
training, whereas data balancing can effectively increase the accuracy of model training [
51
].
Numerous experimental findings demonstrate that in deep learning-based classification
tasks, we can obtain better-trained classifiers by using more samples from the dataset.
Additionally, the data augmentation can successfully increase the model’s capacity for
generalization in the scene [
52
]. The four calls made by the western black-crested gibbon
account for the majority of the dataset imbalance in this study. The size of the original
dataset containing the four gibbon calls is depicted in Table 1.
Male gibbons are more dominant than females in all four sound types, according to
the data in the table; however, there are fewer datasets for male gibbons’ weakly modulated
syllable calls, female gibbons’ agitated calls, and male gibbons’ basic repeated syllable
calls. The sample imbalance in the dataset may therefore have an impact on the automatic
recognition algorithm’s ability to identify particular patterns. In order to ensure that the
number of samples under each category is relatively balanced, this paper chooses two deep
learning methods and some commonly used methods for the data enhancement of the
four calls of gibbons [
22
,
23
,
27
,
52
–
55
]. After experimental selection, our goal is to determine
Sustainability 2024,16, 7536 8 of 24
the most effective technique for gibbon data singing augmentation. The specific data
enhancement methods are shown in Table 2.
Table 1. Call size of the original four western black-crested gibbon species. The table includes four
different gibbon call types, sex, four dataset sizes, average length of calls, and labels for the four calls.
Call Type Genders Sample Size Average
Length (s) Labeling Tags
1. aa notes males 497 4.6 ga
2. Weakly modulated figure males 958 5.8 gw
3. Modulated figure males 1066 7.2 gm
4. Great call females 213 8.4 gf
Note: 1. aa notes (simple repetitive syllable calls of the male gibbon), 2. Weakly modulated figure (weakly
modulated syllable calls of the male gibbon), 3. Modulated figure (strongly modulated syllable calls of the male
gibbon), 4. Great call (agonistic calls of the female gibbon).
For the ten audio data augmentation methods in Table 2, the relevant transformation
factors are randomly generated using a random number generator to ensure the diversity of
the data transformations when expanding the original data using the seven methods based
on data transformations. In order to demonstrate the effectiveness of the method of adding
noise to augment the dataset used in this paper, the classification accuracy of the model is
explored under different noise scenarios to find the optimal range of adding noise. This
paper has used 0 dB,
−
5 dB,
−
10 dB,
−
15 dB,
−
20 dB,
−
25 dB,
−
30 dB,
−
35 dB,
−
40 dB,
−
45 dB, several different noise cases for testing; the specific experimental results are shown in
Table 3and Figure 4, AND according to the contents of the table that can be derived from the
model proposed in this paper, the upper limit of the noise is
−
20 dB. We eventually increased
the dataset to 2000 entries per class when using noise enhancement, choosing three noise
enhancement schemes of 0 dB,
−
5 dB,
−
10dB, and
−
15 dB. The details are shown in Figure 4.
Table 2. The audio data augmentation methods used in this paper.
Augmentation Methods Classification of Methods
Tranlation of Sample Rate (TSR) Data Transformation
Same Class Augmentation (SC) Data Transformation
Time Shifting Augmentation (TS1) Data Transformation
Pitch Shift Augmentation (PS) Data Transformation
Time Stretching Augmentation (TS2) Data Transformation
Speed Tuning (ST) Data Mixing
Volume Tuning (VT) Data Mixing
Noise Augmentation (N) Data Mixing
WaveGAN Data generation
Fre-GAN Data generation
Table 3. Comparison of model classification results at different noise levels.
Noise Levels (dB) Accuracy Precision Recall F1-Score
0 0.95 0.94 0.93 0.94
−5 0.92 0.91 0.90 0.90
−10 0.88 0.87 0.85 0.86
−15 0.85 0.84 0.82 0.83
−20 0.80 0.78 0.76 0.77
−25 0.69 0.68 0.65 0.66
−30 0.65 0.63 0.60 0.61
−35 0.60 0.58 0.55 0.56
−40 0.55 0.53 0.50 0.51
−45 0.50 0.48 0.45 0.46
Sustainability 2024,16, 7536 9 of 24
Figure 4. Examples of spectrograms according to different noise levels of 0 dB,
−
5 dB,
−
10 dB
and −15 dB.
4. Methods
4.1. Overall Identification Model
Figure 5shows the overall system model of the proposed method. The model consists
of six main parts: (1) Due to the imbalance in the dataset of the calls of the four gibbon
species, we expanded the data for the four gibbon calls; (2) Audio preprocessing: extracts
the Mayer spectrogram of the input audio; (3) CNN: we use DenseNet as the backbone
network to learn the features of the spectral image of each input audio; (4) RNN: Learning
temporal features from image sequences using LSTM networks; (5) Attention: Assign
weights to each LSTM network output and transmit the outputs of the LSTM network in
(4) to the temporal note; (5) Output: the output of the attention network in (5) is sent to the
Softmax classifier, which then finalizes and outputs the classification results.
Figure 5. Diagram of the overall network structure. The overall network structure of this paper (from
top to bottom) includes audio data enhancement, DenseNet network and LSTM network based on
attention mechanism and feature fusion using PCA.
Hence, to efficiently identify extended sequences of vocalizations produced by western
black-crested gibbons, we propose the utilization of a hybrid neural network known as
DenseNet+Self-Attention-LSTM-Attention (SA_DenseNet-LSTM-Attention). The reason
we chose the DenseNet network [
56
] for our backbone network is because the dense convo-
Sustainability 2024,16, 7536 10 of 24
lutional network (DenseNet) is an architecture that can effectively address problems such
as gradient vanishing and gradient explosion as the number of network layers continues
to increase. With limited Mel spectrum data, the DenseNet network performs better than
CNN at obtaining training data. We attach a fully connected LSTM (FC-LSTM) to the
completely connected layer of DenseNet in order to effectively represent and learn the
time-frequency properties of the input audio. More importantly, the mutual combination
with the LSTM network aids in the extraction of spatio-temporal features by the overall
network [
57
]. The temporal attention module was then used to find key time-frequency
feature information in the successive calls of the gibbon calls. This is shown in Figure 5.
In addition, we feature-fused the channel, spatial, and time-frequency features extracted
from the entire network, which effectively extracted gibbon call features with different call
durations and also reduced the dimensionality and weight of the final output of the atten-
tion module. In conclusion, a substantial quantity of ablation experiments and comparative
tests were undertaken to assess the performance of our proposed model in comparison to a
number of the more sophisticated classical network models.
The problem of vanishing gradients occurs as the number of network layers continues
to grow. In order to solve this problem, a lot of research has been conducted by many
people [
45
–
47
,
58
], and Huang et al. [
56
] proposed a new convolutional neural network
(DenseNet) in 2017. The suggested network is based on ResNet’s concept of layer-to-
layer connectivity of feature maps. When it comes to vanishing gradient, less processing,
and fewer parameters, this network is far superior to traditional convolutional neural
networks. As a result of the gradient vanishing being reduced, DenseNet avoids many
redundant features and accelerates convergence. Dense connectivity, as illustrated in
Figure 6, is a technique utilized to establish a continuous connection between the inputs
of a subsequent layer and the feature maps of an earlier layer, thereby augmenting the
information flow between the layers.
Figure 6. DenseNet network architecture based on self-attention mechanism. Conv in the figure denotes
the convolutional layer; BN denotes Batch Normalization; ReLu denotes ReLu activation function.
The dense network consists of dense blocks and transition layers. Due to the channel
cascade used to connect the feature maps of the various DenseNet layers, the network may
be over-parameterized, thereby reducing its computational efficiency. In order to avoid
the problem of too large parameters, the DenseNet authors added a bottleneck layer to the
Sustainability 2024,16, 7536 11 of 24
network, as shown in Figure 5. The bottleneck layer comprises Rectified Linear Unit (ReLU),
Convolution (3
×
3), and Batch Normalization (BN). By adding this layer, the number of
input audio feature spectrograms is effectively reduced, thus improving the computational
efficiency of the model. Furthermore, the implementation of the transition layer results
in a reduction of both the quantity and dimensions of the feature maps. The transition
layer, depicted in Figure 5, is a distinct component that is linked behind the dense block. It
comprises BN, ReLU, Convolution (1
×
1), and AvgPooling (2
×
2) as its primary elements.
4.1.1. DenseNet
Conventional recurrent neural networks (RNN) demonstrate exceptional proficiency
in sequential data processing. However, as the size of RNN networks increases, some
critical data may become unaccessible due to insufficient connectivity to all the nodes.
The RNN is able to handle issues like speech processing that are strongly correlated with
time series, because these issues cause it to lose the long-range dependence that comes
with long-term information. LSTM is a special kind of RNN that can effectively solve the
problems of the RNN networks. This simply indicates that LSTM outperforms standard
RNNs when confronted with longer sequences [
59
,
60
]. The specific LSTM structure is
shown in Figure 7a. The LSTM has more input gates, forgetting gates, output gates, and a
hidden state compared to the RNN, which contains memory cells that store information for
longer periods of time and selectively memorize the network error return parameters [
61
].
The relevant calculations are shown below.
ft=σWf·hht−1,xt+bfi (1)
it=σ(Wi·[ht−1,xt+bi]) (2)
∼
Ct=tanh(Wc·[ht−1,xt]+bc)(3)
Ct=(ft×Ct−1)+it×
∼
Ct(4)
Ot=σ(Wo·[ht−1,xt]+bo)(5)
ht=ot×tanh(Ct)(6)
In the above equation,
xt
denotes the input vector of the LSTM unit,
ht−1
denotes the
previous hidden state vector, which can be regarded as the output vector of the previous
LSTM unit,
Wf
denotes the weight matrix and bias vector parameter that the oblivious
analyzer needs to optimize during the model training process,
σ
denotes the activation
function,
Ct
denotes the current temporary state of the neuron,
Ct−1
denotes the state of
the previous neuron, and
∼
Ctdenotes the unit input activation vector.
(a)The structure of LSTM network (b)Temporal attention module
Figure 7. LSTM network architecture module and time attention mechanism module.
Sustainability 2024,16, 7536 12 of 24
4.1.2. Self-Attention Module
This study adds the self-attention module based on the DenseNet network structure to
screen the channel features of the feature map output from the first two layers. This is used
to enhance the expression of the important features of a specific channel and to focus on the
network’s region of interest. Because the dataset used in this paper is from the Mourning
Mountain acoustic monitoring system, noise interference is inevitable.
The upper part, as shown in Figure 6, is the self-attention module, where first, the chan-
nel dimension of the input feature map is compressed by 1 × 1 convolution to reduce
redundant information and improve computational speed. Then, the transpose operation
is performed on the output feature maps of the F(X) branch and matrix multiplication
is performed with the output feature maps of the g(X) branch to obtain the similarity
between feature maps. Next, the similarity is normalized using Softmax function to obtain
the attention matrix. Finally, the feature maps of the h(X) branch are multiplied with
the feature maps of the other branch to obtain the final feature maps, which realizes the
weight reassignment of the features. Subsequently, the outcome is supplied to the Softmax
function, and a 1 × 1 convolution kernel is employed in the output processing to maintain
channel consistency with the input feature map’s channels.
4.1.3. Time attention mechanism
Since the temporal attention mechanism can effectively extract temporal features
with different description lengths, this paper also incorporates the temporal attention
mechanism in the LSTM network of the gibbon recognition network. The specific structure
is shown in Figure 7b. In Figure 7b, the control unit modules
x1−xn
comprise the gate
loop of the LSTM network. The contents of
H1−Hn
represent the outputs of the gate-loop
control unit modules in the LSTM network. These contents are subsequently inputted into
our model of the temporal attention mechanism, with the particular output of the hidden
layer illustrated as follows:
ut=tanh(wwHt+b), (7)
where
tanh
is the activation function and
ww
denotes the weights. The weights of each
LSTM network output result atare as follows:
at=so f t T
max
tu,wi. (8)
vt
represents the final result of the attention mechanism layer. The expression is shown below:
vt=∑atHt. (9)
As a result, the network can now provide more focus to the important data that is
present in the audio feature sequence. This data can be given more weight, which will
increase the accuracy and speed of the identification process.
4.2. Model Evaluation and Dataset Segmentation
The performance of the network model proposed in this paper was evaluated using
accuracy (denoting the percentage of correct predictions), precision (indicating the ratio of
samples with positive actual values to those with accurate predictions), recall (representing
the ratio of samples in which both the actual values and predictions were positive), and F1-
score (which takes into account the accuracy and recall of the classification model). True
data positives (TPs) denote the quantity of samples in which the data was accurately
classified as belonging to the true sample. The number of negative samples predicted by
a trained classification model that are accurately identified as negative in the actual data
is denoted by TN. FP denotes the count of positive samples that the trained classification
network model incorrectly identifies as negative samples when the actual data is positive.
Underreported positive samples are denoted by FN when the trained classification network
Sustainability 2024,16, 7536 13 of 24
model generates negative predictions for the comparison, despite the fact that the actual
data are positive. The specific formula is shown below:
Accuracy =TP +TN
TP +T N +FP +FN (10)
Precision =TP
TP +FP (11)
Recall =TP
TP +F N (12)
F1-score =2∗precision ∗recall
precision +recall (13)
According to the record date in this paper, we divide the dataset into two parts, training
set and validation set; 60% of the dataset is used as the training dataset, 20% is used as
the testing set, and the remaining 20% is used as the validation sample set. Among them,
the training set consists of two parts, the unenhanced dataset and the dataset that has been
enhanced by WaveGAN, as shown in Table 3; the data in the test set has two parts, one part
comes from the unenhanced dataset in the training set, and the other part comes from the
new dataset that has not been trained (not enhanced by the data); the validation set and the
training set and the test set are independent of each other, and there are no data sets that
have been involved in the model training dataset (not augmented with data).
5. Results
5.1. Experimental Environment Settings
All our network comparisons were carried out in the same experimental environment
with Ubuntu 20.04, CPU model-Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, and GPU
model NVIDIA Tesla P100 (24 G Memory). The Python 3.9 and PyTorch 1.12.1 deep learning
framework were used for model training. The parameter settings for all our networks are
as follows: learning rate is set to 0.001, batch size is 32, and the optimizer is set to Adam.
The dataset settings are all of the same size, with 60% of the dataset used as the training
dataset, 20% as the test set, and 20% as the validation sample set. There are 13 categories for
this article classification, with 2000 speech data per category, for a total of 26,000 datasets,
with a size of about 8 GB. This paper labels a considerable assortment of 26,000 distinct
types of speech data. Detailed dataset information is shown in Table 4.
Table 4. Detailed information content of different sound types. The training set for each category is
1200 entries, the test set is 400 entries, and the validation set is 400 entries.
Call Type Genus Family Original
Size
WaveGAN
Augmentation
Labeling
Tags Size
wind - - 2000 2000 wind 656 MB
rain - - 2000 2000 rain 565 MB
Actinodura strigula Actinodura Leiothrichidae 1298 2000 BTM 553 MB
Fulvetta cinereiceps Fulvetta Paradoxornithidae 987 2000 MF 523 MB
Psilopogon virens Psilopogon Megalaimidae 1500 2000 GB 635 MB
Pomatorhinus ruficollis Pomatorhinus Timaliidae 2000 2000 SSB 589 MB
Parus monticolus Parus spilonotus Paridae 1356 2000 GT 643 MB
Pterorhinus sannio Pterorhinus Leiothrichidae 1293 2000 WL 528 MB
aa notes Nomascus Hylobatidae 497 2000 ga 564 MB
weakly modulated figure
Nomascus Hylobatidae 958 2000 gw 894 MB
modulated figure Nomascus Hylobatidae 1066 2000 gm 783 MB
great call Nomascus Hylobatidae 213 2000 gf 679 MB
cicada Cryptotympana Cicadidae 2000 2000 cicada 586 MB
Sustainability 2024,16, 7536 14 of 24
5.2. Classification Results before Data Augmentation
In this paper, we first made a comparison of the models without data expansion,
with the number of iterations wanted and the same dataset; we compared the recognition
accuracy of the western black-crested gibbon’s calls to the DenseNet model proposed by
Huang et al. [
56
], the VGG16 model proposed by Simonyan et al. [
62
], the Xception model
proposed by Chollet et al. [
63
], and the MobileNet model proposed by
Howard et al. [64]
.
In the end, the DenseNet+Self-Attention(SA_DenseNet) model obtained 88.1% accuracy,
the VGG16 model obtained 86.8% accuracy, the Xception model obtained 85.6% accu-
racy, and the MobileNet model obtained 58.9% accuracy. The recognition accuracy of the
SA_DenseNet model used in this paper outperforms the other three models by 29.2% over
the MobileNet model, 1.3% over the VGG16 model, and 2.5% over the Xception model.
The details are shown in Table 5and Figure 8a.
Table 5. Classification results of different models. For classification comparisons of four different
models before data augmentation, we compared accuracy, precision, recall, and F1-score.
Model Accuracy (%) Precision (%) Recall (%) F1-Score (%)
VGG16 86.8% 86.2% 85.6% 85.8%
Xception 85.6% 84.7% 83.7% 83.9%
MobileNet 72.1% 71.7% 70.4% 70.7%
SA_DenseNet 88.1% 87.8% 86.4% 86.9%
(a)Evaluate accuracy comparisons of various
models prior to data augmentation
(b)Comparison of accuracy of different models
after data augmentation
(c)Comparison of accuracy of different models
after adding attention
(d)Comparison of different model loss func-
tions after adding attention
Figure 8. A comparison of the accuracy and loss of our model in relation to alternative models.
(a) represents the model accuracy comparison in the absence of data augmentation; (b) represents
the model accuracy comparison in the accuracy of models following data augmentation; (c) repre-
sents the accuracy comparison between the addition of the LSTM module and data augmentation
using an attention mechanism; (d) represents a comparison between the loss before and after data
augmentation and the addition of the LSTM module using an attention mechanism.
Sustainability 2024,16, 7536 15 of 24
5.3. Classification Results after Data Augmentation
In this study, we investigate 10 distinct data augmentation techniques then classify
and contrast the datasets that have been augmented by these techniques. We chose the
SA_DenseNet, VGG16, Xception, and MobileNet models for comparison, which ultimately
resulted in the WaveGAN augmented dataset, so the models obtained the highest accuracy,
followed by Fre-GAN. Traditional dataset enhancement methods (Speed Tuning, Transla-
tion of Sample Rate, and Volume Tuning) have all improved the accuracy of the model, but
not by much, and there are also data augmentation methods that degrade the accuracy of
some models (Time Stretching and Pitch Shifting Augmentation). The specific experimental
comparison results are shown in Figure 9.
We then concurrently compared the accuracy, precision, recall, and F1-score of four
distinct models following WaveGAN data augmentation to show the efficacy of our sug-
gested data augmentation. After the experiments, it is proved that the accuracies of the
four models are improved after further data augmentation, in which the DenseNet model is
improved by 7.8%, the VGG16 model is improved by 7.1%, the Xception model is improved
by 8.2%, and the MobileNet model is improved by 7.0%, which is shown in Table 6and
Figure 8b, and finally, the model chosen in this paper, the DenseNet model, is better and
more stable than other models.
(a) (b)
(c) (d)
Figure 9. Comparison of accuracy after augmentation by different speech data augmentation methods.
The augmentation methods include Translation of Sample Rate (TSR), Same Class Augmentation (SC),
Time Shifting Augmentation (TS1), Pitch Shift Augmentation (PS), Time Stretching Augmentation
(TS2), Speed Tuning (ST), Volume Tuning (VT), Noise Augmentation (N), Free-GAN and WaveGAN.
The models compared include VGG16, Xception, MobileNet, and DenseNet. (a) Comparison of
MobileNet classification accuracy after different data augmentation methods. (b) Comparison of
Xception classification accuracy after different data augmentation methods. (c) Comparison of VGG16
classification accuracy after different data augmentation methods. (d) Comparison of SA_DenseNet
classification accuracy after different data augmentation methods.
Sustainability 2024,16, 7536 16 of 24
Table 6. Classification results of different models. After enhancement with WaveGAN audio data, we
compared the classification performance of four different network models.
Model Accuracy (%) Precision (%) Recall(%) F1-Score (%)
VGG16 93.9% 93.5% 92.7% 92.9%
Xception 93.8% 93.2% 92.5% 92.7%
MobileNet 78.9% 78.2% 77.4% 77.7%
SA_DenseNet 95.9% 95.6% 94.8% 95.1%
5.4. Classification Results after Adding LSTM-Attention
In the FC-LSTM network module, which was constructed upon the DenseNet back-
bone network, we concurrently integrated the temporal attention mechanism in order to
improve the detection of the four distinct call types shown by the western black-crested
gibbon. We compared each network using datasets that had been improved using the same
data augmentation technique in order to evaluate the experiment’s fairness. Concurrently
with this, we implemented the FC-LSTM module utilizing the attention mechanism in the
remaining three networks before comparing their accuracy. It is demonstrated that the
DenseNet+Self-Attention-LSTM-Attention(SA_DenseNet-LSTM-Attention) model’s accu-
racy increases by 2.3%, while the VGG16-LSTM-Attention model’s accuracy increases by
2.0%, the Xception-LSTM-Attention model’s accuracy increases by 1.8%, and the MobileNet-
LSTM-Attention model’s accuracy increases by 0.5%. The experimental results demonstrate
that the SA_DenseNet-LSTM-Attention architecture we proposed achieves the highest accu-
racy of 98.2%. As shown in Figure 8c,d and Table 7, the classification method (SA_DenseNet-
LSTM-Attention) introduced in this article exhibits superior performance compared to
alternative models in both recognition rate and convergence speed.
Table 7. Classification results of different models. After augmenting the WaveGAN audio data
and incorporating the LSTM network, we compared the classification performance of four different
network models.
Model Accuracy (%) Precision (%) Recall (%)
F1-Score (%)
VGG16-LSTM-Attention 95.9% 95.4% 94.6% 94.8%
Xception-LSTM-Attention 95.6% 95.1% 94.2% 94.5%
MobileNet-LSTM-Attention 81.4% 80.8% 79.3% 79.5%
SA_DenseNet-LSTM-Attention 98.2% 96.7% 96.1% 96.5%
ECAPA-TDNN [65] 92.8% 92.4% 90.5% 91.8%
PANNS [51] 96.1% 95.5% 93.7% 94.1%
TDNN [66] 93.4% 92.7% 91.8% 92.4%
Res2Net [67] 96.9% 95.4% 94.5% 94.8%
ResNetSE [68] 97.1% 95.8% 94.9% 95.3%
CAMPPlus [69] 96.8% 96.1% 95.1% 95.7%
ERes2Net [70] 96.6% 95.3% 94.1% 94.7%
ERes2NetV2 [70] 91.5% 90.8% 89.7% 90.2%
The model proposed in this paper has a training time of 8 h and an inference time of
0.018 s/sample, which is significantly better than other benchmark models. The training
time of the comparison models is 10 h (VGG16-LSTM-Attention), 12 h (Xception-LSTM-
Attention), and 9 h (MobileNet-LSTM-Attention), respectively; and the inference time is
0.025 s/sample, 0.022 s/sample, and 0.020 s/sample. This is shown in Table 8.
In Figure 10a, we show a comparison of the training, testing, and validation accuracies
of our proposed model (SA_DenseNet-LSTM-Attention) across time. It is evident that
our model performs at its peak after the 100th epoch on both test and validation data.
Furthermore, our model’s accuracy in the training, test, and validation sets is rapidly
becoming close to 1, suggesting that it has high generalization capabilities. In real-world
training and validation, our model demonstrates robust expressive and learning capabilities
after the WaVeGAN network augments the input and integrates a powerful attention
Sustainability 2024,16, 7536 17 of 24
mechanism. In order to prove the effectiveness of our proposed model, we have compared
all the trained models, and the specific accuracy comparison is shown in Figure 10b.
From the figure, it can be seen that our model is higher than the other three models in
terms of accuracy on the validation set; therefore, this also shows that the performance and
generalization ability of our proposed model is higher than the other models.
Table 8. Comparison of training time and inference time for different models.
Model Training Time (Hours) Reasoning Time
(Seconds/Sample)
VGG16-LSTM-Attention 10 0.025
Xception-LSTM-Attention 12 0.022
MobileNet-LSTM-Attention 9 0.020
SA_DenseNet-LSTM-Attention 8 0.018
(a) (b)
Figure 10. Model evaluation. First, we show how the accuracy of our proposed network compares on
the training and validation sets; second, we put all the trained models to real-world tests and compare
the test accuracies of each comparison model. (a) Comparison of the accuracy of the DenseNet+Self-
Attention-LSTM-Attention-based model in the training set and validation set after data augmentation.
(b) Comparing the accuracy of different model validation sets.
5.5. Different Calling Recognition Results
To test the effectiveness of our training model (SA_DenseNet-LSTM-Attention),
we selected one month (April) of acoustic monitoring data from 2021 segmented by
zhou et al. [34]
for testing, and we counted western black-collared gibbons and several
bird species with a high number of chirps, as shown in Figure 11a. In Figure 11a we counted
the number of days monitored and the number of calls made by the Western Black-crowned
Gibbon and 6 different bird species in April. During our testing, we found that the model
could effectively recognize most of the calls, with most of the errors concentrated in cases
where multiple calls were mixed, in which case there is no guarantee that our model can
effectively recognize all call types. In Figure 10 it can be seen that most of the calls of these
species are concentrated before 12 o’clock. Some birds (Pomatorhinus ruficollis) had a wider
range of calling time periods, concentrated in the morning through the afternoon.
In this study, we verified the recognition of 13 different sound types, including
four gibbon sounds and nine other sound types. Through the training and testing of
the SA_DenseNet-LSTM-Attention model, we obtained satisfactory classification results.
Specifically, the classification accuracies of all sound types are above 98.0%, and the preci-
sion, recall and F1-scores are above 96.0%. These results validate the excellent classification
performance of our proposed model on a wide range of sound types, further demonstrating
its potential in practical applications. This is shown in Table 9.
Sustainability 2024,16, 7536 18 of 24
Table 9. Detailed recognition results of final different categories of sounds using the models in
this paper.
Call Type Accuracy (%) Precision (%) Recall (%) F1-Score (%)
wind 98.3% 96.7% 96.0% 96.3%
rain 98.4% 96.9% 96.4% 96.6%
Actinodura strigula 98.1% 96.4% 95.8% 96.1%
Fulvetta cinereiceps 98.3% 96.8% 96.1% 96.4%
Psilopogon virens 98.2% 96.6% 96.0% 96.3%
Pomatorhinus ruficollis 98.0% 96.3% 95.7% 96.0%
Parus monticolus 98.1% 96.5% 95.9% 96.2%
Pterorhinus sannio 98.2% 96.7% 96.1% 96.4%
aa notes 98.4% 97.1% 96.7% 96.9%
weakly modulated figure 98.1% 96.5% 96.0% 96.2%
modulated figure 98.5% 97.0% 96.8% 96.9%
great call 98.3% 96.8% 96.2% 96.5%
cicada 98.2% 96.6% 95.9% 96.2%
(a) (b)
Figure 11. Statistics on the number of calls and time period of calls in April 2021 for different species.
(1) The bars in the figure indicate the number of days each species was monitored in April; (2) the
line graphs indicate the number of calls for each species during the month of April; (3) the right-most
graph indicates the distribution of calling time periods for each species during the month of April;
(4) in the figure, Gibbon denotes the western black-crested gibbon Nomascus concolor, WL denotes
White-browed Laughingthrush Pterorhinus sannio, GT denotes Green-backed Tit Parus monticolus, SSB
denotes Streak-breasted Scimitar Babbler Pomatorhinus ruficollis, GB denotes Great Barbet Psilopogon
virens, MF denotes Manipur Fulvetta Fulvetta manipurensis, and BTM denotes Bar-throated Minla
Actinodura strigula, and Time denotes a different time period. (a) Recognition results of different
species’ calls; (b) distribution of call time periods for different species.
6. Discussion
In this paper, we present a new recognition network that uses a PCA method for
feature fusion to fuse the features extracted by the LSTM network based on a temporal
attention mechanism and those extracted by the DenseNet network. While the LSTM net-
work can extract temporal information from a time series, CNN focuses more on computing
spatial structure, despite its superior classification capabilities. To validate the performance
of our proposed model, we compared three different models, VGG16, MobileNet and
Xception, and our proposed model (SA_DenseNet-LSTM-Attention) obtained the highest
accuracy (Figure 8and Table 7). The classification performance of the models VGG16, Mo-
bileNet and Xception also improved after adding the LSTM network based on the temporal
attention mechanism (Table 7); this result further illustrates that the LSTM network can
effectively extract the sequence features into the audio. Our experiments demonstrate
that our proposed SA_DenseNet-LSTM-Attention model can effectively recognize gibbon
calls, which greatly reduces the time required for manual recognition, and also proves
Sustainability 2024,16, 7536 19 of 24
that passive acoustic monitoring combined with deep learning will be an effective tool for
monitoring the western black-crested gibbon population.
Currently, the acoustic monitoring system built by project team member Zhong et al. [
7
]
in the Mourning Mountains allows for 24 h the monitoring of the western black-crested
gibbon group. Long-term acoustic monitoring will yield hundreds of thousands of hours
of recordings annually, and the manual labeling of these recordings is sometimes not
feasible due to post-processing data. Our results suggest that deep learning combined
with passive acoustic monitoring could be a useful approach for late-stage species-specific
identification. This has the ability to reduce costs, save time and labor, and streamline
monitoring processes. Our method works well for call recognition in western black-crested
gibbons and may be easily extended to other call species. Accurately identifying individual
vocalizing species is important for ecological sustainability.
Although our model, with a 98.2% recognition rate, is not sufficient to achieve the
full recognition of western black-crested gibbon calls, the model can recognize most of the
gibbon calls in the recorded files, greatly improving the efficiency of manual annotation
and manual recognition of these data. With or without data augmentation, our model
outperforms the other three models in terms of accuracy and generalization performance
(shown in Figure 8). This is mostly due to two factors: first, our network has a greater depth
at which the gradient vanishing problem can be solved effectively; second, our network has
fewer parameters yet performs better. The only negative aspect is how memory-intensive
model training is. The precision of upcoming models will be enhanced by expanding the
sample size of labeled data.
To achieve high recognition accuracy, our model requires a large labeled dataset;
imbalances in the dataset can restrict the model’s learning ability. Therefore, expanding
the dataset is a common deep learning training practice known as “data augmentation”.
For example, Lasseck et al. implemented data augmentation by adding noise, filtering,
or mixing audio clips [
54
]. This method has been proven to effectively improve the model’s
classification accuracy. Our study evaluates the impact of various speech data augmenta-
tion methods on model classification performance, finding that most traditional speech
data augmentation methods do not improve and may even decrease model performance.
Compared to traditional methods, datasets obtained using deep learning approaches can
significantly enhance each model’s classification performance (see Figure 9). WaveGAN
employs Generative Adversarial Networks (GANs) to generate audio samples, which are
more realistic and of higher quality, closer to the quality of the original audio. WaveGAN
is capable of generating diverse audio samples that simulate different noises, speaking
styles, and speech speeds, thus providing richer data augmentation and improving the
model’s generalization ability. WaveGAN can be tailored to various speech recognition
or audio processing tasks by adjusting the training approach, thereby offering broad
application potential [71,72].
We have incorporated knowledge from several important studies that demonstrate
the complexity and diversity of animal communication into our classification model for
vocalizations. Following the guidance of Payne and McVay [
73
], our model accommo-
dates the repetitive patterns found in calls, which are characteristic of structured vocal
sequences. Additionally, Whaling et al. [
74
] provide a foundation for understanding the
frequent repetitions in these sequences, which our model seeks to identify and classify.
The distinctions between simple calls and more complex composites, as discussed by Behr
and von Helversen [
75
], and Bohn et al. [
76
], have been critical in refining our approach to
differentiate between monosyllabic calls and more structured composites. These studies
collectively form the theoretical backbone of our research, supporting our methodology
and ensuring that our classification scheme aligns with established understandings in
the field of bioacoustics. Both the deep learning methods Fre-GAN and WaveGAN used
in this paper can effectively improve the classification accuracy of the model, but com-
pared to the Fre-GAN model, the WaveGAN model can achieve a better classification
performance, and Fre-GAN needs more training datasets to achieve good performance.
Sustainability 2024,16, 7536 20 of 24
Previous studies have also demonstrated that WaveGAN can effectively improve model
classification accuracy [77–79].
To enhance the classifier’s performance in this study, we use data enhancement ap-
proaches. Nevertheless, there are certain restrictions on data augmentation. Firstly, the syn-
thetic variants introduced by the data improvement approach might not accurately repre-
sent the intricacy of gibbon cries seen in nature. As a result, the classifier’s performance
in actual applications may not match that of the enhanced dataset. In addition, the ef-
fectiveness of data enhancement techniques relies on the choice of enhancement method
and parameter settings, which, if not properly selected, may introduce noise or irrelevant
information, thus affecting the accuracy of the classifier. To evaluate the performance of
the classifier in real-world environments, we tested it on one month of independent, non-
enhanced data (validation set). According to the test results, the classifier’s performance
on unenhanced data offers a more accurate assessment of how well it works in actual
passive acoustic monitoring activities. We are able to obtain a more precise image of the
classifier’s performance in practical applications, including its precision, recall, and general
robustness, by examining the data collected during this time. To evaluate the performance
of the classifier in real-world environments, we tested it on one month of independent, non-
enhanced data (validation set). The test results show that the performance of the classifier
on unenhanced data provides a more realistic evaluation of its effectiveness in real-world
passive acoustic monitoring tasks. By analyzing the data over this period, we are able to
obtain a more accurate picture of the performance of the classifier in real-world applications,
including its precision, recall, and overall robustness. The test results show that the model
in this paper outperforms all other models on the validation set (shown in Table 6).
In sound classification tasks, deep convolutional neural networks outperform shallow
convolutional neural networks, according to research findings [
80
]. Nevertheless, as the
depth of the network further expands, it encounters the challenge of vanishing gradients,
and this degradation does not occur due to overfitting. Higher training errors occur if
more network layers are added to a network model of appropriate depth [
81
]. In order
to solve the degradation problem, Huang et al. [
56
] proposed a densely connected convo-
lutional network (DenseNet).The DenseNet network is the main recognition network in
this manuscript because of its excellent performance. Furthermore, we provide the LSTM
network, which is predicated on the DenseNet network’s temporal attention mechanism.
A recurrent neural network (RNN) variation, the long short-term memory (LSTM) model
performs better than RNNs in solving gradient vanishing and gradient explosion problems,
and the LSTM network shows very good model performance in the sound classification
task [
82
,
83
]. Longer call sequences can be recognized more effectively by the LSTM net-
work, because it addresses the issue of long-term reliance in RNNs when dealing with
lengthy sequence data. On the other hand, the LSTM network may dynamically devote
attention to information at different time steps in the sequence thanks to the temporal
attention mechanism. As a result, the network can concentrate less on time steps that are
secondary or unimportant to the current activity and more on time steps that are significant.
Since the sound data used in this article has variable duration, an LSTM module based on
the attention mechanism is included to help the network better generalize to long unseen
sequences and be flexible in handling input sequences of varying lengths. Incorporating
a temporal attention mechanism can also improve the retention of crucial information
and raise attention to crucial time stages. In order to reduce computational complexity,
the network can also selectively focus on time steps that are pertinent to the job thanks to
the temporal attention mechanism.
Our model can classify new sound recordings with high accuracy. The predictions are
almost perfect when augmented with WaveGAN speech data and when the environmental
conditions used to train the model are similar. In our test dataset, the SA_DenseNet-LSTM-
Attention model proposed in this paper clearly detected all gibbon calls in the test data
at the cost of ten false positives. The primary cause of the false positives was that certain
Sustainability 2024,16, 7536 21 of 24
bird cries were identified as weakly modulated figures. Additionally, the majority of birds’
mistake rates were focused in the event of more mixed noises.
7. Conclusions
In this paper, a new deep learning hybrid model (SA_DenseNet-LSTM-Attention)
is proposed for recognizing the calls of western black-crested gibbons. This paper tries
10 different data augmentation methods to solve the lack of a dataset. Following an ex-
perimental comparison with other methods, the WaveGAN speech data augmentation
method can achieve the maximum accuracy for all models in this article. Furthermore, our
tests demonstrate that each model’s accuracy increases with data augmentation. We fused
the spatial and channel features extracted from the DenseNet network with the temporal
features extracted from the LSTM network using the PCA method in order to increase the
recognition rate of the model for various speech types. The LSTM network, which is based
on a temporal attention mechanism, was added to the DenseNet network. Finally, the ex-
perimental results demonstrate that our proposed DenseNet-LSTM-Attention recognition
model achieves an accuracy of up to 98.2% in recognizing 13 different voices, surpassing
all other network models in terms of both accuracy and generalization performance.
Author Contributions: Conceptualization, X.Z. and K.H.; methodology, Z.G. and L.W.; software, X.Z.
and N.W.; validation, X.Z. and K.H.; formal analysis, X.Z. and Z.G.; investigation, X.Z.; resources,
K.H.; data curation, K.H.; writing—original draft preparation, X.Z.; writing—review and editing,
K.H. and X.Z.; visualization, X.Z.; supervision, C.Y., Q.L., L.Y. and R.H.; project administration, K.H.;
funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.
Funding: We are grateful to the Chuxiong Management and Protection Branch of the Ailao Mountains
National Nature Reserve in Yunnan Province and the builders of the passive acoustic monitoring
system in the early stages of this project. We thank the Major Science and Technology Project of
Yunnan Province (202202AD080010) for support. We thank the National Natural Science Foundation
of China for grant Nos. 32160369 and 31860182.
Data Availability Statement: The original contributions presented in the study are included in the
article, further inquiries can be directed to the corresponding author.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1.
Guan, Z.; Yan, L.; Huang, B. Analysis of the current status of gibbon family population monitoring in China. Sichuan Anim.
2017,36, 7.
2.
Fan, P.; Jiang, X.; Liu, C.; Luo, W. Sonogram structure and timing of duets of western black crested gibbon in Wuliang Mountain.
Dong Xue Yan Jiu Zool. Res. 2010,31 3, 293–302.
3.
Brockelman, W.; Srikosamatara, S. Estimation of density of gibbon groups by use of loud songs. Am. J. Primatol. 1993,29, 93–108.
[CrossRef]
4.
Jiang, X.L.; Luo, Z.; Zhao, S.; Li, R.; Liu, C. Status and distribution pattern of black crested gibbon (Nomascus concolor
jingdongensis) in Wuliang Mountains, Yunnan, China: Implication for conservation. Primates J. Primatol. 2006,47, 264–271.
[CrossRef] [PubMed]
5.
Dat, L.T.; Phong, L.M. 2010 Census of Western Black Crested Gibbon Nomascus Concolor in mu Cang Chai Species/Habitat Conservation
Area (Yen Bai Province) and Adjacent Forests in Muong la District (Son la Province); Fauna & Flora International Vietnam Programme:
Hanoi, Vietnam, 2010.
6.
Li, X.; Zhong, E.; Cui, C.; Zhou, J.; Li, X.; Guan, Z. Monitoring the calling behavior of the western Yunnan subspecies of the
western black crested gibbon (Hylobatidae). J. Guangxi Norm. Univ. Nat. Sci. Ed. 2021,39, 29–37.
7.
Zhong, E.; Guan, Z.; Zhou, X.; Zhao, Y.; Hu, K. Application of passive acoustic monitoring techniques to the monitoring of the
western black-crested gibbon. Biodiversity 2021,29, 9.
8.
LeCun, Y.; Boser, B.E.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.E.; Jackel, L.D. Handwritten Digit Recognition
with a Back-Propagation Network. In Proceedings of the Neural Information Processing Systems (NIPS), Denver, CO, USA,
27–30 November 1989.
9.
Haykin, S.; Kosko, B. GradientBased Learning Applied to Document Recognition. In Intelligent Signal Processing; Wiley-IEEE
Press: Hoboken, NJ, USA, 2001; pp. 306–351. [CrossRef]
10.
Fan, J.; Liu, X.; Wang, X.; Deyi, W.; Han, M. Multi-Background Island Bird Detection Based on Faster R-CNN. Cybern. Syst. 2021,
52, 26–35. [CrossRef]
Sustainability 2024,16, 7536 22 of 24
11.
Grill, T.; Schlüter, J. Two convolutional neural networks for bird detection in audio signals. In Proceedings of the 2017 25th
European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 1764–1768. [CrossRef]
12.
Stowell, D.; Wood, M.; Pamuła, H.; Stylianou, Y.; Glotin, H. Automatic acoustic detection of birds through deep learning: The
first Bird Audio Detection challenge. Methods Ecol. Evol. 2018,10, 368–380. [CrossRef]
13.
Dufourq, E.; Durbach, I.N.; Hansford, J.P.; Hoepfner, A.; Ma, H.; Bryant, J.V.; Stender, C.S.; Li, W.; Liu, Z.; Chen, Q.; et al.
Automated detection of Hainan gibbon calls for passive acoustic monitoring. Remote. Sens. Ecol. Conserv. 2020,7, 475–487.
[CrossRef]
14.
Ruan, W.; Wu, K.; Chen, Q.; Zhang, C. ResNet-based bio-acoustics presence detection technology of Hainan gibbon calls. Appl.
Acoust. 2022,198, 108939. [CrossRef]
15.
Jiang, J.; Bu, L.; Duan, F.; Wang, X.; Liu, W.; Sun, Z.; Li, C. Whistle detection and classification for whales based on convolutional
neural networks. Appl. Acoust. 2019,150, 169–178. [CrossRef]
16.
Bergler, C.; Schröter, H.; Cheng, R.X.; Barth, V.; Weber, M.; Noeth, E.; Hofer, H.; Maier, A. ORCA-SPOT: An Automatic Killer
Whale Sound Detection Toolkit Using Deep Learning. Sci. Rep. 2019,9, 10997. [CrossRef] [PubMed]
17.
Bermant, P.C.; Bronstein, M.M.; Wood, R.J.; Gero, S.; Gruber, D.F. Deep Machine Learning Techniques for the Detection and
Classification of Sperm Whale Bioacoustics. Sci. Rep. 2019,9, 12588. [CrossRef]
18.
Moon, J.; Jung, S.; Park, S.; Hwang, E. Conditional Tabular GAN-Based Two-Stage Data Generation Scheme for Short-Term Load
Forecasting. IEEE Access 2020,8, 205327–205339. [CrossRef]
19.
Nanni, L.; Maguolo, G.; Paci, M. Data augmentation approaches for improving animal audio classification. arXiv 2019,
arXiv:1912.07756. [CrossRef]
20.
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of
the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25.
21.
McFee, B.; Humphrey, E.J.; Bello, J.P. A Software Framework for Musical Data Augmentation. In Proceedings of the International
Society for Music Information Retrieval Conference, Málaga, Spain, 26–30 October 2015.
22.
Salamon, J.; Bello, J.P. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification.
IEEE Signal Process. Lett. 2017,24, 279–283. [CrossRef]
23.
Davis, N.; Suresh, K. Environmental Sound Classification Using Deep Convolutional Neural Networks and Data Augmentation.
In Proceedings of the 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, Kerala, 6–8
December 2018; pp. 41–45. [CrossRef]
24.
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial
Networks. arXiv 2014, arXiv:1406.2661. [CrossRef]
25.
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019,6, 1–48. [CrossRef]
26.
Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech Enhancement Generative Adversarial Network. arXiv 2017, arXiv:1703.09452.
27.
Donahue, C.; McAuley, J.; Puckette, M. Adversarial Audio Synthesis. In Proceedings of the International Conference on Learning
Representations, Vancouver, BC, Canada, 30 April–3 May 2018.
28. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997,9, 1735–1780. [CrossRef]
29.
Petmezas, G.; Cheimariotis, G.A.; Stefanopoulos, L.; Rocha, B.M.M.; Paiva, R.P.; Katsaggelos, A.K.; Maglaveras, N. Automated
Lung Sound Classification Using a Hybrid CNN-LSTM Network and Focal Loss Function. Sensors 2022,22, 1232. [CrossRef]
30.
Atila, O.; ¸Sengür, A. Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition. Appl. Acoust. 2021,
182, 108260. [CrossRef]
31.
Alsayadi, H.A.; Abdelhamid, A.A.; Hegazy, I.; Fayed, Z.T. Non-diacritized Arabic speech recognition based on CNN-LSTM and
attention-based models. J. Intell. Fuzzy Syst. 2021,41, 6207–6219. [CrossRef]
32.
Zhang, Z.; Lv, Z.; Gan, C.; Zhu, Q. Human action recognition using convolutional LSTM and fully-connected LSTM with different
attentions. Neurocomputing 2020,410, 304–316. [CrossRef]
33.
Liu, J.; Wang, G.; Hu, P.; Duan, L.Y.; Kot, A.C. Global Context-Aware Attention LSTM Networks for 3D Action Recognition. In
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July
2017; pp. 3671–3680. [CrossRef]
34.
Zhou, X.; Hu, K.; Guan, Z.; Yu, C.; Wang, S.; Fan, M.; Sun, Y.; Cao, Y.; Wang, Y.; Miao, G. Methods for processing and analyzing
passive acoustic monitoring data: An example of song recognition in western black-crested gibbons. Ecol. Indic. 2023,155, 110908.
[CrossRef]
35.
Zottesso, R.H.D.; Costa, Y.M.G.; Bertolini, D.; Oliveira, L. Bird species identification using spectrogram and dissimilarity approach.
Ecol. Inform. 2018,48, 187–197. [CrossRef]
36.
Pahuja, R.; Kumar, A. Sound-spectrogram based automatic bird species recognition using MLP classifier. Appl. Acoust. 2021,
180, 108077. [CrossRef]
37.
Geng, Y. Design of English teaching speech recognition system based on LSTM network and feature extraction. Soft Comput.
2023, 1–11. [CrossRef]
38.
Ahmed, M.R.; Islam, S.; Islam, A.K.M.M.; Shatabda, S. An Ensemble 1D-CNN-LSTM-GRU Model with Data Augmentation for
Speech Emotion Recognition. arXiv 2021, arXiv:2112.05666.
Sustainability 2024,16, 7536 23 of 24
39.
Abdelhamid, A.A.; El-Kenawy, E.S.M.; Alotaibi, B.; Amer, G.M.; Abdelkader, M.Y.; Ibrahim, A.; Eid, M.M. Robust Speech Emotion
Recognition Using CNN+LSTM Based on Stochastic Fractal Search Optimization Algorithm. IEEE Access 2022,10, 49265–49284.
[CrossRef]
40.
El-Moneim, S.; Nassar, M.; Dessouky, M.; Ismail, N.; El-Fishawy, A.; Abd El-Samie, F. Text-independent speaker recognition
using LSTM-RNN and speech enhancement. Multimed. Tools Appl. 2020,79, 24013–24028. [CrossRef]
41.
Yi, J.; Ni, H.; Wen, Z.; Liu, B.; Tao, J. CTC regularized model adaptation for improving LSTM RNN based multi-accent Mandarin
speech recognition. In Proceedings of the 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP),
Tianjin, China, 17–20 October 2016; pp. 1–5. [CrossRef]
42.
Tang, Y.; Hu, Y.; He, L.; Huang, H. A bimodal network based on Audio-Text-Interactional-Attention with ArcFace loss for speech
emotion recognition. Speech Commun. 2022,143, 21–32. [CrossRef]
43.
Yu, Y.; Kim, Y. Attention-LSTM-Attention Model for Speech Emotion Recognition and Analysis of IEMOCAP Database. Electronics
2020,9, 713. [CrossRef]
44.
Hu, Z.; Linghu, K.; Yu, H.; Liao, C. Speech Emotion Recognition Based on Attention MCNN Combined With Gender Information.
IEEE Access 2023,11, 50285–50294. [CrossRef]
45.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [CrossRef]
46.
Larsson, G.; Maire, M.; Shakhnarovich, G. FractalNet: Ultra-Deep Neural Networks without Residuals. arXiv 2016,
arXiv:1605.07648.
47.
Srivastava, R.K.; Greff, K.; Schmidhuber, J. Training Very Deep Networks. In Proceedings of the Annual Conference on Neural
Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015.
48.
Zhou, X.; Guan, Z.; Zhong, E.; Dong, Y.; Li, H.; Hu, K. Automated Monitoring of Western Black Crested Gibbon Population
Based on Voice Characteristics. In Proceedings of the 2019 IEEE 5th International Conference on Computer and Communications
(ICCC), Chengdu, China, 6–9 December 2019; pp. 1383–1387.
49.
Zhou, X.; Hu, K.; Guan, Z. Environmental sound classification of western black-crowned gibbon habitat based on spectral
subtraction and VGG16. In Proceedings of the 2022 IEEE 5th Advanced Information Management, Communicates, Electronic
and Automation Control Conference (IMCEC), Chongqing, China, 16–18 December 2022; Volume 5, pp. 578–582. [CrossRef]
50.
Fan, P.; Jiang, X.; Liu, C.; Luo, W. The Acoustic Structure and Time Characteristics of Wuliangshan West black crested gibbon
Duet. Zool. Res. 2010,31, 10.
51.
Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-Scale Pretrained Audio Neural Networks for
Audio Pattern Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020,28, 2880–2894. [CrossRef]
52.
Stowell, D.; Petruskova, T.; Linhart, P. Automatic acoustic identification of individuals in multiple species: improving identifica-
tion across recording conditions. J. R. Soc. Interface 2019,16, 20180940. [CrossRef]
53.
Bahmei, B.; Birmingham, E.; Arzanpour, S. CNN-RNN and Data Augmentation Using Deep Convolutional Generative Adversarial
Network for Environmental Sound Classification. IEEE Signal Process. Lett. 2022,29, 682–686. [CrossRef]
54.
Lasseck, M. Audio-based Bird Species Identification with Deep Convolutional Neural Networks. In Proceedings of the Conference
and Labs of the Evaluation Forum, Avignon, France, 10–14 September 2018.
55.
Kim, J.H.; Lee, S.H.; Lee, J.H.; Lee, S.W. Fre-GAN: Adversarial Frequency-consistent Audio Synthesis. arXiv 2021,
arXiv:2106.02297.
56.
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269.
[CrossRef]
57.
Ng, J.Y.H.; Hausknecht, M.J.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for
video classification. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston,
MA, USA, 7–12 June 2015; pp. 4694–4702.
58. Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K. Deep Networks with Stochastic Depth. arXiv 2016, arXiv:1603.09382.
59.
Simpson, T.; Dervilis, N.; Chatzi, E.N. Machine Learning Approach to Model Order Reduction of Nonlinear Systems via
Autoencoder and LSTM Networks. arXiv 2021, arXiv:2109.11213. [CrossRef]
60.
Burgess, J.; O’Kane, P.; Sezer, S.; Carlin, D. LSTM RNN: Detecting exploit kits using redirection chain sequences. Cybersecurity
2021,4, 1–15. [CrossRef]
61.
Zhao, S.; Dong, X. A study on speech recognition based on improved LSTM deep neural network. J. Zhengzhou Univ. Eng. Ed.
2018,39, 5.
62.
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556.
63.
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [CrossRef]
64.
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient
Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861.
65.
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in
TDNN Based Speaker Verification. arXiv 2020, arXiv:2005.07143.
Sustainability 2024,16, 7536 24 of 24
66.
Martinez, A.M.C.; Spille, C.; Rossbach, J.I.; Kollmeier, B.; Meyer, B.T. Prediction of speech intelligibility with DNN-based
performance measures. arXiv 2021, arXiv:2203.09148.
67.
Gao, S.; Cheng, M.M.; Zhao, K.; Zhang, X.; Yang, M.H.; Torr, P.H.S. Res2Net: A New Multi-Scale Backbone Architecture. IEEE
Trans. Pattern Anal. Mach. Intell. 2019,43, 652–662. [CrossRef]
68.
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [CrossRef]
69.
Wang, H.; Zheng, S.; Chen, Y.; Cheng, L.; Chen, Q. CAM++: A Fast and Efficient Network for Speaker Verification Using
Context-Aware Masking. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023.
70.
Chen, Y.; Zheng, S.; Wang, H.; Cheng, L.; Chen, Q.; Qi, J. An Enhanced Res2Net with Local and Global Feature Fusion for Speaker
Verification. arXiv 2023, arXiv:2305.12838.
71.
Yang, M.; Wang, Z.; Chi, Z.; Feng, W. WaveGAN: Frequency-aware GAN for High-Fidelity Few-shot Image Generation. arXiv
2022, arXiv:2207.07288.
72.
Yamamoto, R.; Song, E.; Kim, J.M. Parallel WaveGAN: A fast waveform generation model based on generative adversarial
networks with multi-resolution spectrogram. arXiv 2020, arXiv:1910.11480.
73. Payne, R.; McVay, S. Songs of Humpback Whales. Science 1971,173, 585–597. [CrossRef]
74.
Whaling, C.S.; Solis, M.M.; Doupe, A.J.; Soha, J.A.; Marler, P.R. Acoustic and neural bases for innate recognition of song. Proc.
Natl. Acad. Sci. USA 1997,94 23, 12694–12698. [CrossRef]
75.
Behr, O.; von Helversen, O. Bat serenades—Complex courtship songs of the sac-winged bat (Saccopteryx bilineata). Behav. Ecol.
Sociobiol. 2004,56, 106–115. [CrossRef]
76.
Bohn, K.M.; Schmidt-French, B.A.; Schwartz, C.; Smotherman, M.S.; Pollak, G.D. Versatility and Stereotypy of Free-Tailed Bat
Songs. PLoS ONE 2009,4, e6746. [CrossRef]
77.
Madhu, A.; Kumaraswamy, S. Data Augmentation Using Generative Adversarial Network for Environmental Sound Classification.
In Proceedings of the 2019 27th European Signal Processing Conference (EUSIPCO), A Coruña, Spain, 2–6 September 2019;
pp. 1–5. [CrossRef]
78.
Yang, J.H.; Kim, N.K.; Kim, H.K. Se-Resnet with Gan-Based Data Augmentation Applied to Acoustic Scene Classification
Technical Report. In Proceedings of the DCASE, Surrey, UK, 19–20 November 2018.
79.
Kim, E.; Moon, J.; Shim, J.C.; Hwang, E. DualDiscWaveGAN-Based Data Augmentation Scheme for Animal Sound Classification.
Sensors 2023,23, 2024. [CrossRef]
80.
Dai, W.; Dai, C.; Qu, S.; Li, J.; Das, S. Very deep convolutional neural networks for raw waveforms. In Proceedings of the 2017
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017;
pp. 421–425. [CrossRef]
81.
He, K.; Sun, J. Convolutional neural networks at constrained time cost. In Proceedings of the 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5353–5360.
82.
Abdullah, K.H.; Bilal Er, M. Lung sound signal classification by using Cosine Similarity-based Multilevel Discrete Wavelet
Transform Decomposition with CNN-LSTM Hybrid model. In Proceedings of the 2022 4th International Conference on Artificial
Intelligence and Speech Technology (AIST), Delhi, India, 9–10 December 2022; pp. 1–4. [CrossRef]
83.
Pradeep, R.; Rao, K.S. Incorporation of Manner of Articulation Constraint in LSTM for Speech Recognition. Circuits Syst. Signal
Process. 2019,38, 3482–3500. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
Available via license: CC BY 4.0
Content may be subject to copyright.