ArticlePDF Available

Abstract and Figures

In daily life, there are a variety of complex sound sources. It is important to effectively detect certain sounds in some situations. With the outbreak of COVID-19, it is necessary to distinguish the sound of coughing, to estimate suspected patients in the population. In this paper, we propose a method for cough recognition based on a Mel-spectrogram and a Convolutional Neural Network called the Cough Recognition Network (CRN), which can effectively distinguish cough sounds.
Content may be subject to copyright.
Cough Recognition Based on
Mel-Spectrogram and Convolutional
Neural Network
Quan Zhou
, Jianhua Shan
, Wenlong Ding
, Chengyin Wang
, Shi Yuan
, Fuchun Sun
Haiyuan Li
and Bin Fang
Anhui Province Key Laboratory of Special Heavy Load Robot, Anhui University of Technology, Maanshan, China,
National Research Center for Information Science and Technology, Department of Computer Science and Technology, Tsinghua
University, Beijing, China,
Robotics Institute, School of Automation, Beijing University of Posts and Telecommunications, Beijing,
In daily life, there are a variety of complex sound sources. It is important to effectively detect
certain sounds in some situations. With the outbreak of COVID-19, it is necessary to
distinguish the sound of coughing, to estimate suspected patients in the population. In this
paper, we propose a method for cough recognition based on a Mel-spectrogram and a
Convolutional Neural Network called the Cough Recognition Network (CRN), which can
effectively distinguish cough sounds.
Keywords: cough recognition, mel-spectrogram, CNN, deep learning, audio, COVID-19
As a disease with a long incubation period and high infection rate, COVID-19 has caused millions of
people to be infected and hundreds of thousands of people to died. How to avoid the rapid spread of
the epidemic and effectively control the number of infected people has become an urgent issue. Asif
et al. found that data from 10,172 COVID-19 laboratory-conrmed cases have shown a correlation
with coughing in 54.08% (Sattar Hashmi and Asif, 2020). Therefore, coughing, as a typical symptom
of pneumonia, is of great signicance in controlling the potential infectious source if it can be quickly
and accurately monitored in the population.
Many scholars have studied how to extract features of sound and recognize the sound. Mel
Frequency Cepstrum Coefcient (MFCC), as a method of extracting audio features (Shintri and
Bhatia, 2015), is widely used in various audio recognition tasks. Xie et al. used MFCC to recognize
abnormal voice (Xie et al., 2012). Wang et al. proposed to recognize speech emotion based on
improved MFCC (Wang and Hu, 2018). Suksri described a method that used MFCC extracted from
the speech signals of spoken words for speech recognition (Ittichaichareon et al., 2012). The Fourier
transform (FT) is also widely used in audio processing. Jozef et al. presented a new procedure for the
frequency analysis of audio signals (Pucik et al., 2014).
Although these traditional methods are very effective for the extraction of audio features,
considering the complexity of the real scene, the method of deep learning may achieve better
results. With the development of deep learning, the neural network has played an important role in
audio recognition. Oren et al. proposed spectral representations for convolutional neural networks
(Rippel et al., 2015). Some LSTM-based networks for speech recognition are also presented (Pundak
and Sainath, 2017;Trianto et al., 2018). Compared with traditional methods, deep learning can
extract more complex and robust features.
For cough recognition, various methods are proposed. Cough signals are usually obtained by
audio or inertial sensors, which can detect the vibration caused by coughing. These sensors include a
Edited by:
Shalabh Gupta,
University of Connecticut,
United States
Reviewed by:
Jing Yang,
University of Connecticut,
United States
Shugong Xu,
Shanghai University, China
Bin Fang
Specialty section:
This article was submitted to
Smart Sensor Networks
and Autonomy,
a section of the journal
Frontiers in Robotics and AI
Received: 04 July 2020
Accepted: 09 April 2021
Published: 07 May 2021
Zhou Q, Shan J, Ding W, Wang C,
Yuan S, Sun F, Li H and Fang B (2021)
Cough Recognition Based on Mel-
Spectrogram and Convolutional
Neural Network.
Front. Robot. AI 8:580080.
doi: 10.3389/frobt.2021.580080
Frontiers in Robotics and AI | May 2021 | Volume 8 | Article 5800801
published: 07 May 2021
doi: 10.3389/frobt.2021.580080
microphone that can be worn or placed near the user, or a
piezoelectric transducer and a high-sensitivity accelerator that
can be placed in the throat or chest area (Drugman et al., 2013;
Amoh and Odame, 2016;Elfaramawy et al., 2018).
Infante et al. used a machine learning method to recognize
dry/wet cough (Infante et al., 2017). Semi-supervized Tree
Support Vector Machine is proposed for cough recognition
and detection (Hoa et al., 2011). K-NN is also an efcient tool
that is often used for cough recognition (Hoyos Barcelo et al.,
2017;Vhaduri et al., 2019).
In addition, the Articial Neural Network (ANN), Gaussian
Mixture Model (GMM), Support Vector Machine (SVM), and
other methods are also used for cough recognition (Drugman
et al., 2011).
The difculty of cough recognition mainly lies in the
distinction of background noise. There are many kinds of
sound mixed together in daily scenes. How to effectively
distinguish between coughing and other sounds has become a
difcult problem to be solved.
In this paper, we propose a cough recognition method based
on a Mel-spectrogram and a Convolutional Neural Network
(CNN). First, we enhance the audio data and mix the voice in
various complex scenes. Then, we preprocess the data to ensure
the consistency of data length and convert it into a Mel-
spectrogram. At last, we build a CNN-based model to classify
the cough using the Mel-spectrogram. At the same time, we make
comparisons with some other common methods. After the
experiment result comparison, it can be seen that this method
can effectively identify and detect coughing in complex scenes. It
can be seen that the cough recognition model based on a Mel-
spectrogram and a CNN can achieve good results.
As shown in Figure 1, the work-ow of our cough classication
model is presented.
Data Augmentation
Considering the natural environment, sound is not produced by a
single sound source and the received sound is often the mix of
multiple sounds. In order to improve the recognition effect and
robustness, we enhance the data, using noise and human voice to
mix the cough data.
We selected several audio datasets to make data augmentation,
such as the ESC-50 dataset (Piczak, 2015) and the Speech
Commands Data Set (Warden, 2018). All cough data comes
from the ESC-50 dataset.
Positive Samples 1: Cough. After audio segmentation, we
select all cough audio samples as positive samples. We also
obtain more cough audio samples by increasing and
decreasing the volume.
Positive Samples 2 and 3: Cough + Human Sound and
Cough + Natural Sound. In order to enhance the robustness of the
model, we also mix cough audio with natural sound (wind, rain,
door-clock, footsteps, and other common noises) and human sound
(mainly including commonly spoken words such as go,”“up,
right,and so on) respectively as positive samples 2 and 3.
In all the mixed audio, the volume of the coughing sound is
adjusted to produce more mixed outcomes of different cough
sounds and other sounds.
All of the original and processed cough audio data are labeled
as cough.
Negative Samples 1: Human Sound. We choose human sounds
(mainly include go,”“upand, some other common human
noises, and all sounds come from different samples which are
unused for cough augmentation) from the datasets above as one
of the negative samples. So our model can distinguish between
cough sounds and human sounds. And all human sounds were
mixed with white noise, pink noise, and so on.
Negative Samples 2: Natural Sound: We choose natural noise
(wind, rain, pouring-water, footsteps, and other common sounds.
All sounds come from different samples which are unused for
cough augmentation) from the datasets above as other negative
All human sound and natural sound data are labeled as
In the end, we have cough sounds, mixed cough audio with
natural noise, and mixed cough audio with human sounds as
positive samples. At the same time, human sounds and natural
sounds are taken as negative samples.
Data Preprocess
Considering that audio with a too short length of time may make
it difcult to recognize the sound, and that audio with a too long
length of time may cause the superposition of a variety of
uncorrelated sounds, we choose the length of 1 s as the input.
And the duration of cough samples in the original dataset is
different, so we select the audio containing coughing and divide it
into seconds.
FIGURE 1 | The work-ow diagram.
Frontiers in Robotics and AI | May 2021 | Volume 8 | Article 5800802
Zhou et al. Cough Recognition
The Mel spectrum contains a short-time Fourier transform
(STFT) for each frame of the spectrum (energy/amplitude
spectrum), from the linear frequency scale to the logarithmic
Mel-scale, and then goes through the lter bank to get the
eigenvector, these eigenvalues can be roughly expressed as the
distribution of signal energy on the Mel-scale frequency.
After the audio data are processed into 1 s-long data, we
transform all the data into Mel-spectrograms so that we can
train the convolutional neural networks for recognition.
Audio data usually have complex features, so it is necessary to
extract useful features to recognize the audio. The Mel-
spectrogram is one of the efcient methods for audio
processing and 8 kHz sampling is used for each audio sample.
In the experiment, we employ the Python package called
librosa for data processing and all parameters are as follows:
(nfft 1024,hop length 512,n mels 128). Then we call
the power_to_db function to convert the power spectrum
(amplitude square) to decibel (DB) units.
In Figure 2, we show some examples of Mel-spectrograms. As
can be seen from the gure, there are some differences in different
types of voices. But after mixing noise, some details will be
covered, which is helpful for us to test the cough recognition
effect of the model for the real scene. And we extract the features
of the audio and transform them into feature images, so there are
three channels like traditional color images.
For image input, we normalize them to make the model converge
faster. For the Mel-spectrogram, we calculate the mean and
standard deviation of the three channels respectively and then
normalize them. The normalization formula is as follows:
xnorm xmean(x)
where xdenotes the values in different channels and xnorm
denotes normalized values.
Loss Function
The recognition loss function of the model Lrec represents the
cross-entropy loss:
Lrec −
yis the model output, yis the true label, and nis the
number of samples.
Convolutional Neural Network
With the development of deep learning, more and more
deep learning methods are applied to various scenarios,
such as image recognition, image classication, speech
recognition, machine translation, etc. As a kind of deep
learning method, Convolutional Neural Networks (CNN)
are widely used in the eld of computer vision. In this
section, we introduce the components of the proposed
CNN-based network.
The convolutional layer is the key of a CNN model, it can
effectively reduce the parameters of the model and make it
possible for the model to optimize. The calculation formula
for the convolutional layer is as follows:
ij +bn
where xn
jis the output feature map, xn1
iis the input feature map,
Mjis the selected area in the n1 layer, kn
ij is weight parameter, bn
is bias, and fis the activation function.
After each convolutional layer, we conduct batch
normalization to make the outputs of the convolutional layer
stay identically distributed, which can improve the performance
of the model. The batch normalization formula is as follows:
where xiis the output of convolutional layer without activation, u
is the mean of x,σ2is the variance of x, and γand βare parameters
to learn.
After feature extraction of the convolution layer, although
the number of connections between layers has been
signicantly reduced, the number of neurons in the feature
map group has not been signicantly reduced. Therefore, like
other common models, we add maximum pooling layers to
solve this problem.
In the end, we use the fully connected layer as the output
layer of the model. The calculation for the fully connected
layer is:
xi*wij +bj
where xis the input layer, Nis the number of input layer nodes,
wij is the weight between the links xiand yj,bjis the bias, and fis
the activation function.
FIGURE 2 | Mel-spectrograms of different voices.
Frontiers in Robotics and AI | May 2021 | Volume 8 | Article 5800803
Zhou et al. Cough Recognition
Experiment Approach
The CRN was trained by an Adam optimizer, whose learning rate
is 0.0001. The max epoch and batch size were 20 and 64,
respectively. The CRN was implemented by Pytorch and
trained and tested on a computer with an Intel Core i7-
8750H, two 8 GB memory chips (DDR4), and a GPU (Nvidia
Geforce GTX 1060 6G).
Dataset Description
Before training, we need to preprocess the audio data. As mentioned
in the second part, we obtained 34,320 cough samples augmented by
different audio data, including 17,160 cough + human sound
samples, 17,160 cough + natural sound samples, 17,050 human
sounds, and 17,919 different noises. As shown in Figure 3,data
components have been provided. In order to evaluate the model
better, we use two ways to divide the processed dataset.
Random Division Dataset
After all data are processed, 80% are randomly selected as the
training set, 10% as the verication set, and 10% as the test set.
Considering that due to data augmentation, some data may leak
the features of coughing.
No-Leakage Division Dataset
After all data are processed, we select almost 80% which we
augment as the training set and 10% is augmented from
completely different cough audio as the test set. In this way,
the cough sounds of the training and test sets come from different
original data, so that we can evaluate the generalization ability of
the model.
After all data are split, the mean and variance of each channel
are calculated. They are normalized to make the model converge
Performance Measurements
In order to better evaluate the performance of the model, we list
several indicators used to evaluate the model.
The indicator that the samples with a correct reaction
classication account for the total samples.
The ratio of the number of samples recognized correctly to the
total number of samples recognized.
The ratio of the number of samples recognized correctly to the
number of samples that should be recognized.
F1 Score
It is an index used to measure the accuracy of the binary
classication model.
Accuracy TP +TN
TP +TN +FP +FN,(6)
Recall TP
TP +FN,(7)
Precision TP
TP +FP,(8)
F1 Score 2pPrecisionpRecall
Recall +Precion ,(9)
where TP (True Positive) denotes samples of coughing that are
correctly recognized by the model. FP (False Positive) which
denotes samples of coughing that are recognized as others by the
model. TN (True Negative) which denotes samples of others that
are correctly recognized by the model. FN (False Negative) which
denotes samples of others that are incorrectly recognized as
coughing by the model.
Experiment Based on Mel-Spectrogram
The Mel-spectrogram is an effective tool to extract hidden
features from audio and visualize them as an image. A CNN
model can effectively extract features from images, and then
completetaskssuchasclassication and recognition.
Therefore, we use the CNN model to effectively classify the
audio and to realize the accurate recognition and detection of
coughing. In Figure 4, the architecture of this model has been
Considering the different positions of coughing in audio, the
relative positions of coughing are also different. Before we feed
the image into the network, we rst unify the image size into
256 ×256, and then randomly select 224 ×224 size parts for the
recognition of different cough positions.
After two methods of dataset division and training, we get the
performance of the cough recognition task.
Experiment on Random Division Dataset
As shown in Table 1,wecannd that Mel-Spectrogram +
CNN can achieve the best performance in cough recognition
than other methods. For randomly divided datasets, the
correct recognition rate is 98%. It can be seen that the
model can still achieve good recognition performance even
FIGURE 3 | Data components.
Frontiers in Robotics and AI | May 2021 | Volume 8 | Article 5800804
Zhou et al. Cough Recognition
if a variety of different sounds are mixed. The train/test loss
curves are presented in Figure 5.
Experiment on No-Leakage Division
Considering that the model needs to cope with the cough sounds
of different people, we add an experiment to estimate the
generalization ability of the model. In this experiment, all the
cough data are augmented, but the cough sound in the training set
and the test set come from totally different collection objects. In
this way, it can detect whether the model has the ability to
recognize the cough sound produced by strange sound sources
The train/ test loss curves of no- leakage experiment are
presented in Figure 6 and the experiment result is shown in
FIGURE 4 | The Architecture of the Mel-spectrogram and CNN model.
TABLE 1 | The comparison results of different methods.
Methods Random division recognition task No-leakage division recognition task
Accuracy (%) Recall (%) Precision (%) F1 Score (%) Accuracy (%) Recall (%) Precision (%) F1 Score (%)
Mel-spectrogram + CNN 98.18 99.18 99.28 99.23 95.18 93.33 100 96.55
Mel-spectrogram + BP 94.34 87.50 100 93.33 91.44 93.75 93.75 93.75
MFCC + CNN 97.43 88.88 100 94.12 94.04 100 88.88 94.11
MFCC + BP 96.12 97.19 93.87 97.19 93.45 90.91 100 95.23
MFCC + SVM 95.76 96.99 94.57 95.77 93.29 93.56 91.79 92.67
MFCC + K-means 52.93 42.86 53.09 47.43 50.34 42.44 44.96 43.66
MFCC + Naive-bayes 88.57 95.31 83.83 89.20 78.81 82.43 73.87 77.92
MFCC + LightGBM 95.73 98.46 93.29 95.80 89.89 88.17 89.38 88.77
FIGURE 5 | The loss of the random division experiment. FIGURE 6 | The loss of the no-leakage division experiment.
Frontiers in Robotics and AI | May 2021 | Volume 8 | Article 5800805
Zhou et al. Cough Recognition
Table 1. The no-leakage recognition accuracy is 95.18% and the
F1 score is the highest of all methods. It can be seen that the
model performs well during generalization cough
recognition tasks.
Experiment Based On Other Traditional
In order to prove the effectiveness of this method, we use several
other methods for comparison.
MFCC is an effective method to extract audio features. We use this
method to preprocess the original audio data and then pass it to the
different model. In order to make it suitable for the linear model, in
the experiment, we take the average value on each dimension.
Back Propagation Network
BP is a multilayer feedforward network which has a strong
nonlinear mapping ability. In our experiment, we build a four-
layer BP neural network and the activation is ReLU.
Support Vector Machine
A Support Vector Machine (SVM) is a kind of generalized linear
classier that classies data according to supervised learning.
The K-means algorithm is an iterative clustering algorithm. Firstly, it
randomly selects Kobjects as the initial clustering center. Then it
calculates the distance between each object and each seed cluster
center and assigns each object to the nearest cluster center.
Naive Bayes is a classication method based on Bayes theorem
and the independent hypothesis of characteristic conditions.
LighGBM is one of the boosting set models. It is an efcient
implementation of the Gradient Boosting Decision Tree (GBDT)
as XGBoost. In principle, it is similar to GBDT and XGBoost.
It uses the negative gradient of loss function as the residual
approximation of the current decision tree to t the new
decision tree.
All results based on these methods are shown in Table 1, and
we can nd that the CNN model is better than these methods in
recognition accuracy and other indicators.
In this work, we proposed a cough recognition network (CRN)
based on the CNN model and a Mel-spectrogram. From the
experiments result based on random division and no-leakage
division datasets, we can nd that the proposed CRN can achieve
excellent performance in cough recognition. Compared to other
methods, the accuracy of CRN is highest and most of the indexes
are the best. In order to estimate the generalization ability of the
model, we have collected some cough sounds that were not
included in training. We nd that the CRN can also recognize
them efciency. Experiments show that the model can recognize
coughing in complex scenes effectively, and can recognize
coughing with various other sounds correctly, which is good
for cough monitoring in daily life. Cough recognition is a
potential solution for disease management during the COVID-
19 pandemic and reduces epidemic prevention workersexposure
Although the model has achieved good recognition results,
there are still some problems that need to be further solved. For
example, the audio length is now limited to 1 s. When the
intercept position is not right, it may be misjudged.
Publicly available datasets were analyzed in this study. This data
can be found here: ESC-
50 Dataset
commands_v0.02.tar.gz Speech Commands Dataset.
BF proposed the idea of the paper. QZ and JS designed the network
and wrote the manuscript. WD, CW, and SY wrote the code and
analyzed the results. FS and HL helped improve the paper.
This work supported by the National Key Research and
Development Program of China (2017YFE0113200), National
Natural Science Foundation of China (Grant No. 91848206),
Beijing Science & Technology Project (Grant No.
Z191100008019008) and Natural Science Foundation of
university in Anhui Province (No.KJ 2019A0086).
Amoh, J., and Odame, K. (2016). Deep Neural Networks for Identifying Cough
Sounds. IEEE Trans. Biomed. Circuits Syst. 10, 10031011. doi:10.1109/TBCAS.
Drugman, T., Urbain, J., and Dutoit, T. (2011). Assessment of Audio Features for
Automatic Cough Detection,in 2011 19th European Signal Processing
Conference., Barcelona, Spain, 29 Aug.-2 Sept. 2011 (IEEE), 12891293.
Drugman, T., Urbain, J., Bauwens, N., Chessini, R., Valderrama, C., Lebecque, P.,
et al. (2013). Objective Study of Sensor Relevance for Automatic Cough
Detection. IEEE J. Biomed. Health Inform. 17, 699707. doi:10.1109/jbhi.
Elfaramawy, T., Fall, C. L., Arab, S., Morissette, M., Lellouche, F., and Gosselin, B.
(2019). A Wireless Respiratory Monitoring System Using a Wearable Patch
Sensor Network. IEEE Sensors J. 19, 650657. doi:10.1109/JSEN.2018.2877617
Hoa, H., Tran, A., and Dat, T. (2011). Semi-supervised Tree Support Vector
Machine for Online Cough Recognition, 12th Annual Conference of the
Frontiers in Robotics and AI | May 2021 | Volume 8 | Article 5800806
Zhou et al. Cough Recognition
International SpeechCommunication Association. (Florence, Italy: . ISCA),
16371640. |
Hoyos-Barcelo, C., Monge-Alvarez, J., Zeeshan Shakir, M., Alcaraz-Calero, J.-M.,
and Casaseca-de-la-Higuera, P. (2018). Efcient K-NN Implementation for
Real-Time Detection of Cough Events in Smartphones. IEEE J. Biomed. Health
Inform. 22, 16621671. doi:10.1109/JBHI.2017.2768162
Infante, C., Chamberlain, D. B., Kodgule, R., and Fletcher, R. R. (2017).
Classication of Voluntary Coughs Applied to the Screening of Respiratory
Disease. Annu Int. Conf. IEEE Eng. Med. Biol. Soc. 2017, 14131416. doi:10.
Ittichaichareon, C., Suksri, S., and Yingthawornsuk, T. (2012). Speech Recognition
Using Mfcc. Int. Conf. Comp. Grap. Simula. Model., 135138. doi:10.13140/RG.
Piczak, K. J. (2015). Esc: Dataset for Environmental Sound Classication,
10151018. doi:10.1145/2733373.2806390
Pucik, J., Kubinec, P., and Ondracek, O. (2014). Fft with Modied Frequency Scale
for Audio Signal Analysis,in International Conference Radioelektronika.
Bratislava, Slovakia, 15-16 April 2014 (IEEE), 14.
Pundak, G., and Sainath, T. (2017). Highway-LSTM and Recurrent Highway
Networks for Speech Recognition,in Proceedings of Interspeech 2017,
13031307. doi:10.21437/Interspeech.2017-429
Rippel, O., Snoek, J., and Adams, R. P. (2015). Spectral Representations for
Convolutional Neural Networks. arXiv.
Sattar Hashmi, H. A., and Asif, H. M. (2020). Early Detection and Assessment of
Covid-19. Front. Med. 131, 311. doi:10.3389/fmed.2020.00311
Shintri, R. G., and Bhatia, S. K. (2015). Analysis of Mfcc and Multitaper Mfcc Feature
Extraction Methods. Int. J. Comput. Appl. 131, 710. doi:10.5120/ijca2015906883
Trianto, R., Tai, T.-C., and Wang, J.-C. (2018). Fast-lstm Acoustic Model for
Distant Speech Recognition. IEEE Inter. Confer. Consu. Electro. (ICCE) 2018,
14. doi:10.1109/ICCE.2018.8326195
Vhaduri, S., Kessel, T. V., Ko, B., Wood, D., Wang, S., and Brunschwiler, T. (2019).
Nocturnal Cough and Snore Detection in Noisy Environments Using
Smartphone-Microphones. IEEE Inter. Conf. Health. Infor. (ICHI). 2019,
17. doi:10.1109/ICHI.2019.8904563
Wang, Y., and Hu, W. (2018). Speech Emotion Recognition Based on Improved
Mfcc. Inter. Confe. Compu. Sci. Appli. Engin 88, 17. doi:10.1145/3207677.
Warden, P. (2018). Speech Commands: A Dataset for Limited-Vocabulary Speech
Recognition. arXiv. doi:10.2172/1635786
Xie, C., Cao, X., and He, L. (2012). Algorithm of Abnormal Audio Recognition
Based on Improved Mfcc. Proced. Eng. 29, 731737. doi:10.1016/j.proeng.2012.
Conict of Interest: The authors declare that the research was conducted in the
absence of any commercial or nancial relationships that could be construed as a
potential conict of interest.
Copyright © 2021 Zhou, Shan, Ding, Wang, Yuan, Sun, Li and Fang. This is an
open-access article distributed under the terms of the Creative Commons Attribution
License (CC BY). The use, distribution or reproduction in other forums is permitted,
provided the original author(s) and the copyright owner(s) are credited and that the
original publication in this journal is cited, in accordance with accepted academic
practice. No use, distribution or reproduction is permitted which does not comply
with these terms.
Frontiers in Robotics and AI | May 2021 | Volume 8 | Article 5800807
Zhou et al. Cough Recognition
... A spectrogram represents short time periods of a signal and the power spectrum for different frequency ranges, and it can be visualized through an image for easy interpretation (Sharma et al., 2022). In our study, we utilized the Mel spectrogram to represent the COVID-19 respiratory sounds (Zhou et al., 2021). ...
... As the human ear does not perceive frequencies on a linear scale (lower frequencies Proposed scheme for COVID-19 screening using cough, speech, and breath sounds. are better to discriminate than higher frequencies), the main idea of the Mel scale is to mimic the non-linear human ear perception (Nanni et al., 2021;Zhou et al., 2021). Each frame of the spectrum is passed through a Mel filter bank, and the conversion between Hertz (f) and Mel (m) can be calculated using Eq. 1. m 2595 log 10 1 + 700f . ...
... Since the COVID-19 outbreak, several research studies have been conducted to infer infection by COVID-19 (Brown et al., 2020;Schuller et al., 2021;Zhou et al., 2021;Pahar et al., 2022;Pleva et al., 2022;Sharma et al., 2022;Villa-Parra et al., 2022). From the experiments, we can find that the proposed audio texture feature extraction can achieve a good performance in COVID-19 screening. ...
Full-text available
Since the COVID-19 outbreak, a major scientific effort has been made by researchers and companies worldwide to develop a digital diagnostic tool to screen this disease through some biomedical signals, such as cough, and speech. Joint time–frequency feature extraction techniques and machine learning (ML)-based models have been widely explored in respiratory diseases such as influenza, pertussis, and COVID-19 to find biomarkers from human respiratory system-generated acoustic sounds. In recent years, a variety of techniques for discriminating textures and computationally efficient local texture descriptors have been introduced, such as local binary patterns and local ternary patterns, among others. In this work, we propose an audio texture analysis of sounds emitted by subjects in suspicion of COVID-19 infection using time–frequency spectrograms. This approach of the feature extraction method has not been widely used for biomedical sounds, particularly for COVID-19 or respiratory diseases. We hypothesize that this textural sound analysis based on local binary patterns and local ternary patterns enables us to obtain a better classification model by discriminating both people with COVID-19 and healthy subjects. Cough, speech, and breath sounds from the INTERSPEECH 2021 ComParE and Cambridge KDD databases have been processed and analyzed to evaluate our proposed feature extraction method with ML techniques in order to distinguish between positive or negative for COVID-19 sounds. The results have been evaluated in terms of an unweighted average recall (UAR). The results show that the proposed method has performed well for cough, speech, and breath sound classification, with a UAR up to 100.00%, 60.67%, and 95.00%, respectively, to infer COVID-19 infection, which serves as an effective tool to perform a preliminary screening of COVID-19.
... The mixing process changes the sound of the data based on the used noise. These changes provide wider data coverage, which were benefit the real scene [22]. This process gives a total of 5352 audio data divided into eight labels. ...
Full-text available
The development of transportation technology is increasing every day; it impacts the number of transportation and their users. The increase positively impacts the economy's growth but also has a negative impact, such as accidents and crime on the highway. In 2018, the number of accidents in Indonesia reached 109,215 cases, with a death rate of 29,472 people, which was mostly caused by the late treatment of the casualties. On the other hand, in the same year, there were 8,423 mugs, and 90,757 snitches cases in Indonesia, with only 23.99% of cases reported. This low reporting rate is mostly caused by the lack of awareness and knowledge about where to report. Therefore, a quick response surveillance system is needed. In this study, an audio-based accident and crime detection system was built using a neural network. To improve the system's robustness, we enhance our dataset by mixing it with certain noises which likely to occur on the road. The system was tested with several parameters of segment duration, bandpass filter cut-off frequency, feature extraction, architecture, and threshold values to obtain optimal accuracy and performance. Based on the test, the best accuracy was obtained by convolutional neural network architecture using 200ms segment duration, 0.5 overlap ratio, 100Hz and 12000Hz as bandpass cut-off frequency, and a threshold value of 0.9. By using mentioned parameters, our system gives 93.337% accuracy. In the future, we hope to implement this system in a real environment.
... For pre-processing the CXR images, all the images were converted into greyscale and resized to 150 × 150 pixels due to the GPU memory limitation. Meanwhile, for the cough audio files were converted from .wav to Mel-spectrogram using the short-time Fourier transform [15]. Tables 1-3 shows the dataset split for each model at the final stage of pre-processing. ...
Full-text available
The present work relates to the implementation of core parallel architecture in a deep learning algorithm. At present, deep learning technology forms the main interdisciplinary basis of healthcare, hospital hygiene, biological and medicine. This work establishes a baseline range by training hyperparameter space, which could be support images, and sound with further develop a parallel architectural model using multiple inputs with and without the patient’s involvement. The chest X-ray images input could form the model architecture include variables for the number of nodes in each layer and dropout rate. Fourier transformation Mel-spectrogram images with the correct pixel range use to covert sound acceptance at the convolutional neural network in embarrassingly parallel sequences. COVIDNet the end user tool has to input a chest X-ray image and a cough audio file which could be a natural cough or a forced cough. Three binary classification models (COVID-19 CXR, non-COVID-19 CXR, COVID-19 cough) were trained. The COVID-19 CXR model classifies between healthy lungs and the COVID-19 model meanwhile the non-COVID-19 CXR model classifies between non-COVID-19 pneumonia and healthy lungs. The COVID-19 CXR model has an accuracy of 95% which was trained using 1681 COVID-19 positive images and 10,895 healthy lungs images, meanwhile, the non-COVID-19 CXR model has an accuracy of 91% which was trained using 7478 non-COVID-19 pneumonia positive images and 10,895 healthy lungs. The reason why all the models are binary classification is due to the lack of available data since medical image datasets are usually highly imbalanced and the cost of obtaining them are very pricey and time-consuming. Therefore, data augmentation was performed on the medical images datasets that were used. Effects of parallel architecture and optimization to improve on design were investigated.
... 48 The simple CNN that we use is similar in structure to AlexNet. 49 Although CNNs were developed for image recognition, they have also been used to analyse speech and other sounds either by treating the spectrogram as an image [50][51][52][53] or applying convolutional filters to the raw signal. 35,36 For readers unfamiliar with CNNs, we briefly explain the architecture. ...
A method is presented for combining the feature extraction power of neural networks with model based dimensionality reduction to produce linguistically motivated low dimensional measurements of sounds. This method works by first training a convolutional neural network (CNN) to predict linguistically relevant category labels from the spectrograms of sounds. Then, idealized models of these categories are defined as probability distributions in a low dimensional measurement space with locations chosen to reproduce, as far as possible, the perceptual characteristics of the CNN. To measure a sound, the point is found in the measurement space for which the posterior probability distribution over categories in the idealized model most closely matches the category probabilities output by the CNN for that sound. In this way, the feature learning power of the CNN is used to produce low dimensional measurements. This method is demonstrated using monophthongal vowel categories to train this CNN and produce measurements in two dimensions. It is also shown that the perceptual characteristics of this CNN are similar to those of human listeners.
... AI-guided tools can detect coughs and diagnose Covid-19 patients based on features extracted from audio recordings, such as MFCCs [6][7][8]. In one study, zhou et al. (2021) demonstrated the Cough Recognition Network (CRN), which they suggest as a solution for cough recognition based on a Mel-spectrogram and a deep learningbased model Convolutional Neural Network, is capable of effectively differentiating cough sounds [9]. ...
Full-text available
Early detection of infectious disease is the must to prevent/avoid multiple infections, and Covid-19 is an example. When dealing with Covid-19 pandemic, Cough is still ubiquitously presented as one of the key symptoms in both severe and non-severe Covid-19 infections, even though symptoms appear differently in different sociodemographic categories. By realizing the importance of clinical studies, analyzing cough sounds using AI-driven tools could help add more values when it comes to decision-making. Moreover, for mass screening and to serve resource constrained regions, AI-driven tools are the must. In this thesis, Convolutional Neural Network (CNN) tailored deep learning models are studied to analyze cough sounds to detect the possible evidence of Covid-19. In addition to custom CNN, pre-trained deep learning models (e.g., Vgg-16, Resnet-50, MobileNetV1, and DenseNet121) are employed on a publicly available dataset. In our findings, custom CNN performed comparatively better than pre-trained deep learning models.
... Moreover, as stated in the methodology section, 4 classification algorithms (MLP, SVM, Gradient Boosting and Random Forest) are analyzed and compared with a MFCC-CNN traditional approach. The choice to use the MFCC-CNN approach as a comparative basis against the other models comes from the fact that it is a well established method in the literature for vibration, acoustic and sound signals classification, as can be seen in [1,5,[22][23][24][25][26][27][28], among many other works. After some preliminary experiments, optimal parameters for each model under investigation were chosen, as follows: ...
... Looking at the mammoth size of the samples in the dataset and the image processing task, it is not feasible to perform model training of this size on even an above-average spec computer. e nature of the computation is convolutional, which allows computing on the graphic processor unit (GPU), which accelerates the whole process to up to 10 times based on the GPU being used [38][39][40]. For these computing resources, we decided to rely on Google Collaboratory. ...
Full-text available
Automatic chord recognition has always been approached as a broad music audition task. The desired output is a succession of time-aligned discrete chord symbols, such as GMaj and Asus2. Automatic music transcription is the process of converting a musical recording into a human-readable and interpretable representation. When dealing with polyphonic sounds or removing certain limits, automatic music transcription remains a difficult undertaking. A guitar, for example, presents a greater challenge, as guitarists can play the same note in a variety of places. The study makes use of CNN functionality to generate the guitar tab; initially, the constant-Q transform was used to turn the input audio file into short time spectrograms that the CNN model utilises to analyse the chord. The paper developed a method for extracting chord sequences and notes from audio recordings of solo guitar performances. For intervals in the supplied audio, the proposed approach outputs chord names and fret-board notes. The model described here has been refined to achieve an accuracy of 88.7%. The model’s ability to properly tag audio clips is an incredible advancement.
... Data augmentation can increase the data size for training. Widely used audio data augmentation methods include time stretch, pitch shift, perturbation, and noise injection on raw signals 38,39 and masking or mix-up augmentation on spectrograms. 24,40,41 Besides, the collected audio databases are class-imbalanced with a skewed distribution of the associated respiratory conditions. ...
Full-text available
Auscultation plays an important role in the clinic, and the research community has been exploring machine learning (ML) to enable remote and automatic auscultation for respiratory condition screening via sounds. To give the big picture of what is going on in this field, in this narrative review, we describe publicly available audio databases that can be used for experiments, illustrate the developed ML methods proposed to date, and flag some under-considered issues which still need attention. Compared to existing surveys on the topic, we cover the latest literature, especially those audio-based COVID-19 detection studies which have gained extensive attention in the last two years. This work can help to facilitate the application of artificial intelligence in the respiratory auscultation field.
Full-text available
Currently, a subjective method is used to diagnose cough sounds, particularly wet and dry coughs, which can lead to incorrect diagnoses. In this study, novel emergent features were extracted using spectrogram methods and a parallel-stream one-dimensional (1D) deep convolutional neural network (DCNN) to classify cough sounds. The data of this study were obtained from two datasets. We employed the Mel spectrogram, chromagram constant- $Q$ transform, Mel-frequency cepstral coefficient, constant- $Q$ cepstral coefficient, and linear predictive code coefficient to conduct features analysis. The maximum, mean, variance, and standard deviation values of the original spectrogram as well as the maximum first and second derivatives of this spectrogram were extracted and fused to create a single-feature vector. We adopted two types of features: single features and combined features. Each design was restructured according to the magnitude of features with high discrimination power. A parallel-stream 1D-DCNN was developed for classifying cough sounds accurately. We compared the results obtained using the aforementioned network with those obtained using a single-stream 1D-DCNN. We found that the parallel-stream network outperformed the single-stream network for some feature sets. The developed network achieved $F1$ scores of 98.61% and 82.96% for the first and second datasets, respectively. The concatenation of layers at the flattening level resulted in an $F1$ score of 99.30% in dataset one. Moreover, layer merging strategies exhibited a better performance at the second convolutional layer level than at the flattening layer level in many cases.
Full-text available
Background: Since the Covid-19 global pandemic emerged, developing countries have been facing multiple challenges over its diagnosis. We aimed to establish a relationship between the signs and symptoms of COVID-19 for early detection and assessment to reduce the transmission rate of SARS-Cov-2. Methods: We collected published data on the clinical features of Covid-19 retrospectively and categorized them into physical and blood biomarkers. Common features were assigned scores by the Borg scoring method with slight modifications and were incorporated into a newly-developed Hashmi-Asif Covid-19 assessment Chart. Correlations between signs and symptoms with the development of Covid-19 was assessed by Pearson correlation and Spearman Correlation coefficient (rho). Linear regression analysis was employed to assess the highest correlating features. The frequency of signs and symptoms in developing Covid-19 was assessed through Chi-square test two tailed with Cramer's V strength. Changes in signs and symptoms were incorporated into a chart that consisted of four tiers representing disease stages. Results: Data from 10,172 Covid-19 laboratory confirmed cases showed a correlation with Fever in 43.9% (P = 0.000) cases, cough 54.08% and dry mucus 25.68% equally significant (P = 0.000), Hyperemic pharyngeal mucus membrane 17.92% (P = 0.005), leukopenia 28.11% (P = 0.000), lymphopenia 64.35% (P = 0.000), thrombopenia 35.49% (P = 0.000), elevated Alanine aminotransferase 50.02% (P = 0.000), and Aspartate aminotransferase 34.49% (P = 0.000). The chart exhibited a maximum scoring of 39. Normal tier scoring was ≤ 12/39, mild state scoring was 13–22/39, and star values scoring was ≥7/15; this latter category on the chart means Covid-19 is progressing and quarantine should be adopted. Moderate stage scored 23–33 and severe scored 34–39 in the chart. Conclusion: The Hashmi-Asif Covid-19 Chart is significant in assessing subclinical and clinical stages of Covid-19 to reduce the transmission rate.
Full-text available
Describes an audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Discusses why this task is an interesting challenge, and why it requires a specialized dataset that is different from conventional datasets used for automatic speech recognition of full sentences. Suggests a methodology for reproducible and comparable accuracy metrics for this task. Describes how the data was collected and verified, what it contains, previous versions and properties. Concludes by reporting baseline results of models trained on this dataset.
Full-text available
The potential of telemedicine in respiratory health care has not been completely unveiled in part due to the inexistence of reliable objective measurements of symptoms such as cough. Currently available cough detectors are uncomfortable and expensive at a time when generic smartphones can perform this task. However, two major challenges preclude smartphone-based cough detectors from effective deployment namely, the need to deal with noisy environments and computational cost. This paper focuses on the latter, since complex machine learning algorithms are too slow for real-time use and kill the battery in a few hours unless specific actions are taken. In this paper, we present a robust and efficient implementation of a smartphone-based cough detector. The audio signal acquired from the device's microphone is processed by computing local Hu moments as a robust feature set in the presence of background noise. We previously demonstrated that pairing Hu moments and a standard k-NN classifier achieved accurate cough detection at the expense of computation time. To speed-up k-NN search, many tree structures have been proposed. Our cough detector uses an improved vp-tree with optimized construction methods and a distance function that results in faster searches. We achieve 18x speed-up over classic vp-trees, and 560x over standard implementations of k-NN in state-of-the-art machine learning libraries, with classification accuracies over 93%, enabling real-time performance on low-end smartphones.
Wireless body sensors are increasingly used by clinicians and researchers in a wide range of applications, such as sports, space engineering, and medicine. Monitoring vital signs in real time can dramatically increase diagnosis accuracy and enable automatic curing procedures, e.g., detect and stop epilepsy or narcolepsy seizures. Breathing parameters are critical in oxygen therapy, hospital, and ambulatory monitoring, while the assessment of cough severity is essential when dealing with several diseases, such as chronic obstructive pulmonary disease. In this paper, a low-power wireless respiratory monitoring system with cough detection is proposed to measure the breathing rate and the frequency of coughing. This system uses wearable wireless multimodal patch sensors, designed using off-the-shelf components. These wearable sensors use a low-power nine-axis inertial measurement unit to quantify the respiratory movement and a MEMs microphone to record audio signals. Data processing and fusion algorithms are used to calculate the respiratory frequency and the coughing events. The architecture of each wireless patch-sensor is presented. In fact, the results show that the small 26.67 × 65.53 mm <sup xmlns:mml="" xmlns:xlink="">2</sup> patch-sensor consumes around 12-16.2 mA and can last at least 6 h with a miniature 100-mA lithium ion battery. The data processing algorithms, the acquisition, and wireless communication units are described. The proposed network performance is presented for experimental tests with a freely behaving user in parallel with the gold standard respiratory inductance plethysmography.
Conference Paper
Speech1 Emotion Recognition SER uses the Berlin EMO-DB database, seven emotions. Traditional emotional features and their statistics are used in SER. Two improved Mel Frequency Cepstrum Coefficients MFCC features are added to this experiment, which extract MFCC parameters from the energy curve and the fundamental frequency curve, that are energy MFCC EEMFCC and fundamental frequency MFCC F0MFCC, using Support Vector Machines SVM as the recognition machine, we obtained the highest average recognition rate of 85.37% for the seven categories and 100% for sad.
Conference Paper
Pulmonary and respiratory diseases (e.g. asthma, COPD, allergies, pneumonia, tuberculosis, etc.) represent a large proportion of the global disease burden, mortality, and disability. In this context of creating automated diagnostic tools, we explore how the analysis of voluntary cough sounds may be used to screen for pulmonary disease. As a clinical study, voluntary coughs were recorded using a custom mobile phone stethoscope from 54 patients, of which 7 had COPD, 15 had asthma, 11 had allergic rhinitis, 17 had both asthma and allergic rhinitis, and four had both COPD and allergic rhinitis. Data were also collected from 33 healthy subjects. These patients also received full auscultation at 11 sites, given a clinical questionnaire, and underwent full pulmonary function testing (spirometer, body plethysmograph, DLCO) which culminated in a diagnosis provided by an experienced pulmonologist. From machine learning analysis of these data, we show that it is possible to achieve good classification of cough sounds in terms of Wet vs Dry, yielding an ROC curve with AUC of 0.94, and show that voluntary coughs can serve as an effective test for determining Healthy vs Unhealthy (sensitivity=35.7% specificity=100%). We also show that the use of cough sounds can enhance the performance of other diagnostic tools such as a patient questionnaire and peak flow meter; however voluntary coughs alone provide relatively little value in determining specific disease diagnosis.
In this paper, we consider two different approaches of using deep neural networks for cough detection. The cough detection task is cast as a visual recognition problem and as a sequence-to-sequence labeling problem. A convolutional neural network and a recurrent neural network are implemented to address these problems, respectively. We evaluate the performance of the two networks and compare them to other conventional approaches for identifying cough sounds. In addition, we also explore the effect of the network size parameters and the impact of long-term signal dependencies in cough classifier performance. Experimental results show both network architectures outperform traditional methods. Between the two, our convolutional network yields a higher specificity 92.7% whereas the recurrent attains a higher sensitivity of 87.7%.