ArticlePDF Available

Evaluation of Classical Machine Learning Techniques towards Urban Sound Recognition on Embedded Systems


Abstract and Figures

Automatic urban sound classification is a desirable capability for urban monitoring systems, allowing real-time monitoring of urban environments and recognition of events. Current embedded systems provide enough computational power to perform real-time urban audio recognition. Using such devices for the edge computation when acting as nodes of Wireless Sensor Networks (WSN) drastically alleviates the required bandwidth consumption. In this paper, we evaluate classical Machine Learning (ML) techniques for urban sound classification on embedded devices with respect to accuracy and execution time. This evaluation provides a real estimation of what can be expected when performing urban sound classification on such constrained devices. In addition, a cascade approach is also proposed to combine ML techniques by exploiting embedded characteristics such as pipeline or multi-thread execution present in current embedded devices. The accuracy of this approach is similar to the traditional solutions, but provides in addition more flexibility to prioritize accuracy or timing.
Content may be subject to copyright.
Evaluation of Classical Machine Learning Techniques
towards Urban Sound Recognition
on Embedded Systems
Bruno da Silva 1,2,, Axel W. Happi 2, An Braeken 1and Abdellah Touhafi 1,2
1Department of Engineering Technology (INDI), Vrije Universiteit Brussel (VUB), 1050 Brussels, Belgium
Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), 1050 Brussels, Belgium
*Correspondence:; Tel.: +32 2 6293768
Received: 16 July 2019; Accepted: 11 September 2019; Published: 16 September 2019
Automatic urban sound classification is a desirable capability for urban monitoring systems,
allowing real-time monitoring of urban environments and recognition of events. Current embedded
systems provide enough computational power to perform real-time urban audio recognition. Using
such devices for the edge computation when acting as nodes of Wireless Sensor Networks (WSN)
drastically alleviates the required bandwidth consumption. In this paper, we evaluate classical
Machine Learning (ML) techniques for urban sound classification on embedded devices with respect
to accuracy and execution time. This evaluation provides a real estimation of what can be expected
when performing urban sound classification on such constrained devices. In addition, a cascade
approach is also proposed to combine ML techniques by exploiting embedded characteristics such
as pipeline or multi-thread execution present in current embedded devices. The accuracy of this
approach is similar to the traditional solutions, but provides in addition more flexibility to prioritize
accuracy or timing.
urban sound classification; machine learning; embedded system; environment sound
recognition; audio feature extraction; edge computing.
1. Introduction
Despite the potential application of urban sound recognition in Wireless Acoustic Sensor Networks
(WASN), there exists a lack of evaluation of existing solutions. The achievable classifier’s accuracy
or the demanded time to perform the sound classification are just a couple of parameters, which
must be taken into account when considering WASN applications demanding sound recognition.
Current embedded systems provide enough computational power to perform urban noise classification,
enabling edge-computing solutions. Nevertheless, we believe that such embedded devices are currently
underused for such type of sound classification.
The presented work evaluates the most popular Machine Learning (ML) techniques for
Environment Sound Recognition (ESR), in particular for urban sound classification. Existing open
source libraries for audio analysis and datasets for ESR are used to evaluate the achievable accuracy of
classical sound classifiers. The best performing classifiers are evaluated on an embedded system in
order to determine the achievable accuracy and the execution time that could be expected on such
constrained devices. Several datasets for ESR, which partially include urban sounds are combined
here to create larger datasets targeting urban sound classification. In addition, a scalable approach
is proposed to not only exploit some characteristics of embedded devices but also to enable the
combination of different sound classifiers to prioritize accuracy or response time. The steps followed
to construct a hierarchical dataset, the selection of the audio features extracted to fetch the sound
classifiers and the parameter’s exploration for the optimization of the classifiers are presented in detail.
Appl. Sci. 2019,9, 3885; doi:10.3390/app9183885
Appl. Sci. 2019,9, 3885 2 of 27
Contributions towards urban sound recognition on embedded devices:
A cascade approach for scalable dataset customization.
Evaluation of classical ML techniques for ESR on an embedded system.
This paper is organized as follows. Section 2presents related work. The methodology used for the
evaluation of the ML techniques is detailed in Section 3. In Section 4, the proposed cascade approach
is presented. The experimental results are detailed in Section 5. Finally, our conclusions are presented
in Section 6.
2. Related Work
Several authors have compared classical ML techniques for sound classification for the evaluation
of their ESR datasets [
]. However, a discussion about the convenience of embedding ML techniques
for sound classification is missing.
The work presented in [
] is one of the few describing an environmental sound classifier fully
embedded on a wearable device. Their embedded system uses a 16-bit microcontroller (MCU) to
perform a 1D Haar-like filtering for the feature extraction combined with a Hidden Markov Model
(HMM) for the sound classification. Instead, this paper presents a more general evaluation of the most
popular sound classifiers in terms of accuracy and response time.
A Convolutional Neural Network (CNN) is proposed as a sound classifier in [
] for Wireless
Sensor Networks (WSN) composed of Raspberry Pi acting as a node. The operations on the Raspberry
Pi involve the encoding of the audio and its transmission to a computer, where the sound classification
is performed. We believe that the feature extraction and the sound classification can be executed in
such a relatively powerful embedded device. Our tests include the execution on a Raspberry Pi of all
the operations required for the sound recognition.
A distributed sensor network capable of urban sound recognition is presented in [
]. The
authors propose a CNN-based classifier able to recognize 14 different types of urban sounds, which
is embedded on a Raspberry Pi 3 acting as node of a distributed sensor network. Although all the
operations seem to be embedded, no profiling of the execution time nor discussion about the achieved
accuracy are provided.
A centralized WSN for ESR is proposed in [
]. Their approach uses distributed nodes with
minimal compute power for the audio data acquisition, compression and transmission to a central
processing node (Raspberry Pi 3), where a Gaussian Mixture Model (GMM) performs the sound
classification. An accuracy of 72% is achieved for only four categories of urban sounds. Unfortunately,
the authors do not profile the embedded executions for the sound recognition or provide details of the
overall latency of their centralized approach.
The authors in [
] discuss the achievable accuracy and the required modifications to embed on a
WSN node their Audio-Visual Correspondence (AVC) based architecture for urban sound recognition.
Despite discussing the required modifications on their architecture, only the memory requirements are
considered as the constraint parameter of the WSN node. The computational demand of the operations
required for the urban sound recognition is not discussed for such contrained devices. Our evaluation
covers not only the achievable accuracy, but also the time of response for each classifier.
A system for monitoring, analysis and mitigation of urban noise pollution is proposed in [
] as
part of the SOunds of New York City (SONYC) project. A Raspberry Pi-based system acting as WSN
nodes performs continuous Sound Pressure Level (SPL) measurements while acquiring, compressing
and transmitting 10-second audio snippets at random intervals. Although the final objective is to fully
embed the urban sound classification in the WSN node, the acquired audio is currently recognized on
a laptop machine.
To summarize, some authors have already evaluated different sound classifiers [
], but very
few have effectively executed them on an embedded system [
]. Our evaluation not only covers
Appl. Sci. 2019,9, 3885 3 of 27
a larger amount of sound classifiers but also profiles their execution on a state-of-the-art embedded
3. Methodology
3.1. Datasets
The automatic urban sound recognition uses ML techniques for the sound classification. Such
classifiers need a training period before being able to recognize a particular type of sound. Audio
datasets containing labelled urban sounds are required for such training.
3.1.1. Open Source Environmental Sound Datasets
Table 1summarizes the most popular open source datasets for urban sound and ESR, which are
used in our methodology to identify the most suitable sound recognition systems to be embedded.
Although most of the categories in these datasets are not directly urban related, they present enough
representative urban sounds to be considered for our purpose.
The authors in [
] created an audio dataset composed of 12 categories representing real-life
situations. The collected audio is from sources such as BBC Complete Sound Effects Library,Sony
Pictures Sound Effects Series and the online sound repository called Freesound [
]. Each category
has 10 audio files of ten seconds each, with great variation between them.
UrbanSound is a large audio dataset of 12 GB with around 18.5 hours of labelled
audio [
]. This dataset is exclusively composed of urban sound grouped in 10 categories. Similarly
to the BDLib, this dataset is created by manually filtering and labelling every recording from
the online sound repository called Freesound [
]. Although a larger subset of 4-second audio
clips called UrbanSound8K has been also created based on the UrbanSound dataset that is not
considered here.
An annotated collection of more than 2000 short audio clips belonging to 50 different
categories is presented in [
]. This dataset (ESC-50) comprises not only urban sounds but
also present a large number of categories including environmental recordings such as animal
sounds, domestic sounds or natural soundscapes. A shorter dataset version recognizing 10 audio
categories (ESC-10) is used in this work.
Table 1. Details of the most popular open source datasets for urban sound and ESR.
BDLib Dataset ESC-10 Dataset UrbanSound Dataset
Categories Total Time (s) Categories Total time (s) Categories Total time (s)
Airplane 100 Dog barking 200 Air conditioner 6577
Alarms 100 Baby crying 200 Car horn 4819
Applause 100 Clock tick 200 Children playing 13454
Birds 100 Person sneezing 200 Dog bark 8399
Dogs 100 Helicopter 200 Drilling 4864
Footsteps 100 Chainsaw 200 Engine idling 3604
Motorcycles 100 Rooster 200 Gun shot 7865
Rain 100 Fire cracking 200 Jackhammer 4328
Sea waves 100 Sea waves 200 Siren 4477
Wind 100 Rain 200 Street music 6453
Rivers 100
Thunderstorm 100
3.1.2. Limitations
The referred datasets are lacking in the type and number of categories for urban sounds.
Appl. Sci. 2019,9, 3885 4 of 27
Large number of categories per dataset
: Whereas the UrbanSound dataset only contains 10 urban
sound categories, the ESC-50 dataset contains 50 categories, but many are not audio recordings
from the urban environment. A larger number of categories would enable the recognition of more
urban sounds.
Small number of audio samples per categories
: The datasets do not contain a large number of
audio samples, which would benefit the training of the classifiers.
Irrelevant categories
: Sound categories such as sea waves or applause available in the BDLib dataset
are not of interest for urban sound recognition.
These limitations can be alleviated by rearranging or combining datasets. Sounds belonging to
the same category can be grouped and similar categories can be fused to create new categories. The
creation of a new dataset from the existing datasets is detailed in Section 5.4.1.
3.2. Audio Segmentation and Feature Extraction
In our experiments, we were not looking for the best feature/classifier, which led to the best
accuracy, but instead were interested in what is the achievable accuracy and the execution time of the
classic ML techniques reported in the literature.
Frame 9
Figure 1.
A 5-second input audio signal is split in frames of one second with a 50% overlap. Each
frame is sampled at 44.1 kHz and a set of audio features are extracted.
Figure 1details the audio segmentation and the feature extraction. A HANN window [
] with a
length of 44,100 samples and 50% overlap is used to divide every 5-second segment of the audio files
into frames. Since the sampling frequency is 44,100 Hz, the duration of each audio frame is one second.
The choice of the particular window is based on the fact that the sounds included in the data sets,
although non-stationary, do not contain abrupt changes in time. The choice of the sampling frequency
is based on [1].
The audio features are extracted per audio frame. The selection of the audio features can be
done using software environment tools such as Weka [
]. However, an analysis about the impact in
accuracy based on the selection of the features is discussed in Section 5.4.2. The first 12 Mel-Frequency
Cepstral Coefficients (MFCC) are extracted per audio frame. The per-frame values of each coefficient
are summarized across time using statistics. As a result, each audio frame contains a vector containing:
Appl. Sci. 2019,9, 3885 5 of 27
mean, median, minimum, maximum, variance, skewness and kurtosis. Other features, which are
extracted per frame are the spectral contrast, the spectral centroid, the Root-Mean Square (RMS), the
spectral bandwidth, the spectral roll off and the zero crossing. This results in a feature vector of 90
elements per frame. All of these feature vectors per frame conform the feature set, which is used by
the ML techniques to perform the sound classification.
3.3. Classifiers
There exist several ML techniques, which enable performing sound classifiers. For our evaluation,
we have selected supervised ML, which use correct labelled data during their training. The selected
supervised ML techniques are:
k-Nearest Neighbour (k-NN)
Naive Bayes
Artificial Neural Network (ANN)
Support Vector Machine (SVM)
Decision Tree
These ML techniques are known to deliver good performance when classifying acoustic events.
Although several classic ML techniques have been compared in [
] and in [
], there is no evaluation
discussing timing results. This is an important aspect because, due to the lack of computational
power that embedded devices suffer, some of the evaluated classifiers might be discarded due to
their execution time. Our analysis uses a random 80% of the labelled sounds of one dataset for the
training and the remaining 20% is used for the validation. While the ratio of correct recognition of that
20% of audio files provide us with the accuracy of the classifier, the time needed by the classifier to
individually recognize each sound determines its execution time.
The selected ML techniques have already been evaluated for the sound datasets described in
Section 3.1.1 like in [
] and in [
]. Although it leaves other ML techniques such as HMMs or Logistic
Regression out of our evaluation, we considered that the selected classic ML techniques provide enough
diversity for our analysis. CNNs are not considered for a different reason. CNNs have become very
popular in the last few years thanks to their use in Deep Learning architectures. Relevant publications
evaluating CNNs for sound recognition are [
]. For instance, the authors in [
] demonstrate
the potential of CNN-based sound classifiers using a limited number of labelled sounds in existing
datasets. The discussion about the computational needs of such architecture is expected to be satisfied
by general-purpose GPUs. Some authors have already considered constrained embedded devices [
for CNN-based sound classifiers. The authors in [
] discuss the feasibility of embedding a CNN on a
WSN device, but they only evaluate the required memory consumption. Many other computational
operations, such as the feature extraction or even the execution of the CNN on such a compute-limited
device, are not evaluated. Moreover, despite the use of techniques for data augmentation, the lack
of annotated audio data for supervised learning of CNN classifiers remains a bottleneck [
]. We
believe that such CNN-based sound recognizers have the potential to overcome other traditional ML
techniques for certain sound recognition, but their execution time in current embedded systems is not
(yet) faster than classical ML techniques. Although our approach is evaluated in a relatively powerful
embedded system such as a Raspberry Pi, capable enough to perform CNN-based classifiers, our
ultimate goal is to evaluate the achievable accuracy and time to response that such ML techniques can
achieve in general-purpose CPUs similar to ARM-based microprocessors. The proper evaluation of
CNNs for sound classification on embedded devices requires an extensive analysis, which is out of the
scope of this work.
3.4. Libraries
Operations such as the feature extraction or the classifiers are already supported by many existing
libraries. Table 2presents a list of related audio analysis libraries. Our tests on an embedded system
Appl. Sci. 2019,9, 3885 6 of 27
require the use of compatible libraries, which strongly restrict the options detailed in Table 2. For
instance, the Essentia library [
] enables audio analysis and feature extraction, and can be combined
with a C++ library called Gaia to implement similarity measures and classification. However, despite
being cross-platform, its implementation on ARM-based devices is not yet fully supported since only
ARM-based devices running Android O.S. are supported.
Table 2. List of some libraries for audio analysis. The selected library is highlighted in bold.
Name Language Description
pyAudioAnalysis [19,20] Python Audio Feature Extraction and Classification
Yaafe [21] Python Audio Feature Extraction
Essentia [18,22] C++ Audio Feature Extraction and Classification
aubio [23] C Audio Feature Extraction
CLAM [24] C++ Audio Feature Extraction
LibROSA, [25], [26]Python Audio Feature Extraction
Matlab Audio Analysis Library [27] Matlab Audio Feature Extraction and Classification
PyCASP [28,29] Python
Audio-Specialized library for automatic
mapping of computations onto
GPUs or multicore CPUs.
seewave [30] R Audio Feature Extraction
bob [31,32] C++/Python Audio Feature Extraction and Classification
Despite the high execution overhead (related to C for example) introduced by Python, we preferred
the use of Python packages due to the existence of an independent Python package scikit-learn
0.21.2 [
] specialized in ML techniques. Although several libraries include their own classifiers,
we use thePython package scikit-learn since our evaluation covers several types of sound classifiers.
In fact, some libraries such as pyAudioAnalysis base their support in ML techniques by using the
scikit-learn package. Regarding the feature extraction, the LibROSA Python package [
] is selected
due to supporting ARM-based processors.
4. A Cascade Approach
Urban environments present a rich acoustic context, where many types of sounds can coexist.
Traffic noise or the sounds generated by other human activities present a large variety, which might not
be properly described in a standalone dataset. Datasets such as ESC-50 [
] present up to 50 different
categories of urban noises. Trained classifiers recognizing such number of categories have been shown
to achieve relatively low accuracy in recognizing sounds [
]. The proposed cascade approach
intends to solve such limitation. It consists of a multi-stage classifier such as the one depicted in
Figure 2. The audio features are extracted and processed in a two-stages classifier. The first stage
classifies the sound in one of the supercategories, which correspond to one particular dataset. The
second stage uses the same features to determine the category of the sound in the dataset selected in
the first stage. This cascade approach enables the use of different types of classifiers at each stage and
for each supercategory or dataset. Therefore, while the first stage is composed of one single classifier,
the second stage is composed of multiple and potentially different types of classifiers trained for each
supercategory or dataset.
The main advantages of this approach are:
Different classifiers can coexist in the same system. The optimal classifier is selected based on the
recognized sound.
A multi-stage approach enables the parallel execution of the classifiers. This would lead to a faster
sound recognition when exploiting multi-thread or pipeline capabilities of the system.
Moreover, factors such as their execution time or even power consumption, both critical
parameters on embedded systems, can be also considered in the selection of the active classifiers.
Appl. Sci. 2019,9, 3885 7 of 27
FeatureExtraction SoundClassification
Figure 2.
The multi-stage sound classification is composed of two stages, an initial stage performs the
broad classification, while a bank of classifiers are used to determine the sound’s category.
The cascade approach demands the creation of hierarchical datasets, which can be composed
of tens of categories. Figure 3depicts one example of such hierarchical dataset. The training for
the first-stage classifier is done by treating each supercategory or dataset as a single dataset category.
For instance, categories 1 to 4 of the Dataset 1 in Figure 3would be considered as one category
when training the first-stage classifier. The same approach is applied for the other supercategories or
datasets. In order to preserve the accuracy, each supercategory or dataset must be specialized, containing
categories belonging to specific types of sounds. More details about how to build such hierarchical
dataset is provided in Section 5.4.1.
Root Dataset
Supercategory 1
or Dataset 1
Category 1 Category 2 Category 3 Category 4
Supercategory 2
or Dataset 2
Category 5 Category 6
Supercategory 3
or Dataset 3
Category 7 Category 8 Category 9
Figure 3. Example of a hierarchical dataset to be used by our cascade approach.
A similar approach has been presented as a part of the challenge proposed for the Workshop
on Detection and Classification of Acoustic Scenes and Events (DCASE) [
]. The competitors have
to perform a course-grain and/or a fine-grain sound recognition of a two-stage hierarchical dataset
composed of seven datasets and a total of 23 categories of urban sounds recorded by the SONYC
project [
]. Instead, our proposed cascade approach is a general solution that can be extended by
increasing the number of stages and benefits from the combination of different sound classifiers.
Although a multi-stages cascade approach increases the operations for the sound classification, it can
be exploited by embedded systems. Many ARM-based embedded systems incorporate multi-core
processors capable of concurrently executing the classifiers in different threads. The full potential can
be achieved in FPGA-based embedded systems since both operations can be executed in the pipeline,
resulting in an overlapped execution. The present work evaluates this cascade approach in a multi-core
processor. For the sake of simplicity, we only consider a two-stage approach for our experiments.
5. Experimental Results
The experimental results presented in this section are grouped in three main blocks. Firstly, the
evaluation of the classifiers for the existing datasets is presented. Secondly, the proposed cascade
approach is evaluated. Finally, the best classifiers are evaluated in an embedded system.
Appl. Sci. 2019,9, 3885 8 of 27
The results presented in this section are:
Evaluation of the classifiers for the existing datasets
: The LibROSA Python package [
] is
used for the audio feature extraction for each dataset discussed in Section 3.4. The evaluation
of each selected classifier (k-NN, ANN, Naive Bayes, Decision Tree and SVM) uses 80% of the
dataset samples for the training and the remaining 20% for the validation. The performance of the
classifiers is compared.
Optimization of the classifiers
: The classifiers are optimized per dataset. This optimization is
performed by looping the classifiers’ parameters on different data configurations and checking for
the set of parameters resulting in the smallest mean error. The evaluation of the classifiers on the
hierarchical dataset is also done. This evaluation involves different feature sets in order to identify
the impact of the feature selection on the recognition’s accuracy and the computational speed.
Dataset for the cascade approach
: Existing datasets are combined in a hierarchical dataset to be
used by the cascade approach.
Evaluation of the cascade approach
: The cascade approach is compared to the traditional solution
using an equivalent dataset where all the categories of the hierarchical dataset are different
categories in the same dataset.
Evaluation on an embedded system
: According to the results, the optimal classifier is embedded
on a Raspberry Pi. The performance and computational speed of the classifier are then analysed.
5.1. Experimental Setup
Our tests evaluate the existing datasets, the audio features and the classifiers. All the experiments
have been firstly executed on an Apple MacBook Pro (2.3 GHz Intel Core i5), in order to evaluate the
impact of the features’ selection and to compare the achievable performance of the classifiers. The
experimental measurements on the embedded system have been obtained after 10 executions on a
Raspberry Pi 3B+.
Firstly, the feature extraction is performed by using the Python 2.7 package LibROSA 0.6.3 [
The features’ selection can be done by importing Weka [
] on Python after converting the features’
files in an .arff format. However, in most of our experiments, all the features mentioned in Section 3.2
are used. Secondly, the classifiers are implemented using the Python library scikit-learn 0.21.2 [
The classification is performed by splitting the data into 20% testing and 80% training. Each execution
of the Python scripts is iterated 10 times to obtain the average of the time extraction per feature. Notice
that the feature extraction and the sound classification are operations independent of the approach
since both use the same extracted features for the sound classification at all stages. Therefore, our
experiments target the profiling of the sound classifiers in terms of accuracy and timing.
5.2. Experimental Timing Profile
The total execution time (
) is defined as the sum of the times demanded by the individual
operations. The time required for the feature extraction is defined as
textr action
and represents the time
needed to extract the feature set (Set 3) defined in Table 7. The average time required by the classifier
to perform the sound classification using the feature vectors from all frames is defined as tclas si f icati on.
This value corresponds to the sum of the time required by each stage
, in case of the cascade approach
cla ssi f ica tion
), to process the feature vectors from all frames, or, alternatively, it represents the time for
the sound classification in case of a single classifier. Therefore, texec can be defined as follows:
texec =textr action +tcl assi f i catio n, (1)
which becomes
texec =textr action +tstage1
cla ssi f ica tion +tstage2
cla ssi f ica tion , (2)
in case of a two-level cascade approach.
Appl. Sci. 2019,9, 3885 9 of 27
Although the input audio is read from WAV files for our experiments, the time to read these files
is not considered since the sound classification is assumed to be performed from the audio directly
coming from a microphone.
5.3. Evaluation of Classifiers per Dataset
5.3.1. Default Classifiers
The reported accuracy of several classifiers is summarized in Table 3. The paper associated with
each dataset is selected for this comparison and is to be used as a reference of the expected performance.
Nonetheless, not all classifiers are evaluated for each dataset.
Our measurements when evaluating multiple classifiers using the existing datasets are
summarized in Table 4. Although only the top 10 features with the highest entropy for each dataset
reported by the Weka audio analysis tool could be selected, all the features available in LibROSA
have been used as input vectors for the classifiers. Whereas our measurements show that most of
the classifiers improve the literature’s results (Table 3), our SVM classifier presents a significantly
lower accuracy than expected. This lack of accuracy can be associated with the configuration of the
SVM classifier since we used the default configuration in our tests and its configuration is not usually
The relation between the number of categories per dataset and the classifier accuracy is exposed in
Table 4. The dataset ESC-50 composed of 50 categories [
] results in the lowest classifiers’ performance
compared to the others. This low performance can also be seen in the number of features whose
entropy is null, and, therefore, discarded. As a result, a dataset is recommended to have a limited
number of categories. In fact, the number of categories should be as small as possible according
to [1,36].
Table 4also depicts the average execution time of each classifier to process the feature vectors from
all frames. The average time computation and accuracy of each classifier is done using 10 iterations.
There exists a significant difference between the computational time demanded by each classifier.
Whereas ANN only demands few microseconds, k-NN presents the highest demand achieving around
two milliseconds. Such disparity must be considered when selecting the classifiers to be evaluated on
an embedded system taking into account the constrained nature of these devices.
Appl. Sci. 2019,9, 3885 11 of 27
5.3.2. Optimized Classifiers
Each classifier supports multiple configurations, which are set by the value of specific parameters.
The selection of these parameters which potentially optimizes the classifier’s performance is explored
here for the k-NN and the ANN classifiers. The parameters under evaluation are:
The number of neighbors is the core deciding factor for the k-NN classifier. The parameter
n_neighbors determines the number of nearest neighbors.
The parameter
used by the ANN classifier corresponds to the number of
neurons in a layer.
Figure 4illustrates the optimization process for each configuration. The search for the optimal
configuration is performed by empirically exploring the set of parameters giving the smallest mean
error. For instance, Figure 5a depicts how the smallest error is obtained when K is equal to 1 for the
k-NN classifier. This configuration leads to the highest performance for the classifier k-NN using
the dataset BDLib. Similar methodology is applied for the ANN classifier. Figure 5b also depicts
the exploration of the size of the hidden layer parameter of the ANN classifier for the dataset BDLib.
This exploration enables the selection of the optimal configuration for the classifiers based on the
dataset. As a result, the empirical values of the classifiers’ parameters which lead to higher accuracy
are summarized in Table 5. Notice the dependency between the parameters and the dataset. Not
only the audio recordings but also the number of categories determine the optimal configuration of
the classifiers.
Figure 4. The configuration of the sound classifier is selected based on the achieved accuracy.
Table 5. Optimal values of the classifiers’ parameters which minimize the accuracy error per dataset.
Classifier Parameter Default BDLib ESC-10 ESC-50 UrbanSound
k-NN n_neighbours 5 1 1 1 1
ANN hidden layer 10 39 23 87 88
(a) K-Nearest Neighbours. (b) Artificial Neural Networks.
Figure 5.
Exploration of the classifiers’ parameters for the BDLib dataset. The k-NN and ANN
classifiers are optimized by selecting the value of the parameter leading to the minimal mean error.
Appl. Sci. 2019,9, 3885 12 of 27
Figure 6shows the achieved accuracy of the optimized classifiers. The features used as input
vector for the classifiers are the same as the non-optimized case. As expected, the optimized classifiers
significantly increase their accuracy, reaching higher accuracy than detailed in Table 3for similar
comparisons. The evolution of the execution time of the classifiers when optimized is depicted in
Figure 7. The ANN classifier demonstrates being more sensitive to optimizations, since it not only
increases its accuracy but also significantly increments its execution time for the ESC-50 and the
UrbanSound datasets. Nonetheless, this classifier remains significantly faster than the k-NN classifier,
as shown in the sections below.
Default Optimized Default Optimized Default Optimized Default Optimized
BDLib ESCͲ10 ESCͲ50 UrbanSound
KNN 57.92% 59.58% 66.50% 74.00% 33.40% 50.40% 51.14% 54.32%
ANN 51.25% 65.00% 68.50% 78.00% 25.60% 46.70% 49.57% 56.43%
Figure 6. Achieved accuracy of the classifiers with their default and optimized configuration.
BDLib ESCͲ10 ESCͲ50 UrbanSound
Figure 7. Relative improvement of optimised parameters over default parameters of the classifiers.
5.4. Cascade Approach
In this section, the cascade approach proposed in Section 4is evaluated. Firstly, a hierarchical
dataset is elaborated from existing datasets. Secondly, the best performing audio features are
selected. Finally, the classifiers are evaluated for the cascade approach and for an equivalent
non-hierarchical dataset.
5.4.1. Hierarchical Dataset
The existing datasets do not necessarily contain the desired categories for a particular application.
The generation of new datasets takes effort and is a time-demanding task. Instead, existing datasets
can be combined or rearranged in order to satisfy the application’s demands.
In order to solve the limitations present on the existing datasets, a semantic hierarchical dataset
has been generated (Table 6). The datasets described in Section 3.1.1 are used here to create the new
dataset, which is used to evaluate our approach in Section 5.4.4 and in Section 5.5.3. The new dataset is
built under certain conditions in order to have higher performance:
The dataset must contain relevant categories
: This work intends to evaluate ML techniques in
recognizing urban sounds around the campus of the Vrije University Brussels (VUB). Therefore,
categories such as Rivers or Sea waves are discarded. Other similar categories are merged into the
same category. For example, the existing categories Cars and Motorcycles in datasets are merged
into a new category Vehicles.
Small number of categories per data set
: According to [
], there is an inverse relation
between the number of categories contained in a dataset and the classifier’s performance. The
Appl. Sci. 2019,9, 3885 13 of 27
more categories, the lower the classifier performance that is obtained. Although the cascade
approach intends to minimize this degradation, the number of supercategories should be up to
5 each [
]. Notice that, in spite of up to 25 categories able to be recognized thanks to using the
cascade approach, the hierarchical dataset only presents 12 categories of interest.
Balanced categories
: All supercategories contained in the hierarchical dataset should be equally
balanced. This condition is satisfied by forcing each supercategory to present an equal number of
audio samples per category.
Sufficient number of audio samples per category
: The data set should have a sufficient number
of audio samples per category in order to train the classifier. For the generation of the hierarchical
dataset, some online tools such as Freesound [
] and Freesoundeffects [
] have been used to
increase the number of audio samples per category.
Table 6. Summary of the categories used in the hierarchical dataset.
Datasets Categories Audio data (s)
Root Dataset
Airport 475
Traffic 475
Construction 475
Residence 475
Airport Airplane 395
Helicopter 395
Construction Drilling 170
Jackhammer 170
Vehicles Cars 455
Motorcycles 455
Warning Cars horn 345
Siren 345
Cats 505
Human 505
Children 505
Dogs 505
Figures 8and 9illustrate the semantic hierarchical dataset for the cascade approach. This approach
uses multiple datasets as categories for the initial stage. The hierarchical dataset is composed of five
supercategories, which are in fact standalone datasets: Vehicles,Warning,Residence,Construction and
Airport. The hierarchical dataset can be considered as composed of six datasets; these five supercategories
and the complete dataset without categories (Root dataset) used at the initial stage (Figure 8). All are
created by combining the datasets described in Section 3.1.1.
Root dataset
(Airplane, Helicopter)
(Drilling, Jackhammer)
(Car, Motorcycle)
(Car horn, Siren)
(Cat, Dog, Human,Kids)
Figure 8.
Training dataset for the first stage of the cascade approach. Notice that different sound
categories (in brackets) are grouped in the same category.
Based on the prediction of the first stage, a second stage recognizes, the categories of the
recognized supercategory to which the sound belongs (Figure 9). As a result, the cascade approach
supports an overall large number of categories while presenting a small number of categories
per dataset.
Appl. Sci. 2019,9, 3885 14 of 27
Root dataset
Airplane Helicopter
Drilling Jackhammer
Car Motorcycle
Car horn Siren
Cat Dog Human Kids
Figure 9. Detailed hierarchical dataset for the cascade approach.
5.4.2. Feature Selection
The selection of the audio features to be used by the classifiers can be done using the Weka audio
analysis tool. The Weka’s features’ selection uses the ranker algorithm InformationGain, which gives
the set of most relevant features per dataset. This information can be used to select the best feature set
leading to the highest classifier performance [
]. The main concern is to have the highest performance
while using the smallest number of features as possible. Similar to [
], three sets of features have been
created and used to identify the feature set leading to a higher classifier’s performance. Each set has
been created based on:
Set 1 : Features that are common to all datasets.
Set 2 : Features that are common to at least three datasets.
Set 3 : All features provided by LibROSA.
Table 7summarizes the features composing the sets under evaluation. The common features to
all datasets of the hierarchical dataset are expected to be the most useful. Nevertheless, it resulted in a
very short set composed of only two features. Our analysis shows that more features are in common
between several datasets, and, in order to increase the number of features to be evaluated, Set 2 and
Set 3 are proposed. Whereas Set 2 is composed of features that are also common to at least three
datasets, Set 3 is expected to deliver the best results since all supported features are used. Although a
more fine-tuning selection of the features can be done by evaluating the linear correlation among the
features, as done in [
], we have decided to only evaluate the feature sets depicted in Table 7for the
sake of simplicity.
Table 7.
Considered feature sets and average time required for the features’ extraction. The values in
brackets are the Standard Deviation (SD).
Set 1 Set 2 Set 3
Features Time [ms] Features Time [ms] Features Time [ms]
Mfcc 1, 6 17.30 (1.08) Zero Crossing 4.17 (0.37) Mfcc 0 - Mfcc 12 17.30 (1.08)
Total time ( textr actio n )17.30 (1.08) Sp. Contrast 0, 2 24.58 (1.92) Sp. Contrast 0 - Sp. Contrast 5 24.58 (1.92)
Mfcc 0, 2, 4, 5, 7 17.30 (1.08) Sp. Centroid 12.75 (0.54)
Total time ( textr actio n )46.06 (0.62) Sp. Roll off 12.93 (0.55)
Sp. Bandwidth 15.84 (0.59)
Rms 63.95 (3.70)
Zero Crossing 4.17 (0.37)
Total time ( textr actio n )151.55 (1.67)
Table 8summarizes the classifiers’ performance (with default configuration) for each feature set.
The values in bold correspond to the highest performance reached for one feature set. The classifiers
k-NN, Naive Bayes and ANN achieve higher performance for most of the datasets. Most of the
classifiers increase their performance when increasing the number of features. Notice that it is not
always the case. For instance, the k-NN and the Decision Tree classifiers do not present significant
difference on its performance for the Construction dataset. Nonetheless, for most of the cases, Set 3
achieves the highest performance. As a result, this feature set is used hereby for our evaluations.
Appl. Sci. 2019,9, 3885 15 of 27
Table 8. Average accuracy in percentage per data set. The best classifier per data set is highlighted in bold.
Datasets Sets k-NN ANN Decision tree SVM Naive Bayes
Root dataset
Set1 35.61 52.84 39.93 25.54 49.19
Set2 55.47 51.76 44.46 33.04 47.03
Set3 62.16 62.09 48.92 45.68 54.46
Set1 73.44 72.81 70.31 61.88 73.13
Set2 74.38 76.25 77.50 65.31 78.13
Set3 81.56 81.25 70.94 74.69 69.38
Set1 70.00 72.78 72.78 66.67 81.11
Set2 71.11 85.56 64.44 74.44 87.22
Set3 70.56 82.78 62.22 68.89 89.44
Set1 44.69 69.38 59.38 37.38 62.88
Set2 65.56 68.25 64.00 51.38 68.63
Set3 78.52 78.88 66.38 71.00 76.25
Set1 66.22 64.32 67.03 55.14 66.76
Set2 72.16 78.65 69.19 59.46 72.43
Set3 77.30 80.27 68.92 73.24 77.84
Set1 60.36 64.64 60.36 59.64 64.64
Set2 66.07 70.00 58.93 65.00 63.21
Set3 65.71 72.86 65.00 65.36 65.36
The average execution time of each classifier for each features set is depicted in Figure 10. On the
one hand, the classifier k-NN demands a computational time significantly higher than other classifiers,
while the ANN classifier demonstrates being extremely fast independently of the number of features.
On the other hand, not all classifiers present the same sensitivity to the feature’s set.
Set1 Set2 Set3 Set1 Set2 Set3 Set1 Set2 Set3 Set1 Set2 Set3 Set1 Set2 Set3 Set1 Set2 Set3
RootDataset Airport Construction Residence Vehicles Warning
ANN DecisionTree SVM NaiveBayes
Set1 Set2 Set3 Set1 Set2 Set3 Set1 Set2 Set3 Set1 Set2 Set3 Set1 Set2 Set3 Set1 Set2 Set3
RootDataset Airport Construction Residence Vehicles Warning
Figure 10. Average timing of the classifiers running with their default configuration.
Appl. Sci. 2019,9, 3885 16 of 27
One would expect that, for most of the classifiers, the increment of the number of features used
by the classifiers would lead to more accurate results, but also to slower classifiers. The results
summarized in Table 8and Figure 10 confirm this assumption. Few exceptions exist. For instance, the
classifier Naive Bayes reaches a higher performance using Set 1 instead of Set 2 for the Root and the
Warning datasets. Moreover, the execution time of the ANN classifier slightly varies in most of the
cases when enlarging the number of features.
5.4.3. Classifier Selection
The evaluated classifiers support multiple parameters, which directly affects their performance.
The strategies applied in Section 5.3.2 to optimize the classifiers are applied here to select the best
performing classifiers for the cascade approach. As shown in Table 8, the k-NN and the ANN classifiers
offer the higher accuracy for most of the datasets of the cascade approach. Although the Naive Bayes
classifier could be also considered, this classifier presents a relatively low accuracy for the Root dataset,
for which accuracy is critical for the cascade approach as discussed in the section below. For these
reasons, only the K-NN and the ANN classifiers are selected to be optimized. Table 9summarizes the
default and optimized parameters for the k-NN and ANN classifiers.
Table 9.
The parameters for the optimization of the k-NN and ANN classifiers for the datasets of the
cascade approach and for the non-cascade approach.
Optimized Parameters
Datasets n_neighbors hidden_layer_sizes
Root Dataset 1 96
Airport 1 14
Construction 2 5
Residence 1 48
Vehicles 1 60
Warning 3 48
Non-cascade 1 74
Figure 11 depicts the achieved accuracy when using the default and the optimized configuration
of the classifiers using Set 3. As summarized in Table 9, each classifier is optimized to achieve the
highest accuracy per data set. The optimization is evaluated using all the available features (Set 3).
The classifier ANN presents a higher increment when optimized, offering the highest performance for
most of the datasets.
Figure 11. Achieved accuracy of the classifiers with their default and optimized configuration.
Appl. Sci. 2019,9, 3885 17 of 27
RootDataset Airport Construction Residence Vehicles Warning
Figure 12. Average execution time of the classifiers based on their configuration.
Figure 12 depicts the ratio between the average execution time when using the default and the
optimized configuration of the classifiers. Again, an ANN classifier increments their execution time
when optimized for the Root dataset. In general, the execution time does not present any significant
variation when optimizing.
5.4.4. Comparison
The proposed approach is compared to a conventional non-cascade approach, which uses
a non-hierarchical dataset containing the same categories of each individual dataset of the
cascade approach.
The global accuracy of the cascade approach (
) is obtained by multiplying the
achieved accuracy at each stage, as defined in Equation (3).
AccuracyOverall =AccuracyStage1×AccuracyStage2, (3)
where AccuracyStageiis the accuracy achieved at stage i{1, 2}.
The measured accuracy and timing per classifier for the cascade approach is summarized in
Table 10. As expected, the most accurate classifiers are the k-NN and the ANN classifiers. Notice that
the Naive Bayes classifier outperforms these classifiers for the Construction dataset but offers lower
accuracy for most of the datasets. Moreover, due to its relatively low accuracy for the Root dataset, this
classifier is not a candidate for the initial stage. The overall accuracy of the cascade approach for the
k-NN and ANN classifiers obtained by applying Equation (3) is summarized in Table 11. The results
follow the achieved accuracy depicted in Figure 11. While the k-NN classifier achieves a slightly lower
accuracy than the ANN classifier, it is approximately 30 times slower. Notice how
AccuracySta ge 1
impacts the overall accuracy. The ANN classifier remains the most accurate one. However, classifiers
can be combined to increase the overall accuracy or for speed. The last columns in Table 11 depict the
accuracy and the timing when combining the most accurate and the fastest classifiers. For instance,
to increase accuracy, the ANN can be used for most of the datasets while using k-NN for the Airport
and Vehicles datasets. Such a combination achieves around 58% accuracy but doubles the timing of
the ANN classifier standalone. On the other side, a faster sound classification can be obtained when
combining the faster classifiers such as using the Decision Tree classifier for the Root dataset while
using the ANN classifier for the rest of the datasets. The execution time decreases to around 1 ms
while the accuracy drastically decreases to around 39%. Both are good examples of one of the benefits
of the cascade approach, which enables the combination of classifiers based on the priorities of the
application performing the sound recognition.
Appl. Sci. 2019,9, 3885 20 of 27
Table 12 shows the results of the classifiers’ accuracy when all the categories are regrouped in one
non-hierarchical dataset. The achieved accuracy of the cascade approach is similar to the traditional
approach in all categories. The similarity in the accuracy of both approaches is due to the fact that the
non-hierarchical dataset only considers 12 categories. The main advantage of the cascade approach is
the additional flexibility due to combining different types of classifiers.
The disadvantage of the cascade approach is the execution time when only using the same type
of classifier for all stages. Although the feature extraction is common for both approaches, with
values rounding 150 ms for Set 3 (Table 7), the cascade approach requires the computation of two
classifiers. Based on the classifier, the timing reported in Table 11 ranges from around 1.2 ms to more
than 40 ms for the ANN (Root Dataset => Airport) and for the k-NN (Root Dataset => Residence)
classifier, respectively. Due to the limited computational resources, the overall execution on the
embedded system is expected to be significantly higher. There is a trade-off between time and accuracy.
Additional stages increase the supported number of categories but also the overall execution time.
Nonetheless, the ultimate objective is to embed this cascade approach in an FPGA, where it can be
computed in pipeline, or in an embedded device with a multi-core processor, where each stage is
executed concurrently in a different thread.
5.5. Evaluation on an Embedded System
The evaluated classifiers and the proposed cascade approach have been implemented on
Raspberry Pi 3B+ running a Raspbian 4.14 environment. Firstly, the accuracy and timing measurements
done on the Raspberry Pi of the five classifiers are discussed. Secondly, the proposed approach is
evaluated and compared to a traditional solution running on this embedded system. The libraries and
the source code running on the embedded system is the same as the one used for the experiments
on the computer. The execution time of each operation related to the feature extraction and to the
sound classification has been individually measured. Our experiments are repeated 10 times in order
to obtain the average timing and the standard deviation (SD) of the operations.
5.5.1. Feature Extraction on an Embedded System
The measurement of
textr action
has been individually obtained. The value of
textr action
can be added
to the timing of the sound classifiers to obtain the
when acquiring the audio data directly from a
microphone (Equation (1)). Table 13 summarizes the timing for the different types of audio features
running on the embedded system. The computational effort to extract Set 3 requires a lot of time,
becoming around 20 times slower than on the laptop (Table 7).
Table 13.
Average time required for the feature extraction on the Raspberry Pi. The values in brackets
are the Standard Deviation (SD).
Features Time [ms]
Mfcc 0 - Mfcc 12 622.22 (163.71)
Sp. Contrast 0 - Sp. Contrast 5 336.39 (64.87)
Sp. Centroid 300.24 (52.81)
Sp. Roll off 315.79 (61.37)
Sp. Bandwidth 349.83 (63.33)
Rms 1089.10 (260.62)
Zero Crossing 64.22 (15.19)
Total time ( textr actio n )3077.79 (125.21)
5.5.2. Evaluation of Classifiers per Dataset Running on an Embedded System
Table 14 summarizes the measured accuracy and timing of the five classifiers for the BDLib,
ESC-10, ESC-50 and the UrbanSound datasets running on the Raspberry Pi. The performance of the
classifiers is not affected by the platform, achieving similar accuracy to the one depicted in Table 4. As
expected, the execution time needed for the sound classification is significantly affected, becoming
around 10 times higher in many cases.
Appl. Sci. 2019,9, 3885 22 of 27
5.5.3. Cascade Approach on an Embedded System
Like the experiments on the computer, the embedded implementations use the Python 2.7 libraries
LibROSA and scikit-learn for the feature extraction and the sound classification, respectively. The
training of all six datasets (Root dataset plus the 5 datasets) required for the cascade approach required
around 1.5 hours on the Raspberry Pi. This training time is possible by specifying the audio length
when loading the audio with LibROSA, which for our case is 5s per audio file. Both the features
extraction and classification are also performed on the Raspberry Pi. The ANN and k-NN classifier has
been evaluated due to their relatively high accuracy. They are configured with the parameters present
on Table 9.
Table 15 summarizes the measured accuracy and timing of the classifiers when using the cascade
approach on the Raspberry Pi. The results are the average of 10 executions. Despite the relatively high
accuracy achieved by both classifiers for each dataset (or supercategory), the global accuracy when
considering the combined accuracy of both classifiers (Table 16) decreases similarly to the accuracy
depicted in Table 11. The achievable accuracy obtained by Equation (3) rounds to 54.4% for the ANN
classifier and to 50.3% for the k-NN classifier. Both classifiers slightly decrease their accuracy in their
embedded version. Some improvements in the dataset (or supercategory) Warning would certainly
lead to a higher accuracy since both classifiers significantly decrease their accuracy for this dataset.
Although the execution time of both classifiers remains in the order of milliseconds, the ANN classifier
has increased its timing significantly, becoming only around 3.5 times faster than the k-NN classifier.
Notice that the k-NN and ANN classifiers are significantly slower than the other classifiers for the Root
dataset. There are several reasons for this:
The k-NN classifier is the most time demanding classifier, not only when embedded (Tables 4,
11 and 12 and in Figure 10). Similarly to other classifiers, the k-NN classifier becomes around
10 times slower when embedded.
The exception is the ANN classifier. The increment of the execution time of the ANN classifier
when embedded, reflected in Table 15, becomes around 100 times slower when compared to
Table 11.
The k-NN and the ANN classifiers are evaluated with their optimal configuration. This fact
increments the ANN’s execution time like that shown in Figure 7and in Figure 12.
The main benefit of the cascade approach is the combination of classifiers. Whereas the ANN
classifier outperforms the k-NN classifier for the Root dataset, the accuracy of the k-NN remains higher
for the Airport and the Vehicles datasets. Thus, the ANN classifier can be used for all the datasets except
the Airport and the Vehicles datasets, where the k-NN classifier could be applied due to its higher
accuracy. Such combination reaches an accuracy of 55.7% but demands 95.7 ms. Note that the overall
execution time of the cascade approach decreases if the Decision Tree classifier is used in stage 1 and
datasets Residence,Vehicles and Warning in the second stage while the ANN classifier is used for the
Airport and Construction datasets in the second stage. Such a solution would only require 8.6 ms, but
the accuracy decreases to 35.7%. Both are examples of how the cascade approach enables multiple
combinations, either targeting accuracy or a fast response.
The evaluation of the ANN and k-NN classifiers for the non-cascade approach is summarized in
Table 17. The achieved performance of both classifiers is similar to the cascade approach, while the
execution time of both classifiers is slightly lower than the two-stages approach. The timing increases
from a factor of 8.8 to 52.2 times for the k-NN and for the ANN classifiers respectively compared to the
timing when both classifiers are not embedded (Table 12).
Appl. Sci. 2019,9, 3885 25 of 27
5.6. Discussion
The measured accuracy decreases for all classifiers when they are embedded. This fact can be
related to the limitation of the embedded version of the LibROSA library, on which the audio feature
extraction perform differently. Based on our analysis, the expected accuracy of both approaches ranges
around 50%–60% for the recognition of different sound categories. It represents a lower range than
the CNN-based solutions proposed in [
], where accuracy ranges around 50%–75% for the fine-grain
sound recognition of a hierarchical dataset. Although the limited computational power available on
embedded devices challenges the use of such CNN-based solutions, recent new commercial devices
such as Googles’ Edge Tensor Processing Unit (TPU) [
] facilitate the use of this technique for
edge computing.
Our measurements show that the most time-demanding operation is the audio features extraction,
an operation which requires several seconds on an embedded system. Although this time can be
reduced by lowering the number of extracted features, our experiments demonstrate that it certainly
affects the accuracy of the sound classifier. Further analysis is needed when selecting the audio features
for a particular dataset since not all sound classifiers are equally affected when reducing the features
set. Nonetheless, such analysis is beyond the scope of this paper.
Our analysis provides valuable information about the achievable accuracy and the required
execution time on embedded devices. Some discussion is needed when targeting real-time sound
classification. The frame size of 1 second must be reduced to a few milliseconds and the sound
classification might be performed per frame. Our results provide timing information, which can be
used to select frame size, determining the time slot to perform the feature extraction and the sound
classification. For instance, a frame size of around 100 ms would only provide enough time for the
ANN classifier in case of the cascade approach (Table 16), but not for the feature extraction, which
for Set 3 rounds to 3 s (Table 13). A frame size of around 700 ms could be enough by only using
the features from the MFCC and the ANN classifier in case of the non-cascade approach (Table 17).
Although we do not propose a solution for real-time sound recognition on an embedded device, our
evaluation can be used for the selection of the frame size, the feature extraction or the classifiers.
Embedded devices are often power-constrained devices. A power analysis of the different
approaches would certainly help to not only select the most power-efficient technique but also to
help in the selection of the embedded device. One would expect that the cascade approach presents
a higher power consumption due to using two sound classifiers. However, the flexibility of this
approach enables the combination of power-efficient classifiers, which can lead to low power solutions.
A detailed power analysis of the feature extraction and the sound classification is one of the priorities
for future work.
6. Conclusions
Our experimental results demonstrate that classical sound classifiers for urban sound recognition
achieve a slightly lower accuracy when embedded on an Raspberry Pi 3 at the cost of an increment
of the timing. Although the ANN classifier remains faster than the k-NN classifier, its execution
time increases by a factor of 100 when embedded, while other ML classifiers increase their execution
time by 10 times. The proposed cascade approach obtains similar accuracy to traditional solutions,
while providing additional flexibility to prioritize accuracy or timing. Our analysis provides valuable
information about the achievable accuracy and the required execution time on embedded devices,
which can be used to decide parameters such as the frame size in order to achieve real time. Embedded
devices such as the evaluated Raspberry Pi 3B+ demonstrate being powerful enough to achieve urban
sound recognition in a reasonable time.
Author Contributions:
Conceptualization, B.d.S.; Formal analysis, B.d.S. and A.W.H.; Investigation, A.W.H.;
Methodology, B.d.S. ; Software, A.W.H.; Supervision, A.B and A.T.; Validation, B.d.S.; Writing—original draft,
B.d.S.; Review and editing, A.B; Funding Acquisition, A.B and A.T.
Appl. Sci. 2019,9, 3885 26 of 27
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest.
Bountourakis, V.; Vrysis, L.; Papanikolaou, G. Machine learning algorithms for environmental sound
recognition: Towards soundscape semantics. In Proceedings of the Audio Mostly 2015 on Interaction With
Sound, Thessaloniki, Greece, 7–9 October 2015 ; p. 5.
Piczak, K.J. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM
international conference on Multimedia, Brisbane, Australia, 26–30 October 2015 ; pp. 1015–1018.
Salamon, J.; Jacoby, C.; Bello, J.P. A dataset and taxonomy for urban sound research. In Proceedings of the
22nd ACM international conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 1041–1044.
Zhan, Y.; Kuroda, T. Wearable sensor-based human activity recognition from environmental background
sounds. J. Ambient. Intell. Humaniz. Comput. 2014,5, 77–89.
Mendoza, J.M.; Tan, V.; Fuentes, V.; Perez, G.; Tiglao, N.M. Audio Event Detection Using Wireless Sensor
Networks Based on Deep Learning. In International Wireless Internet Conference; Taipei, Taiwan 15–16 October
2018; Springer: Berlin, Germany; pp. 105–115.
Jakob, A.; Marco, G.; Stephanie, K.; Robert, G.; Christian, K.; Tobias, C.; Hanna, L. A Distributed Sensor
Network for Monitoring Noise Level and Noise Sources in Urban Environments. In Proceedings of the
2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud), Barcelona, Spain,
6–8 August 2018; pp. 318–324.
Siamwala, A.; Lochhead, Z.; Abdulla, W. Environmental Noise Monitoring Using Distributed IoT Sensor
Nodes. In Proceedings of the IEEE 2019 International Conference on Electronics, Information, and
Communication (ICEIC), Beijing, China, 12–14 July 2019; pp. 1–10.
Kumari, S.; Roy, D.; Cartwright, M.; Bello, J.P.; Arora, A. EdgeLˆ 3: Compressing Lˆ 3-Net for Mote Scale
Urban Noise Monitoring. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing
Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, 20–24 May 2019; pp. 877–884.
Bello, J.P.; Silva, C.; Nov, O.; DuBois, R.L.; Arora, A.; Salamon, J.; Mydlarz, C.; Doraiswamy, H. SONYC: A
system for the monitoring, analysis and mitigation of urban noise pollution. arXiv 2018, arXiv:1805.00889
Mydlarz, C.; Sharma, M.; Lockerman, Y.; Steers, B.; Silva, C.; Bello, J.P. The life of a New York City noise
sensor network. Sensors 2019,19, 1415.
Font, F.; Roma, G.; Serra, X. Freesound technical demo. In Proceedings of the 21st ACM international
conference on Multimedia, Barcelona, Spain, 21–25 October 2013; pp. 411–412.
Blackman, R.B.; Tukey, J.W. The measurement of power spectra from the point of view of communications
engineering—Part I. Bell Syst. Tech. J. 1958,37, 185–282.
Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA data mining software:
an update. Acm Sigkdd Explor. Newsl. 2009,11, 10–18.
Boddapati, V.; Petef, A.; Rasmusson, J.; Lundberg, L. Classifying environmental sounds using image
recognition networks. Procedia Comput. Sci. 2017,112, 2048–2056.
Cao, J.; Cao, M.; Wang, J.; Yin, C.; Wang, D.; Vidal, P.P. Urban noise recognition with convolutional neural
network. Multimed. Tools Appl. 2018,78, 29021–29041.
Su, Y.; Zhang, K.; Wang, J.; Madani, K. Environment Sound Classification Using a Two-Stream CNN Based
on Decision-Level Fusion. Sensors 2019,19, 1733.
Piczak, K.J. Environmental sound classification with convolutional neural networks. In Proceedings of the
2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA,
USA, 17–20 September 2015; pp. 1–6.
Bogdanov, D.; Wack, N.; Gómez Gutiérrez, E.; Gulati, S.; Herrera Boyer, P.; Mayor, O.; Roma Trepat, G.;
Salamon, J.; Zapata González, J.R.; Serra, X. Essentia: An audio analysis library for music information
retrieval. In Proceedings of the 14th Conference of the International Society for Music Information Retrieval
(ISMIR), Curitiba, Brazil, 4–8 November 2013; pp. 493–498.
pyAudioAnalysis. Available online: (accessed on 1 June
Appl. Sci. 2019,9, 3885 27 of 27
Giannakopoulos, T. pyaudioanalysis: An open-source Python library for audio signal analysis. PLoS ONE
2015,10, e0144610.
21. Yaafe 0.64. Available online: (accessed on 1 June 2019).
Essentia 2.1. Available online: (accessed on 1 June 2019).
23. aubio 0.4.0. Available online: (accessed on 1 June 2019).
24. CLAM 1.4.0. Available online: (accessed on 1 June 2019).
25. LibROSA 0.6.3. Available online: (accessed on 1 June 2019).
McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music
signal analysis in Python. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA,
6–12 July 2015; pp. 18–25.
Matlab Audio Analysis Library. Available online:
fileexchange/45831-matlab-audio-analysis- library (accessed on 1 June 2019).
28. PyCASP. Available online: (accessed on 1 June 2019).
Gonina, E.I. A Framework for Productive, Efficient and Portable Parallel Computing. PhD thesis, EECS
Department, University of California, Berkeley, 2013.
Seewave 2.1.3. Available online:
(accessed on 1 June 2019).
31. Bob 2.1.3. Available online: (accessed on 1 June 2019).
Anjos, A.; Shafey, L.E.; Wallace, R.; Günther, M.; McCool, C.; Marcel, S. Bob: A free signal processing
and machine learning toolbox for researchers. In Proceedings of the 20th ACM Conference on Multimedia
Systems (ACMMM), Nara, Japan, 29 October–2 November 2012.
33. Scikit-learn 0.21.2. Available online: (accessed on 1 June 2019).
Cowling, M.; Sitte, R. Comparison of techniques for environmental sound recognition. Pattern Recognit. Lett.
2003,24, 2895–2907.
Detection and Classification of Acoustic Scenes and Events (DCASE), Challenge 2019, Task 5: Urban Sound
Tagging. Available online: (accessed on
7 August 2019).
Jekic, N.; Pester, A. Environmental Sound Recognition with Classical Machine Learning Algorithms. In
International Conference on Remote Engineering and Virtual Instrumentation; Duesseldorf, Germany, 21–23 March
2018l; Springer: Berlin, Germany; pp. 14–21.
FreeSound: collaborative database of creative-commons licensed sound for musicians and sound lovers.
Available online: (accessed on 1 September 2019).
FreeSoundEffects Available online:
(accessed on 1 September 2019).
Cass, S. Taking AI to the edge: Google’s TPU now comes in a maker-friendly package. IEEE Spectr.
2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (
... Lately, there has been an increasing attention towards the development of embedded systems dedicated to digital sound processing, where a variety of audio software runs on single board computers such as the Raspberry Pi or the Beaglebone [174]. This endeavour is usually referred to as "embedded audio". ...
... Regarding non-musical contexts, as described in Section III-B1 several efforts have been conducted to extract meaning from soundscapes (i.e., sonic environments), particularly in urban areas [174] and wildlife monitoring. While those two application areas are now well established, one recent and promising field of application for ASNs is the surveillance of industrial processes for security and quality control. ...
Full-text available
Current sound-based practices and systems developed in both academia and industry point to convergent research trends that bring together the field of Sound and Music Computing with that of the Internet of Things. This paper proposes a vision for the emerging field of the Internet of Sounds (IoS), which stems from such disciplines. The IoS relates to the network of Sound Things, i.e., devices capable of sensing, acquiring, processing, actuating, and exchanging data serving the purpose of communicating sound-related information. In the IoS paradigm, which merges under a unique umbrella the emerging fields of the Internet of Musical Things and the Internet of Audio Things, heterogeneous devices dedicated to musical and non-musical tasks can interact and cooperate with one another and with other things connected to the Internet to facilitate sound-based services and applications that are globally available to the users. We survey the state of the art in this space, discuss the technological and non-technological challenges ahead of us and propose a comprehensive research agenda for the field.
... Various methods, including Machine Learning algorithms and signal processing techniques, have been investigated to achieve the accurate classification of urban sounds with a limited number of embedded devices. Several studies have employed the classical ML algorithms, such as k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), Naive Bayes, and Decision Trees, which have been previously evaluated [4]. Sound recognition has seen significant progress in ML, as evidenced by the recent advancements. ...
Full-text available
Environmental Sound Recognition (ESR) plays a crucial role in smart cities by accurately categorizing audio using well-trained Machine Learning (ML) classifiers. This application is particularly valuable for cities that analyzed environmental sounds to gain insight and data. However, deploying deep learning (DL) models on resource-constrained embedded devices, such as Raspberry Pi (RPi) or Tensor Processing Units (TPUs), poses challenges. In this work, an evaluation of an existing pre-trained model for deployment on Raspberry Pi (RPi) and TPU platforms other than a laptop is proposed. We explored the impact of the retraining parameters and compared the sound classification performance across three datasets: ESC-10, BDLib, and Urban Sound. Our results demonstrate the effectiveness of the pre-trained model for transfer learning in embedded systems. On laptops, the accuracy rates reached 96.6% for ESC-10, 100% for BDLib, and 99% for Urban Sound. On RPi, the accuracy rates were 96.4% for ESC-10, 100% for BDLib, and 95.3% for Urban Sound, while on RPi with Coral TPU, the rates were 95.7% for ESC-10, 100% for BDLib and 95.4% for the Urban Sound. Utilizing pre-trained models reduces the computational requirements, enabling faster inference. Leveraging pre-trained models in embedded systems accelerates the development, deployment, and performance of various real-time applications.
... Enhanced computational speed: Quantum machine learning (QML) algorithms can potentially perform specific tasks at a faster rate compared to their classical counterparts, owing to quantum computing's inherent parallelism and the exploitation of quantum phenomena such as superposition and entanglement (Danielson et al., 2022). Scalability: QML can offer more efficient and scalable solutions to large-scale problems, where the volume of data and computational requirements can be overwhelming for classical machine learning techniques (da Silva et al., 2019). Improved optimization: Quantuminspired optimization techniques, such as quantum annealing, quantum genetic algorithms, and quantum swarm optimization, can enhance the training of machine learning models by effectively tackling combinatorial optimization problems (Karimi-Mamaghan et al., 2022). ...
Conference Paper
Full-text available
It is becoming more obvious that quantum computing might have groundbreaking implications in a variety of different industries as the globe continues to explore further into the realms of quantum computing. The reader will be given an introduction to the exciting world of quantum machine learning (QML) in this essay. QML is a rapidly developing multidisciplinary topic that integrates the fundamentals of quantum computing with the more conventional machine learning approaches. We want to make our offering of an accessible starting point for novices clear and understandable by giving a condensed review of the principles of quantum computing, classical machine learning techniques, and the connection between the two. In this article, the benefits and difficulties of using QML are discussed, including the language's potential for accelerated calculations as well as the constraints provided by existing technology. In addition, we investigate the potential uses of QML in the real world, particularly in the areas of drug development, finance, and cryptography. In conclusion, we provide a view on the future of QML, focusing on continuing research, emerging trends, and the possible influence it might have on a variety of different businesses as well as society in general.
... In the same way, [32,33] put forth tools for remote detection of COVID-19 infections by classifying coughing and breathing. Audio-based embedded ML models have found application in different domains as well, including urban sound identification [34], detecting mosquito populations [35], and anomaly detection purposes [36]. ...
Full-text available
Forest fires are the main cause of desertification, and they have a disastrous impact on agricultural and forest ecosystems. Modern fire detection and warning systems rely on several techniques: satellite monitoring, sensor networks, image processing, data fusion, etc. Recently, Artificial Intelligence (AI) algorithms have been applied to fire recognition systems, enhancing their efficiency and reliability. However, these devices usually need constant data transmission along with a proper amount of computing power, entailing high costs and energy consumption. This paper presents the prototype of a Video Surveillance Unit (VSU) for recognising and signalling the presence of forest fires by exploiting two embedded Machine Learning (ML) algorithms running on a low power device. The ML models take audio samples and images as their respective inputs, allowing for timely fire detection. The main result is that while the performances of the two models are comparable when they work independently, their joint usage according to the proposed methodology provides a higher accuracy, precision, recall and F 1 score (96.15%, 92.30%, 100.00%, and 96.00%, respectively). Eventually, each event is remotely signalled by making use of the Long Range Wide Area Network (LoRaWAN) protocol to ensure that the personnel in charge are able to operate promptly.
Full-text available
Environmental sound classification(ESC) is one of the challenging issues in the area of audio recognition. Classifying the sounds such as glass breaking, gunshots, dog barking can aid in a variety of applications like audio surveillance, smart homes, robotic navigation, and other crime investigation systems. Environmental sounds are more complicated as compared to speech and music due to their unstructured nature. Environ- mental sounds are classified using machine learning algorithms such as Support Vector Machines, K-Nearest Neighbour, and various deep learn- ing algorithms such as Convolutional Neural Networks, Recurrent Neural Networks, Long Short Term Memory Neural Networks. In this paper, a robust approach is proposed for ESC on the UrbanSound8k dataset. Cep- stral features Mel Frequency Cepstral Coefficients are extracted from the audios and fed to the Convolutional Recurrent Neural Network (CRNN). The effect of varying various hyperparameters such as the number of layers, batch size, and filter count are analyzed to determine the ade- quate combination of parameters that can give considerable accuracy for ESC. The proposed approach is compared to the baseline models and it attains the best accuracy of 93.58% which is higher than the previ- ous research results of ESC. The proposed model gives the best results using one LSTM layer, 0.3 momentum value, 512 number of filters, 512 neurons in LSTM layer and 256 batch size. The proposed framework can be ameliorated by using novel neural networks and combinations.
Conference Paper
Although sound is the maximum critical speaking device for all dwelling organisms, people’s day-by-day lives have brought extra forms of sounds into the herbal environment, which can also additionally or might not be useful. As a result, separating the regular communicating noises and identifying and interpreting them is a pressing necessity of the day. This study is about the different types of sounds that can be heard in cities. As a result, sound classification may aid the machine in determining the type of sound. This study examines the numerous strategies that are used to classify sounds and train machines to learn and analyze data in order to produce appropriate output. These kind of investigations can also aid in the detection of criminal activity. In addition, the numerous types of input and other parameters that can be employed for categorization were examined in this research. The methods’ benefits and drawbacks were also addressed.
Conference Paper
Este artigo apresenta uma introdução de conceitos e práticas para aprender o inicial de como trabalhar com Reconhecimento de Sons. Será apresentado o embasamento teórico necessário, os conceitos e as ferramentas para trabalhar na prática em um projeto prático de Reconhecimento de Padrões em áudio utilizando Inteligência Artificial com a linguagem de programação Python.
Classifying environmental sounds such as gunshots and dog barking are gaining popularity. Environmental sound classification(ESC) helps in developing context-aware applications such as security systems and criminal investigation systems. Research in speech and music has been done but environmental sounds are different because of their unstructured nature and attracts extensive attention in the field of research. Researchers have explored various preprocessing techniques, feature extraction and feature selection methods, and classification algorithms for ESC. In this paper, the ensemble technique—Adaboost algorithm— is applied to classify environmental sounds. The accuracy of different base estimators is evaluated on the publicly available dataset UrbanSound8K, and the highest accuracy is obtained in the case of the base estimator as random forest. The results of the Adaboost algorithm are also compared with the benchmark results reported using other machine learning classification algorithms such as support vector machines(SVM), IBK5, random forest 500, J48, and ZeroR.
Full-text available
Noise pollution is one of the topmost quality of life issues for urban residents in the United States. Continued exposure to high levels of noise has proven effects on health, including acute effects such as sleep disruption, and long-term effects such as hypertension, heart disease, and hearing loss. To investigate and ultimately aid in the mitigation of urban noise, a network of 55 sensor nodes has been deployed across New York City for over two years, collecting sound pressure level (SPL) and audio data. This network has cumulatively amassed over 75 years of calibrated, high-resolution SPL measurements and 35 years of audio data. In addition, high frequency telemetry data have been collected that provides an indication of a sensors’ health. These telemetry data were analyzed over an 18-month period across 31 of the sensors. It has been used to develop a prototype model for pre-failure detection which has the ability to identify sensors in a prefail state 69.1% of the time. The entire network infrastructure is outlined, including the operation of the sensors, followed by an analysis of its data yield and the development of the fault detection approach and the future system integration plans for this.
Full-text available
With the popularity of using deep learning-based models in various categorization problems and their proven robustness compared to conventional methods, a growing number of researchers have exploited such methods in environment sound classification tasks in recent years. However, the performances of existing models use auditory features like log-mel spectrogram (LM) and mel frequency cepstral coefficient (MFCC), or raw waveform to train deep neural networks for environment sound classification (ESC) are unsatisfactory. In this paper, we first propose two combined features to give a more comprehensive representation of environment sounds Then, a fourfour-layer convolutional neural network (CNN) is presented to improve the performance of ESC with the proposed aggregated features. Finally, the CNN trained with different features are fused using the Dempster–Shafer evidence theory to compose TSCNN-DS model. The experiment results indicate that our combined features with the four-layer CNN are appropriate for environment sound taxonomic problems and dramatically outperform other conventional methods. The proposed TSCNN-DS model achieves a classification accuracy of 97.2%, which is the highest taxonomic accuracy on UrbanSound8K datasets compared to existing models.
Full-text available
Urban noise recognition play a vital role in city management and safety operation, especially in the recent smart city engineering. Exiting studies on urban noise recognition are mostly based on conventional acoustic features, such as Mel-Frequency Cepstral Coefficients (MFCC) and Linear Prediction Cepstral Coefficients (LPCC), and the shallow structure based classifiers, such as support vector machine (SVM). However, the urban acoustic environment is complicated and changeable. Conventional acoustic representation and recognition methods may be insufficient in characterizing urban noises, and generally suffer from a degraded performance. In this paper, we study the recent deep neural network based urban noise recognition. The log-Mel-spectrogram, namely, the FBank feature is first derived for acoustic representation. Then, the FBank spectrum constructed with a set of FBank feature vectors from multiple acoustic signal frames is fed to a convolutional neural network (CNN) for urban noise recognition. Comprehensive studies on the dimension of FBank spectrums and the parameters in CNN, including the size of learnable kernels, the dropout rate, and the activation function, etc., are presented in the paper. An acoustic database collected in real environment covering 11 most common urban noises with more than 56,000 samples is constructed for model verification and performance evaluation. In addition, the traditional LPCC and MFCC acoustic feature combining with two popular machine learning algorithms, extreme learning machine (ELM) and support vector machine (SVM), and the FBank image feature combining with extreme learning machine (ELM), hierarchical extreme learning machine (H-ELM) and multilayer extreme learning machine (ML-ELM), have also been presented for discussions. Experimental results show that the proposed method generally outperforms conventional shallow structure based classifiers. © 2018 Springer Science+Business Media, LLC, part of Springer Nature
Full-text available
Automatic classification of environmental sounds, such as dog barking and glass breaking, is becoming increasingly interesting, especially for mobile devices. Most mobile devices contain both cameras and microphones, and companies that develop mobile devices would like to provide functionality for classifying both videos/images and sounds. In order to reduce the development costs one would like to use the same technology for both of these classification tasks. One way of achieving this is to represent environmental sounds as images, and use an image classification neural network when classifying images as well as sounds. In this paper we consider the classification accuracy for different image representations (Spectrogram, MFCC, and CRP) of environmental sounds. We evaluate the accuracy for environmental sounds in three publicly available datasets, using two well-known convolutional deep neural networks for image recognition (AlexNet and GoogLeNet). Our experiments show that we obtain good classification accuracy for the three datasets.
There's a steady drumbeat about how deep learning is going to touch nearly every area of technology, not least from us here at IEEE Spectrum. So, like a lot of folks, I've been interested in getting my hands dirty and learning some of the tools and techniques. But, like a lot of folks, I've been stymied by the difficulty of getting up and running.
We present the Sounds of New York City (SONYC) project, a smart cities initiative focused on developing a cyber-physical system for the monitoring, analysis and mitigation of urban noise pollution. Noise pollution is one of the topmost quality of life issues for urban residents in the U.S. with proven effects on health, education, the economy, and the environment. Yet, most cities lack the resources to continuously monitor noise and understand the contribution of individual sources, the tools to analyze patterns of noise pollution at city-scale, and the means to empower city agencies to take effective, data-driven action for noise mitigation. The SONYC project advances novel technological and socio-technical solutions that help address these needs. SONYC includes a distributed network of both sensors and people for large-scale noise monitoring. The sensors use low-cost, low-power technology, and cutting-edge machine listening techniques, to produce calibrated acoustic measurements and recognize individual sound sources in real time. Citizen science methods are used to help urban residents connect to city agencies and each other, understand their noise footprint, and facilitate reporting and self-regulation. Crucially, SONYC utilizes big data solutions to analyze, retrieve and visualize information from sensors and citizens, creating a comprehensive acoustic model of the city that can be used to identify significant patterns of noise pollution. These data can be used to drive the strategic application of noise code enforcement by city agencies to optimize the reduction of noise pollution. The entire system, integrating cyber, physical and social infrastructure, forms a closed loop of continuous sensing, analysis and actuation on the environment. SONYC provides a blueprint for the mitigation of noise pollution that can potentially be applied to other cities in the US and abroad.