Conference PaperPDF Available

Baby cry recognition in real-world conditions

Authors:
Baby Cry Recognition in Real-World Conditions
Ioana-Alina Bănică, Horia Cucu*, Andi Buzo, Dragoş Burileanu, and Corneliu Burileanu
Speech and Dialogue Research Laboratory, University Politehnica of Bucharest, Romania
*Corresponding author (E-mail: horia.cucu@upb.ro)
Abstract—Studies have shown that there are different types of
cries depending on the newborns’ need such as hunger, tiredness,
discomfort and so on. Neonatologist or pediatricians can
distinguish between different types of cries and can find a pattern
in each type of cry. Unfortunately, this is a real problem for the
parents who want to act as fast as possible to comfort the
newborn. In this paper, we propose a fully automatic system that
attempts to discriminate between different types of cries. The
baby cry classification system is based on Gaussian Mixture
Models and i-vectors. The evaluation is performed on an audio
database comprising six types of cries (colic, eructation,
discomfort, hunger, pain, tiredness) from 127 babies. The
experiments show promising results despite the difficulty of the
task.
Keywords— baby cry; automatic cry recognition; GMM-UBM;
i-vectors
I.
I
NTRODUCTION
Newborns are communicating by crying. This way the
newborn expresses his/her physical and emotional state and
his/her needs. The reasons for a baby to cry are: hunger,
tiredness, pain, discomfort, pain, colic, eructation, flatulence,
the need for attention, and so on. As a parent, one wants to act
as fast as possible when your child is crying, but the parents
don’t always understand why their baby is crying and they
become frustrated.
Neonatologist can distinguish between different types of
cries, although it was noticed that the interpretations of cries
can be subjective and the experience level of the listener is
essential. Also, it was shown that cries differ according to
gender, age, weight, natural conditions and social cultural
development, which in principle requires the development of
complex analysis. Studies have shown [1] that skilled persons
are not always able to explain the basis of such cries, which
give rise to the need of having an automatic infant cry
recognition and analysis system.
From the way a newborn cries it can be determined
whether he/she suffers from pathological diseases, which can
be vital for newborn if the disease is discovered early. The
sound, duration, intensity or height of cries can show that the
newborn is suffering from a specific disease [2]. For example
the cries of a newborn who suffers from Down syndrome are
very specific so they can be easily distinguished by skilled
persons, such as neonatologists or pediatricians.
A newborn cry consists of four main parts: a sound coming
from the expiration phase, a brief pause, a sound coming from
the inspiration phase, followed by another pause. Sound and
pause duration, and also the repetitions may differ from child
to child [3].
The newborns’ cry is a special case of human speech and
is a short-term stationary signal as the speech is. The cry
signal is considered more stationary than the speech signal
because newborns don’t have full control of the vocal tract.
Compared with the generation of adult speech it can be shown
that a cry is in turn a complex neurophysiologic act. It is the
result of an intense expulsion of air pressure in the lungs,
causing oscillations of the vocal cords and leading to the
formation of sound waves [3].
Kaoru Arakawa invented a system for analyzing baby cries
capable of diagnosing why the baby is crying [4]. The system
performs waveform analysis (envelope shape analysis of the
waveform) and frequency analysis. The system can classify
the pain, hunger and tiredness cries.
Abdulaziz and Mumtazah [5] classified the pain and non-
pain cries (two cry classes) using a neural network architecture
which is trained with scaled conjugate algorithm. The system
accuracy reported an accuracy which varies from 57% up to
76.2% under different parameter settings.
In this paper we approach the baby cries classifications
task by using methods that were successful in speaker and
language recognition task, such as Gaussian Mixture Model –
Universal Background Model (GMM-UBM) and i-vectors
methods. We performed our experiments on a database with
recordings from real life with about 127 babies, crying due to
six physiological needs: hunger, discomfort, tiredness, pain,
colic, eructation. The database was collected and labeled
within the SPLANN research1 project, which aims to design
and develop an automatic recognition system of the newborn
cries.
II. B
ABY CRY DATABASE
Collecting and labeling a database of baby cries is a
difficult task, but such a database is required for any
experiment regarding automatic classification of baby cries.
First of all it’s hard finding enough babies to record and
experimented persons who could differentiate between
different types of cries. Another problem is collecting baby
cries in a controlled environment. A place where you can find
babies and experimented persons such as neonatologists are
1
SPLANN research project: softwinresearch.ro/index.php/ro/proiecte/splann
maternities, but creating and using a standard audio
acquisition framework in a hospital is itself a complex task.
Also, for the acquisition and usage of baby cries a written
consent from the parents is needed.
The SPLANN infant crying database, on which we
performed all the experiments presented in this paper, was
collected and labeled within the SPLANN research project [6].
The recordings were performed both in the maternity of the
“Sf. Pantelimon” Emergency Hospital in Bucharest and at
home. In our experiments we used only the recordings which
were collected in the hospital.
The recordings were collected using an iRig Mic
unidirectional microphone (range frequency 100 Hz
15 kHz),
connected to a Samsung Galaxy S4 Mini smartphone. The
sampling frequency was 44.1 kHz, the bit resolution was 16
bits per sample, and the file format was wav [6]. An example
of wave signal for a typical baby cry (colic in this case) is
illustrated in Fig. 1.
For the recordings collected in the hospital, the data was
labeled by neonatal experts, on site, immediately after the
acquisition. The doctors recorded the newborn’s cry using an
online application and then comforted the need that they
thought the newborn had. The data labeling validation
consisted in comforting the newborn.
This database contains six types of cries (colic, discomfort,
eructation, hunger, pain, tiredness) from 127 infants aged
between 0 and 3 months. In total the database comprises about
750 audio recordings meaning 12914 cries. The total duration
of the cries is 205 minutes. More details regarding each need
are given in Table I.
III. B
ABY
C
RY
C
LASSIFICATION METHODS
For baby cry recognition problem we applied two methods
that were successful in speaker recognition [7, 8], namely
GMM-UBM and i-vectors frameworks.
A. GMM-UBM Framework
In GMM-UBM framework we have two stages: the
training stage and the evaluation stage.
In the training stage a Gaussian Mixture Model (GMM) for
every type of cry is built. A GMM is a statistical model which
is characterized by a mean vector and a variance vector. The
GMMs are created by adapting a Universal Background
Model (UBM) which represents the general crying. The
adaptation is made using recordings of a specific type of cry
using Maximum A-Posteriori (MAP) algorithm. The UBM is
built based on a large number of recordings of all types of
crying using the Estimation Maximization (EM) algorithm.
During the evaluation each recording is scored with each
of the GMMs. The cry class whose GMM reaches the highest
similitude score is chosen to label the given recording.
B. The I-vectors Framework
The i-vectors framework was also introduced for the first
time in the context of speaker identification [9]. The method is
also known as the total variability modeling matrix method.
Fig. 1. Example of colic baby cry
TABLE I. D
ETAILS FOR
SPLANN
I
NFANT
C
RYING
D
ATABASE
Cry type Code # babies # cries Duration
Colic C1 7 225 3 min 26 s
Discomfort C2 77 2210 37 min 13 s
Eructation C3 11 505 5 min 36 s
Hunger C4 92 5536 86 min 42 s
Pain C5 104 4404 72 min 9 s
Tiredness C6 1 34 24 s
Total 12914 205 min 30 s
The total variability matrix (TVM) contains on columns all
mean vectors of the GMMs. Every column is represented by a
GMM supervector which is created from one audio file in the
training stage, so at first the number of columns in a TVM is
identical with the number of the audio files used in the training
stage.
The i-vectors framework involves also two stages: a
training stage and an evaluation stage. In the training stage the
UBM is estimated, the TVM is constructed, the i-vectors are
extracted for each type of cry and a Probabilistic Linear
Discriminant Analysis (PLDA) Gaussian model is used for
reducing the dimensionality of the i-vectors.
In the evaluation stage, the Cosine Scoring and Within
Class Covariance Normalization (WCCN) is used for
classifying each cry recording to one of the cry classes.
IV. B
ABY
C
RY
R
ECOGNITION
E
XPERIMENTS
A. Experimental Setup
The experiments were performed using the Microsoft
Speaker Recognition (MSR) toolkit, which contains the
implementations for both GMM-UBM and i-vectors
techniques as well as for feature extraction [10].
For characterizing the type of cry we used 13 Mel
Frequency Cepstral Coefficients (MFCCs) extracted from
overlapped windows (50% overlapping, a similar setup as for
speech and speaker recognition). For the extraction of the
MFCCs with MSR toolkit we used a Hamming window of
size of 20 ms.
B. Experimental Results
1) Experiments using GMM-UBM framework
The database was divided such that the subjects used in the
training stage would not be used in the evaluation stage. The
subjects were chosen so that the number of evaluation
recordings represents 10% of the total.
Even though the tiredness class contains just a subject we
decided to use this class further in our experiments. Only for
this class we randomly chose 10% of the total number of files
for the evaluation stage, whereas the rest of 90% was used in
the training stage.
We run the experiment ranging the number of Gaussian
probability densities per GMM from 16 to 16384. The highest
overall accuracy (35%) was obtained by the models created
with 2048 and 4096 Gaussian probability densities. A higher
number of Gaussian densities is not required as the accuracy
starts to drop. Although the overall accuracy appears to be
low, it should be noted that random performance would mean
an accuracy of 16.6%.
Table II illustrates the confusion matrix for the case where
maximum accuracy has been achieved (for the model created
with 2048 Gaussian probability densities per GMM). The lines
represent the actual cry classes, while the columns represent
the recognized cry classes. The last column shows the per-
class accuracy, which varies from 0 to 54%.
From the confusion matrix (Table II) we can observe that
the discomfort (C2), hunger (C4) and pain (C5) cries are
confused in large proportion to each other. Moreover, these
classes are the ones trained with the greatest amount of data.
This has motivated us to make experiments using only these
three classes and the results are shown further in this section.
2) Experiments using i-vectors
In this experiment we tested on the same data set as in the
previous experiment and we used the same features, but we
used i-vector method instead of GMM-UBM. We varied the
number of densities in a GMM from 2 to 256, the size of the
TVM (50, 100, 200, and 250), and the number of the TVM
training iterations (5, 10, 15, and 20).
We observed that the highest accuracy of 39% was
obtained for the model created with the UBM formed with 32
Gaussian probability densities, the size of the TVM 200 and
20 training iterations for the TVM. The confusion matrix for
this model configuration is presented in Table III. It shows
that the discomfort (C2), hunger (C4) and pain (C5) cries are
still confused in large proportion to each other.
3) Experiments with 3 classes using GMM-UBM
framework
In this experiment we used only the discomfort, hunger
and pain cries which are confused in large proportion to each
other (as shown in the previous experiments). We run the
experiment ranging the number of Gaussian probability
densities per GMM from 2 to 8192. We observe that the
highest accuracy (50.6%) is obtained with the GMM with 4
probability densities. The detailed results are depicted in Fig.
2.
TABLE II. C
ONFUSION
M
ATRIX FOR
GMM-UBM
E
XPERIMENT
# cries C1 C2 C3 C4 C5 C6 Acc [%]
C1 7 13 0 1 2 2 28
C2 18 46 9 54 70 12 22
C3 2 9 3 22 9 5 6
C4 36 171 39 156 121 38 27
C5 11 115 12 47 242 15 54
C6 0 1 0 2 0 0 0
TABLE III. C
ONFUSION
M
ATRIX FOR I
-
VECTORS
E
XPERIMENT
# cries C1 C2 C3 C4 C5 C6 Acc [%]
C1 11 7 2 1 3 1 44
C2 49 7 37 52 58 6 3
C3 10 1 10 18 8 3 6
C4 148 25 74 182 112 20 32
C5 58 27 23 27
295 12 66
C6 0 1 0 1 0 1 0
Fig. 2. Detection accuracy using GMM-UBM method (for 3 cry classes)
4) Experiments with three classes using i-vectors
framework
In this experiment we used the same setup as in the
previous experiment, but we used i-vectors framework instead
of GMM-UBM.
We run the experiments ranging the number of Gaussian
probability densities from 2 up to 512; we also varied the size
of the TVM (5, 10, 50, 100, 200, 250, 300, 400, 500, and 600)
and the number of TVM training iterations (5, 10, 15, and 20).
We obtained the highest accuracy (58%) for the model
trained with the TVM of size 100, number of iteration of the
TVM 20, with the UBM created with 256 Gaussian
probability densities. The confusion matrix and the confusion
matrix in percentage for the case were we obtain the highest
accuracy are represented in Table IV, respectively Table V.
As seen in Table V, the discomfort cries are in proportion
of 5.7% correct classified, the rest are wrongly confused in
equal proportion with hunger (51.2%) and pain (43.1%).
Hunger cries are confused in a big proportion (26.6%) with
pain cries. Also we had noticed from this experiment that in
general the pain cries are correctly classified, approximately in
proportion of 70%.
TABLE IV. C
ONFUSION
M
ATRIX FOR
I-
VECTORS
E
XPERIMENT
U
SING
O
NLY
T
HREE
C
LASSES
# cries Discomfort (C2) Hunger (C4) Pain (C5)
Discomfort (C2) 12 107 90
Hunger (C4) 31 381 149
Pain (C5) 15 116
311
TABLE V. C
ONFUSION
M
ATRIX IN
P
ERCENTAGE FOR
I-
VECTORS
E
XPERIMENT
U
SING
O
NLY
T
HREE
C
LASSES
% cries Discomfort (C2) Hunger (C4) Pain (C5)
Discomfort (C2) 5.7 51.2 43.1
Hunger (C4) 5.5 67.9 26.6
Pain (C5) 3.4 26.2 70.4
V. D
ISCUSSION AND
C
ONCLUSIONS
Motivated by the fact that skilled persons such as
neonatologists can distinguish between different types of baby
cries, finding some patterns in each type, we developed a fully
automatic system that attempts to discriminate between
different types of cries. The system uses the GMM-UBM and
i-vectors frameworks for classification and was evaluated on a
real-world audio database comprising about 13000 cries of six
types uttered by more than 120 babies.
The results obtained using six types of cries (colic,
discomfort, eructation, hunger, pain, tiredness) using both
GMM-UBM and i-vectors frameworks are comparable. The
accuracy obtained with the i-vectors framework is slightly
higher (39%) than GMM-UBM framework (35%). These
results show that the classes are not disjunctive. One possible
reason can be the choice of the features, i.e. the MFCCs. This
suggests the use of other features in the future. Another reason
can be the inherent overlap of the classes given that different
needs may lead to similar contractions and thus similar sounds
produced by the babies. Yet another reason can be the
multitude of information contained by a cry. Besides the
information regarding the need, a cry contains also the
information about the identity of the baby and even some
background sounds. The last two play the role of noise and
contribute to increasing the confusion.
We decided to run some experiments with only the
discomfort, hunger and tiredness cries because we noticed that
these types of cries are confused in large proportion to each
other and seem to dominate the recognition of other classes
too. The experimental results showed that the recognition
performed with only these three classes has greater accuracy
(50.6% for the GMM-UBM method and 58.0% for the i-
vectors method). Even though this was expected because the
uncertainty is reduced (3 classes instead of 6), the big increase
in accuracy also suggests that a greater amount of data is
needed to properly train the rest of the classes.
We observed that the discomfort cries are in proportion of
5.7% correct classified, the rest are wrongly confused in equal
proportion with hunger (51.2%) and pain (43.1%). Hunger
cries are confused in big proportion (26.6%) with pain cries
which are correctly classified, approximately in proportion of
70%. One possible explanation is that the discomfort cry is a
more vague class and may contain cries that are produced by
needs similar (in terms of muscles contraction) to the ones of
hunger and pain.
The results obtained on SPLANN database are lower than
those obtained on Dunstan Baby Language (DBL) database
(81.8% overall accuracy) [11]. A possible explanation for this
is that the DBL database is much smaller. It contains only 83
cries from 5 classes (discomfort, eructation, hunger,
flatulence, tiredness) uttered by only 39 babies. The total
duration of the cries in DBL database is 430 seconds. A
second explanation could be that the recordings were carefully
selected to be representative for different types of cry (they
were extracted from the DBL video tutorial).
Future work involves an in-depth analysis of the errors
obtained in these experiments and the evaluation of other
classification methods and audio features (including time
domain features, such as: duration of expiratory component,
the time between two cries, etc.).
A
CKNOWLEDGMENT
This work was partly supported by the PN II Programme
“Partnerships in priority areas” of MEN - UEFISCDI, through
project no. 25/2014.
R
EFERENCES
[1] E. Drumond and M.L McBride, “The development of mother’s
understanding of infant crying,” in Clinical Nursing Research, vol. 2, pp
396-441, 1993.
[2] O.F. Reyes-Galaviz and C.A. Reyes-Garcia, “A system for the
processing of infant cry to recognize pathologies in recently born babies
with neural networks,” Proc. of the 9th Conference Speech and
Computer SPECOM’2004, St. Petersburg, Russia, pp. 552-557, 2004.
[3] P.S. Zeskind and B.M. Lester, “Analysis of infant crying,” Chapter 8 in
Biobehavioral Assessment of the Infant, L.T. Singer and P.S. Zeskind,
Eds. New York: Guilford Publications Inc., 2001, pp. 149-166.
[4] K. Arakawa, “System and method for analyzing baby cries,” US Patent
6,496,115 B2, Google Patents, 2002.
[5] Y. Abdulaziz and S.M.S. Ahmad, “Infant cry recognition system: a
comparison of system performance based on Mel frequency and linear
prediction cepstral coefficients," Proc of Int. Conf. on Information
Retrieval and Knowledge Management, Selangor, pp. 260-263, 2010.
[6] M.S. Rusu, S.S. Diaconescu, G. Sardescu, and E. Bratila, “Database and
system design for data collection of crying related to infant’s needs and
diseases,” Proc. of the 8th Int. Conf. on Speech Technology and Human-
Computer Dialogue SpeD 2015, Bucharest, Romania, pp. 1-6, 2015.
[7] D.A. Reynolds, T.F. Quatieri, and R.B. Dunn, “Speaker verification
using adapted Gaussian mixture models,” in Digital Signal Processing,
vol. 10, nos. 1-3. Academic Press, 2000, pp. 19-41.
[8] D. Najim, R. Dehak, P. Kenny, N. Brümmer, and P. Ouellet, “Support
vector machines versus fast scoring in the low-dimensional total
variability space for speaker verification,” Proc of INTERSPEECH
2009, Brighton, United Kingdom, pp. 1559-1562, 2009.
[9] N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-
end factor analysis for speaker verification,” in IEEE Trans. on Audio,
Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.
[10] S.O. Sadjadi, M. Slaney, and L. Heck, “MSR identity toolbox v1.0: a
MATLAB toolbox for speaker-recognition research,” in Speech and
Language Processing Technical Committee Newsletter, IEEE, 2013.
[11] I.-A. Banică, H. Cucu. A. Buzo, D. Burileanu, and C. Burileanu,
“Automatic methods for infant cry classification,” Proc of Int. Conf. on
Communications COMM 2016, Bucharest, Romania, 2016.
... If the newborn is suffering from a specific disease, it can be identified considering the sound, duration, height, or intensity of the cries. Examples of such diseases that can be early discovered include autism, deafness, asphyxia, respiratory diseases, heart diseases, blood diseases, or neurological diseases [6,7,8]. ...
... The cry of a baby is a special case of human speech and is a short-term stationary signal as the speech is. The cry signal is evaluated to be more stationary than the speech signal because newborns do not have full control of their vocal tract [6,8]. ...
... GMM is a statistical model whose data points are formed from more Gaussian distributions, characterized by some mean and variance. Then the test data can be classified by the trained model based on the Expectation Maximization (EM) algorithm used for finding the maximum match of the parameters [6,8]. ...
... On the classifier side, baby cry studies (including our own previous research [17,18]) use only a handful of classifiers at a time [14,19]. ...
... In our previous work, which is part of the same research project aiming to develop an automatic recognition system for newborn cries, we used the classical Mel Frequency Cepstral Coefficients (MFCCs) and classifiers such as Gaussian Mixture Model -Universal Background Model (GMM-UBM) and i-vectors, we performed experiments on 2 baby cry datasets: Dunstan [17] and SPLANN [18]. ...
... In this paper, we explore baby cry characteristics by looking at various paralinguistic feature sets, proven to be successful in tasks such as emotion recognition [9], deception, sincerity and native language from speech recognition [20]: the emobase2010 feature set [24] and the INTERSPEECH Computational Paralinguistics Challenge (ComParE) feature set [17]. Going further, we extended our approach by analyzing these features and using them to train the various classifiers available in the Waikato Environment for Knowledge Analysis (WEKA) [22]. ...
... The work is based on research performed a priori, representing a continuity of the following papers: development of a fully automatic system that attempts to discriminate between different types of cries using a database created and labeled based on the Dunstan Baby Language (DBL) baby-cry classification video tutorial [5] and using the SPLANN infant crying database [6]. ...
... Both databases are labeled at expiration level, therefore the previous detection [5,6] was in terms of expiration. The novelty of this paper emerges from a new approach -the detection for SPLANN database is performed also at cry level, namely a succession of expirations (variable) , instead of a single one. ...
... The first one -Dunstan Baby Language Database [5] has been used to prove that information about need is found in the spectrum of the signal; it contains cries for five needs: : "EAIRH" -flatulence, "EH" -eructation, "HEH" -discomfort, "NEH" -hunger, "OWH"tiredness. The second database used was SPLANN Infant Crying Database [6], seven times bigger than the Dunstan database; the recordings of the cries were collected in the hospital with the help of a mobile phone application. ...
Chapter
Full-text available
Studies have shown that newborns are crying differently depending on their need. Medical experts state that the need behind a newborn cry can be identified by listening to the cry – an easy task for specialists but extremely hard for unskilled parents, who want to act as fast as possible to comfort their baby. In this paper, we propose various experiments on a previously developed fully automatic system that attempts to discriminate between different types of cries, based on Gaussian Mixture Models. The experiments show promising results despite the difficulty of the task.
... Young parents get baffled and experience difficulty calming down their infants since all cry signals sound very similar to them. The major problem faced by many new parents is that they hardly understand the reason for the infant's cry (1)(2)(3)(4)(5)(6)(7)(8). It is not possible to identify the reason just by looking at the face or analyzing the emotions of the infant (9)(10)(11)(12). ...
... The work in (7) portraying the advancement of significant information innovation, anticipating clients' buying goals through precise information of their buying practices has turned into a fundamental system for organizations to perform accuracy promotion and increase deal volume. The information of clients' buying behavior is described by an enormous sum, significant changeability, and long haul reliance. ...
Article
Full-text available
Understanding the reason for an infant's cry is the most difficult thing for parents. There might be various reasons behind the baby's cry. It may be due to hunger, pain, sleep, or diaper-related problems. The key concept behind identifying the reason behind the infant's cry is mainly based on the varying patterns of the crying audio. The audio file comprises many features, which are highly important in classifying the results. It is important to convert the audio signals into the required spectrograms. In this article, we are trying to find efficient solutions to the problem of predicting the reason behind an infant's cry. In this article, we have used the Mel-frequency cepstral coefficients algorithm to generate the spectrograms and analyzed the varying feature vectors. We then came up with two approaches to obtain the experimental results. In the first approach, we used the Convolution Neural network (CNN) variants like VGG16 and YOLOv4 to classify the infant cry signals. In the second approach, a multistage heterogeneous stacking ensemble model was used for infant cry classification. Its major advantage was the inclusion of various advanced boosting algorithms at various levels. The proposed multistage heterogeneous stacking ensemble model had the edge over the other neural network models, especially in terms of overall performance and computing power. Finally, after many comparisons, the proposed model revealed the virtuoso performance and a mean classification accuracy of up to 93.7%.
Article
Full-text available
This paper reviews recent research works in infant cry signal analysis and classification tasks. A broad range of literatures are reviewed mainly from the aspects of data acquisition, cross domain signal processing techniques, and machine learning classification methods. We introduce pre-processing approaches and describe a diversity of features such as MFCC, spectrogram, and fundamental frequency, etc. Both acoustic features and prosodic features extracted from different domains can discriminate frame-based signals from one another and can be used to train machine learning classifiers. Together with traditional machine learning classifiers such as KNN, SVM, and GMM, newly developed neural network architectures such as CNN and RNN are applied in infant cry research. We present some significant experimental results on pathological cry identification, cry reason classification, and cry sound detection with some typical databases. This survey systematically studies the previous research in all relevant areas of infant cry and provides an insight on the current cutting-edge works in infant cry signal analysis and classification. We also propose future research directions in data processing, feature extraction, and neural network classification fields to better understand, interpret, and process infant cry signals.
Conference Paper
Baby cry sound detection allows parents to be automatically alerted when their baby is crying. Current solutions in home environment ask for a client-server architecture where an end-node device streams the audio to a centralized server in charge of the detection. Even providing the best performances, these solutions raise power consumption and privacy issues. For these reasons, interest has recently grown in the community for methods which can run locally on battery-powered devices. This work presents a new set of features tailored to baby cry sound recognition, called hand crafted baby cry (HCBC) features. The proposed method is compared with a baseline using mel-frequency cepstrum coefficients (MFCCs) and a state-of-the-art convolutional neural network (CNN) system. HCBC features result to be on par with CNN, while requiring less computation effort and memory space at the cost of being application specific.
Article
Full-text available
In this work we present the design of an automatic infant cry recognition system that classifies three different kinds of cries, which come from normal, deaf and asphyxiating infants, of ages from one day up to nine months old. The classification is done through a pattern classifier, where the crying waves are taken as the input patterns. We have experimented with patterns formed by vectors of Mel Frequency Cepstral Coefficients and Linear Prediction Coefficients. The acoustic feature vectors are then processed, to be classified in their corresponding type of cry, through an Input Delay Neural Network, trained by gradient descent with adaptive learning rate back propagation algorithm. To perform the experiments and to test the recognition system, we train the neural network with cries from randomly selected babies, and test it with a separate set of cries from babies selected only for testing. Here, we present the design and implementation of the complete system, as well as the results from some experiments, which in the presented case are up to 86 %.
Article
Full-text available
This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.
Conference Paper
Full-text available
This paper presents a new speaker verification system architecture based on Joint Factor Analysis (JFA) as feature extractor. In this modeling, the JFA is used to define a new low-dimensional space named the total variability factor space, instead of both channel and speaker variability spaces for the classical JFA. The main contribution in this approach, is the use of the cosine kernel in the new total factor space to design two different systems: the first system is Support Vector Machines based, and the second one uses directly this kernel as a decision score. This last scoring method makes the process faster and less computation complex compared to others classical methods. We tested several intersession compensation methods in total factors, and we found that the combination of Linear Discriminate Analysis and Within Class Covariance Normalization achieved the best performance. We achieved a remarkable results using fast scoring method based only on cosine kernel especially for male trials, we yield an EER of 1.12% and MinDCF of 0.0094 on the English trials of the NIST 2008 SRE dataset.
Conference Paper
In this paper we have focused on designing and collecting an infant crying database. The data have been collected using an in-house developed acquisition system. The study has involved 136 infants aged between 0 and 3 months. Seven types of cries have been collected: hunger, pain, eructation, tiredness, discomfort, colic and pathology. Some of the data (about 750 audio/video recordings meaning about 13000 cries) have been collected from the maternity of the “Sf. Pantelimon” Emergency Hospital and the other part (over 340 audio/video recordings meaning about 5100 cries) have been collected at newborns' home. All the data have been labeled by neonatal experts. This infant crying database will be used to develop new methods for infant cry recognition.
Conference Paper
This paper describes the architecture of an automatic infant cry recognition system which main task is to identify and differentiate between pain and non-pain cries belonging to infants. The recognition system is mainly based on feed forward neural network architecture which is trained with the scaled conjugate gradient algorithm. This paper presents an in depth comparison of system performance whereby two different sets of features, namely Mel Frequency Cepstral Coefficient (MFCC) and Linear Prediction Cepstral Coefficients (LPCC) are extracted from the audio samples of infant's cries and are fed into the recognition module. The system accuracy reported in this study varies from 57% up to 76.2% under different parameter settings. The results demonstrated that in general, the infant cry recognition system performs better by using the MPCC feature sets.
Article
Reynolds, Douglas A., Quatieri, Thomas F., and Dunn, Robert B., Speaker Verification Using Adapted Gaussian Mixture Models, Digital Signal Processing10(2000), 19–41.In this paper we describe the major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. The development and use of a handset detector and score normalization to greatly improve verification performance is also described and discussed. Finally, representative performance benchmarks and system behavior experiments on NIST SRE corpora are presented.
Article
This study examines the development of mothers' understanding of their infants' crying. Semistructured tape-recorded interviews were conducted with 17 mothers at 6 weeks, 10 weeks, and 16 weeks postpartum. The mothers (9 primiparous, 8 multiparous) were chosen for their good health status and for their immediate support system. Two major themes were identified from the interviews. In general, it was found that as the mothers became more experienced, the understanding of the cry situation became more complete and soothing was more effective. The relation between crying and soothing became more differentiated, more cohesive, and more complete. The effect of experience on understanding was particularly dramatic in the case of multiparous mothers. Both health promotional and illness prevention programming are proposed as nursing care measures for mothers of crying infants. The important assumptions underlying each approach are delineated.
MSR identity toolbox v1.0: a MATLAB toolbox for speaker-recognition research
  • S O Sadjadi
  • M Slaney
  • L Heck
S.O. Sadjadi, M. Slaney, and L. Heck, "MSR identity toolbox v1.0: a MATLAB toolbox for speaker-recognition research," in Speech and Language Processing Technical Committee Newsletter, IEEE, 2013.