Baby Cry Recognition in Real-World Conditions
Ioana-Alina Bănică, Horia Cucu*, Andi Buzo, Dragoş Burileanu, and Corneliu Burileanu
Speech and Dialogue Research Laboratory, University Politehnica of Bucharest, Romania
*Corresponding author (E-mail: email@example.com)
Abstract—Studies have shown that there are different types of
cries depending on the newborns’ need such as hunger, tiredness,
discomfort and so on. Neonatologist or pediatricians can
distinguish between different types of cries and can find a pattern
in each type of cry. Unfortunately, this is a real problem for the
parents who want to act as fast as possible to comfort the
newborn. In this paper, we propose a fully automatic system that
attempts to discriminate between different types of cries. The
baby cry classification system is based on Gaussian Mixture
Models and i-vectors. The evaluation is performed on an audio
database comprising six types of cries (colic, eructation,
discomfort, hunger, pain, tiredness) from 127 babies. The
experiments show promising results despite the difficulty of the
Keywords— baby cry; automatic cry recognition; GMM-UBM;
Newborns are communicating by crying. This way the
newborn expresses his/her physical and emotional state and
his/her needs. The reasons for a baby to cry are: hunger,
tiredness, pain, discomfort, pain, colic, eructation, flatulence,
the need for attention, and so on. As a parent, one wants to act
as fast as possible when your child is crying, but the parents
don’t always understand why their baby is crying and they
Neonatologist can distinguish between different types of
cries, although it was noticed that the interpretations of cries
can be subjective and the experience level of the listener is
essential. Also, it was shown that cries differ according to
gender, age, weight, natural conditions and social cultural
development, which in principle requires the development of
complex analysis. Studies have shown  that skilled persons
are not always able to explain the basis of such cries, which
give rise to the need of having an automatic infant cry
recognition and analysis system.
From the way a newborn cries it can be determined
whether he/she suffers from pathological diseases, which can
be vital for newborn if the disease is discovered early. The
sound, duration, intensity or height of cries can show that the
newborn is suffering from a specific disease . For example
the cries of a newborn who suffers from Down syndrome are
very specific so they can be easily distinguished by skilled
persons, such as neonatologists or pediatricians.
A newborn cry consists of four main parts: a sound coming
from the expiration phase, a brief pause, a sound coming from
the inspiration phase, followed by another pause. Sound and
pause duration, and also the repetitions may differ from child
to child .
The newborns’ cry is a special case of human speech and
is a short-term stationary signal as the speech is. The cry
signal is considered more stationary than the speech signal
because newborns don’t have full control of the vocal tract.
Compared with the generation of adult speech it can be shown
that a cry is in turn a complex neurophysiologic act. It is the
result of an intense expulsion of air pressure in the lungs,
causing oscillations of the vocal cords and leading to the
formation of sound waves .
Kaoru Arakawa invented a system for analyzing baby cries
capable of diagnosing why the baby is crying . The system
performs waveform analysis (envelope shape analysis of the
waveform) and frequency analysis. The system can classify
the pain, hunger and tiredness cries.
Abdulaziz and Mumtazah  classified the pain and non-
pain cries (two cry classes) using a neural network architecture
which is trained with scaled conjugate algorithm. The system
accuracy reported an accuracy which varies from 57% up to
76.2% under different parameter settings.
In this paper we approach the baby cries classifications
task by using methods that were successful in speaker and
language recognition task, such as Gaussian Mixture Model –
Universal Background Model (GMM-UBM) and i-vectors
methods. We performed our experiments on a database with
recordings from real life with about 127 babies, crying due to
six physiological needs: hunger, discomfort, tiredness, pain,
colic, eructation. The database was collected and labeled
within the SPLANN research1 project, which aims to design
and develop an automatic recognition system of the newborn
ABY CRY DATABASE
Collecting and labeling a database of baby cries is a
difficult task, but such a database is required for any
experiment regarding automatic classification of baby cries.
First of all it’s hard finding enough babies to record and
experimented persons who could differentiate between
different types of cries. Another problem is collecting baby
cries in a controlled environment. A place where you can find
babies and experimented persons such as neonatologists are
SPLANN research project: softwinresearch.ro/index.php/ro/proiecte/splann
maternities, but creating and using a standard audio
acquisition framework in a hospital is itself a complex task.
Also, for the acquisition and usage of baby cries a written
consent from the parents is needed.
The SPLANN infant crying database, on which we
performed all the experiments presented in this paper, was
collected and labeled within the SPLANN research project .
The recordings were performed both in the maternity of the
“Sf. Pantelimon” Emergency Hospital in Bucharest and at
home. In our experiments we used only the recordings which
were collected in the hospital.
The recordings were collected using an iRig Mic
unidirectional microphone (range frequency 100 Hz
connected to a Samsung Galaxy S4 Mini smartphone. The
sampling frequency was 44.1 kHz, the bit resolution was 16
bits per sample, and the file format was wav . An example
of wave signal for a typical baby cry (colic in this case) is
illustrated in Fig. 1.
For the recordings collected in the hospital, the data was
labeled by neonatal experts, on site, immediately after the
acquisition. The doctors recorded the newborn’s cry using an
online application and then comforted the need that they
thought the newborn had. The data labeling validation
consisted in comforting the newborn.
This database contains six types of cries (colic, discomfort,
eructation, hunger, pain, tiredness) from 127 infants aged
between 0 and 3 months. In total the database comprises about
750 audio recordings meaning 12914 cries. The total duration
of the cries is 205 minutes. More details regarding each need
are given in Table I.
For baby cry recognition problem we applied two methods
that were successful in speaker recognition [7, 8], namely
GMM-UBM and i-vectors frameworks.
A. GMM-UBM Framework
In GMM-UBM framework we have two stages: the
training stage and the evaluation stage.
In the training stage a Gaussian Mixture Model (GMM) for
every type of cry is built. A GMM is a statistical model which
is characterized by a mean vector and a variance vector. The
GMMs are created by adapting a Universal Background
Model (UBM) which represents the general crying. The
adaptation is made using recordings of a specific type of cry
using Maximum A-Posteriori (MAP) algorithm. The UBM is
built based on a large number of recordings of all types of
crying using the Estimation Maximization (EM) algorithm.
During the evaluation each recording is scored with each
of the GMMs. The cry class whose GMM reaches the highest
similitude score is chosen to label the given recording.
B. The I-vectors Framework
The i-vectors framework was also introduced for the first
time in the context of speaker identification . The method is
also known as the total variability modeling matrix method.
Fig. 1. Example of colic baby cry
TABLE I. D
Cry type Code # babies # cries Duration
Colic C1 7 225 3 min 26 s
Discomfort C2 77 2210 37 min 13 s
Eructation C3 11 505 5 min 36 s
Hunger C4 92 5536 86 min 42 s
Pain C5 104 4404 72 min 9 s
Tiredness C6 1 34 24 s
Total 12914 205 min 30 s
The total variability matrix (TVM) contains on columns all
mean vectors of the GMMs. Every column is represented by a
GMM supervector which is created from one audio file in the
training stage, so at first the number of columns in a TVM is
identical with the number of the audio files used in the training
The i-vectors framework involves also two stages: a
training stage and an evaluation stage. In the training stage the
UBM is estimated, the TVM is constructed, the i-vectors are
extracted for each type of cry and a Probabilistic Linear
Discriminant Analysis (PLDA) Gaussian model is used for
reducing the dimensionality of the i-vectors.
In the evaluation stage, the Cosine Scoring and Within
Class Covariance Normalization (WCCN) is used for
classifying each cry recording to one of the cry classes.
A. Experimental Setup
The experiments were performed using the Microsoft
Speaker Recognition (MSR) toolkit, which contains the
implementations for both GMM-UBM and i-vectors
techniques as well as for feature extraction .
For characterizing the type of cry we used 13 Mel
Frequency Cepstral Coefficients (MFCCs) extracted from
overlapped windows (50% overlapping, a similar setup as for
speech and speaker recognition). For the extraction of the
MFCCs with MSR toolkit we used a Hamming window of
size of 20 ms.
B. Experimental Results
1) Experiments using GMM-UBM framework
The database was divided such that the subjects used in the
training stage would not be used in the evaluation stage. The
subjects were chosen so that the number of evaluation
recordings represents 10% of the total.
Even though the tiredness class contains just a subject we
decided to use this class further in our experiments. Only for
this class we randomly chose 10% of the total number of files
for the evaluation stage, whereas the rest of 90% was used in
the training stage.
We run the experiment ranging the number of Gaussian
probability densities per GMM from 16 to 16384. The highest
overall accuracy (35%) was obtained by the models created
with 2048 and 4096 Gaussian probability densities. A higher
number of Gaussian densities is not required as the accuracy
starts to drop. Although the overall accuracy appears to be
low, it should be noted that random performance would mean
an accuracy of 16.6%.
Table II illustrates the confusion matrix for the case where
maximum accuracy has been achieved (for the model created
with 2048 Gaussian probability densities per GMM). The lines
represent the actual cry classes, while the columns represent
the recognized cry classes. The last column shows the per-
class accuracy, which varies from 0 to 54%.
From the confusion matrix (Table II) we can observe that
the discomfort (C2), hunger (C4) and pain (C5) cries are
confused in large proportion to each other. Moreover, these
classes are the ones trained with the greatest amount of data.
This has motivated us to make experiments using only these
three classes and the results are shown further in this section.
2) Experiments using i-vectors
In this experiment we tested on the same data set as in the
previous experiment and we used the same features, but we
used i-vector method instead of GMM-UBM. We varied the
number of densities in a GMM from 2 to 256, the size of the
TVM (50, 100, 200, and 250), and the number of the TVM
training iterations (5, 10, 15, and 20).
We observed that the highest accuracy of 39% was
obtained for the model created with the UBM formed with 32
Gaussian probability densities, the size of the TVM 200 and
20 training iterations for the TVM. The confusion matrix for
this model configuration is presented in Table III. It shows
that the discomfort (C2), hunger (C4) and pain (C5) cries are
still confused in large proportion to each other.
3) Experiments with 3 classes using GMM-UBM
In this experiment we used only the discomfort, hunger
and pain cries which are confused in large proportion to each
other (as shown in the previous experiments). We run the
experiment ranging the number of Gaussian probability
densities per GMM from 2 to 8192. We observe that the
highest accuracy (50.6%) is obtained with the GMM with 4
probability densities. The detailed results are depicted in Fig.
TABLE II. C
# cries C1 C2 C3 C4 C5 C6 Acc [%]
C1 7 13 0 1 2 2 28
C2 18 46 9 54 70 12 22
C3 2 9 3 22 9 5 6
C4 36 171 39 156 121 38 27
C5 11 115 12 47 242 15 54
C6 0 1 0 2 0 0 0
TABLE III. C
ATRIX FOR I
# cries C1 C2 C3 C4 C5 C6 Acc [%]
C1 11 7 2 1 3 1 44
C2 49 7 37 52 58 6 3
C3 10 1 10 18 8 3 6
C4 148 25 74 182 112 20 32
C5 58 27 23 27
295 12 66
C6 0 1 0 1 0 1 0
Fig. 2. Detection accuracy using GMM-UBM method (for 3 cry classes)
4) Experiments with three classes using i-vectors
In this experiment we used the same setup as in the
previous experiment, but we used i-vectors framework instead
We run the experiments ranging the number of Gaussian
probability densities from 2 up to 512; we also varied the size
of the TVM (5, 10, 50, 100, 200, 250, 300, 400, 500, and 600)
and the number of TVM training iterations (5, 10, 15, and 20).
We obtained the highest accuracy (58%) for the model
trained with the TVM of size 100, number of iteration of the
TVM 20, with the UBM created with 256 Gaussian
probability densities. The confusion matrix and the confusion
matrix in percentage for the case were we obtain the highest
accuracy are represented in Table IV, respectively Table V.
As seen in Table V, the discomfort cries are in proportion
of 5.7% correct classified, the rest are wrongly confused in
equal proportion with hunger (51.2%) and pain (43.1%).
Hunger cries are confused in a big proportion (26.6%) with
pain cries. Also we had noticed from this experiment that in
general the pain cries are correctly classified, approximately in
proportion of 70%.
TABLE IV. C
# cries Discomfort (C2) Hunger (C4) Pain (C5)
Discomfort (C2) 12 107 90
Hunger (C4) 31 381 149
Pain (C5) 15 116
TABLE V. C
% cries Discomfort (C2) Hunger (C4) Pain (C5)
Discomfort (C2) 5.7 51.2 43.1
Hunger (C4) 5.5 67.9 26.6
Pain (C5) 3.4 26.2 70.4
Motivated by the fact that skilled persons such as
neonatologists can distinguish between different types of baby
cries, finding some patterns in each type, we developed a fully
automatic system that attempts to discriminate between
different types of cries. The system uses the GMM-UBM and
i-vectors frameworks for classification and was evaluated on a
real-world audio database comprising about 13000 cries of six
types uttered by more than 120 babies.
The results obtained using six types of cries (colic,
discomfort, eructation, hunger, pain, tiredness) using both
GMM-UBM and i-vectors frameworks are comparable. The
accuracy obtained with the i-vectors framework is slightly
higher (39%) than GMM-UBM framework (35%). These
results show that the classes are not disjunctive. One possible
reason can be the choice of the features, i.e. the MFCCs. This
suggests the use of other features in the future. Another reason
can be the inherent overlap of the classes given that different
needs may lead to similar contractions and thus similar sounds
produced by the babies. Yet another reason can be the
multitude of information contained by a cry. Besides the
information regarding the need, a cry contains also the
information about the identity of the baby and even some
background sounds. The last two play the role of noise and
contribute to increasing the confusion.
We decided to run some experiments with only the
discomfort, hunger and tiredness cries because we noticed that
these types of cries are confused in large proportion to each
other and seem to dominate the recognition of other classes
too. The experimental results showed that the recognition
performed with only these three classes has greater accuracy
(50.6% for the GMM-UBM method and 58.0% for the i-
vectors method). Even though this was expected because the
uncertainty is reduced (3 classes instead of 6), the big increase
in accuracy also suggests that a greater amount of data is
needed to properly train the rest of the classes.
We observed that the discomfort cries are in proportion of
5.7% correct classified, the rest are wrongly confused in equal
proportion with hunger (51.2%) and pain (43.1%). Hunger
cries are confused in big proportion (26.6%) with pain cries
which are correctly classified, approximately in proportion of
70%. One possible explanation is that the discomfort cry is a
more vague class and may contain cries that are produced by
needs similar (in terms of muscles contraction) to the ones of
hunger and pain.
The results obtained on SPLANN database are lower than
those obtained on Dunstan Baby Language (DBL) database
(81.8% overall accuracy) . A possible explanation for this
is that the DBL database is much smaller. It contains only 83
cries from 5 classes (discomfort, eructation, hunger,
flatulence, tiredness) uttered by only 39 babies. The total
duration of the cries in DBL database is 430 seconds. A
second explanation could be that the recordings were carefully
selected to be representative for different types of cry (they
were extracted from the DBL video tutorial).
Future work involves an in-depth analysis of the errors
obtained in these experiments and the evaluation of other
classification methods and audio features (including time
domain features, such as: duration of expiratory component,
the time between two cries, etc.).
This work was partly supported by the PN II Programme
“Partnerships in priority areas” of MEN - UEFISCDI, through
project no. 25/2014.
 E. Drumond and M.L McBride, “The development of mother’s
understanding of infant crying,” in Clinical Nursing Research, vol. 2, pp
 O.F. Reyes-Galaviz and C.A. Reyes-Garcia, “A system for the
processing of infant cry to recognize pathologies in recently born babies
with neural networks,” Proc. of the 9th Conference Speech and
Computer SPECOM’2004, St. Petersburg, Russia, pp. 552-557, 2004.
 P.S. Zeskind and B.M. Lester, “Analysis of infant crying,” Chapter 8 in
Biobehavioral Assessment of the Infant, L.T. Singer and P.S. Zeskind,
Eds. New York: Guilford Publications Inc., 2001, pp. 149-166.
 K. Arakawa, “System and method for analyzing baby cries,” US Patent
6,496,115 B2, Google Patents, 2002.
 Y. Abdulaziz and S.M.S. Ahmad, “Infant cry recognition system: a
comparison of system performance based on Mel frequency and linear
prediction cepstral coefficients," Proc of Int. Conf. on Information
Retrieval and Knowledge Management, Selangor, pp. 260-263, 2010.
 M.S. Rusu, S.S. Diaconescu, G. Sardescu, and E. Bratila, “Database and
system design for data collection of crying related to infant’s needs and
diseases,” Proc. of the 8th Int. Conf. on Speech Technology and Human-
Computer Dialogue SpeD 2015, Bucharest, Romania, pp. 1-6, 2015.
 D.A. Reynolds, T.F. Quatieri, and R.B. Dunn, “Speaker verification
using adapted Gaussian mixture models,” in Digital Signal Processing,
vol. 10, nos. 1-3. Academic Press, 2000, pp. 19-41.
 D. Najim, R. Dehak, P. Kenny, N. Brümmer, and P. Ouellet, “Support
vector machines versus fast scoring in the low-dimensional total
variability space for speaker verification,” Proc of INTERSPEECH
2009, Brighton, United Kingdom, pp. 1559-1562, 2009.
 N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-
end factor analysis for speaker verification,” in IEEE Trans. on Audio,
Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.
 S.O. Sadjadi, M. Slaney, and L. Heck, “MSR identity toolbox v1.0: a
MATLAB toolbox for speaker-recognition research,” in Speech and
Language Processing Technical Committee Newsletter, IEEE, 2013.
 I.-A. Banică, H. Cucu. A. Buzo, D. Burileanu, and C. Burileanu,
“Automatic methods for infant cry classification,” Proc of Int. Conf. on
Communications COMM 2016, Bucharest, Romania, 2016.