ChapterPDF Available

Baby Cry Detection: Deep Learning and Classical Approaches

Authors:

Abstract and Figures

In this chapter, we compare deep learning and classical approaches for detection of baby cry sounds in various domestic environments under challenging signal-to-noise ratio conditions. Automatic cry detection has applications in commercial products (such as baby remote monitors) as well as in medical and psycho-social research. We design and evaluate several convolutional neural network (CNN) architectures for baby cry detection, and compare their performance to that of classical machine-learning approaches, such as logistic regression and support vector machines. In addition to feed-forward CNNs, we analyze the performance of recurrent neural network (RNN) architectures, which are able to capture temporal behavior of acoustic events. We show that by carefully designing CNN architectures with specialized non-symmetric kernels, better results are obtained compared to common CNN architectures.
Content may be subject to copyright.
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/311011518
Baby cry detection in domestic environment using deep learning
Conference Paper · November 2016
DOI: 10.1109/ICSEE.2016.7806117
CITATIONS
19
READS
822
4 authors, including:
Some of the authors of this publication are also working on these related projects:
Baby cry detection View project
Fricative spotting and classification View project
Yizhar Lavner
Tel-Hai
52 PUBLICATIONS835 CITATIONS
SEE PROFILE
Rami Cohen
Technion - Israel Institute of Technology
24 PUBLICATIONS113 CITATIONS
SEE PROFILE
All content following this page was uploaded by Rami Cohen on 14 May 2019.
The user has requested enhancement of the downloaded file.
Electronic copy available at: https://ssrn.com/abstract=2877132
2016 ICSEE International Conference on the Science of Electrical Engineering
Baby Cry Detection in Domestic Environment using
Deep Learning
Yizhar Lavner, Rami Cohen, Dima Ruinskiy∗‡ and Hans IJzerman§
Dept. of Computer Science, Tel-Hai College, Upper Galilee, Israel
Andrew and Erna Viterbi Faculty of Electrical Engineering, Technion – Israel Institute of Technology, Haifa, Israel
Intel Corporation, Haifa, Israel
§Dept. of Clinical Psychology, Vrije Universiteit, Amsterdam, the Netherlands
Email: yizhar.lavner@gmail.com, rc@campus.technion.ac.il, dima.ruinskiy@intel.com, h.ijzerman@gmail.com
Abstract—Automatic detection of a baby cry in audio sig-
nals is an essential step in applications such as remote baby
monitoring. It is also important for researchers, who study
the relation between baby cry patterns and various health
or developmental parameters. In this paper, we propose two
machine-learning algorithms for automatic detection of baby
cry in audio recordings. The first algorithm is a low-complexity
logistic regression classifier, used as a reference. To train this
classifier, we extract features such as Mel-frequency cepstrum
coefficients, pitch and formants from the recordings. The second
algorithm uses a dedicated convolutional neural network (CNN),
operating on log Mel-filter bank representation of the recordings.
Performance evaluation of the algorithms is carried out using an
annotated database containing recordings of babies (0-6 months
old) in domestic environments. In addition to baby cry, these
recordings contain various types of domestic sounds, such as
parents talking and door opening. The CNN classifier is shown
to yield considerably better results compared to the logistic
regression classifier, demonstrating the power of deep learning
when applied to audio processing.
I. INTRODUCTION
Automatic detection and classification of acoustic events in
audio signals is a challenging research area in auditory ma-
chine perception [1], related to computational auditory scene
analysis [2]. Due to the vast amount of acoustic data collected
and accumulated in recent years, manual annotation of the data
is impractical. This raises the need for developing reliable and
efficient algorithms for automatic detection and classification
of acoustic events. Such algorithms are a pre-requisite for
automatic recognition and labeling of audio content.
In this study, we focus on the detection and classification of
baby cry sounds in various domestic environments. Crying is
one of the major means of infants to communicate distress
and attachment needs (such as being hungry or cold) to
their caregivers [3]. One common application of automatic
cry detection is a baby remote monitor, where parents are
alerted if their baby is crying. Another important application is
enablement of non-intrusive psychological research of infants
and their caregivers in the earliest days of life. The bond
between caregiver and infant is formed through physiological
co-regulation processes that take place throughout the day [4].
Thus, monitoring often needs to be conducted over many hours
or days to collect meaningful data. The sheer volume of data
and the difficulty to decide a priori which variables to target
make precise measurements and classification of acoustic
events necessary to understand the co-regulation patterns.
A baby cry is elicited from rhythmical transitions between
inhalation and exhalation, due to a vibration of the vocal cords
that produces periodic air pulses. The period of these pulses is
called the fundamental frequency (pitch), and its typical values
in healthy babies are 250 600 Hz. The cry signal is shaped
by the vocal tract, leading to resonant frequencies termed as
formants. The first two formants occur typically around 1100
Hz and 3300 Hz, respectively [3]. The detection of cry signals
is usually carried out by extracting distinguishing features
from segments of the audio signal. Apart from pitch and
formants, these include temporal and spectral features such as
short-time energy, Mel-frequency cepstrum (MFC) coefficients
and others [5]–[7].
In this work, we implement two methods for the detection
of cry signals in audio recordings: a low-complexity classi-
fier based on logistic regression and a convolutional neural
network (CNN) classifier, and compare their performance.
II. ME TH OD S
A. Database
The database for this study contains recordings of several
tens of hours of audio recordings made by parents of babies
in the Netherlands. The babies were in their first 6 months of
life, and were recorded 24/7 in a domestic environment. The
recordings contain various types of sounds, such as crying,
parents talking, door opening etc. The database was collected
as a part of a pilot study aimed at investigating ”attachment
formation” (forming the bond between caregiver and child)
[8]. Three hours of the recordings were fully annotated, down
to the millisecond level, with about 50 different event types.
The sampling frequency of the recordings is Fs= 44100 Hz.
B. Preprocessing and feature extraction
The audio recordings are divided into consecutive overlap-
ping segments of 4096 samples (about 93ms) with an overlap
of 50%. These segments are further divided into frames of
16ms with a step size of 8ms. A pitch detector [9] is applied to
each frame, using peaks in the cepstral domain for rough pitch
978-1-5090-2152-9/16/$31.00 c
2016 IEEE
Electronic copy available at: https://ssrn.com/abstract=2877132
2016 ICSEE International Conference on the Science of Electrical Engineering
Fig. 1: An example of a baby cry signal. Top: the signal
waveform. Bottom: the signal spectrogram, demonstrating the
harmonic structure of the cry signal.
value estimation, and cross-correlation in the time-domain for
refinement of the initial pitch value. Possible pitch period
durations are restricted to the range of 1.63.3ms due to
the expected baby cry pitch period.
The following features are computed for each audio seg-
ment. The reader is referred to [7] and [10] for details.
1) 38 Mel-Frequency Cepstrum coefficients (MFCC).
2) Short-time energy (STE).
3) Zero-crossing rate (ZCR).
4) Pitch median value within a segment.
5) Run-length of pitch, defined as the number of consec-
utive voiced frames within a segment where pitch was
detected.
6) Harmonicity factor (HF).
7) Harmonic-to-average power ratio (HAPR).
8) First formant, based on the line-spectral pair represen-
tation.
9) Band energy ratio, defined as the ratio (in dB) between
the spectral energy in the frequency bands [0,3.5]kHz
and [3.5,22.5]kHz.
10) Spectral rolloff point: the frequency below which 75%
of the spectral energy is concentrated.
Figure 2 shows an example of the distribution of the 5th
MFC coefficient among baby cry sections (red) vs. all other
sound events (blue) in the training set (about 320 seconds).
The discriminating potential of this feature is evident, although
there is a wide overlapping area.
Fig. 2: A histogram of the 5th MFC coefficient. Red: cry
events, blue: other events.
III. LOGISTIC REGRESSION
The logistic regression classifier [11] is a simple supervised
algorithm, with the advantage of low computational complex-
ity. The logistic regression is a non-linear hypothesis function
of the form:
hθ(x) = 1
1 + exp (θTx),(1)
where xis a d-dimensional feature vector and θis a weight
vector. In our case, hθ(x)(0,1) predicts the likelihood of
a segment to be a cry sound (values close to 1), or a different
sound (values close to 0). The decision is made by comparing
hθ(x)(0,1) to a threshold value, to obtain a final binary
classification y∈ {0,1}, where 1denotes a cry event. In the
training phase of the classifier, a gradient descent algorithm is
used to find θthat minimizes the cost function
E(θ) = 1
n
n
X
j=1
y(j)log 1
1 + exp(θTx(j))(2)
1
n
n
X
j=1 1y(j)log exp(θTx(j))
1 + exp(θTx(j))
+λ
2n
d
X
k=1
θk2,
given a dataset of nlabeled samples x(j), y(j)n
j=1, where λ
is a regularization parameter. The θ-minimizer found by the
gradient descent algorithm is then assigned to (1) to classify
new unlabeled samples.
A. Detection procedure
A schematic block diagram of the logistic-regression-based
algorithm is depicted in Figure 3. The input data is divided
into consecutive segments of 4096 samples. For each segment
a 50-dimensional feature vector is computed. The trained
regularized logistic regression is then applied on each fea-
ture vector, and the hypothesis function hθ(x)is obtained,
2016 ICSEE International Conference on the Science of Electrical Engineering
Fig. 3: A schematic block diagram of the logistic regression
algorithm
representing an estimation of the posterior probability p(y|x),
where y∈ {0,1}is the sound event to be classified as cry or
non-cry and xis the feature vector. Using a threshold value
Th1, an initial decision value for each segment is set according
to the following rule:
d(n) = (1,if hθ(x)>Th1
0,otherwise. (3)
The duration of a single segment is about 93ms, while most
cry events are at least several hundred of milliseconds long.
In order to avoid erroneous detections of sections that are
too short to be a likely cry event, a smoothing operation
is applied to the sequence of initial decisions as follows: a
sliding window of length Lis applied on the initial sequence
of decisions and the smoothed decision ds(n)for the central
segment is updated according to the following rule:
ds(n) =
1,if
M
P
k=M
d(nk)>Th2
0,otherwise.
(4)
where Lis odd, M= (L1)/2and Th2[1, L]is a
predefined threshold value.
IV. CONVOLUTIONAL NEURAL NETWORKS
Convolutional neural networks (CNN) [11] have wide ap-
plications in the fields of computer vision, natural language
processing and many others, especially where huge amounts of
data have to be processed and classified. Like ordinary neural
networks, they consist of several layers connected by neurons
that have learnable weights. Each CNN layer is composed
Fig. 4: An LMFB representation of a cry frame. Note that
the non-uniform gaps in the frequency axis are due to the
logarithmic Mel scale
of several filters, applied to outputs provided by the previous
layer using the convolution operation. CNNs learn the filters
during the training process, which can be thought of a way to
generate important features out of the data. Thus, in contrast
to traditional classification algorithms, the lack of dependence
on prior knowledge is a major advantage of CNNs.
To work with a CNN classifier, the audio signal is divided
into consecutive segments of 4096 samples. Each segment is
further divided into frames of 512 samples, with a step size of
128 samples. As the contribution of high frequency bands to
the detection of cry signals is limited, a low-pass filter at 11025
Hz is applied. A log Mel-filter bank (LMFB) representation
is then produced for each frame, using 40 filters distributed
according to the Mel scale in the frequency range [0,11025]
Hz. Given segments of 4096 samples and a step size of 128
samples, this leads to a 40×29 ”image” representation of each
segment. An example is shown in Figure 4.
The main difference between LMFB and MFCC is that the
discrete cosine transform (DCT) of the log-power spectrum is
skipped in LMFB representation. This is mainly due to the
tendency of the DCT to decorrelate the data, whereas spatial
correlation of the input is actually advantageous for a CNN.
The distinctive features of a signal within a frame are mostly
due to frequency changes. Thus, our CNN uses convolution
layers with ”tall” filters: 10×2,6×2and 3×2, to achieve high
frequency resolution compared to low temporal resolution.
Due to the Mel scale, each ”pixel” in the LMFB represents
a frequency range. To better capture the frequency behavior,
small stride values are used. Similarly, the max-pooling layers
consist of small blocks, to emphasize the content of correlated
frequency bands. The activation function for the CNN is the
standard rectifier, corresponding to rectified linear unit (ReLU)
layers. The entire CNN architecture is shown in Figure 5.
2016 ICSEE International Conference on the Science of Electrical Engineering
Fig. 5: The CNN architecture.
To train the network, we employed a stochastic gradient
descent algorithm with momentum [12]. The gradient in each
iteration was evaluated using a mini-batch of 256 frames, over
50 training iterations. A visualization of the filters obtained for
the first convolution layer after the training phase is provided
in Figure 6. It is evident that at this initial layer the filters
capture mostly basic image features such as edges, which
correspond to fast transitions in LFMB values.
V. PERFORMANCE EVALUATIO N
Two important measures for the evaluation are the detection
rate and the false-positive rate. The detection rate (also known
as sensitivity or recall) is defined as the ratio between the
number of true-positive events, i.e. the number of cry events
correctly identified, and the total number of cry events in
the recording set (true positives and false negatives). The
Fig. 6: First convolutional layer filter weights (120 filters, each
of dimensions 10 ×2).
false-positive (or false-alarm) rate is defined as the ratio be-
tween the number of false positives (non-cry events identified
erroneously as cry events) and the total number of non-
cry events in the recording set (including true negatives).
Thus, if TP,TN,FP and FN denote true positives, true
negatives, false positives and false negatives, respectively, then
the detection rate is TP/(TP+FN), and the false-positive rate
is FP/(FP + TN).
One of the goals of the current study is to construct a plat-
form for conducting psychological research on co-regulatory
patterns between a baby and its caregiver, with cry events
being a primary variable as a predictor of attachment. Thus,
the importance of obtaining a high detection rate is obvious.
However, a low false-positive rate is perhaps even more im-
portant, in order to avoid the contamination of data with non-
related events, which may prevent meaningful conclusions.
Therefore, in the analysis of the cry-detection performance
of the logistic regression and the CNN classifiers we focus on
the trade-off between the false-positive rate and the detection
rate. The performance evaluation was carried out using a
receiver operator characteristic (ROC) curve, as shown in
Figure 7. Both classifiers were trained using a similar training
set of 18000 frames (about 30 minutes), and the ROC curves
were obtained using a validation set of two hours. For false-
positive rates lower than 5%, the CNN classifier evidently
outperforms the logistic regression classifier.
The evaluation results for both classifiers are summarized in
Table I. For detection rates of 80%,85% and 90%, the false-
positive rates of the CNN classifier are lower than the corre-
sponding rates of the logistic regression classifier. However,
the performance is similar and reversed for higher detection
rates. Keeping the false-positive rate at a fixed value of 1.0%,
a detection rate of 82.5% is yielded for the CNN classifier,
versus 81.0% and 65.0% for the logistic regression classifier,
with and without the smoothing procedure, respectively.
2016 ICSEE International Conference on the Science of Electrical Engineering
Fig. 7: ROC curves for the logistic regression and the CNN
classifiers.
Classifier/Detection rate 80% 85% 90% 95%
Logistic Regression 0.9% 2.1% 4.3% 9.0%
CNN 0.7% 1.6% 4.2% 12.0%
TABLE I: A summary of the false-positive rates for a given
detection rate among the two classifiers.
VI. CONCLUSIONS
In this work, two machine-learning algorithms were pro-
posed for the detection of baby cry in audio recordings:
a logistic regression classifier and a more complex CNN
classifier. The results show a considerable advantage of the
CNN classifier compared to the logistic regression classifier.
As CNNs are naturally suited for large training datasets and
for multi-class classification, we plan to train a CNN classifier
to detect various types of domestic sounds in addition to cry
signals.
ACKNOWLEDGMENT
The authors would like to thank the staff of the Signal and
Image Processing Lab (SIPL), Technion, and the staff of the
Vision and Image Sciences Lab (VISL), Technion, for their
support.
REFERENCES
[1] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D.
Plumbley, “Detection and classification of acoustic scenes and events,”
IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, Oct
2015.
[2] D. Barchiesi, D. Giannoulis, D. Stowell, and M. D. Plumbley, “Acoustic
scene classification: Classifying environments from the sounds they
produce,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 16–
34, May 2015.
[3] L. L. LaGasse, A. R. Neal, and B. M. Lester, “Assessment of infant cry:
Acoustic cry analysis and parental perception,” Mental Retardation and
Developmental Disabilities Research Reviews, vol. 11, no. 1, pp. 83–93,
2005.
[4] L. Beckes, H. IJzerman, and M. Tops, “Toward a radically embodied
neuroscience of attachment and relationships,” Frontiers in Human
Neuroscience, vol. 9, p. 266, 2015.
[5] J. Saraswathy, M. Hariharan, S. Yaacob, and W. Khairunizam, “Au-
tomatic classification of infant cry: A review,” in 2012 International
Conference on Biomedical Engineering (ICoBE), Feb 2012, pp. 543–
548.
[6] G. Varallyay, “The melody of crying,International Journal of Pediatric
Otorhinolaryngology, vol. 71, no. 11, pp. 1699–1708, Nov. 2007.
[7] R. Cohen and Y. Lavner, “Infant cry analysis and detection,” 2012 IEEE
27th Convention of Electrical and Electronics Engineers in Israel (IEEEI
2012), pp. 2–6, 2012.
[8] H. IJzerman et al., “A theory of social thermoregulation in human
primates,” Frontiers in Psychology, vol. 6, no. 464, 2015.
[9] A. M. Noll, “Cepstrum pitch determination,” The Journal of the Acous-
tical Society of America, vol. 41, no. 2, pp. 293–309, 1967.
[10] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing: A
Guide to Theory, Algorithm, and System Development. Prentice Hall
PTR, 2001.
[11] C. M. Bishop, Pattern Recognition and Machine Learning. Springer
Science+Business Media, 2006.
[12] K. P. Murphy, Machine Learning: A Probabilistic Perspective. MIT
Press, 2012.
View publication statsView publication stats
... Typical examples include Mel frequency cepstral coefficients (MFCCs) and their varieties, pitch-related features, harmonic features, energy, zero crossing rate, etc. For a review of these approaches, the readers are referred to references [1][2][3]. ...
... In reference [9], it trains an LSTM-with-self-attention model on infant-cry samples automatically detected from the recorded audio through cluster analysis and Hidden Markov Model classification. It has been shown in references [1,3] that the performance of CNN is much better than traditional machine learning. On the other hand, the complexity of the CNN may be high, preventing it be used in common embedded devices, like low-cost IP cameras or tablets. ...
... However, in their comparison, a common recommendation is that CNN outperforms the traditional machine learning methods. For a complete review, please refer to references [1][2][3]. ...
Article
Full-text available
Detection of baby cry is an important part of baby monitoring. Almost all existing methods use supervised SVM, CNN, or their varieties. In this work, we propose to use weakly supervised anomaly detection to detect baby cry. In this weak supervision framework, we only need weak annotation if there is a cry in an audiofile. We design a data mining technique using the pre-trained VGGish feature extractor and an anomaly detection network on long untrimmed audiofiles. The obtained datasets are used to train a delicately designed super lightweight CNN for cry/non-cry classification. This CNN is then used as a feature extractor in an anomaly detection framework to achieve better cry detection performance. Received: 27 November 2023 | Revised: 18 February 2024 | Accepted: 10 May 2024 Conflicts of Interest The authors declare that they have no conflicts of interest to this work. Data Availability Statement The data that support the findings of this study are openly available at 1) https://github.com/giulbia/baby cry detection, reference number [11]; 2) audioSet at https://research.google.com/audioset/dataset/index.html, reference [10]; 3) ESC-50 at https://github.com/karolpiczak/ESC-50, reference [23]; 4) https://github.com/gveres/donateacry-corpus, reference [27]. Author Contribution Statement Weijun Tan: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization, Supervision, Project administration. Qi Yao: Resources, Data curation. Jingfeng Liu: Resources, Data curation, Project administration.
... In 2020 Cohen et al. [12], developed a cry recognition system in commercial products that was comparable to psycho-social and medical research. Different CNNs are more effective than traditional machine-learning classifiers including SVMs and logistic regressions at detecting newborn cry signals. ...
Article
Full-text available
Premature babies scream to make contact with their mothers or other people. Infants communicate via their screams in different ways based on the motivation behind their cries. A considerable amount of work and focus is required these days to preprocess, extract features, and classify audio signals. This research aims to propose a novel Elephant Herding Optimized Deep Convolutional Gated Recurrent Neural Network (EHO-DCGR net) for classifying cry signals from premature babies. Cry signals are first preprocessed to remove distortion caused by short sample times. MFCC (Mel-frequency cepstral coefficient), Power Normalized Cepstral Coefficients (PNCC), BFCC (Bark-frequency cepstral coefficient), and LPCC (Linear Prediction cepstral coefficient) are used to identify abnormal weeping through their prosodic aspects. The Elephant Herding optimization (EHO) algorithm is utilized for choosing the best features from the extracted set to form a fused feature matrix. These characteristics are then used to categorize premature baby cry sounds using the DCGR net. The proposed EHO-DCGR net effectiveness is measured by precision, specificity, recall, and F1-score, accuracy. According to experimental fallouts, the proposed EHO-DCGR net detects baby cry signals with an astounding 98.45% classification accuracy. From the experimental analysis, the EHO-DCGR Net increases the overall accuracy by 12.64%, 3.18%, 9.71% and 3.50% better than MFCC-SVM, DFFNN, SVM-RBF and SGDM respectively.
... Many researches focus on extracting cepstral domain features such as Mel Frequency Cepstral Coefficients (MFCCs) [15,17,18], Linear Frequency Cepstral Coefficients (LFCCs) Short Time Cepstral Coefficients (STCCs) [19], and Bark Frequency Cepstral Coefficients (BFCCs) [20] in audio signal feature extraction and feeding them into machine learning (ML) and deep learning models; for example, in [21], they used MFCC in classifying infant crying causes with 76% accuracy by using ML learning algorithms such as K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Naive Bayes Classifier (NBC); however, in the study [17], they built binary classification for healthy and pathological infants, achieving 91% by combining the MFCC feature extracted with the Hidden Markov Model (HMM). ...
Article
Full-text available
Early detection of infant pathologies by non-invasive means is a critical aspect of pediatric healthcare. Audio analysis of infant crying has emerged as a promising method to identify various health conditions without direct medical intervention. In this study, we present a cutting-edge machine learning model that employs audio spectrograms and transformer-based algorithms to classify infant crying into distinct pathological categories. Our innovative model bypasses the extensive preprocessing typically associated with audio data by exploiting the self-attention mechanisms of the transformer, thereby preserving the integrity of the audio’s diagnostic features. When benchmarked against established machine learning and deep learning models, our approach demonstrated a remarkable 98.69% accuracy, 98.73% precision, 98.71% recall, and an F1 score of 98.71%, surpassing the performance of both traditional machine learning and convolutional neural network models. This research not only provides a novel diagnostic tool that is scalable and efficient but also opens avenues for improving pediatric care through early and accurate detection of pathologies.
Chapter
This research presents a inclusive study into the growth of a deep education model handling Convolutional Neural Networks (CNN) for the purpose of discriminating differing causes behind baby crying. The study includes the accumulation and study of baby cry visual and audio entertainment transmitted via radio waves samples, including an far-reaching array of visual and audio entertainment transmitted via radio waves limits in the way that Short-Time Fourier Transform (STFT) Mean, Root Mean Square (RMS) Mean, Spectral Centroid (SC) Mean, Spectral Bandwidth (SBAN) Mean, Zero-Crossing Rate (ZCR) Mean, Mel-repetitiveness Cepstral Coefficients (MFCCs) including MFCCs1 to MFCCs13, alongside accumulation of solid and opening-delta MFCCs13. These diverse visual and audio entertainment transmitted via radio waves appearance are working to train the CNN construction, permissive the model to correctly categorize baby cries established different creative determinants.
Article
Infant cry is a crucial indicator that offers valuable insights into their physical and mental conditions, such as hunger and pain. However, the scarcity of infant cry datasets hinders the model's generalization in real-life scenarios. The varying voiceprint characteristics among infants further exacerbate this challenge, deteriorating the model's performance on unseen infants. To this end, we propose a multi-task model for Infant Cry Detection and Reasoning (ICDR). It leverages datasets from two tasks to enrich data diversity and introduces an efficient attention module to achieve inter-task feature supplementarity. To mitigate the impact of subject differences, ICDR introduces an intra-task contrastive mixture of experts (CMoE) module that adaptively allocates experts to reduce subject variance and applies contrastive learning to enhance the representation consistency of samples from different infants in the same state. Extensive cross-subject experiments show that ICDR outperforms the state-of-the-art models in infant cry detection and reasoning, with an improvement of 2-9% in the F1-score. This demonstrates that multi-task learning effectively enhances the model's generalization ability by inter-task attention and intra-task CMoE.
Article
Full-text available
This paper presents a novel sound event detection (SED) system for rare events occurring in an open environment. Wavelet multiresolution analysis (MRA) is used to decompose the input audio clip of 30 seconds into five levels. Wavelet denoising is then applied on the third and fifth levels of MRA to filter out the background. Significant transitions, which may represent the onset of a rare event, are then estimated in these two levels by combining the peak-finding algorithm with the K-medoids clustering algorithm. The small portions of one-second duration, called ‘chunks’ are cropped from the input audio signal corresponding to the estimated locations of the significant transitions. Features from these chunks are extracted by the wavelet scattering network (WSN) and are given as input to a support vector machine (SVM) classifier, which classifies them. The proposed SED framework produces an error rate comparable to the SED systems based on convolutional neural network (CNN) architecture. Also, the proposed algorithm is computationally efficient and lightweight as compared to deep learning models, as it has no learnable parameter. It requires only a single epoch of training, which is 5, 10, 200, and 600 times lesser than the models based on CNNs and deep neural networks (DNNs), CNN with long short-term memory (LSTM) network, convolutional recurrent neural network (CRNN), and CNN respectively. The proposed model neither requires concatenation with previous frames for anomaly detection nor any additional training data creation needed for other comparative deep learning models. It needs to check almost 360 times fewer chunks for the presence of rare events than the other baseline systems used for comparison in this paper. All these characteristics make the proposed system suitable for real-time applications on resource-limited devices.
Article
Full-text available
The aim of this research is to develop a portable, efficient and cost effective automatic infant's cry detector and self-soother with real time monitoring system for employed parents. The cry detection algorithm has developed according to the crying signals and it is segmented using the short time energy function which is used as a voice activity detector to disable the operation of the algorithm when voice activity is not present. The features are extracted using MFCC (Mel Frequency Cepstrum Coefficients) and pitch frequency. Statistical properties are calculated for the extracted features of MFCC and pitch frequency. K-NN (K-Nearest Neighbour) algorithm classifier is used to classify the cry signal. The system can easily identify the infant cry and it is verified using K-NN with accurate results by proposed detection algorithm. The combination of Pitch and MFCC gives more promising approach to cry detection than using only MFCC. The total average accuracy of MATLAB simulation is 80.8335% and on the device accuracy was77.5% for cry detection. Immediate cry detection and self-soothing system helps to increase baby's cognitive development process. This all in one module approach gives great benefits to the first-time parents, adoptive parents, caretakers, researchers or physicians by both economically and scientifically.
Article
Full-text available
The Recurrent Neural Network (RNN) utilizes dynamically changing time information through time cycles, so it is very suitable for tasks with time sequence characteristics. However, with the increase of the number of layers, the vanishing gradient occurs in the RNN. The Grid Long Short-Term Memory (GridLSTM) recurrent neural network can alleviate this problem in two dimensions by taking advantage of the two dimensions calculated in time and depth. In addition, the time sequence task is related to the information of the current moment before and after. In this paper, we propose a method that takes into account context-sensitivity and gradient problems, namely the Bidirectional Grid Long Short-Term Memory (BiGridLSTM) recurrent neural network. This model not only takes advantage of the grid architecture, but it also captures information around the current moment. A large number of experiments on the dataset LibriSpeech show that BiGridLSTM is superior to other deep LSTM models and unidirectional LSTM models, and, when compared with GridLSTM, it gets about 26 percent gain improvement.
Preprint
Full-text available
Psychological science is in a transitional period: Many findings do not replicate and theories appear not as robust as previously presumed. We suspect that a main reason for theories not appearing as robust is because they are too simple. In this paper, we provide an important step towards this transition in the field of interpersonal relationship research by 1) providing an overarching theoretical framework grounded in existing relationship science, and 2) outlining a novel approach - mobile social physiology – that relies on intelligent technologies like wearable sensors, actuators, and modern analytical methods. At the core of our theoretical principles is co-regulation (one partner’s [statistical] co-dependency on the other partner). Co-regulation has long existed in the literature, but has to date been largely untested. To test the outlined principles, we 3) present a newly programmed app – the Bio-App for Bonding (available on GitHub: https://github.com/co-relab/bioapp). By providing a paradigm shift for relationship research, the field can not only increase the accuracy of measurement and the generalizability of findings, it also allows for moving from the lab to real life situations. We discuss how the mobile social physiology approach is rooted in existing theoretical principles (e.g., Social Baseline and Attachment Theory), extends the concept of co-regulation to allow for specific measurements, and provides a research agenda to develop a model of interpersonal relationships that we hope will stand the test of time.
Chapter
Laying foundations for an interdisciplinary approach to the study of evolution in communication systems with tools from evolutionary biology, linguistics, animal behavior, developmental psychology, philosophy, cognitive sciences, robotics, and neural network modeling. The search for origins of communication in a wide variety of species including humans is rapidly becoming a thoroughly interdisciplinary enterprise. In this volume, scientists engaged in the fields of evolutionary biology, linguistics, animal behavior, developmental psychology, philosophy, the cognitive sciences, robotics, and neural network modeling come together to explore a comparative approach to the evolution of communication systems. The comparisons range from parrot talk to squid skin displays, from human language to Aibo the robot dog's language learning, and from monkey babbling to the newborn human infant cry. The authors explore the mysterious circumstances surrounding the emergence of human language, which they propose to be intricately connected with drastic changes in human lifestyle. While it is not yet clear what the physical environmental circumstances were that fostered social changes in the hominid line, the volume offers converging evidence and theory from several lines of research suggesting that language depended upon the restructuring of ancient human social groups. The volume also offers new theoretical treatments of both primitive communication systems and human language, providing new perspectives on how to recognize both their similarities and their differences. Explorations of new technologies in robotics, neural network modeling and pattern recognition offer many opportunities to simulate and evaluate theoretical proposals. The North American and European scientists who have contributed to this volume represent a vanguard of thinking about how humanity came to have the capacity for language and how nonhumans provide a background of remarkable capabilities that help clarify the foundations of speech. Bradford Books imprint
Conference Paper
The amount of time an infant cries in a day helps the medical staff in the evaluation of his/her health conditions. Extracting this information requires a cry detection algorithm able to operate in environments with challenging acoustic conditions, since multiple noise sources, such as interferent cries, medical equipments, and persons may be present. This paper proposes an algorithm for detecting infant cries in such environments. The proposed solution is a multiple stage detection algorithm: the first stage is composed of an eight-channel filter-and-sum beamformer, followed by an Optimally Modified Log-Spectral Amplitude estimator (OMLSA) post-filter for reducing the effect of interferences. The second stage is the Deep Neural Network (DNN) based cry detector, having audio Log-Mel features as inputs. A synthetic dataset mimicking a real neonatal hospital scenario has been created for training the network and evaluating the performance. Additionally, a dataset containing cries acquired in a real neonatology department has been used for assessing the performance in a real scenario. The algorithm has been compared to a popular approach for voice activity detection based on Long-Term Spectral Divergence, and the results show that the proposed solution achieves superior detection performance both on synthetic data and on real data.
Article
Automatic infant cry classification is one of the crucial studies under biomedical engineering scope, adopting the medical and engineering techniques for the classification of diverse physical and physiological conditions of the infants by their cry signal. Subsequently, plentiful studies have executed and issued, broadened the potential application of cry analyses. As yet, there is no ultimate literature documentation composed by performing a longitudinal study, emphasizing on the boast trend of automatic classification of infant cry. A review of literature is performed using the key words “infant cry” AND “automatic classification” from different online resources, regardless of the year of published in order to produce a comprehensive review. Review papers were excluded. Results of search reported about more than 300 papers and after some exclusion 101 papers were selected. This review endeavors at reporting an overview about recent advances and developments in the field of automated infant cry classification, specifically focusing on the developed infant cry databases and approaches involved in signal processing and recognition phases. Eventually, this article was accomplished with some possible implications which may lead for development of an advanced automated cry based classification systems for real time applications.