Conference PaperPDF Available

Automatic Motherese Detection for Face-to-Face Interaction Analysis

  • Centre Hospitalier Valvert

Abstract and Figures

This paper deals with emotional speech detection in home movies. In this study, we focus on infant-directed speech also called "motherese" which is characterized by higher pitch, slower tempo, and exaggerated intonation. In this work, we show the robustness of approaches to automatic discrimination between infant-directed speech and normal directed speech. Specifically, we estimate the generalization capability of two feature extraction schemes extracted from supra-segmental and segmental information. In addition, two machine learning approaches are considered: k-nearest neighbors (k-NN) and Gaussian mixture models (GMM). Evaluations are carried out on real-life databases: home movies of the first year of an infant.
Content may be subject to copyright.
Automatic Motherese Detection for Face-to-Face
Interaction Analysis
A. Mahdhaoui1, M. Chetouani1,C.Zong
2,3, C. Saint-Georges2,3,
M-C. Laznik4,S.Maestro
5, F. Muratori5, and D. Cohen2,3
1Institut des Syst`emes Intelligents et de Robotique, CNRS FRE 2507, Universit´e
Pierre et Marie Curie, Paris, France
2Department of Child and Adolescent Psychiatry, AP-HP, Groupe Hospitalier
Piti´e-Salp´etri`ere, Universit´e Pierre et Marie Curie, Paris, France
3Laboratoire Psychologie et Neurosciences Cognitives, CNRS UMR 8189, Paris,
Fra n c e
4Department of Child and Adolescent Psychiatry, Association Sant´eMentaledu
13`eme, Paris, France
5Scientific Institute Stella Maris, University of Pisa, Italy,
Abstract. This paper deals with emotional speech detection in home
movies. In this study, we focus on infant-directed speech also called
“motherese” which is characterized by higher pitch, slower tempo, and
exaggerated intonation. In this work, we show the robustness of ap-
proaches to automatic discrimination between infant-directed speech and
normal directed speech. Specifically, we estimate the generalization capa-
bility of two feature extraction schemes extracted from supra-segmental
and segmental information. In addition, two machine learning approaches
are considered: k-nearest neighbors (k-NN) and Gaussian mixture models
(GMM). Evaluations are carried out on real-life databases: home movies
of the first year of an infant.
Keywords: motherese detection, feature and classifier fusion.
1 Introduction
Since more than 30 years, interest has been growing about family home movies
of infants who will become autistic. Typically developing infants gaze at people,
turn toward voices and express interest for communication. In contrast, infants
who became autistic will be characterized by the presence of abnormalities in re-
ciprocal social interactions and in patterns of communications [1]. In this paper,
we focus on a verbal information which has been recently shown to be crucial for
engaging interaction between the parent and infant. This verbal information is
called “motherese” (also termed infant-directed speech) and it is a simplified lan-
guage/dialect/register [2] that parents use spontaneously when speaking to their
young baby. From an acoustic point of view, motherese has a clear signature (high
A. Esposito et al. (Eds.): Multimo dal Signals, LNAI 5398, pp. 248–255, 2009.
Springer-Verlag Berlin Heidelberg 2009
Automatic Motherese Detection for Face-to-Face Interaction Analysis 249
pitch, exaggeratedintonation contours). The phonemes, and particularly the vow-
els, are more clearly articulated. Motherese has been shown to be preferred by
infants over adult-directed speech and might assist infants during the language ac-
quisition process [3]. The exaggerated patterns facilitate the discrimination
between the phonemes or sounds. Motherese plays also a major role during so-
cial interactions. However, even if motherese is clearly defined in terms of acoustic
properties, the modeling and the detection are expected to be difficult which is the
case of the majority of emotional speech. Indeed, the characterization of sponta-
neous and affective speech in terms of features is still an open question and several
parameters have been proposed in the literature [4].
As a starting-point and following the definition of motherese [2], we char-
acterized the verbal interactions by the extraction of supra-segmental features
(prosody). However, segmental features are often used in speech segmentation.
Consequently, the utterances are characterized by both segmental (short-time
spectrum) and supra-segmental (statistics of fundamental frequency, energy and
duration) features. These features aim at representing the verbal information for
the next classification stage based on machine learning techniques.
This paper presents a framework for the study of parent-infant interaction
during the first year of age focusing on the engagement produced by motherese,
in normal infants or infants who will become autistic. To this purpose, we use a
longitudinal case study methodology based on the analysis of home movies. We
focus on a basic and crucial task namely the classification of verbal information as
motherese or adult directed speech which implies the design of robust motherese
detector. Section 2 presents the longitudinal corpora used. The proposed method
is described in section 3 which needs specific attentions to the different stages:
feature extraction, classification and decision fusion. The experiments performed
are discussed in section 4. Finally, the conclusions and suggestions for further
works are detailed in section 5.
2 Database
The speech corpus used in this experiment is a collection of natural and sponta-
neous interactions usually used for children development research (home movies).
The corpus consists of recordings in Italian of mother and father as they ad-
dressed their infants. In addition, the analysis of home movies makes it possible
to set up a longitudinal study (months or years) and gives information about
early behaviors of autistic infants, a long time before the diagnostic would be
made by the clinicians. However, this large corpus makes it inconvenient for peo-
ple to review it. Also, the recordings are not done by professionals resulting in
adverse conditions (noise, camera, microphones). We focus on one home video
totaling 3 hours of data describing the first year of an infant. Verbal interactions
of the mother have been carefully annotated by a psycholinguist on two cate-
gories: motherese and normal directed speech. From this manual annotation, we
extracted 100 utterances for each class. The utterances are typically between
0.5s and 4s in length.
250 A. Mahdhaoui et al.
Segmental Feature
feature extraction
Fig. 1. Motherese classification system
3 System Description
This work aims at designing an automatic detection system for the analysis of
parent-infant interaction. This system will provide an independent classification
of the utterances. In order to improve the system, two different approaches have
been investigated individually: segmental and supra-segmental. Figure 1 shows
a schematic overview of the final system which is described in more details in
the following paragraphs.
3.1 Feature Extraction
In this paper, we evaluate two approaches respectively termed as segmental and
supra-segmental features. The first ones are characterized by the Mel Frequency
Cepstrum Coefficients (MFCC) while the second ones are characterized by sta-
tistical measures of both the fundamental frequency (F0) and the short-time
energy, these statistical measures are calculated from the voiced segments.For
the computation of segmental features, a 20ms window is used, and the over-
lapping between adjacent frames is 1/2. A parameterized vector of order 16 was
computed. The supra-segmental features are characterized by 3 statistics (mean,
variance and range) of both F0 and short-time energy resulting on a 6 dimen-
sional vector. One should note that the duration of the acoustic events is not
directly characterized as a feature but it is taken into account during the classi-
fication process by a weighting factor (eq. 5). The feature vectors are normalized
(zero mean, unit standard deviation).
3.2 Classification
In this study, two different classifiers were investigated: k-NN and GMM. The
k-NN classifier [6] is a distance based method while the GMM classifier [5] is
a statistical model. For the fusion process, we adopted a common statistical
framework for both the k-NN and the GMM by the estimation of a posteriori
A Posteriori Probabilities Estimation. The Gaussian mixture model (GMM)
[5] is adopted to represent the distribution of the features. Under the assumption
Automatic Motherese Detection for Face-to-Face Interaction Analysis 251
that the feature vector sequence x={x1,x
2, ..., xn}is an independent identical
distribution sequence, the estimated distribution of the d-dimensional feature vec-
tor xis a weighted sum of M component Gaussian densities g(μ,Σ), each parame-
terized by a mean vector μiand covariance matrix Σi; the mixture density for the
model Cmis defined as :
Each component density is a d-variate Gaussian function:
g(μ,Σ)(x)= 1
(2π)d/2det(Σ)e1/2(xμ)TΣ1((xμ)) (2)
The mixture weights ωisatisfy the following constraint: M
i=1 ωi= 1. The feature
vector xis then modeled by the following posteriori probability:
j=1 p(x|Cj)P(Cj)(3)
where P(Cm) is the prior probability for class Cm, we assume equal prior prob-
abilities. We use the expectation maximization (EM) algorithm for the mixtures
to get maximum likelihood.
The k-NN classifer [6] is a non-parametric technique which classifies the input
vector with the label of the majority k-nearest neighbors (prototypes). In order
to keep a common framework with the statistical classifier (GMM), we estimate
the posteriori probability that a given feature vector x belongs to class Cmusing
k-NN estimation [6]:
pknn(Cm|x)= km
where kmdenotes the number of prototypes which belong to the class Cmamong
the knearest neighbors.
Segmental and Supra-Segmental Characterizations. Segmental features
(i.e. MFCC) are extracted from all the frames of an utterance Uxindependently
of the voiced or unvoiced parts. Posteriori probabilities are then estimated by
both GMM and k-NN classifiers and are respectively termed Pgmm,seg(Cm|Ux)
and Pknn,seg (Cm|Ux).
The classification of supra-segmental features follows the segment-based ap-
proach (SBA)[7]. An utterance Uxis segmented into N voiced segments (Fxi)
obtained by F0 extraction (cf. 3.1.). Local estimation of posteriori probabilities
is carried out for each segment. The utterance classification combines the N local
252 A. Mahdhaoui et al.
The duration of the segments is introduced as weights of the posteriori proba-
bilities: importance of the voiced segment (length(Fxi)). The estimation is also
carried out by the two classifiers resulting on supra-segmental characterizations:
Pgmm,supra(Cm|Ux)andPknn,supra (Cm|Ux).
3.3 Fusion
The segmental and supra-segmental characterizations provide different temporal
information and a combination of them should improve the accuracy of the
detector. Many decision techniques can be employed [9] but we investigated a
simple weighted sum of likelihoods from the different classifiers:
Cl=λ.log(Pseg (Cm|Ux)) + (1 λ).log(Psupra (Cm|Ux)) (6)
With l= 1 (motherese) or 2 (normal directed speech). λdenotes the weighting
coefficient. For the GMM classifier, the likelihoods can be easily computed from
the posteriori probabilities (Pgmm,seg (Cm|Ux), Pgmm,supra (Cm|Ux)) [5]. How-
ever, the k-NN estimation can produce a null posteriori probability (eq. 4) in-
compatible with the computation of the likelihood. We used a solution recently
tested by Kim et al. [8], which consists in using the posteriori probability instead
of the log probability of the k-NN:
Cl=λ.log(ePknn,seg (Cm|Ux))+(1λ).log(Pgmm,supra(Cm|Ux)) (7)
Table 1 . Table of combinations
Comb1Pknn,seg Pknn,supr a
Comb2Pgmm,seg Pg mm,supra
Comb3Pknn,seg Pgmm,su pra
Cpmb4Pgmm,seg Pknn,supra
Comb5Pgmm,seg Pknn,seg
Comb6Pgmm,supr a Pknn,supr a
Consequently, for the k-NN classifier we used equation 7 while for the GMM the
likelihood is conventionally computed. We investigated cross combinations listed
in table 1.
4 Experimental Results
4.1 Classifier Configuration
To find the optimal structure of our classifiers, we adjust the different parame-
ters: the number of Gaussian (M) for GMM classifier and the number of neigh-
bors (k) for the k-NN classifier. Table 2 shows the best configuration for both
GMM and k-NN classifiers with segmental and supra-segmental features, the
Automatic Motherese Detection for Face-to-Face Interaction Analysis 253
Table 2 . Accuracy of optimal configurations
segmental supra-segmental
k-nn 72.5% (k=11) 61% (k=7)
GMM 79.5% (M=15) 82% (M=16)
same table shows that GMM classifier trained with prosody features outper-
forms the other classifiers confirming the definition of motherese [2].
4.2 Fusion of Best Systems
To evaluate the system performance we used the receiver operating characteristic
(ROC) methodology [6]. A ROC curve (Fig. 2) represents the tradeoff between
the true positives (TPR = true positive rate) and false positives (FPR = false
positive rate) of as the classifier output threshold value is varied. Two quanti-
tative measures of verification performance, the equal error rate (EER) and the
area under ROC curve (AUC), were calculated. It should be noted that while
EER represents the performance of a classifier at only one operating thresh-
old, the AUC represents the overall performance of the classifier over the entire
range of thresholds. Hence, we employed AUC and not EER for comparing the
verification performance of two classifiers and their combination. However, the
result obtained in table 2 motivates an investigation on fusion of both features
and classifiers following the statistical approach described in section §3.2. The
combination of features and classifiers is known to be efficient [9]. However, one
Fig. 2. ROC curves for Comb1(kNN(seg,supra))andComb2(GM M(seg,supra))
254 A. Mahdhaoui et al.
Table 3 . Optimal cross-combinations
Fus i on Comb1Comb2Comb3Comb4Comb5Comb6
(K=1,K=11) (M=12,M=15) (M=12,K=11) (K=1,M=15) (K=11,M=12) (K=5,M=16)
λ0.8 0.9 0.6 0.8 0.9 0.7 0.5 0.6 0.9 0.7 0.5 0.4
AUC 0,813 0,812 0,932 0.922 0,913 0.905 0,846 0.845 0,845 0.849 0,897 0.895
EER (%) 26 29 12 19 13 16 22 22 25 24 16 14
should be careful because the fusion of best configurations does not always give
best results since the efficiency will depend on errors produced by the classifiers
(independent vs dependent) [9]. Table 1 and section §3.3 show that 6 different
fusion schemes can be investigated (Comb1to Comb6) and for each of them we
optimized classifiers (k,M) and weighting λ(eq. 6) parameters. In table 3 and
figure 2 (Fig. 2) we can see that for the k-NN classifier, best scores (0,813/0,812)
are obtained with an important contribution of the segmental features (λ=0.8)
which is in agreement with the results obtained without the fusion (table 2). The
best GMM results (0,932/0,922) are obtained with a weighting factor equals to
0.6 revealing a balance between the two features. Despite that motherese is char-
acterized by prosody, we showed that is interesting to combine the acoustic and
prosodic features, also the fusion of two classifiers increased the performance of
5 Conclusions
we have developed a first motherese detection system tested in a speaker-
dependent mode. Using classification techniques that are also often used in
speech and speaker recognition (GMM and k-NN) we have developed moth-
erese detection system and we have tested them on mode dependent of speaker.
Fusion of features and classifiers were also investigated. We obtained results
from which we can draw interesting conclusions. Firstly, our results show that
segmental features alone contain much useful information for discrimination be-
tween motherese and adult direct speech since they outperform supra segmental
feature. Thus, we can conclude that segmental features can be used alone. How-
ever according to our detection results, prosodic features are also very promising
features. Based on the previous two conclusions, we combined classifiers that use
segmental features with classifiers that use supra-segmental features and found
that this combination improves the performance of motherese detector consider-
ably. For our motherese classification experiments, we used only segments that
were already segmented (based on human transcription). In other words, de-
tection of onset and offset of motherese was not investigated in this study but
can be addressed in a follow-up study. Detection of onset and offset of moth-
erese (motherese segmentation) can be seen as a separate problem that gives rise
to other interesting questions such as how to define the beginning and end of
motherese, and what kind of evaluation measures to use.
Automatic Motherese Detection for Face-to-Face Interaction Analysis 255
1. Muratori, F., Maestro, S.: Autism as a downstream effect of primary difficulties
in intersubjectivity interacting with abnormal development of brain connectivity.
International Journal for Dialogical Science Fall 2(1), 93–118 (2007)
2. Fernald, A., Kuhl, P.: Acoustic determinants of infant preference for Motherese
speech. Infant Behavior and Development 10, 279–293 (1987)
3. Kuhl, P.K.: Early language acquisition: Cracking the speech code. Nature Reviews
Neuroscience 5, 831–843 (2004)
4. Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., Devillers, L.,
Vidrascu, L., Amir, N., Kessous, L., Aharonson, V.: The relevance of feature type
for the automatic classification of emotional user states: low level descriptors and
functionals. In: Proceedings of Interspeech, pp. 2253–2256 (2007)
5. Reynolds, D.: Speaker identification and verification using Gaussian mixture
speaker models. Speech Communication 17, 91–108 (1995)
6. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn (2000)
7. Shami, M., Verhelst, W.: An Evaluation of the Robustness of Existing Supervised
Machine Learning Approaches to the Classification of Emotions. Speech. Speech
Communication 49(3), 201–212 (2007)
8. Kim, S., Georgiou, P., Lee, S., Narayanan, S.: Real-time emotion detection system
using speech: Multi-modal fusion of different timescale features. In: IEEE Interna-
tional Workshop on Multimedia Signal Processing (October 2007)
9. Kuncheva, I.: Combining pattern classifiers: Methods & algorithms. Wiley, Chich-
ester (2004)
... Recent research has sought to address the automatic classification of CDS [3,4,5,6]. Child-directed speech is typically characterized by elongated vowels, highly varying pitch contours [7,8], and generally clearer speech [9]. Previous work uses binary classifiers [4,5] or Hidden Markov Models [6] on pre-segmented audio, which is therefore not immediately applicable to a raw audio recording. ...
... Child-directed speech is typically characterized by elongated vowels, highly varying pitch contours [7,8], and generally clearer speech [9]. Previous work uses binary classifiers [4,5] or Hidden Markov Models [6] on pre-segmented audio, which is therefore not immediately applicable to a raw audio recording. Further, Mahdhaoui et al. [4] trained and tested on the same two speakers. ...
... Previous work uses binary classifiers [4,5] or Hidden Markov Models [6] on pre-segmented audio, which is therefore not immediately applicable to a raw audio recording. Further, Mahdhaoui et al. [4] trained and tested on the same two speakers. Vosoughi et al. [3] perform their automatic detection on naturalistic recording but they also rely on the fact that only three speakers are present in the data, and can therefore identify the speakers and use speaker characteristics in some of their features. ...
Conference Paper
Full-text available
Identifying the distinct register that adults use when speaking to children is an important task for child development research. We present a fully automatic, speaker-independent system that detects child-directed speech. The two-stage system uses diarization-style voice activation techniques to extract speech segments followed by a supervised ν-SVM classifier trained on 1582 prosodic and log Mel energy features. The system significantly improves the state of the art, detecting child-directed speech with F1 of.66 (exact boundary) and.83 (within 1 second). A feature analysis confirms the importance of F0 features (especially 3rd quartile and range) as well as new features like the variance, kurtosis, and min of log Mel energy within a frequency band.
... figure 2). 4 Introduction générale et extra-lingustiques). Parmi ces éléments, on retrouve la qualité vocale ainsi que d'autres caractéristiques prosodiques comme le rythme de la parole ou l'intonation. ...
... 18 1. 4.1 Signaux verbaux de communication . . . . . . . . . . . 18 1.4.2 ...
... Le 3. Un canal où circule les signaux codés. 4. Un récepteur qui décode le message à partir des signaux. 5. Une destination ou un destinataire qui reçoit le message. ...
... Another reason of the algorithm's accuracy may be the use of MFCC, which is a robust technique for feature extraction and tolerates a certain degree of variability in the signal. Moreover, differently from previous algorithms (Mahdhaoui et al., 2009;Mahdhaoui and Chetouani, 2011;Slaney and McRoberts, 1998), together with MFCC, the present algorithm also uses HMM, which can model the statistical variation in spectral features and speech rates, increasing the algorithm's performance. ...
... That is, once the algorithm is implemented, it does not need to be trained for every person whose speech is to be eval- uated. This is a clear advantage over past approaches that opted for a speaker-dependent classification method (e.g., Mahdhaoui et al., 2009;Slaney and McRoberts, 1998). The need to train for specific speakers reduces the generalizability of the algorithm and, more importantly, increases the effort required by participants. ...
Full-text available
The aim of the present work was a cross-linguistic generalization of Inoue et al.'s (2011) algorithm for discriminating infant- (IDS) vs. adult-directed speech (ADS). IDS is the way in which mothers communicate with infants; it is a universal communicative property, with some cross-linguistic differences. Inoue et al. (2011) implemented a machine algorithm that, by using a mel-frequency cepstral coefficient and a hidden Markov model, discriminated IDS from ADS in Japanese. We applied the original algorithm to two other languages that are very different from Japanese - Italian and German - and then tested the algorithm on Italian and German databases of IDS and ADS. Our results showed that: First, in accord with the extant literature, IDS is realized in a similar way across languages; second, the algorithm performed well in both languages and close to that reported for Japanese. The implications for the algorithm are discussed.
... Automatic identification of emotional content in speech has also been applied to categorize different communicative intentions in infant-directed speech. For this task, supra-segmental features are examined such as statistical measures of fundamental frequency and properties of the fundamental frequency contour shape (Mahdhaoui, et al., 2009;Katz, Cohn, & Moore, 2008) In the present study, we will make use of computational techniques, linguistic and psychology knowledge with the purpose of understanding music and speech categorization by infants. Methods used to carry out this study will be described in the next section. ...
... Previous research has shown that the shape of infant-directed speech melodic contours can be categorized into contour prototypes, according to their communicative intent (Fernald, 1989). Automatic characterization of emotional content in speech regarding motherese has been implemented and features concerning the melodic contour of speech have shown satisfactory results for the task (Mahdhaoui, et al., 2009). Do melodic contour related features show the best performance when discriminating interaction classes such as affection, disapproval and question? ...
Full-text available
In the present study, we aim to capture rhythmic and melodic patterning in speech and singing directed to infants. We address this issue by exploring the acoustic features that best predict different classification problems. We built a database composed by infant-directed speech from two Portuguese variants (European vs Brazilian Portuguese) and infant-directed singing from the two cultures, comprising 977 tokens. Machine learning experiments were conducted in order to automatically discriminate between language variants for speech, vocal songs and between interaction contexts. Descriptors related with rhythm exhibited strong predictive ability for both speech and singing language variants' discrimination tasks, presenting different rhythmic patterning for each variant. Moreover, common features could be used by a classifier to discriminate speech and singing tasks, indicating that the processing of speech and singing might share the analysis of the same properties of the stimuli. With respect to discrimination between different interaction contexts, pitch-related descriptors showed better performance. Therefore, we conclude that prosodic cues present in the surrounding sonic environment of an infant are sources of rich information not only to make distinction between different communicative contexts through melodic cues, but also to provide specific cues about the rhythmic identity of their mother tongue. These prosodic differences may lead to further research on their influence in infant's development of musical representations.
... Researchers in autism pathology and parent-infant interaction highlighted the importance of infant-directed speech for infants who will become autistic [14] [16]. Given that home movies offer a unique opportunity to follow infant development and ParentlInfant interactions. ...
... If we look to the definition of infant-directed speech and the acoustic characteristics of this kind of speech, hight pitch/dialect/register [3], the problem of classification seems to be not very complicated, a simple method based only on prosodic feature should immediately discriminate motherese from normal speech. However, in [13] [14] shown that the processing of natural and authentic interactions requires the development of methods beyond the strict definition: prosodic features alone were not sufficient to resolve this problem. We tested many machine learning techniques, statistical and parametric, with different feature extraction methods (time/frequency domains). ...
Conference Paper
Full-text available
Authentic and natural infant-parent interactions analysis requires the development of efficient detectors such as the discrimination between infant and adult-directed speech. Supervised methods have been found to be efficient for labeled data. The annotation process is time-consuming and the eventual divergence between annotators increases the difficulty. Semi-supervised approaches such as co-training offers a framework allowing to take advantage of supervised classifiers trained by different features. The proposed motherese detector system combined various features and classifiers used in emotion recognition in a co-training framework. The results show the relevance of this approach for real-life corpora such as home movies.
... Additionally, two databases were screened and 144 relevant studies were retained. Areas where significant similarities have been found in terms of this research include the studies concerning the differences between CDS and adult directed speech (Mahdhaoui et al., 2009). Data analysis provided suggests a strong link between the register of the mother and the differences that were found in the studies mentioned. ...
Conference Paper
Full-text available
Témou príspevku je expresívna lexika v reči matky orientovanej na dieťa. Predmetom jazykovej analýzy bolo skúmanie, identifikácia, rozbor a charakteristika lexikálnych znakov v komunikačnom registri matky, ktorej komunikácia bola skúmaná. Materiálové východisko tvorí 5 hodín spontánnej komunikácie zaznamenávanej v prirodzenom domácom prostredí. Hlavným cieľom príspevku je charakterizovať komunikačný register matky z hľadiska uplatňovania expresívnej lexiky, ale zároveň sa pokúšame aj o charakteristiku jej individuálneho personálneho štýlu. Analýza bola realizovaná prostredníctvom pretranskribovaných audio- a videonahrávok, ktoré predstavujú náš výskumný materiál.
... Specifically in relation to mothers and infants, singing has been discussed as an evolutionary adaptation designed to support mother-infant bonding. Falk (2004) has proposed that singing developed for this purpose directly out of motherese; a style of infant-directed speech consisting of exaggerations, elevated pitch, slow repetitions, and melodic elaborations of ordinary vocal communication ( Dissanayake, 2004;Mahdhaoui et al., 2009;Saint-Georges et al., 2013). Motherese has been found to occur in cultures globally ( Gogate, Maganti, & Bahrick, 2015;Grieser & Kuhl, 1988;Papoušek, Papoušek, & Symmes, 1991;Trehub, Unyk, & Trainor, 1993a, 1993b). ...
Full-text available
Among mammals who invest in the production of a relatively small number of offspring, bonding is a critical strategy for survival. Mother–infant bonding among humans is not only linked with the infant’s survival but also with a range of protective psychological, biological, and behavioral responses in both mothers and infants in the post-birth period and across the life span. Anthropological theories suggest that one behavior that may have evolved with the aim of enhancing mother–infant bonding is infant-directed singing. However, to date, despite mother–infant singing being practiced across cultures, there remains little quantitative demonstration of any effects on mothers or their perceived closeness to their infants. This within-subjects study, comparing the effects of mother–infant singing with other mother–infant interactions among 43 mothers and their infants, shows that singing is associated with greater increases in maternal perceptions of emotional closeness in comparison to social interactions. Mother–infant singing is also associated with greater increases in positive affect and greater decreases in negative affect as well as greater decreases in both psychological and biological markers of anxiety. This supports previous findings about the effects of singing on closeness and social bonding in other populations. Furthermore, associations between changes in closeness and both affect and anxiety support previous research suggesting associations between closeness, bonding, and wider mental health.
This chapter presents a detailed conceptual analysis of the nature of the object of autistic foreclosure. More specifically, it provides three possible frameworks through which the object of autistic foreclosure can be addressed. In the first section, the object of autistic foreclosure is associated with the “unary trait” based on Lacan’s Seminar IX: Identification (1961–1962). The following two sections go on to explore two prominent contemporary approaches to autistic foreclosure. The second section presents Éric Laurent’s notion of the hole as the object of autistic foreclosure. Based on Lacan’s topological account of the function of the hole in the figure of the torus in Seminar IX, it develops Laurent’s notion of the “foreclosure of the hole.” The third section presents Jean-Claude Maleval’s account of the autistic retention of the object of the invocatory drive. Based on Lacan’s Seminar X: Anxiety (1962–1963) and Seminar XI: The Four Fundamental Concepts of Psychoanalysis (1964), this section reformulates Maleval’s account of the retention of the voice in terms of the functioning of the mechanism of autistic foreclosure.
This thesis is situated in the boarder of two research do- mains: emotional speech recognition and affective interaction analysis. It revolves around the automatic detection of infant-directed speech in home made videos and the analysis of child-parent interaction in the same data. In order to analyse a special kind of speech called motherese or infant-directed speech, we first developed an automatic infant-directed speech detection system based on supervised learning approach, with the aim to parent-infant interaction analysis. However the supervised methods still have some significant limitations. One of these limitations is that large amount of labelled data are needed for training. Therefore, to overcome this problem, we implemented a new semi-supervised approach for infant directed speech detection based on the extension of the standard co-training algorithm of Blum and Mitchell. The second part of this thesis consists of study of non-verbal communication. We focus on the analysis and interpretation of different social signals exchanged in parent-infant interaction. However, there are considerable differences among infants in the quality of interaction with their parents. These differences depend especially on the infant's development which affects the parent behaviours. In our study we are interested on 3 groups of infants: typical development infants (TD), autistic infants (AD) and mental retardation infants (MR). In order to identify the groups of signs/behaviours that differentiate the development of the three groups of children we investigated a clustering method NMF (non- negative matrix factorization), an algorithm based on decomposition by parts that can reduce the dimension of interaction signs to a few number of interaction behaviours groups. Coupled with a statistical data representation, usually used for document clustering and adapted to our work.
Conference Paper
Full-text available
In this paper, we report on classification results for emotional user states (4 classes, German database of children interacting with a pet robot). Six sites computed acoustic and linguistic features independently from each other, following in part dif- ferent strategies. A total of 4244 features were pooled together and grouped into 12 low level descriptor types and 6 functional types. For each of these groups, classification results using Sup- port Vector Machines and Random Forests are reported for the full set of features, and for 150 features each with the highest individual Information Gain Ratio. The performance for the different groups varies mostly between ≈ 50% and ≈ 60%. Index Terms: emotional user states, automatic classification, feature types, functionals
Conference Paper
Full-text available
The goal of this work is to build a real-time emotion detection system which utilizes multi-modal fusion of different timescale features of speech. Conventional spectral and prosody features are used for intra-frame and supra-frame features respectively, and a new information fusion algorithm which takes care of the characteristics of each machine learning algorithm is introduced. In this framework, the proposed system can be associated with additional features, such as lexical or discourse information, in later steps. To verify the realtime system performance, binary decision tasks on angry and neutral emotion are performed using concatenated speech signal simulating realtime conditions.
This paper presents high performance speaker identification and verification systems based on Gaussian mixture speaker models: robust, statistically based representations of speaker identity. The identification system is a maximum likelihood classifier and the verification system is a likelihood ratio hypothesis tester using background speaker normalization. The systems are evaluated on four publically available speech databases: TIMIT, NTIMIT, Switchboard and YOHO. The different levels of degradations and variabilities found in these databases allow the examination of system performance for different task domains. Constraints on the speech range from vocabulary-dependent to extemporaneous and speech quality varies from near-ideal, clean speech to noisy, telephone speech. Closed set identification accuracies an the 630 speaker TIMIT and NTIMIT databases were 99.5% and 60.7%, respectively. On a 113 speaker population from the Switchboard database the identification accuracy was 82.8%. Global threshold equal error rates of 0.24%, 7.19%, 5.15% and 0.51% were obtained in verification experiments on the TIMIT, NTIMIT, Switchboard and YOHO databases, respectively.
Three experiments investigated possible acoustic determinants of the infant listening preference for motherese speech found by Fernald 1985. To test the hypothesis that the intonation of motherese speech was sufficient to elicit this preference, it was necessary to eliminate lexical content and to isolate the three major acoustic correlates of intonation: (1) fundamental frequency (Fo), or pitch; (2) amplitude, correlated with loudness; and (3) duration, related to speech rhythm. Three sets of auditory reinforcers were computer-synthesized, derived from the Fo (Experiment 1), amplitude (Experiment 2), and duration (Experiment 3) characteristics of the infant- and adult-directed natural speech samples used by Fernald 1985. Thus, each of these experiments focused on particular prosodic variables in the absence of segmental variation. Twenty 4-month-old infants were tested in an operant auditory preference procedure in each experiment. Infants showed a significant preference for the Fo-patterns of motherese speech, but not for the amplitude or duration patterns of motherese.
Chapter 4 discusses the fusion of label outputs. Four types of classifier outputs are listed: class labels (abstract level), ranked class labels (rank level), degree of support for the classes (measurement level) and correct/incorrect decision (oracle level). Combination methods for class label outputs are presented including majority vote, plurality vote, weighted majority vote, naive Bayes, multinomial combiners (Behavior Knowledge Space (BKS) and Werneckes methods), probabilistic combination and singular value decomposition (SVD) combination.
In this study, the robustness of approaches to the automatic classification of emotions in speech is addressed. Among the many types of emotions that exist, two groups of emotions are considered, adult-to-adult acted vocal expressions of common types of emotions like happiness, sadness, and anger and adult-to-infant vocal expressions of affective intents also known as “motherese”. Specifically, we estimate the generalization capability of two feature extraction approaches, the approach developed for Sony’s robotic dog AIBO (AIBO) and the segment-based approach (SBA) of [Shami, M., Kamel, M., 2005. Segment-based approach to the recognition of emotions in speech. In: IEEE Conf. on Multimedia and Expo (ICME05), Amsterdam, The Netherlands]. Three machine learning approaches are considered, K-nearest neighbors (KNN), Support vector machines (SVM) and Ada-boosted decision trees and four emotional speech databases are employed, Kismet, BabyEars, Danish, and Berlin databases.
Infants learn language with remarkable speed, but how they do it remains a mystery. New data show that infants use computational strategies to detect the statistical and prosodic patterns in language input, and that this leads to the discovery of phonemes and words. Social interaction with another human being affects speech learning in a way that resembles communicative learning in songbirds. The brain's commitment to the statistical and prosodic patterns that are experienced early in life might help to explain the long-standing puzzle of why infants are better language learners than adults. Successful learning by infants, as well as constraints on that learning, are changing theories of language acquisition.