Conference PaperPDF Available

openSMILE -- The Munich Versatile and Fast Open-Source Audio Feature Extractor

Authors:

Abstract

We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/.
openSMILE – The Munich Versatile and Fast Open-Source
Audio Feature Extractor
Florian Eyben
Institute for Human-Machine
Communication
Technische Universität
München
80290 München, Germany
eyben@tum.de
Martin Wöllmer
Institute for Human-Machine
Communication
Technische Universität
München
80290 München, Germany
woellmer@tum.de
Björn Schuller
Institute for Human-Machine
Communication
Technische Universität
München
80290 München, Germany
schuller@tum.de
ABSTRACT
We introduce the openSMILE feature extraction toolkit,
which unites feature extraction algorithms from the speech
processing and the Music Information Retrieval communi-
ties. Audio low-level descriptors such as CHROMA and
CENS features, loudness, Mel-frequency cepstral coefficients,
perceptual linear predictive cepstral coefficients, linear pre-
dictive coefficients, line spectral frequencies, fundamental
frequency, and formant frequencies are supported. Delta
regression and various statistical functionals can be applied
to the low-level descriptors. openSMILE is implemented in
C++ with no third-party dependencies for the core function-
ality. It is fast, runs on Unix and Windows platforms, and
has a modular, component based architecture which makes
extensions via plug-ins easy. It supports on-line incremen-
tal processing for all implemented features as well as off-line
and batch processing. Numeric compatibility with future
versions is ensured by means of unit tests. openSMILE can
be downloaded from http://opensmile.sourceforge.net/.
Categories and Subject Descriptors
H.5.5 [Information Systems Applications]: Sound and
Music Computing
General Terms
Design, Performance
Keywords
audio feature extraction, statistical functionals, signal pro-
cessing, music, speech, emotion
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MM’10, October 25–29, 2010, Firenze, Italy.
Copyright 2010 ACM 978-1-60558-933-6/10/10 ...$10.00.
1. INTRODUCTION
Feature extraction is an essential part of many audio anal-
ysis tasks, e.g. Automatic Speech Recognition (ASR), anal-
ysis of paralinguistics in speech, and Music Information Re-
trieval (MIR). There are a few freely available feature ex-
traction utilities which, however, are mostly designed for a
special domain, such as ASR or MIR (see section 2). More-
over, they are either targeted at off-line data processing or
are libraries, which do not offer a ready-to-use, yet flexible,
feature extractor. Tools for off-line feature extraction are
useful for research tasks, but when it comes to building a
live demonstrator system (e.g. the SEMAINE system1)or
a commercial application where one wants to use the exact
same features as used in research work, something different
is needed.
We thus introduce openSMILE2, a novel open-source fea-
ture extractor for incremental processing. SMILE is an
acronym for Speech and Music Interpretation by Large-space
Extraction. Its aim is to unite features from two worlds,
speech processing and Music Information Retrieval, enabling
researchers in either domain to benefit from features from
the other domain. A strong focus is put on fully supporting
real-time, incremental processing. openSMILE provides a
simple, scriptable console application where modular fea-
ture extraction components can be freely configured and
connected via a single configuration file. No feature has to
be computed twice, since output from any feature extractor
can be used as input to all other feature extractors inter-
nally. Unit tests are provided for developers to ensure exact
numeric compatibility with future versions.
Even though openSMILE’s primarily intended area of use
is audio feature extraction, it is principally modality inde-
pendent, e.g. physiological features such as heart rate, EEG,
or EMG signals can be analysed with openSMILE using au-
dio processing algorithms. An easy plugin interface more-
over provides the ability to extend openSMILE with one’s
own components, thus virtually being able to solve any fea-
ture extraction task and thereby using existing components
as building blocks.
In the following we give an overview on related tools in sec-
tion 2, describe openSMILE’s design principles in section 3
and the implemented features and descriptors in section 4.
We provide computation time benchmarks in section 5 and
summarise this overview paper in section 6.
1http://www.semaine-project.eu/
2Available at: http://opensmile.sourceforge.net/
1459
2. RELATED TOOLKITS
Related feature extraction tools used for speech research
include e.g. the Hidden Markov Model Toolkit (HTK ) [15],
the PRAAT Software [3], the Speech Filing System3(SFS ),
the Auditory Toolbox4, a MatlabTM toolbox5by Raul Fer-
nandez [6], the Tracter framework [7], and the SNACK 6
package for the Tcl scripting language. However, not all of
these tools are distributed under a permissive open-source
license, e.g. HTK and SFS. The SNACK package is without
support since 2004.
For Music Information Retrieval many feature extraction
programs under a permissive open-source license exist, e.g.
the lightweight ANSI C library libXtract7, the Java based
jAudio extractor [9], the Music Analysis, Retrieval and Syn-
thesis Software Marsyas8,theFEAPI framework [8], the
MIRtoolbox9, and the CLAM framework [1].
However, very few feature extraction utilities exist that
unite features from both speech and music domains. While
many features are common, MFCC or LPC are used primar-
ily for speech, for example, and e.g. CHROMA features or
algorithms for estimating multiple fundamental frequencies
are mostly found in music applications. Next to low-level
audio features, functionals mapping time series of variable
length to static values are common e.g. for emotion recog-
nition or music genre discrimination. Such mappings are
also referred to as aggregat e features or feature summaries.
Further, delta coefficients, moving average, or various filter
types are commonly applied to feature contours. Hierar-
chies of such post-processing steps have proven to lead to
more robust features, e.g. in [13], hierarchical functionals,
i.e. ‘functionals of functionals’ are used for robust speech
emotion recognition.
For an application or demonstrator system which is a re-
sult of research work, it is convenient to use the same feature
extraction code in the live system as used to produce pub-
lished results. To achieve this goal, we require incremental
processing of the input with a delay as small as possible.
3. OPENSMILE’S ARCHITECTURE
This section addresses the problems that were consid-
ered during planning of openSMILE’s architecture and sum-
marises the resulting architecture. A more detailed descrip-
tion can be found in the openSMILE documentation.
In order to address deficiencies in existing software pack-
ages, such as the lack of a comprehensive cross-domain fea-
ture set, the lack of flexibility and extensibility, and the
lack of incremental processing support, the following require-
ments had to be – and were – met:
Incremental processing, where data from an arbi-
trary input stream (file, sound card, etc.) is pushed
through the processing chain sample by sample and
frame by frame (see figure 1).
3http://www.phon.ucl.ac.uk/resource/sfs/
4http://cobweb.ecn.purdue.edu/~malcolm/interval/
1998-010/
5http://affect.media.mit.edu/publications.php
6http://www.speech.kth.se/snack/
7http://libxtract.sourceforge.net/
8http://marsyas.sness.net/
9https://www.jyu.fi/hum/laitokset/musiikki/en/
research/coe/materials/mirtoolbox
Ring-buffer memory for features requiring temporal
context and/or buffering, and for reusability of data,
i.e. to avoid duplicate computation of data used by
multiple feature extractors such as FFT spectra (see
figure 1, right).
Fast and lightweight algorithms carefully imple-
mented in C/C++, no third-party dependencies for
the core functionality.
Modular architecture which allows for arbitrary fea-
ture combination and easy addition of new feature ex-
tractor components by the community via a well struc-
tured API and a run-time plug-in interface.
Configuration of feature extractor parameters and
component connections in a single configuration file.
Moreover, the extractor is easy to compile on many com-
monly used platforms, such as Windows, Unix, and Mac.
Figure 1 (left) shows the overall data-flow architecture of
openSMILE, where the Data Memory is the central link be-
tween all Data Sources (components that write data from ex-
ternal sources to the data memory), Data Processors (com-
ponents which read data from the data memory, modify it,
and write it back to the data memory), and Data Sinks
(components that read data from the data memory and write
it to external places such as files).
The ring-buffer based incremental processing is illustrated
in figure 1 (mid). Three levels are present in this setup:
wave, frames, and pitch. A cWaveSource component writes
samples to the ‘wave’ level. The write positions in the levels
are indicated by the vertical arrows. A cFramer produces
frames of size 3 from the wave samples (non-overlapping),
and writes these frames to the ‘frames’ level. A cPitch (sim-
plified for the purpose of illustration) component extracts
pitch features from the frames and writes them to the ‘pitch’
level. Since all boxes in the plot contain values (=data), the
buffers have been filled, and the write pointers have been
warped. Figure 1 (right) shows the incremental processing
for higher order features. Functionals (max and min) over
two frames (overlapping) of the pitch features are extracted
and saved to the level ‘func’.
The size of the buffers must be adjusted to the size of
the block a reader or writer reads/writes from/to the data
memory at once. In the above example the read blocksize
of the functionals component would be 2 because it reads 2
pitch frames at once. The input level buffer of ‘pitch’ must
be at least 2 frames long, otherwise the functionals compo-
nent will not be able to read a complete window from this
level. openSMILE handles this adjustment of the buffersize
automatically.
To speed up computation, openSMILE supports multi-
threading. Each component in openSMILE can be run in
a separate thread. This enables parallelisation of the fea-
ture extraction process on multi-core machines and reduces
computation time when processing large files.
4. AVAILABLE FEATURE EXTRACTORS
openSMILE is capable of extracting Low-Level Descrip-
tors (LLD) and applying various filters, functionals, and
transformations to these. The LLD currently implemented
are listed in table 1.
1460
Data Memory
Data Source
(e.g. sound card)
Data Sink
(e.g. LibSVM classifier)
Data Sink
(e.g. CSV file export)
Data Processor
(e.g. windowing)
Data Processor
(e.g. FFT)
Data Processor
(e.g. Mel-Filterbank)
...
:
:
Data Processor
(e.g. Functionals)
Data Processor
(e.g. Delta Coefficients)
:
:



     
   






   


 !""#
!$% $!""#
&'$'#
())*$*#
+,"&&*$*#
-
  


-



  










 !!"#"#
$%&"#"#


'(
")"))))
'"
)))
'(
'"
'(
'"
'(
'"
""&

Figure 1: Sketch of openSMILE’s architecture (left) and incremental data-flow in ring-buffer memories (centre
and right); the (red) arrow (pointing in between the columns) indicates the current write pointer.
Feature Group Description
Waveform Zero-Crossings, Extremes, DC
Signal energy Root Mean-Square & logarithmic
Loudness Intensity & approx. loudness
FFT spectrum Phase, magnitude (lin, dB, dBA)
ACF, Cepstrum Autocorrelation and Cepstrum
Mel/Bark spectr. Bands 0-Nmel
Semitone spectr. FFT based and filter based
Cepstral Cepstral features, e.g. MFCC, PLP-
CC
Pitch F0via ACF and SHS methods
Probability of Voicing
Voice Quality HNR, Jitter, Shimmer
LPC LPC coeff., reflect. coeff., residual
Line spectral pairs (LSP)
Auditory Auditory spectra and PLP coeff.
Formants Centre frequencies and bandwidths
Spectral Energy in Nuser-defined bands,
multiple roll-off points, centroid,
entropy, flux, and rel. pos. of
max./min.
Tonal CHROMA, CENS, CHROMA-
based features
Table 1: openSMILE’s low-Level descriptors.
The Mel-frequency features, Mel-Spectrum and Mel-Fre-
quency Cepstral Coefficients (MFCC), as well as the Percep-
tual Linear Predictive Coefficients (PLP) can be computed
exactly as described in [15], thus providing compatibility
with the popular Hidden-Markov Toolkit (HTK).
Delta regression coefficients can be computed from the
low-level descriptors, and a moving average filter can be ap-
plied to smooth the feature contours. All data vectors can
be processed with elementary operations such as add, mul-
tiply, and power, which enables the user to create custom
features by combining existing operations.
Next, functionals (statistical, polynomial regression coeffi-
cients, and transformations) listed in table 2 can be applied,
e.g. to low-level features. This is a common technique, e.g.
in emotion recognition ([11, 4]) and Music Information Re-
trieval [12]. The list of functionals is based on CEICES,
Category Description
Extremes Extreme values, positions, and ranges
Means Arithmetic, quadratic, geometric
Moments Std. dev., variance, kurtosis, skewness
Percentiles Percentiles and percentile ranges
Regression Linear and quad. approximation coeffi-
cients, regression err., and centroid
Peaks Number of peaks, mean peak distance,
mean peak amplitude
Segments Number of segments based on delta
thresholding, mean segment length
Sample values Values of the contour at configurable
relative positions
Times/durations Up- and down-level times, rise/fall
times, duration
Onsets Number of onsets, relative position of
first/last on-/offset
DCT Coefficients of the Discrete Cosine
Transformation (DCT)
Zero-Crossings Zero-crossing rate, Mean-crossing rate
Table 2: Functionals (statistical, polynomial regres-
sion, and transformations) available in openSMILE.
where seven sites combined their features and established a
feature coding standard that among others aims at a broad
coverage of functional types [2]. Functionals can be applied
multiple times in hierarchical structure as described in [13]).
Due to the modular architecture, it is possible to apply
any implemented processing algorithm to any time series,
i.e. the Mel-band filter-bank could be applied as a functional
to any feature contour. This gives researchers an efficient
and customisable tool to generate millions of novel features
without adding a single line of C++ code.
To facilitate interoperability, feature data can be loaded
from and saved to popular file formats such as WEKA ARFF
[14], LibSVM format, Comma Separated Value (CSV) File,
HTK [15] parameter files, and raw binary files (which can
be read in, e.g. MatlabTM or GNU Octave).
Live recording of audio and subsequent incremental ex-
traction of features in real-time is also supported. A built-
in voice activity detection can be used to pre-segment the
1461
recorded audio stream in real-time, and on-line mean and
variance normalisation as well as on-line histogram equal-
isation can be applied. Features extracted on-line can be
directly visualised via gnuplot, which is a great feature es-
pecially for demonstration and teaching tasks.
5. PERFORMANCE
Since the main objective of openSMILE is real-time op-
eration, run-time benchmarks for various feature sets are
provided. Evaluation was done on a Ubuntu Linux machine
with Kernel 2.6 and an AMD Phenom 64 bit CPU (only
one core was used) at 2.2 GHz having 4 GB of DDR2” 800
RAM. All real-time factors (rtf) were computed by timing
the CPU time required for extracting features from 10 min-
utes of monaural 16 kHz PCM (uncompressed) audio data.
Extraction of standard PLP and MFCC frame-based fea-
tures with log-energy and 1st and 2nd order delta coefficients
can be done with an rtf of 0.012. 250 k features (hierarchical
functionals (2 levels) of 56 LLD (pitch, MFCC, LSP, etc.))
can be computed with an rtf of 0.044. Prosodic low-level
features (pitch contour and loudness) can be extracted with
an rtf of 0.026. This show the high efficiency of the code to
compute functionals; most computation time is spent with
tasks such as FFT or filtering during low-level descriptor
extraction.
6. CONCLUSION AND OUTLOOK
We introduced openSMILE, an efficient, on-line (and also
batch scriptable), open-source, cross platform, and exten-
sible feature extractor implemented in C++. A well struc-
tured API and example components make integration of new
feature extraction and I/O components easy. openSMILE is
compatible with research tool-kits, such as HTK, WEKA,
and LibSVM by supporting their data-formats. Although
openSMILE is very new, it is already successfully used by
researchers around the world. The openEAR project [5]
builds on openSMILE features for doing emotion recogni-
tion. openSMILE was the official feature extractor for the
INTERSPEECH 2009 Emotion Challenge [11] and the ongo-
ing INTERPSEECH 2010 Paralinguistic Challenge. It has
also been used for problems as exotic as classification of
speaker height from voice characteristics [10].
Development of openSMILE is still active and even more
features such as TEAGER energy, TOBI pitch descriptors,
and psychoacoustic measures such as Sharpness and Rough-
ness are considered for integration. Moreover, openSMILE
will soon support MPEG-7 LLD XML output. In the near
future we aim at linking to openCV10,tobeabletofuse
visual and acoustic features. Due to openSMILE’s modular
architecture and the public source code, rapid addition of
new and diverse features by the community is encouraged.
Future work will focus on improved multithreading support
and cooperation with related projects to ensure coverage of
a broad variety of typically employed features in one piece
of fast, lightweight, flexible open-source software.
Acknowledgment
The research leading to these results has received funding
from the European Community’s Seventh Framework Pro-
10http://opencv.willowgarage.com/
gramme (FP7/2007-2013) under grant agreement No. 211486
(SEMAINE).
7. REFERENCES
[1] X. Amatriain, P. Arumi, and D. Garcia. A framework
for efficient and rapid development of cross-platform
audio applications. Multimedia Systems, 14(1):15–32,
June 2008.
[2] A. Batliner, S. Steidl, B. Schuller, D. Seppi,
K. Laskowski, T. Vogt, L. Devillers, L. Vidrascu,
N. Amir, L. Kessous, and V. Aharonson. Combining
efforts for improving automatic classification of
emotional user states. In T. Erjavec and J. Gros,
editors, Language Technologies, IS-LTC 2006, pages
240–245. Informacijska Druzba, 2006.
[3] P. Boersma and D. Weenink. Praat: doing phonetics
by computer (v. 4.3.14). http://www.praat.org/, 2005.
[4] F. Eyben, M. W¨
ollmer, A. Graves, B. Schuller,
E. Douglas-Cowie, and R. Cowie. On-line emotion
recognition in a 3-d activation-valence-time continuum
using acoustic and linguistic cues. Journal on
Multimodal User Interfaces, 3(1-2):7–19, Mar. 2010.
[5] F. Eyben, M. W¨
ollmer, and B. Schuller. openEAR -
introducing the munich open-source emotion and
affect recognition toolkit. In Proc. of ACII 2009,
volume I, pages 576–581. IEEE, 2009.
[6] R. Fernandez. A Computational Model for the
Automatic Recognition of Affect in Speech. PhD thesis,
MIT Media Arts and Science, Feb. 2004.
[7] P. N. Garner, J. Dines, T. Hain, A. El Hannani,
M. Karafiat, D. Korchagin, M. Lincoln, V. Wan, and
L. Zhang. Real-time asr from meetings. In Proc. of
INTERSPEECH 2009, Brighton, UK. ISCA, 2009.
[8] A. Lerch and G. Eisenberg. FEAPI: a low level feature
extraction plug-in api. In Proc. of the 8th
International Conference on Digital Audio Effects
(DAFx), Madrid, Spain, 2005.
[9] D. McEnnis, C. McKay, I. Fujinaga, and P. Depalle.
jaudio: A feature extraction library. In Proc. of ISMIR
2005, pages 600–603, 2005.
[10] I. Mporas and T. Ganchev. Estimation of unknown
speaker’s height from speech. International Journal of
Speech Technology, 12(4):149–160, dec 2009.
[11] B. Schuller, S. Steidl, and A. Batliner. The
INTERSPEECH 2009 emotion challenge. In Proc.
Interspeech (2009), Brighton, UK, 2009. ISCA.
[12] B. Schuller, F. Wallhoff, D. Arsic, and G. Rigoll.
Musical signal type discrimination based on large open
feature sets. In Proc. of the International Conference
on Multimedia and Expo ICME 2006. IEEE, 2006.
[13] B. Schuller, M. Wimmer, L. M¨
osenlechner, C. Kern,
D. Arsic, and G. Rigoll. Brute-forcing hierarchical
functionals for paralinguistics: A waste of feature
space? In Proc. of ICASSP 2008, April 2008.
[14] I. H. Witten and E. Frank. Data Mining: Practical
machine learning tools and techniques. Morgan
Kaufmann, San Francisco, 2nd edition edition, 2005.
[15] S. Young, G. Evermann, M. Gales, T. Hain,
D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason,
D. Povey, V. Valtchev, and P. Woodland. The HTK
book (v3.4). Cambridge University Press, Cambridge,
UK, December 2006.
1462
... Acoustic Modality: For the CMU-MOSI and CMU-MOSEI datasets, COVAREP [5] is employed to extract a series of acoustic features, including pitch tracking, speech polarity, 12 Mel-frequency cepstral coefficients, glottal closure instants, and spectral envelope. For the SIMSv2 dataset, 25-dimensional eGeMAPS low-level descriptors (LLD) features are extracted using the OpenSMILE [8] at a sampling rate of 16000 Hz. These features form a sequence that reflects the dynamic changes in vocal tone over the course of the utterance. ...
... (7) Multimodal Global Contrastive Learning (MGCL) [37]: MGCL learns the joint representation of multimodal data from a global perspective through supervised contrastive learning, and enhances the learning effect of multimodal representation by maximizing the similarity of similar sample pairs and minimizing the similarity of dissimilar sample pairs; (8) Unimodal Label Generation and Modality Decomposition (ULMD) [17]: ULMD is a multimodal sentiment analysis approach based on unimodal label generation and modality decomposition, leveraging a multi-task learning framework to improve performance by addressing redundancy, heterogeneity, and the lack of unimodal labels in existing datasets. ...
Preprint
Multimodal Sentiment Analysis (MSA) faces two critical challenges: the lack of interpretability in the decision logic of multimodal fusion and modality imbalance caused by disparities in inter-modal information density. To address these issues, we propose KAN-MCP, a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework. First, KAN leverages its univariate function decomposition to achieve transparent analysis of cross-modal interactions. This structural design allows direct inspection of feature transformations without relying on external interpretation tools, thereby ensuring both high expressiveness and interpretability. Second, the proposed MCPareto enhances robustness by addressing modality imbalance and noise interference. Specifically, we introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality. This approach provides KAN with discriminative low-dimensional inputs to reduce the modeling complexity of KAN while preserving critical sentiment-related information. Furthermore, MCPareto dynamically balances gradient contributions across modalities using the purified features output by DRD-MIB, ensuring lossless transmission of auxiliary signals and effectively alleviating modality imbalance. This synergy of interpretability and robustness not only achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2 but also offers an intuitive visualization interface through KAN's interpretable architecture.
Chapter
In recent years, the world’s population has been aging (United Nations Population Division 2002). Elderly people often develop depression due to decline in physical functions, health concerns, bereavement of friends, and loneliness, and it is one of the most common diseases among the elderly along with dementia (Beekman et al. 1999). We proposed a voice evaluation index based on logistic regression to detect depression in the elderly that replaces vitality and mental activity by combining the emotional components of the voice (Higuchi et al. 2017, 2018a).
Chapter
The US Food and Drug Administration (FDA) classifies biomarkers into diagnostic, monitoring, response, predictive, prognostic, safety, and susceptibility/risk biomarkers and reasonably likely surrogate endpoints (FDA 2016). For voice biomarkers, what is important is their use as diagnostic, monitoring, or response biomarkers. In other words, voice-based screening can be used to detect (diagnose) diseases at an early stage, monitor disease progression and remission, and confirm the response to and efficacy of drug therapies and other interventions.
Chapter
In this study, the development of an algorithm for estimating depression based on machine learning combined with acoustic features analyzed from speech is presented (Omiya et al. 2018, 2019).
Chapter
With the development and widespread use of wearable devices and smartphones, digital biomarkers that use biometric information obtained from these devices to objectively determine the presence or absence of disease and the effectiveness of treatment have been attracting attention. Voice biomarkers can be considered a type of digital biomarker and are garnering interest because voice input is extensively used, making it easier to acquire voice data. Further, significant improvement in computer technology has allowed the analysis of voice data using various devices. In addition, the COVID-19 pandemic has changed lifestyles, and living in remote environments has become more prevalent, increasing the need for such technologies (Alexa 2020). Digital biomarkers can capture information noninvasively and allow for remote intervention. In particular, voice which does not require special sensors and has a wide range of applications can be an excellent biomarker.
Chapter
The prevalence of dementia is increasing due to the aging of the population worldwide; the proportion of people aged 65 years or older is expected to be approximately 16% of the world population by 2050. It is estimated that approximately 100 million people will suffer from dementia (Prince et al. 2013; United Nations Population Division 2002). Therefore, the social cost of dementia is increasing annually, and early detection and preventive treatment are challenging.
Chapter
With the aging of society, the prevalence of dementia and Parkinson’s disease is increasing. Early detection and preventive treatment of these diseases are important to address this problem. Most Parkinson’s patients have some form of movement disorder, and regular visits to the hospital can be burdensome for the patient and their family. It is also estimated that 90% of patients with Parkinson’s disease have some type of speech disorder. Our research aimed to use voice to estimate health status. Voice analysis has the advantage of being noninvasive and can be remotely controlled using a smartphone or other devices. If Parkinson’s disease patients’ conditions can be monitored remotely, it will reduce the burden of patient visits to the hospital.
Chapter
Ankyloglossia with deviation of the epiglottis and larynx (ADEL), in which the origin of the tongue is positioned anteriorly resulting in suppressed breathing, is considered one of the causes of abnormal crying in infants, including night crying and temper tantrums. Moreover, laryngomalacia is another disease that causes inspiratory wheezing in infants. Wheezing is caused by the common cold, respiratory syncytial (RS) virus, asthma, bronchitis, pneumonia, and aspiration. Crying is an important form of expression for infants as it is their way of communicating with their caregivers and others around them. However, the crying of infants suffering from ADEL or laryngomalacia is unpleasant and may be a factor of child abuse, which has become a social problem recently, as parents present have difficulty with childcare, such as temper tantrums.
Chapter
Depression is classified into bipolar disorder and major depressive disorder depending on the appearance of symptoms. Bipolar disorder is a psychiatric disorder with recurrent phases of mania and depression (American Psychiatric Association 2013), and distinguishing bipolar disorder from unipolar major depressive disorder during the depressive phase is difficult. Therefore, a facile method to distinguish between depression in bipolar disorder from depression in major depressive disorder is required.
Conference Paper
Full-text available
Classification performance of emotional user states found in realistic, spontaneous speech is not very high, compared to the performance reported for acted speech in the literature. This might be partly due to the difficulty of providing reliable annotations, partly due to suboptimal feature vectors used for classification, and partly due to the difficulty of the task. In this paper, we present a co-operation between several sites, using a thoroughly processed emotional database. For the four-class problem motherese/neutral/emphatic/angry, we first report classification performance computed independently at each site. Then we show that by using all the best features from each site in a combined classification, and by combining classifier outputs within the ROVER framework, classification results can be improved; all feature types and features from all sites contributed.
Conference Paper
Full-text available
The last decade has seen a substantial body of literature on the recognition of emotion from speech. However, in comparison to related speech processing tasks such as Automatic Speech and Speaker Recognition, practically no standardised corpora and test-conditions exist to compare performances under exactly the same conditions. Instead a multiplicity of evaluation strategies employed - such as cross-validation or percentage splits without proper instance definition - prevents exact reproducibility. Further, in order to face more realistic scenarios, the community is in desperate need of more spontaneous and less prototypical data. This INTERSPEECH 2009 Emotion Challenge aims at bridging such gaps between excellent research on human emotion recognition from speech and low compatibility of results. The FAU Aibo Emotion Corpus [1] serves as basis with clearly defined test and training partitions incorporating speaker independence and different room acoustics as needed in most real-life settings. This paper introduces the challenge, the corpus, the features, and benchmark results of two popular approaches towards emotion recognition from speech.
Article
Full-text available
For many applications of emotion recognition, such as virtual agents, the system must select responses while the user is speaking. This requires reliable on-line recognition of the user's affect. However most emotion recognition systems are based on turnwise processing. We present a novel approach to on-line emotion recognition from speech using Long Short-Term Memory Recurrent Neural Networks. Emotion is recognised frame-wise in a two-dimensional valence-activation continuum. In contrast to current state-of-the-art approaches, recognition is performed on low-level signal frames, similar to those used for speech recognition. No statistical functionals are applied to low-level feature contours. Framing at a higher level is therefore unnecessary and regression outputs can be produced in real-time for every low-level input frame. We also investigate the benefits of including linguistic features on the signal frame level obtained by a keyword spotter.
Article
Spoken language, in addition to serving as a primary vehicle for externalizing linguistic structures and meaning, acts as a carrier of various sources of information, including back- ground, age, gender, membership in social structures, as well as physiological, pathological and emotional states. These sources of information are more than just ancillary to the main purpose of linguistic communication: Humans react to the various non-linguistic factors en- coded in the speech signal, shaping and adjusting their interactions to satisfy interpersonal and social protocols. Computer science, artificial intelligence and computational linguistics have devoted much active research to systems that aim to model the production and recovery of linguistic lexico-semantic structures from speech. However, less attention has been devoted to systems that model and understand the paralinguistic and extralinguistic information in the signal. As the breadth and nature of human-computer interaction escalates to levels previously reserved for human-to-human communication, there is a growing need to endow computational systems with human-like abilities which facilitate the interaction and make it more natural. Of paramount importance amongst these is the human ability to make inferences regarding the affective content of our exchanges. This thesis proposes a framework for the recognition of affective qualifiers from prosodic- acoustic parameters extracted from spoken language.
Article
In the present study, we propose a regression-based scheme for the direct estimation of the height of unknown speakers from their speech. In this scheme every speech input is decomposed via the openSMILE audio parameterization to a single feature vector that is fed to a regression model, which provides a direct estimation of the persons’ height. The focus in this study is on the evaluation of the appropriateness of several linear and non-linear regression algorithms on the task of automatic height estimation from speech. The performance of the proposed scheme is evaluated on the TIMIT database, and the experimental results show an accuracy of 0.053 meters, in terms of mean absolute error, for the best performing Bagging regression algorithm. This accuracy corresponds to an averaged relative error of approximately3%. We deem that the direct estimation of the height of unknown people from speech provides an important additional feature for improving the performance of various surveillance, profiling and access authorization applications. KeywordsHuman height estimation from speech-Speech processing-Regression algorithms
Article
In this article, we present CLAM, a C++ software framework, that offers a complete development and research platform for the audio and music domain. It offers an abstract model for audio systems and includes a repository of processing algorithms and data types as well as all the necessary tools for audio and control input/output. The framework offers tools that enable the exploitation of all these features to easily build cross-platform applications or rapid prototypes for media processing algorithms and systems. Furthermore, included ready-to-use applications can be used for tasks such as audio analysis/synthesis, plug-in development, feature extraction or metadata annotation. CLAM represents a step forward over other similar existing environments in the multimedia domain. Nevertheless, it also shares models and constructs with many of those. These commonalities are expressed in the form of a metamodel for multimedia processing systems and a design pattern language.