Conference PaperPDF Available

MEMORABLE SPOKEN QUOTE CORPORA OF TED PUBLIC SPEAKING

Authors:

Abstract and Figures

In this paper we present the construction and analysis of memorable spoken quote corpora from TED public speaking. Memorable quotes are interesting and useful words which usually contain generic pearls of wisdom that could achieve public awareness and retained in people consciousness. Our study aims to reveal why can some public speeches can be retained in people mind and make their consciousness to like and share it, while some others can not. To achieve this purpose, the relevance corpora is required to perform system quantitative evaluation. In this study, we start with the collection of the corpus from TED public speaking. Specifically, we utilize 899 video files of TED Talks and more than 2000 speech quotes annotated by TED team. We then complement the data with non-memorable quotes. According to shares number of quotes which are provided by TED, we also annotate memorable quotes with popularity factor. Analysis of memorable spoken quotes is done based on speech duration, F0, and popularity.
Content may be subject to copyright.
MEMORABLE SPOKEN QUOTE CORPORA OF TED PUBLIC SPEAKING
Fajri Koto†‡, Sakriani Sakti, Graham Neubig, Tomoki Toda, Mirna Adriani, Satoshi Nakamura
Nara Institute of Science and Technology, Japan
Faculty of Computer Science, Universitas Indonesia, Indonesia
fajri91@ui.ac.id, mirna@cs.ui.ac.id, {ssakti,neubig,tomoki,s-nakamura}@is.naist.jp
ABSTRACT
In this paper we present the construction and analysis of
memorable spoken quote corpora from TED public speaking.
Memorable quotes are interesting and useful words which
usually contain generic pearls of wisdom that could achieve
public awareness and retained in people consciousness. Our
study aims to reveal why can some public speeches can be
retained in people mind and make their consciousness to like
and share it, while some others can not. To achieve this
purpose, the relevance corpora is required to perform system
quantitative evaluation. In this study, we start with the col-
lection of the corpus from TED public speaking. Specifically,
we utilize 899 video files of TED Talks and more than 2000
speech quotes annotated by TED team. We then complement
the data with non-memorable quotes. According to shares
number of quotes which are provided by TED, we also an-
notate memorable quotes with popularity factor. Analysis of
memorable spoken quotes is done based on speech duration,
F0, and popularity.
Index Termsmemorable quote, public speaking, spon-
taneous speech, corpora, favorable
1. INTRODUCTION
Research about speech processing has been actively inves-
tigated over decades. Specifically, study related to dialog
system [1], speech recognition [2], speech summarization
[3], and speech synthesis [4, 5] over languages are expand-
ing along with relevance corpora which are successfully
collected. One of the goal is to build spoken dialog sys-
tem which enable machine to interact with human naturally.
Consequently, understanding human conversational expres-
siveness to social presence that may gain partner acceptance
becomes important. The term expressiveness used in this
work does not specifically refer to emotional expressiveness,
but to describe the skill of communicating genuine involve-
ment in the conversation, including the choice of words and
the way it phrased (i.e., loudness and intonation). Enhanced
expressiveness may contribute to dramatic effect, making
the message easier to listen to. Here, we focus on studying
human expressiveness during public speeches, in which how
the important messages are conveyed that may be retained in
audience consciousness.
Memorable quotes are defined as interesting and useful
words which usually contain generic pearls of wisdom ex-
pressed with unusual combination of words in ordinary sen-
tences [6]. Through history, the best speeches of all time
normally feature memorable quotes that genuinely inspire the
audience. For instance, the most famous quote of John F.
Kennedy: “Ask not what your country can do, ask what you
can do for your country”. History has proven the existence of
this memorable quote which inspired many generations since
John F. Kennedy gave this speech in January 19611.
Nowadays, one popular site in of public speeches is
TED2. TED started out in 1984 as a conference bringing to-
gether people from three worlds: Technology, Entertainment,
Design. TED talks bring together the world’s most fascinat-
ing thinkers and doers, who are challenged to give the talk
of their lives in about 5-25 minutes. Many famous people
have given speeches on TED and inspired people by their
memorable words. Recently, TED has started “TED Quotes,
which collects memorable quotes from TED talks, annotates
them manually, groups them by category, and provides an
easy way for people to share their favorite quotes. The most
popular quotes can have more than a thousand shares.
We initiate our study by collecting corpus of memo-
rable quote from TED public speaking. Specifically, in this
study we collect more than 2000 spoken quotes and 899
corresponding TED Talks video that have been manually an-
notated. We build the segmented corresponding audio file
and complement the corpus with non-quotes spoken data that
are randomly generated. Manually checking of 899 subti-
tle/transcription files were also done. According to shares
number of quotes which are provided by TED, we also anno-
tate memorable quotes with favorableness factor. Analysis of
memorable spoken quotes is done based on speech duration,
F0, and popularity.
The rest of this paper is structured as follows. Section
2 summarizes some related works. Section 3 provides the
procedure of data construction. The analysis of data regarding
1http://ushistory.com
2http://www.ted.com/
140
to duration, F0 and favorableness will be given in Section 4.
Finally, conclusions are drawn in Section 5.
2. RELATED WORK
Study related to natural expressive speech has been done by
some researchers. Bulut et al. work in synthesizing four
emotional states: anger, happiness, sadness, and neutral us-
ing a concatenative speech synthesizer [4]. Eide et al. add
five speaking styles - neutral declarative, conveying good
news, conveying bad news, asking a question, and showing
contrastive emphasis - in synthesizing speech [5]. Generating
expressive speech for storytelling applications were also done
by Theune et al. They designed and implemented a set of
prosodic rules for converting neutral speech into storytelling
speech [7]. However, most works related to synthesizing
emotional expressiveness. Here, we study on memorable
spoken quotes; the skill of communicating genuine involve-
ment in the conversation, including the choice of words and
the way it phrased (i.e., loudness and intonation) that may be
retained in audience consciousness.
Research related to memorable quote is still very limited.
There is only one study that has been published and discuss
about memorable quote in text document. Bandersky et al.,
extracts some text features in order to analyze how a phrase
in book can be memorable [8]. This research stems to the fact
that there are close to 130 million unique book records in the
world libraries today, an many of these are being digitized [9].
Moreover, many annotated text quotes are spread in Internet
today. For instance, BrainyQuote3, and WikiQuote4which
have been developed to provide several inspirational groups
of quotes from many resources.
Another study by Danescu-Niculescu-Mizil et al. [10]
attempted to investigate the effect of phrasing on a quote’s
memorability from movie transcription. They argue that
quotes differ not only in how they are worded, but also in
who said them and under what circumstances. Although this
study focused on spoken words, the work is limited to only
textual data of movie transcription. While most techniques
developed so far for memorable quote detection have focused
primarily on the processing of text, we are interested in dis-
covering memorable spoken quotes of real public speeches.
3. DATA CONSTRUCTION
As described in Section 1, the memorable spoken quotes cor-
pora were built by utilizing TED speech, manually annotated
quote, and transcription file of corresponding video speech.
In total there are 2152 annotated quotes by July 2013 in TED
website. They are required 914 speeches with its correspond-
ing transcriptions to be processed. Due to there were not 15
needed transcription files, we reduced 34 quotes and the rest
3http://www.brainyquote.com
4http://www.wikiquote.com
are 2118 memorable spoken quotes, with 899 required audio
files.
At Fig. 1 we present the stage of corpora construction.
First we downloaded all required file: 2118 memorable spo-
ken quotes, 899 TED speeches and their transcriptions. We
then manually checked all 899 transcription files and found
there were some of them which had time mismatch for 1-
10 seconds. The details of this mismatch of transcription are
summarized in Table. 1.
Table 1.The statistic of time mismatch for all transcription
file.
time mismatch Count
-4 seconds 1
-3 seconds 202
0 second 660
2 seconds 1
3 seconds 31
6 seconds 2
10 seconds 2
Total 899
After manually checking and updating the transcriptions,
we find the segment timing of every memorable spoken
quotes in their transcription file. Non-quote data are then ran-
domly generated to complement the corpora with explanation
as follows: 1) The length of non-quote data were randomly
generated in range 1-3 passages of transcription. We consider
this case based on the length of existing quote data, 2) For
each speech, we generate non-memorable quote as many as
existing quotes in that speech. After we complement the data,
segmentations then were applied and the data are ready to be
extracted and analyzed.
4. ANALYSIS
4.1. Duration Analysis
In Table. 2 we provide the statistic of memorable quote cor-
pora. We divide the speeches based on Speech-Duration Inter-
val (SDI) in minute unit. It shows that 42.37% of our speeches
corpora (386 speeches) have duration interval in 15-20 min-
utes, while 32.7% lay on 20- interval and the rests are in 0-
15 interval. The quote utterances for each interval are also
provided. We present Fig. 2 to show the changing of quote
utterance in each interval. The quotes utterance normally in-
crease from lower interval, then achieve maximum number of
utterance in 15-20 interval. But, it then suddenly decreases
when it reaches data with SDI greater than 20 minutes. In
average, there will be two memorable segments in a speech
interval which will be recognized as spoken quote by public
consciousness.
141
Fig. 1. The construction of memorable spoken quote corpora.
Table 2.The statistic of quote utterance in speech corpora
SDI # TED # Quotes in Quotes Avg Quotes
(min) Talks TED Talks Dur (sec) Position
q1 q2 q3
0-5 71 123 11.028 36 40 47
5-10 147 295 11.224 80 88 127
10-15 187 443 10.673 141 139 163
15-20 386 963 11.308 366 288 309
20- 294 294 11.976 110 92 92
Total 899 2118 11.236 733 647 738
Fig. 2. The average number of Quotes in each TED Talk in-
terval.
Quotes Avg Dur (sec) in Table. 2 represents the dura-
tion average of quote utterance for each interval in second
unit. Our data reveal that quote utterances have similar dura-
tion for all SDI, about 10-11 seconds. Starting position of
these spoken quote utterances are also our concern in this
section. In Table. 2 we divide starting position of quote
utterance in three segments. The first one-third segment of
speech is denoted as q1, while q2 and q3 are the second and
the third of next one-third segment. We then count the utter-
ance of spoken quote for each segment and the results give
ratio q1 : q2 : q3 = 1.13 : 1 : 1.14 for their total. It re-
veals that the utterances of memorable quote can be spoken
in any duration of speech and can not be easily determined
only based on their starting position.
Fig. 3. Memorable quote data distribution according to shares
number
4.2. Popularity analysis
TED provide an easy way for user to share their favorite
quotes. The shares number of every quotes are publicly pro-
vided by TED. From the total of 2118 memorable quotes, the
popular quotes can have more than a thousand shares, while
the non-popular quotes have zero shares. For example, the
most popular quote in our corpus is ”If you hire people just
because they can do a job, they will work for your money.
But if you hire people who believe what you believe, they will
work for you with blood and sweat and tears” given by Simon
Sinek. This quote was shared by 4788 people. However, only
very few quotes are shared by more than thousand people,
while a large number of memorable quotes are shared by
around 1-50 people (See the distribution in Fig. 3).
In this preliminary study, we only focused on extreme
cases and constructed a corpus with memorable quotes that
have zero shares (labeled as non-popular quotes), and memo-
rable quotes that have more than 50 shares (labeled as popular
quotes). Here, all new published quotes still have zero shares,
and thus they are excluded from data as it may not be irrel-
evant to annotate them as non-popular quotes. In total, the
corpus consists of 262 non-popular quotes and 179 popular
quotes.
142
4.3. F0 Analysis
Danescu-Niculescu-Mizil et al. in their work argue that there
may be factors which make information retained in people
consciousness. One of those factors is may be due to the way
of it is expressed. In emotional speech, F0 has also been in-
vestigated and stated as important feature [11]. The study
by Liscombi et al. found that higher F0 may correlates with
positive-action emotion[12].
Table 3.F0 comparison between both corpora
F0 Quote Non-Quote
F0-Max 343.39 323.81
F0-Min 49.83 52.47
F0-Range 293.57 271.34
F0-Mean 169.61 168.42
In this preliminary study, we investigate F0 features
between memorable and non-memorable spoken quotes.
Table.3 presents F0 analysis between memorable and non-
memorable quotes. Based on INTERSPEECH 2009 par-
alinguistic challenge configuration (IS09 Paraling features)
[13], we extract F0-Max, F0-Min, F0-Range and F0-Mean.
It is done using openSMILE5; a feature extraction toolkit,
which unites feature extraction algorithms from the speech
processing and the Music Information Retrieval communities
[14].The result shows that F0-Mean of memorable quotes
apparently are higher than non-memorable quotes. This may
indicate that people tend to act in positive-action emotion
in emphasizing important content during public speeches.
Furthermore, as F0-Range (F0-Min:F0-Max) of memorable
quotes is larger than non-memorable quotes, it may also re-
veal a tendency that memorable quotes are spoken with more
variative intonation.
5. CONCLUSION AND FUTURE DIRECTION
In this paper we present our first step in collecting and analyz-
ing memorable spoken quotes. We collect the corpus of mem-
orable quote from TED public speaking and did some prepro-
cessing works, including: 1) match the speech and transcrip-
tion file, 2) randomly generate the non-memorable corpora, 3)
add annotation of popularity factor. The completed corpora
consists of memorable and non-memorable quotes in both
speech and textual form. The analysis of memorable spoken
quotes is done based on speech duration, F0, and popular-
ity. The results reveal that the number of memorable quotes
achieve maximum in 15-20 speech duration interval. Analysis
on F0 also shows that F0 score of memorable quote corpus ap-
parently is higher than non-memorable quote. This indicates
that acoustic may be one of factors which differentiate memo-
rable and non-memorable quotes. As future direction, we will
5Available: http://opensmile.sourceforge.net/
build automatic detection of memorable and popular quotes,
which may be learned to enhance spoken dialog system.
6. ACKNOWLEDGEMENT
Part of this work was supported by JSPS KAKENHI Grant
Number 26870371.
7. REFERENCES
[1] R.W. Smith, “Performance measures for the next generation
of spoken natural language dialog systems,” ISDS, 1997, pp.
37–40.
[2] M. Cavazza, An empirical study of speech recognition errors
in a task-oriented dialogue system,” SIGDIAL, 2001, vol. 16,
pp. 1–8.
[3] S. Furui, “Recent advances in automatic speech summariza-
tion,” RIAO, 2007, pp. 90–101.
[4] M. Bulut, S. S. Narayanan, and A. K. Syrdal, “Expressive
speech synthesis using a concatenative synthesizer, INTER-
SPEECH, 2002.
[5] E. Eide, A. Aaron, R. Bakis, W. Hamza, M. Picheny, and
J. Pitrelli, A corpus-based approach to expressive speech syn-
thesis,” ISCA Workshop on Speech Synthesis, 2004.
[6] E.T.F. arXiv, “The secret science of memorable quotes,” MIT
Technol, 2012.
[7] M. Theune, K. Meijs, D. Heylen, and R. Ordelman, “Gener-
ating expressive speech for storytelling applications, Audio,
Speech, and Language Processing, IEEE Transactions, 2006.
[8] M. Bendersky and D. A. Smith, A dictionary of wisdom
and wit: Learning to extract quotable phrases,” NAACL-HLT,
2012, pp. 69–77.
[9] L. Taycher, “Books of the world, stand up and be counted! all
129.864.800 of you,” Inside Google blog, 2010.
[10] C. Danescu-Niculescu-Mizil, J. Cheng, J. Kleinberg, and
L. Lee, “You had me at hello: How phrasing affects memo-
rability, ACL, 2012, vol. 1, pp. 892–901.
[11] M. Drolet, R. I. Schubotz, and J. Fischer, “Recognizing the au-
thenticity of emotional expressions: F0 contour matters when
you need to know,” Frontiers, Human neuroscience, 8., 2014.
[12] J. Liscombe, J. Venditti, and J. B. Hirschberg, “Classifying
subject ratings of emotional speech using acoustic features,”
Eurospeech, 2003.
[13] B. Schuller, S. Steidl, and A. Batliner, “The interspeech 2009
emotion challenge,” INTERSPEECH, 2009.
[14] F. Eyben, M. Woellmer, and B. Schuller, “opensmile the mu-
nich open speech and music interpretation by large space ex-
traction toolkit,” Institute for Human-Machine Communica-
tion, version 1.0.1, 2010.
143
... Further details of data construction can be found in our previous work [8]. ...
Conference Paper
Full-text available
This paper presents a study on natural expressive speech during public talks. Specifically, we focus on how people convey important messages that may be retained in the audience's consciousness. Our study aims to answer several ques-tions. Why are some public speeches memorable and inspirational for the audience, while others are not? Why are some memorable/inspirational spoken quotes more popular than others? Being able to evaluate why certain spoken words are mem-orable/inspirational is not a trivial matter, and most studies on memorable quote detection are only limited to textual data. In this study, we use both linguistic and acoustic features of public speeches in TED talks. The results reveal that based on those linguistic and acoustic features, we are able to distinguish memorable spoken quotes and non-memorable spoken quotes with 70.4% accuracy. Furthermore, we also analyze the important factors that affect the memorableness and popularity of spoken quotes.
Chapter
This paper presents a study on natural expressive speech during public talks. Specifically, we focus on how people convey important messages that may be retained in the audience’s consciousness. Our study aims to answer several questions. Why are some public speeches memorable and inspirational for the audience, while others are not? Why are some memorable/inspirational spoken quotes more popular than others? Being able to evaluate why certain spoken words are memorable/inspirational is not a trivial matter, and most studies on memorable quote detection are only limited to textual data. In this study, we use both linguistic and acoustic features of public speeches in TED talks. The results reveal that based on those linguistic and acoustic features, we are able to distinguish memorable spoken quotes and non-memorable spoken quotes with 70.4 % accuracy. Furthermore, we also analyze the important factors that affect the memorableness and popularity of spoken quotes.
Article
Full-text available
Authenticity of vocal emotion expression affects emotion recognition and brain activity in the so-called Theory of Mind (ToM) network, which is implied in the ability to explain and predict behavior by attributing mental states to other individuals. Exploiting the variability of the fundamental frequency (F0 contour), which varies more (higher contour) in play-acted expressions than authentic ones, we examined whether contour biases explicit categorization toward a particular authenticity or emotion category. Moreover, we tested whether contour modulates blood-oxygen-level dependent (BOLD) response in the ToM network and explored the role of task as a top-down modulator. The effects of contour on BOLD signal were analyzed by contrasting high and low contour stimuli within two previous fMRI studies that implemented emotion and authenticity rating tasks. Participants preferentially categorized higher contour stimuli as play-acted and lower contour stimuli as sad. Higher contour was found to up-regulate activation task-independently in the primary auditory cortex. Stimulus contour and task were found to interact in a network including medial prefrontal cortex, with an increase in BOLD signal for low-contour stimuli during explicit perception of authenticity and an increase for high-contour stimuli during explicit perception of emotion. Contour-induced BOLD effects appear to be purely stimulus-driven in early auditory and intonation perception, while being strongly task-dependent in regions involved in higher cognition.
Article
Full-text available
The development of spoken dialogue systems is often limited by the performance of their speech recognition component. The impact of speech recognition errors on dialogue systems is often studied at the global level of task completion. In this paper, we carry an empirical study on the consequences of speech recognition errors on a fully-implemented dialogue prototype, based on a speech acts formalisms. We report the impact of speech recognition errors on speech act identification and discuss how standard control mechanisms can participate to robustness by assisting the user in repairing the consequences of speech recognition errors.
Conference Paper
Full-text available
The last decade has seen a substantial body of literature on the recognition of emotion from speech. However, in comparison to related speech processing tasks such as Automatic Speech and Speaker Recognition, practically no standardised corpora and test-conditions exist to compare performances under exactly the same conditions. Instead a multiplicity of evaluation strategies employed - such as cross-validation or percentage splits without proper instance definition - prevents exact reproducibility. Further, in order to face more realistic scenarios, the community is in desperate need of more spontaneous and less prototypical data. This INTERSPEECH 2009 Emotion Challenge aims at bridging such gaps between excellent research on human emotion recognition from speech and low compatibility of results. The FAU Aibo Emotion Corpus [1] serves as basis with clearly defined test and training partitions incorporating speaker independence and different room acoustics as needed in most real-life settings. This paper introduces the challenge, the corpus, the features, and benchmark results of two popular approaches towards emotion recognition from speech.
Article
Full-text available
Understanding the ways in which information achieves widespread public awareness is a research question of significant interest. We consider whether, and how, the way in which the information is phrased --- the choice of words and sentence structure --- can affect this process. To this end, we develop an analysis framework and build a corpus of movie quotes, annotated with memorability information, in which we are able to control for both the speaker and the setting of the quotes. We find that there are significant differences between memorable and non-memorable quotes in several key dimensions, even after controlling for situational and contextual factors. One is lexical distinctiveness: in aggregate, memorable quotes use less common word choices, but at the same time are built upon a scaffolding of common syntactic patterns. Another is that memorable quotes tend to be more general in ways that make them easy to apply in new contexts --- that is, more portable. We also show how the concept of "memorable language" can be extended across domains.
Conference Paper
Full-text available
Speech summarization technology, which extracts important information and removes irrelevant information from speech, is expected to play an important role in building speech archives and improving the efficiency of spoken document retrieval. However, speech summarization has a number of significant challenges that distinguish it from general text summarization. Fundamental problems with speech summarization include speech recognition errors, disfluencies, and difficulties of sentence segmentation. Typical speech summarization systems consist of speech recognition, sentence segmentation, sentence extraction, and sentence compaction components. Most research up to now has focused on sentence extraction, using LSA (Latent Semantic Analysis), MMR (Maximal Marginal Relevance), or feature-based approaches, among which no decisive method has yet been found. Proper sentence segmentation is also essential to achieve good summarization performance. How to objectively evaluate speech summarization results is also an important issue. Several measures, including families of SumACCY and ROUGE measures, have been proposed, and correlation analyses between subjective and objective evaluation scores have been performed. Although these measures are useful for ranking various summarization methods, they do not correlate well with human evaluations, especially when spontaneous speech is targeted.
Conference Paper
Full-text available
This paper presents results from a study examining emotional speech using acoustic features and their use in automatic ma- chine learning classification. In addition, we propose a clas- sification scheme for the labeling of emotions on continuous scales. Our findings support those of previous research as well as indicate possible future directions utilizing spectral tilt and pitch contour to distinguish emotions in the valence dimension.
Article
Full-text available
Work on expressive speech synthesis has long focused on the expression of basic emotions. In recent years, however, interest in other expressive styles has been increasing. The research presented in this paper aims at the generation of a storytelling speaking style, which is suitable for storytelling applications and more in general, for applications aimed at children. Based on an analysis of human storytellers' speech, we designed and implemented a set of prosodic rules for converting "neutral" speech, as produced by a text-to-speech system, into storytelling speech. An evaluation of our storytelling speech generation system showed encouraging results
Article
Human speech communication can be thought of as comprising two channels – the words themselves, and the style in which they are spoken. Each of these channels carries information. Today's most-advanced text-to-speech (TTS) systems such as [1],[2],[3],[4] fall far short of human speech because they offer only a single, fixed style of delivery, independent of the message. In this paper, we describe the IBM Expressive TTS Engine, which is able to add another channel by offering five speaking styles. These are: neutral declarative, conveying good news, conveying bad news, asking a question, and showing contrastive emphasis. In addition to generating speech in these five styles, our TTS system is also able to generate paralinguistic events such as sighs, breaths, and filled pauses which further enrich the style channel. We describe our methods for generating and evaluating expressive synthetic speech and paralinguistic effects. We show significant perceptual differences between expressive and neutral synthetic speech for each of our speaking styles. In addition, we describe how users have been empowered to easily communicate the desired expression to the TTS engine through our extensions [5] of the Speech Synthesis Markup Language (SSML) [6].
Conference Paper
This paper describes an experiment in synthesizing four emotional states - anger, happiness, sadness and neutral - using a concatenative speech synthesizer. To achieve this, five emotionally (i.e., semantically) unbiased target sentences were prepared. Then, separate speech inventories, comprising the target diphones for each of the above emotions, were recorded. Using the 16 different combinations of prosody and inventory during synthesis resulted in 80 synthetic test sentences. The results were evaluated by conducting listening tests with 33 naïve listeners. Synthesized anger was recognized with 86.1% accuracy, sadness with 89.1%, happiness with 44.2%, and neutral emotion with 81.8% accuracy. According to our results, anger was classified as inventory dominant and sadness and neutral as prosody dominant. Results were not sufficient to make similar conclusions regarding happiness. The highest recognition accuracies were achieved for sentences synthesized by using prosody and diphone inventory belonging to the same emotion.