Conference PaperPDF Available

The Use of Semantic and Acoustic Features for Open-Domain TED Talk Summarization

Authors:

Abstract and Figures

In this paper, we address the problem of automatic speech summarization on open-domain TED talks. The large vo-cabulary and diversity of topics from speaker-to-speaker presents significant difficulties. The challenges increase not only how to handle disfluencies and fillers, but also how to extract topic-related meaningful messages within the free talks. Here, we propose to incorporate semantic and acoustic features within the speech summarization technique. In addition, we also propose a new evaluation method for speech summarization by checking semantic similarity between system and human summarization. Experiments results reveal that the proposed methods are effec-tive in spontaneous speech summarization.
Content may be subject to copyright.
The Use of Semantic and Acoustic Features for
Open-Domain TED Talk Summarization
Fajri Koto∗†, Sakriani Sakti, Graham Neubig, Tomoki Toda, Mirna Adriani, and Satoshi Nakamura
Graduate School of Information Science, Nara Institute of Science and Technology, Japan
E-mail: {ssakti,neubig,tomoki,s-nakamura}@is.naist.jp Tel: +81-743-725264
Faculty of Computer Science, University of Indonesia, Indonesia
E-mail: fajri91@ui.ac.id, mirna@cs.ui.ac.id Tel: +62-21-7863419
Abstract—In this paper, we address the problem of automatic
speech summarization on open-domain TED talks. The large vo-
cabulary and diversity of topics from speaker-to-speaker presents
significant difficulties. The challenges increase not only how to
handle disfluencies and fillers, but also how to extract topic-
related meaningful messages within the free talks. Here, we
propose to incorporate semantic and acoustic features within the
speech summarization technique. In addition, we also propose a
new evaluation method for speech summarization by checking
semantic similarity between system and human summarization.
Experiments results reveal that the proposed methods are effec-
tive in spontaneous speech summarization.
I. INTRODUCTION
Recently, information in Internet is available with various
data such as text, images, sound, and also video. Consequently,
many researchers start to study how to retrieve information
from these various data. Automatic speech summarization has
been also actively investigated. By using audio and video
of speech data, many researchers have investigated summa-
rization based on the output of automatic speech recognition
(ASR) [2][3][5]. Here, the summarization process is performed
over the text output of ASR system without involving audio
features information. For example, the study by Hori et al.,
extract and calculate the word significance score and the
linguistic likelihood from the ASR output [2]. Furthermore,
some other techniques like random walk, words and sen-
tence extraction, weighted finite-state transducers, and Hidden
Markov Model have been also studied by some researchers to
improve speech summarization technique [3][14][15][16].
However, despite a lot of progress in speech summarization,
most works focused primarily only on news content, news
broadcast, and other non spontaneous speech data. On the
other hand, there are many spontaneous speech exist where
people are willing to have a summarization of the talks but
difficult to obtain. One of the cases is open-domain talk
like TED talks1that are still limited to be used. TED is a
nonprofit devoted to Ideas Worth Spreading. It started out
in 1984 as a conference bringing together people from three
worlds: Technology, Entertainment, Design. TED talks bring
together the world’s most fascinating thinkers and doers, who
are challenged to give the talk of their lives in 18 minutes
1http://www.ted.com/
or less. Here, we initiate to address the problem of automatic
speech summarization (ASR) on open-domain TED talks.
It is obvious that spontaneous speech in TED talks is very
different from speech in broadcast news in which speakers do
not have any text guidance in their hand. This resulted in out-
put of ASR system will have higher error than broadcast news
speech recognition. Furthermore, The large vocabulary and di-
versity of topics from speaker-to-speaker presents a significant
difficulties. The challenges increase not only how to handle
disfluencies and fillers, but also how to extract topic-related
meaningful messages within the free talks. In this study, we
propose to incorporate semantic features in automatic speech
summarization. In this way, the topic related sentences are
scored higher than unrelated sentences. As preliminary study,
we start with incorporating the proposed methods within the
widely-used MMR summarization technique [1].
In addition, we also include acoustic features to the sum-
marization framework, since the acoustic features are also
one of significant factors in speech summarization [17]. The
MMR technique is done by processing the output of ASR
in term frequency (TF) and term frequency-inverse document
frequency (TF-IDF) model. Then, various combination with
acoustic features, semantic features, as well as acoustic and
semantic features together are investigated.
We also propose a new evaluation method for speech sum-
marization by checking semantic similarity between system
and human summarization. We argue that Common evalua-
tion like Recall-Oriented Understudy for Gisting Evaluation
(ROUGE) and Longest Common Subsequence (LCS) have
limitation for spontaneous speech because they are only based
on the number of overlapping units such as n-gram and word
sequences [13]. Whereas, public speeches like TED are more
unstructured and rich with dictionary. Therefor, performing
evaluation with semantic similarity is more promising.
II. OVE RVIEW OF MMR-BASE D SUMARIZATION
TECHNIQUES
MMR has been widely used for text summarization. MMR
is a measure where the retrieval status value (RSV) of a
document is influenced by other already retrieved documents:
documents similar to retrieved documents have their RSV
lowered, thus boosting dissimilar documents [1]. Carbonell
and Goldstein proposed this formula as follow:
MMR(Si) = αSim1(Si, D) + (1 α)S im2(Si, Summ)
(1)
where Siis i-th sentence in document D,Summ is summary
result that is being built according to highest MMR score for
every iteration, and Sim1and Sim2are similarity measures,
which can be the same, or can be set to different similarity
metrics. In this study, we use cosine similarity to calculate
Sim1and S im2as follows:
sim(D1, D2) = Pit1i, t2i
pPit2
1ipPit2
2i
(2)
The value of αallows a readjustment of the behavior of MMR
to control diversity ranking between unselected sentence with
selected summary sentences. Here, TF and TF-IDF model to
MMR are performed.
III. THE PRO PO SE D SUMMARIZATION TECHNIQUES
A. Acoustic and Semantic Feature
1) Acoustic Feature: Acoustic feature that is used in this
study is based on INTERSPEECH 2010 paralinguistic chal-
lenge configuration (IS10 Paraling features) [18]. It consists
of 1582 features, which are obtained in three steps: (1) the 38
low-level descriptors are extracted and smoothed by simple
moving average low-pass filtering; (2) their first order regres-
sion coefficients are added in full HTK compliance; (3) 21
functionals are applied. However, 16 zero-information features
(e. g. minimum F0, which is always zero) are discarded.
Finally, the 2 single features F0 number of onsets and turn
duration are added. More details of description of each feature
can be found in [18].
2) Semantic Feature: Semantic similarity feature is a simi-
larity score that describes the similarity between a sentence
and document. We proposed this formula to re-rank the
sentences according to similarity score between sentence and
whole document.
Simsem (si, D) = Pj=|S|
j=1j6=iSimsem (si, sj)
|S| − 1(3)
where siis the i-th sentence in document D, and |S|represents
the number of all sentences in document D. That formula
calculates all semantic similarity score between one sentence
with other sentences. We take the mean score as our final
score to make the sentence rank. The Simsem is calculated
according to [6], in which one sentence is divided into noun
set and verb set and the similarity score between two sentence
is then calculated based on the similarity of those noun and
verb set described in Eq. 6.
S1={V1, N1}and S2={V2, N2}(4)
Nb =N1N2and V b =V1V2(5)
SimS em(S1, S2) = β1Simv(v1, v2) + β2Simn(n1, n2)
(6)
Eq. 6 above uses two kinds of vector: noun and verb. These
vectors are described simply below:
V v1k=Max|V1|
i=1 (Sim(V v1i, V bk)) (7)
V v2k=Max|V2|
i=1 (Sim(V v2i, V bk)) (8)
Nn1k=M ax|N1|
i=1 (Sim(N n1i, N bk)) (9)
Nn1k=M ax|N2|
i=1 (Sim(N n2i, N bk)) (10)
To calculate semantic similarity score between two words
above, we also use words-relationship-Tree based on Wu and
Palmer’s Algorithm [7]. This function utilized online lexical
database WordNet [8][9]. While, the similarity score (Simv
and Simn) between this two vector can be calculated easily
by using cosine similarity formula.
B. Summarization Method
1) Incorporating Semantic Features: In summarization pro-
cess, MMR re-ranks every sentence by calculating cosine
similarity score between two term vectors: sentence and doc-
uments. To boost the system accuracy we propose modified
MMR by replacing Sim1(Si, D)in Eq. 1 with Eq. 11. Our
modified MMR incorporates Semantic Similarity in similarity
calculation which also considers the cosine similarity. In this
study, we use βequals to 0.5.
βSimS em(Si, D) + (1 β)S im1(Si, D)(11)
2) Incorporating Acoustic Features: The motivation of in-
corporating acoustic features within summarization framework
is to give more score into the sentences that are considered
as candidate summary based on acoustic characteristic of
the sentences. This is done by naive bayes (NB) classifier,
implemented based on this formula below, where it output
”1” if sentence Si is considered as candidate summary for
that document, and 0 otherwise:
MMRspeech(Si)=0.5MMR(Si)+0.5NBM odel(Si)
(12)
We train the classifier by training acoustic feature of each
sentences as described in Fig. 1. In total we used 712 TED
talks to train the data. The labels were created by calculating
semantic score between sentences of every document with
existing summary provided by TED website. For each doc-
uments, we picked top 10 sentences with highest semantic
similarity score as summary and then complement the dataset
by randomly picked other 10 sentences as non-summary.
3) Incorporating Acoustic and Semantic Features: To get
more elaboration we also implement Naive-bayes-based acous-
tic classifier to our new method in speech summarization. We
formulate the new score of similarity by adding the score with
NBModel function like Eq. 12. In this study, we use γequals
to 0.1.
Sim0
sem(si, D) = (1γ)Simsem (si, D)+γN BM odel(Si)
(13)
Fig. 1. Build summary label by performing Semantic Similarity.
Fig. 2. Automatic Speech Summarization stage
C. Evaluation Metric: Semantic Similarity Checking (SSC)
The well-known automatic evaluation method for Summa-
rizer: ROUGE and LCS have been introduced by Lin [13].
Formally, ROUGE-N is an n-gram recall between a candidate
summary and a set of reference summary. Whereas, LCS does
not require consecutive matches but in-sequence matches that
reflect sentence level order as n-grams [12]. Since we focused
on unstructured document like Spontaneous Speech, in this
study we propose semantic similarity checking (SSC) in Eq. 14
as a new automatic evaluation method for summarizer which
utilizes semantic similarity calculation. Intuitively, this method
will be more powerful than ROUGE and LCS because the
resulting score is not just built by counting matched words or
sequences.
SSC(D1, D2) = Pi=|D1|
i=1 Simsem (si, D2)
|D1|(14)
In Eq. 14, the D1and D2represents document of resulting
summary and document of reference summary consecutively.
The equation simply calculates the average of all semantic
similarity score between every sentence siin D1and D2.
The similarity score Simsem (si, D2)is also calculated by
averaging the semantic similarity score between siand all
sentences in D2.
IV. OVER AL L ARCHITECTURE OF SUMMARRIZER
Fig. 2 above shows our summarization experiment stage. We
use output of ASR system to build sentence-based summary by
doing some techniques: MMR, Semantic Similarity and their
incorporation with the audio model. This model is built by
training acoustic feature. Then we do evaluation by calculating
similarity score between the resulting summary and human
summarization. The ASR system that is used here was trained
using 157 hours of TED talks released before the cut-off date
of 31 December 2010, downloaded from the TED websites
with the corresponding subtitles.
V. EXPERIMENTAL SET-UP
A. The TED Data
The TED talks that are used in summarization are the
same data which were used to evaluate the speech recognition
system. There are 20 TED talks in total. The reference
summarization is obtained from human summarization. In this
study, five native speakers are required to pick ten sentences
that were considered as most representative sentences to the
speaker topic for each speech document.
B. Preprocessing
To build the vector space models (TF and TF-IDF) we did
preprocessing to all TED speech data. We replace all capital
letters of transcription file with lowercase and eliminate all
punctuations that exist in the transcription file. We also remove
some of the unimportant word or segment like laugh and
applause. We use all TED documents to build idf(t,D) score
in calculating TF-IDF.
For acoustic features, we perform segmentation based on
time sequences obtained from the srt file and our ASR system.
It aims to get the valid timing of every sentence in document.
Segmented audio file will be extracted by openSMILE toolkit
[11]. openSMILE is a feature extraction toolkit, which unites
feature extraction algorithms from the speech processing and
the Music Information Retrieval communities [10].
The noun and verb vector that will be used to calculate
semantic similarity are processed by implementing python
code with NLTK library. The similarity checking will be
performed for the same tagging of words (only between two
nouns or two verbs).
VI. EX PE RI ME NT RE SU LT
As our baseline system, we perform MMR in TF and
TF-IDF model. Various value of alpha parameters in MMR
formula are investigated. The results are then compare with
two proposed methods: incorporating semantic similarity and
acoustic features. We do evaluation with SSC and take top 30%
highest MMR score of sentences in document as its summary.
Our first experiment is comparing SSC and ROUGE to
look their tendency in evaluating system summary. In Fig.
3 we use ROUGE-3 gram and SSC to perform evaluation
of MMR-TF for various alpha value. The results reveal that
our proposed metric evaluation has in line performance with
ROUGE. However, SSC is still better to be used because it is
calculated based on semantic.
In MMR experiment we use some alpha parameters: 0.1,
0.3, 0.5, 0.7, and 0.9. Fig. 4 and Fig. 5 show that Semantic
Similarity can boost the accuracy of MMR for all alpha
parameters. The incorporation of MMR, semantic and acoustic
feature is shown by the top line for both graph. It affirms
Fig. 3. SSC and ROUGE performance for MMR-TF
Fig. 4. MMR-TF Experiment result.
that semantic and acoustic have important role for optimizing
automatic speech summarization. According to both line graph
(Fig. 4 and Fig. 5), the highest accuracy is achieved by MMR
and Semantic at alpha equals with 0.7. They are 55.29% and
55.30% for TF model and TF-IDF model consecutively.
In Table. I, we present the best performance of each methods
for vary parameters. Here we compare MMR, MMR+Audio,
and MMR+Audio+Semantic. And the result reveals that the
combination of MMR, Acoustic and Semantic feature always
give better performance than standard technique for both
vector space model.
TABLE I
Highest accuracy performance of MMR and its incorporation
Incorporation TF TFIDF
MMR 51.38% 50.71%
MMR+Audio 51.77% 53.18%
MMR+Audio+Semantic 54.64% 55.10%
VII. CONCLUSION
In this study, we attempt to incorporate both semantic
and acoustic features in automatic speech summarization for
open-domain TED talks. The experimental results reveal that
they can improve textual speech summarization. In short, our
study reveals that semantic similarity can be used in speech
summarization: 1) as summarization feature and 2) evaluation
method. Our experiments also show that the incorporation
of MMR, Semantic and Acoustic feature can achieve best
performance. It affirms the both features have important role
in speech summarization. In our future work, we will further
investigate various incorporation approaches of semantic and
acoustic features into MMR as well as the combination with
other summarization techniques
Fig. 5. MMR-TFIDF Experiment result.
VIII. ACKNOWLEDGMENT
Part of this work was supported by JSPS KAKENHI Grant
Number 26870371.
REFERENCES
[1] Carbonell, J., and Goldstein, J. “The Use of MMR, dicersity-based
reranking for reordering documents and producing summaries”. SIGIR
1998, pp. 335-336. ACM, New York, 1998.
[2] Hori, C., and Furui, S. “Automatic speech summarization based on word
significance and linguistic likelihood”. Proc. ICASSP2000, Istanbul,
Vol.3, pp.1579-1582, 2000.
[3] Hori, C., and Furui, S. “Improvements in automatic speech summariza-
tion and evaluation methods”. ICSLP2000 4: 326-329, 2000
[4] Maskey, S., and Hirschberg, J. “Comparing lexical, acoustic/prosodic,
structural and discourse features for speech summarization”. In INTER-
SPEECH, pp. 621-624, 2005
[5] Xie, S., and Yang L. “Using confusion networks for speech summariza-
tion in Human Language Technologies”. The 2010 Annual Conference
of the North American Chapter of the Association for Computational
Linguistics, pp. 46-54. Association for Computational Linguistics, 2010.
[6] Liu, D., Liu, Z., and Dong, Q. “A dependency grammar and WordNet
based sentence similarity measure”. Journal of Computational Informa-
tion Systems 8:3 1027-1035, 2012.
[7] Wu, Z., and P. Martha. “Verbs semantics and lexical selection”. In Pro-
ceedings of the 32nd annual meeting on Association for Computational
Linguistics, pp. 133-138. Association for Computational Linguistics,
1994.
[8] Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K.
J. “Introduction to wordnet: An on-line lexical database”. International
journal of lexicography 3, no. 4: 235-244, 1990
[9] Available: http://wordnet.princeton.edu/
[10] Eyben, F., Woellmer, M., and Schuller, B. “openSMILE the Munich open
Speech and Music Interpretation by Large Space Extraction toolkit”.
Institute for Human-Machine Communication, version 1.0.1, 2010.
[11] Available: hhtp://opensmile.sourceforge.net/
[12] Eyben, F., Martin W., and Bjrn S. “Opensmile: the munich versatile
and fast open-source audio feature extractor”. In Proceedings of the
international conference on Multimedia, pp. 1459-1462. ACM, 2010.
[13] Lin, C.-Y. “ROUGE: A Package for Automatic Evaluation of Sum-
maries”. In Text Summarization Branches Out: Proceedings of the ACL-
04 Workshop, pp. 74-81, 2004.
[14] Chen, Y.-N., Huang, Y., Yeh C.-F., and Lee. L.-S. “Spoken Lecture
Summarization by Random Walk over a Graph Constructed with Auto-
matically Extracted Key Terms.” In INTERSPEECH, pp. 933-936, 2011.
[15] Furui, S., Hirohata, M., Shinnaka, Y., and Iwano, K. “Sentence
extraction-based automatic speech summarization and evaluation tech-
niques.” In Proceedings of the Symposium on Large-scale Knowledge
Resources, pp. 33-38, 2005.
[16] Maskey, S. and Hirschberg, J. “Summarizing speech without text using
hidden markov models.” In Proceedings of the Human Language Tech-
nology Conference of the NAACL, Companion Volume: Short Papers,
pp. 89-92. Association for Computational Linguistics, 2006.
[17] Inoue, A., Mikami, T. and Yamashita Y. “Improvement of Speech
Summarization Using Prosodic Information.” In Speech Prosody 2004,
International Conference, 2004.
[18] Schuller, B., Steild, S., Batliner, A., Burkardt, F., Devillers, L., Muller,
C., Narayanan, S. “The INTERSPEECH 2010 Paralinguistic Challenge”.
In INTERSPEECH, pp. 2794-2797, 2010.
... The target summary consists of the title and abstract of the talk. Note that others have also used TED Talks from speech summarization [41], but those corpora were small. Furthermore, they did not publicly release their corpus. ...
Preprint
Speech summarization, which generates a text summary from speech, can be achieved by combining automatic speech recognition (ASR) and text summarization (TS). With this cascade approach, we can exploit state-of-the-art models and large training datasets for both subtasks, i.e., Transformer for ASR and Bidirectional Encoder Representations from Transformers (BERT) for TS. However, ASR errors directly affect the quality of the output summary in the cascade approach. We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary. We investigate several schemes to combine ASR hypotheses. First, we propose using the sum of sub-word embedding vectors weighted by their posterior values provided by an ASR system as an input to a BERT-based TS system. Then, we introduce a more general scheme that uses an attention-based fusion module added to a pre-trained BERT module to align and combine several ASR hypotheses. Finally, we perform speech summarization experiments on the How2 dataset and a newly assembled TED-based dataset that we will release with this paper. These experiments show that retraining the BERT-based TS system with these schemes can improve summarization performance and that the attention-based fusion module is particularly effective.
... However, studies that used more than two feature types typically concluded that the best results could be obtained by combining different features [5,[42][43][44][45][46]. Some studies claimed that the use of only one or two features was sufficient, and performed on par with the combined use of lexical, acoustic, structural, and relevance features [10,15,44,47,48]. The importance of features differed by task and domain. ...
Preprint
Speech summarisation techniques take human speech as input and then output an abridged version as text or speech. Speech summarisation has applications in many domains from information technology to health care, for example improving speech archives or reducing clinical documentation burden. This scoping review maps the speech summarisation literature, with no restrictions on time frame, language summarised, research method, or paper type. We reviewed a total of 110 papers out of a set of 153 found through a literature search and extracted speech features used, methods, scope, and training corpora. Most studies employ one of four speech summarisation architectures: (1) Sentence extraction and compaction; (2) Feature extraction and classification or rank-based sentence selection; (3) Sentence compression and compression summarisation; and (4) Language modelling. We also discuss the strengths and weaknesses of these different methods and speech features. Overall, supervised methods (e.g. Hidden Markov support vector machines, Ranking support vector machines, Conditional random fields) performed better than unsupervised methods. As supervised methods require manually annotated training data which can be costly, there was more interest in unsupervised methods. Recent research into unsupervised methods focusses on extending language modelling, for example by combining Uni-gram modelling with deep neural networks. Protocol registration: The protocol for this scoping review is registered at https://osf.io.
... It causes judging the result becomes difficult. Koto et al. (2014) has discussed this issue when working on TED Speech summarization. They proposed Semantic Similarity Checking (SSC) to measure the similarity between two unstructured documents. ...
Conference Paper
Full-text available
In this paper we report our effort to construct the first ever Indonesian corpora for chat summarization. Specifically, we utilized documents of multi-participant chat from a well known online instant messaging application, WhatsApp. We construct the gold standard by asking three native speakers to manually summarize 300 chat sections (152 of them contain images). As result, three reference summaries in extractive and either abstractive form are produced for each chat sections. The corpus is still in its early stage of investigation, yielding exciting possibilities of future works.
... Though the storage and the CPU become cheaper and faster today, not all devices on user side have sufficient configuration to perform the summarization task. For example the summarization based on semantic similarity [3] will seize big resource in client device. Therefore, a server-based system for summarizer in online instant messaging becomes a better choice. ...
Chapter
Full-text available
In this paper, we report the first work ever of detecting the summarizable chat conversation in order to improve the quality of summarization and system performance, especially in real time server-based system like online instant messaging. Summarizable chat conversation means that the document assessed could produce a meaningful summary for human. Our study intends to answer the question: what are the characteristics of a summarizable chat and how to distinguish it with non-summarizable chat conversation. To conduct the experiment, corpora of 536 chat conversations was constructed manually. Technically, we used 19 attributes and grouped them by feature sets of (1) chat attribute, (2) lexical, and (3) Rapid Automatic Keyword Extraction (RAKE). As result, our work reveals that the features can classify summarizable chat by 78.36 % as our highest accuracy, performed by feature selection with SVM.
Article
Speech summarization techniques take human speech as input and then output an abridged version as text or speech. Speech summarization has applications in many domains from information technology to health care, for example improving speech archives or reducing clinical documentation burden. This scoping review maps close to 2 decades of speech summarization literature, spanning from the early machine learning works up to ensemble models, with no restrictions on the language summarized, research method, or paper type. We reviewed a total of 110 papers out of a set of 188 found through a literature search and extracted speech features used, methods, scope, and training corpora. Most studies employ one of four speech summarization architectures: (1) Sentence extraction and compaction; (2) Feature extraction and classification or rank-based sentence selection; (3) Sentence compression and compression summarization; and (4) Language modelling. We also discuss the strengths and weaknesses of these different methods and speech features. Overall, supervised methods (e.g. Hidden Markov support vector machines, Ranking support vector machines, Conditional random fields) performed better than unsupervised methods. As supervised methods require manually annotated training data which can be costly, there was more interest in unsupervised methods. Recent research into unsupervised methods focusses on extending language modelling, for example by combining Uni-gram modelling with deep neural networks. This review does not include recent work in deep learning.
Conference Paper
Full-text available
This paper will focus on the semantic representation of verbs in computer systems and its impact on lexical selection problems in machine translation (MT). Two groups of English and Chinese verbs are examined to show that lexical selection must be based on interpretation of the sentences as well as selection restrictions placed on the verb arguments. A novel representation scheme is suggested, and is compared to representations with selection restrictions used in transfer-based MT. We see our approach as closely aligned with knowledge-based MT approaches (KBMT), and as a separate component that could be incorporated into existing systems. Examples and experimental results will show that, using this scheme, inexact matches can achieve correct lexical selection.
Article
Full-text available
This paper presents automatic speech summarization tech-niques and its evaluation metrics, focusing on sentence extraction-based summarization methods for making abstracts from spontaneous presentations. Since humans tend to sum-marize presentations by extracting important sentences from introduction and conclusion parts, this paper proposes a method using sentence location. Experimental results show that the proposed method significantly improves automatic speech summarization performance for the condition of 10% sum-marization ratio. Results of correlation analysis between subjective and objective evaluation scores confirm that ob-jective evaluation metrics, including summarization accu-racy, sentence F-measure and ROUGE-N, are effective for evaluating summarization techniques.
Conference Paper
Full-text available
We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/.
Conference Paper
Full-text available
Most paralinguistic analysis tasks are lacking agreed-upon evaluation procedures and comparability, in contrast to more 'traditional' disciplines in speech analysis. The INTERSPEECH 2010 Paralinguistic Challenge shall help overcome the usually low compatibility of results, by addressing three selected sub-challenges. In the Age Sub-Challenge, the age of speakers has to be determined in four groups. In the Gender Sub-Challenge, a three-class classification task has to be solved and finally, the Affect Sub-Challenge asks for speakers' interest in ordinal representation. This paper introduces the conditions, the Challenge corpora "aGender" and "TUM AVIC" and standard feature sets that may be used. Further, baseline results are given.
Conference Paper
Full-text available
We present results of an empirical study of the usefulness of different types of features in selecting extractive summaries of news broadcasts for our Broadcast News Summarization Sys- tem. We evaluate lexical, prosodic, structural and discourse features as predictors of those news segments which should be included in a summary. We show that a summarization sys- tem that uses a combination of these feature sets produces the most accurate summaries, and that a combination of acous- tic/prosodic and structural features are enough to build a 'good' summarizer when speech transcription is not available.
Article
The measure of sentence similarity plays an important role in text-related researches of Natural Language Processing (NLP). This paper proposed a novel method to calculate the sentence similarity which takes into accounts both semantic and syntactic information entailed in the sentences. The algorithm measures the semantic similarity via a vector space model (VSM), in which content word nodes and its dependent relations in a dependency tree were treated as the semantic features and WordNet are adopted to construct semantic vectors. Besides, the syntactic similarity is measured based on the co-occurring of effective dependent structures (EDS) between two sentences after a dependency parsing. The experimental result demonstrates that this approach has excellent performance in handling sentences in different length. 1553-9105/
Article
Speech summarization is a technique of extracting important sentences from spoken documents. It provides us useful information to looking for the spoken documents that we want. Spoken documents contain non-linguistic information, which is mainly expressed by prosody, while written text conveys only linguistic information. This paper describes a summarization method which uses prosodic information as well as linguistic information. The linguistic information is derived from text which is transcribed by a continuous speech recognition system. In this paper, the speech summarization is defined as extraction of important sentences from transcribed text. Importance of the sentence is predicted by the prosodic parameters and the linguistic information which are combined by multiple regression analysis. Proposed methods are evaluated both on the correlation between the predicted scores of sentence importance and the preference scores by subjects and on the accuracy of extraction of important sentences. Prosodic information improved the quality of speech summary, and it is more effective when the speech is transcribed by automatic speech recognition because speech recognition errors damage linguistic information.
Conference Paper
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluations. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale sum- marization evaluation sponsored by NIST.