Conference PaperPDF Available

Lexical Choice via Topic Adaptation for Paraphrasing Written Language to Spoken Language

Authors:

Abstract and Figures

Our research aims at developing a system that paraphrases written language text to spoken language style. In such a system, it is important to distinguish between appropriate and inappropriate words in an input text for spoken language. We call this task lexical choice for paraphrasing. In this paper, we describe a method of lexical choice that considers the topic. Basically, our method is based on the word probabilities in written and spoken language corpora. The novelty of our method is topic adaptation. In our framework, the corpora are classified into topic categories, and the probability is estimated using such corpora that have the same topic as input text. The result of evaluation showed the effectiveness of topic adaptation.
Content may be subject to copyright.
Lexical Choice via Topic Adaptation for
Paraphrasing Written Language to Spoken
Language
Nobuhiro Kaji1and Sadao Kurohashi2
1Institute of Industrial Science, The University of Tokyo,
4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan
kaji@tkl.iis.u-tokyo.ac.jp
2Graduate School of Information Science and Technology,
The University of Tokyo, 7-3-1 Hongo,
Bunkyo-ku, Tokyo 113-8656, Japan
kuro@kc.t.u-tokyo.ac.jp
Abstract. Our research aims at developing a system that paraphrases
written language text to spoken language style. In such a system, it is
important to distinguish between appropriate and inappropriate words
in an input text for spoken language. We call this task lexical choice
for paraphrasing. In this paper, we describe a method of lexical choice
that considers the topic. Basically, our method is based on the word
probabilities in written and spoken language corpora. The novelty of our
method is topic adaptation. In our framework, the corpora are classified
into topic categories, and the probability is estimated using such corpora
that have the same topic as input text. The result of evaluation showed
the effectiveness of topic adaptation.
1 Introduction
Written language is different from spoken language. That difference has various
aspects. For example, spoken language is often ungrammatical, or uses simplified
words rather than difficult ones etc. Among these aspects this paper examines
difficulty. Difficult words are characteristic of written language and are not ap-
propriate for spoken language.
Our research aims at developing a system that paraphrases written language
text into spoken language style. It helps text-to-speech generating natural voice
when the input is in written language. In order to create such a system, the
following procedure is required: (1) the system has to detect inappropriate words
in the input text for spoken language, (2) generate paraphrases of inappropriate
words, and (3) confirm that the generated paraphrases are appropriate. This
paper examines step (1) and (3), which we call lexical choice for paraphrasing
written language to spoken language.
Broadly speaking, lexical choice can be defined as binary classification task:
the input is a word and a system outputs whether it is appropriate for spoken
R. Dale et al. (Eds.): IJCNLP 2005, LNAI 3651, pp. 981–992, 2005.
c
Springer-Verlag Berlin Heidelberg 2005
982 N. Kaji and S. Kurohashi
language or not. This definition is valid if we can assume that the word diffi-
culty is independent of such factors as context or listeners. However, we think
such assumption is not always true. One example is business jargon (or technical
term). Generally speaking, business jargon is difficult and inappropriate for spo-
ken language. Notwithstanding, it is often used in business talk. This example
implies that the word difficulty is dependent on the topic of text/talk.
In this paper, we define the input of lexical choice as a word and text where it
occurs (= the topic). Such definition makes it possible for a system to consider
the topic. We think the topic plays an important role in lexical choice, when
dealing with such words that are specific to a certain topic, e.g., business jargon.
Hereafter, those words are called topical words, and others are called non-topical
words. Of course, in addition to the topic, we have to consider other factors such
as listeners and so on. But, the study of such factors lies outside the scope of
this paper.
Based on the above discussion, we describe a method of lexical choice that
considers the topic. Basically, our method is based on the word probabilities in
written and spoken language corpora. It is reasonable to assume that these two
probabilities reflect whether the word is appropriate or not. The novelty of the
method is topic adaptation. In order to adapt to the topic of the input text,
the corpora are classified into topic categories, and the probability is estimated
using such corpora that have the same topic category as the input text. This
process enables us to estimate topic-adapted probability. Our method was evalu-
ated by human judges. Experimental results demonstrated that our method can
accurately deal with topical words.
This paper is organized as follows. Section 2 represents method overview.
Section 3 and Section 4 describe the corpora construction. Section 5 represents
learning lexical choice. Section 6 reports experimental results. Section 7 describes
related works. We conclude this paper in Section 8.
2MethodOverview
Our method uses written and spoken language corpora classified into topic cat-
egories. They are automatically constructed from the WWW. The construction
procedure consists of the following two processes (Figure 1).
1. Style Classification
Web pages are downloaded from the WWW, and are classified into written
and spoken language style. Those pages classified as written/spoken language
are referred as written/spoken language corpus. In this process, we discarded
ambiguous pages that are difficult to classify.
2. Topic Classification
The written and spoken language corpora are classified into 14 topic cate-
gories, such as arts, computers and so on.
Both classification methods are represented in Section 3 and Section 4.
Given an input word and a text where it occurs, it is decided as follows
whether the input word is appropriate or inappropriate for spoken language.
Lexical Choice via Topic Adaptation for Paraphrasing Written Language 983
1. The topic category of the input text is decided by the same method as the
one used to classify Web pages into topic categories.
2. We estimate the probabilities of the input word in the written and spoken
language corpora. We use such corpora that have the same topic as the input
text.
3. Using the two probabilities, we decide whether the input word is appropriate
or not. Section 5 describes this method.
Web Pages
Spoken AmbiguousWritten
Topic classification
Style classification
(Section 3)
(Section 4)
Written
Arts category
Spoken Written
Recreation category
Spoken
.........
Written
Science category
Spoken
Fig. 1. Written and spoken language corpora construction
3 Style Classification
In order to construct written and spoken language corpora classified into topic
categories, first of all, Web pages are classified into written and spoken language
pages (Figure 1). Note that what is called spoken language here is not real
utterance but chat like texts. Although it is not real spoken language, it works
as a good substitute, as some researchers pointed out [2,11].
We follow a method proposed by Kaji et al (2004). Their method classifies
Web pages into three types: (1) written language page, (2) spoken language page,
and (3) ambiguous page. Then, Web pages classified into type (1) or (2) are
used. Ambiguous pages are discarded because classification precision decreases
if such pages are used. This Section summarizes their method. See [11] for detail.
Note that for this method the target language is Japanese, and its procedure is
dependent on Japanese characteristics.
3.1 Basic Idea
Web pages are classified based on interpersonal expressions, which imply an at-
titude of a speaker toward listeners, such as familiarity, politeness, honor or con-
tempt etc. Interpersonal expressions are often used in spoken language, although
984 N. Kaji and S. Kurohashi
not frequently used in written language. For example, when spoken language is
used, one of the most basic situations is face-to-face communication. On the
other hand, such situation hardly happens when written language is used.
Therefore, Web pages containing many interpersonal expressions are classi-
fied as spoken language, and vice versa. Among interpersonal expressions, such
expressions that represent familiarity or politeness are used, because:
Those two kinds of interpersonal expressions frequently appear in spoken
language,
They are represented by postpositional particle in Japanese and, therefore,
are easily recognized as such.
Hereafter, interpersonal expression that represents familiarity/politeness is
called familiarity/politeness expression.
3.2 Style Classification Procedure
Web pages are classified into the three types based on the following two ratios:
Familiarity ratio (F-ratio): ‘# of sentences including familiarity expressions’
divided by ‘# of all the sentences in the page’.
Politeness ratio (P-ratio): ‘# of sentences including politeness expressions’
divided by ‘# of all the sentences in the page’.
The procedure is as follows. First, Web pages are processed by Japanese
morphological analyzer JUMAN3. And then, in order to calculate F-ratio and P-
ratio, sentences which include familiarity or politeness expressions are recognized
in the following manner. A sentence is considered to include the familiarity
expression, if it has one of the following six postpositional particles: ne,yo,
wa,sa,ze,na. A sentence is considered to include the politeness expression, if
it has one of the following four postpositional particles: desu,masu,kudasai,
gozaimasu.
After calculating the two ratios, the page is classified according to the rules
illustrated in Figure 2. If F-ratio and P-ratio are equal to 0, the page is classified
as written language page. If F-ratio is more than 0.2, or if F-ratio is more than
0.1 and P-ratio is more than 0.2, the page is classified as spoken language page.
The other pages are regarded as ambiguous and are discarded.
3.3 The Result
Table 1 shows the number of pages and words (noun, verb, and adjective) in the
corpora constructed from the WWW. About 8,680k pages were downloaded from
the WWW, and 994k/1,338k were classified as written/spoken language. The
rest were classified as ambiguous page and they were discarded. The precision of
this method was reported by Kaji et al (2004). According to their experiment,
the precision was 94%.
3http://www.kc.t.u-tokyo.ac.jp/nl-resource/juman-e.html
Lexical Choice via Topic Adaptation for Paraphrasing Written Language 985
0
0.2
F-ratio
P-ratio0.2
0.1
Written language
Ambiguous
Spoken language
Fig. 2. Style classification rule
Tabl e 1. Thesizeofthewrittenandspoken
language corpora
# of pages # of words
Written language 989k 432M
Spoken language 1,337k 907M
Tabl e 2. The size of the training and test
data
Topic category Training Test
Arts 2,834 150
Business & Economy 5,475 289
Computers & Internet 6,156 325
Education 2,943 155
Entertainment 6,221 328
Government 3,131 165
Health 1,800 95
News 2,888 152
Recreation 4,352 230
Reference 1,099 58
Regional 4,423 233
Science 3,868 204
Social Science 5,410 285
Society & Culture 5,208 275
4 Topic Classification
The written and spoken language corpora are classified into 14 topic categories
(Figure 1). This task is what is called text categorization. We used Support Vec-
tor Machine because it is reported to achieve high performance in this task.The
training data was automatically built from Yahoo! Japan4.
The category provided by Yahoo! Japan have hierarchy structure. For exam-
ple, there are Arts and Music categories, and Music is one of the subcategories
of Arts. We used 14 categories located at the top level of the hierarchy. We
downloaded Web pages categorized in one of the 14 categories. Note that we
did not use Web pages assigned more than one categories. And then, the Web
pages were divided them into 20 segments. One of them was used as the test
data, and the others were used as the training data (Table 2). In the Table, the
4http://www.yahoo.co.jp/
986 N. Kaji and S. Kurohashi
0
50000
100000
150000
200000
250000
300000
350000
400000
Arts
Business&Economy
Computers&Internet
Education
Entertainment
Government
Health
News
Recreation
Reference
Regional
Science
Social Science
Society&Culture
Topic category
# of pages
Written language page Spoken language page
Fig. 3. The size of written and spoken language corpora in each topic category
first column shows the name of the 14 topic categories. The second/third column
shows the number of pages in the training/test data.
SVM was trained using the training data. In order to build multi-class clas-
sifier, we used One-VS-Rest method. Features of SVM are probabilities of nouns
in a page. Kernel function was linear. After the training, it was applied to the
test data. The macro-averaged accuracy was 86%.
The written and spoken language corpora constructed from the WWW were
classified into 14 categories by SVM. Figure 3 depicts the number of pages in
each category.
5 Learning Lexical Choice
We can now construct the written and spoken language corpora classified into
topic categories. The next step is discrimination between inappropriate and ap-
propriate words for spoken language using the probabilities in written and spoken
language corpora (Section 2). This paper proposes two methods: one is based
on Decision Tree (DT), and the other is based on SVM. This Section first de-
scribes the creation of gold standard data, which is used for both training and
evaluation. Then, we describe the features given to DT and SVM.
5.1 Creation of Gold Standard Data
We prepared data consisting of pairs of a word and binary tag. The tag represents
whether that word is inappropriate or appropriate for spoken language. This data
is referred as gold standard data. Note that the gold standard is created for each
topic category.
Gold standard data of topic T is created as follows.
1. Web pages in topic T are downloaded from Yahoo! Japan, and we sampled
words (verbs, nouns, and adjectives) from those pages at random.
2. Three human judges individually mark each word as INAPPROPRIATE,
APPROPRIATE or NEUTRAL. NEUTRAL tag is used when a judge cannot
mark a word as INAPPROPRIATE or APPROPRIATE with certainty.
Lexical Choice via Topic Adaptation for Paraphrasing Written Language 987
3. The three annotations are combined, and single gold standard data is cre-
ated. A word is marked as INAPPROPRIATE/APPROPRIATE in the gold
standard, if
All judges agree that it is INAPPROPRIATE/APPROPRIATE, or
Two judges agree that it is INAPPROPRIATE/APPROPRIATE and
the other marked it as NEUTRAL.
The other words are not used in the gold standard data.
5.2 The Features
Both DT and SVM use the same three features: the word probability in written
language corpus, the word probability in spoken language corpus, and the ratio
of the word probability in spoken language corpus to that in written language.
Note that when DT and SVM are trained on the gold standard of topic T, the
probability is estimated using the corpus in topic T.
6 Evaluation
This Section first reports the gold standard creation. Then, we show that DT and
SVM can successfully classify INAPPROPRIATE and APPROPRIATE words
in the gold standard. Finally, the effect of topic adaptation is represented.
6.1 The Gold Standard Data
The annotation was performed by three human judges (Judge1, Judge2 and
Judge3) on 410 words sampled from Business category, and 445 words sampled
from Health category. Then, we created the gold standard data in each category
(Table 3). The average Kappa value [3] between the judges was 0.60, which
corresponds to substantial agreement.
Tabl e 3. Gold standard data
Business Health
INAPPROPRIATE 49 38
APPROPRIATE 267 340
Total 316 378
Tabl e 4. # of words in Business and Health
categories corpora
Business Health
Written language 29,891k 30,778k
Spoken language 9,018k 32,235k
6.2 Lexical Choice Evaluation
DT and SVM were trained and tested on the gold standard data using Leave-
One-Out (LOO) cross validation. DT and SVM were implemented using C4.55
and TinySVM6packages. The kernel function of SVM was Gaussian RBF. Table
5http://www.rulequest.com/Personal/
6http://chasen.org/ taku/software/TinySVM/
988 N. Kaji and S. Kurohashi
Tabl e 5. The result of LOO cross validation
Topic Method Accuracy # of correct answers Precision Recall
Business DT .915 (289/316) 31 + 258 = 289 .775 .660
SVM .889 (281/316) 21 + 260 = 281 .750 .429
MCB .845 (267/316) 0 + 267 = 267 — .000
Health DT .918 (347/378) 21 + 326 = 347 .600 .552
SVM .918 (347/378) 13 + 334 = 347 .684 .342
MCB .899 (340/378) 0 + 340 = 340 — .000
4 shows the number of words in Business and Health categories corpora. Three
features described in Section 5 were used.
The result is summarized in Table 5. For example, in Business category, the
accuracy of DT was 91.5%. 289 out of 316 words were classified successfully, and
the 289 consists of 31 INAPPROPRIATE and 258 APPROPRIATE words. The
last two columns show the precision and recall of INAPPROPRIATE words.
MCB is Majority Class Baseline, which marks every word as APPROPRIATE.
Judging from the accuracy in Health category, one may think that our
method shows only a little improvement over MCB. However, considering other
evaluation measures such as recall of INAPPROPRIATE words, it is obvious
that the proposed method overwhelms MCB. We would like to emphasize the
fact that MCB is not at all practical lexical choice method. If MCB is used, all
words in the input text are regarded as appropriate for spoken language and the
input is never paraphrased.
One problem of our method is that the recall of INAPPROPRIATE words is
low. We think that the reason is as follows. The number of INAPPROPRIATE
words in the gold standard is much smaller than that of APPROPRIATE words.
Hence, we think a system that is biased to classify words as APPROPRIATE
often achieves high accuracy. It is one of future works to improve the recall while
keeping high accuracy.
We examined discrimination rules learned by DT. Figure 4 depicts the rules
learned by DT when the whole gold standard data of Business category is used
as a training data. In the Figure, the horizontal/vertical axis corresponds to the
probability in the written/spoken language corpus. Words in the gold standard
can be mapped into this two dimension space. INAPPROPRIATE/ APPROPRI-
ATE words are represented by a cross/square. The line represents discrimination
rules. Words below the line are classified as INAPPROPRIATE, and the others
are classified as APPROPRIATE.
6.3 Effect of Topic Adaptation
Finally, we investigated the effect of topic adaptation by comparing our method
to a baseline method that does not consider topic.
Our method consists of two steps: (1) mapping from a word to features, and
(2) applying discrimination rules to the features. In the step (1), the probability
is estimated using the written and spoken language corpora in a certain topic
Lexical Choice via Topic Adaptation for Paraphrasing Written Language 989
0
0.0001
0 0.0001 0.0002
Probability in written language
Probability in spoken language
Fig. 4. Decision tree rules in Business cat-
egory
customer
customer
management
management
0
0.0002
0.0004
0 0.0002 0.0004 0.0006 0.0008
Probability in written language
Probability in spoken language
Fig. 5. Examples in Business category
T. In the step (2), discrimination rules are learned by DT using the whole gold
standard data of topic T. We used DT rather than SVM because rules are easy
for humans to understand. On the other hand, the baseline uses the same dis-
crimination rules as our method, but uses the whole written and spoken language
corpora to map a word to features. Hereafter, the two methods are referred as
Proposed and Baseline. Both methods use the same discrimination rules but
map a word to features in a different way. Therefore, there are such words that
are classified as INAPPROPRIATE by Proposed and are classified as AP-
PROPRIATE by Baseline, and vice versa. In the evaluation, we compared the
classification results of such words.
We evaluated the results of topical words and non-topical words separately.
This is because we think Proposed is good at dealing with topical words and
hence we can clearly confirm the effectiveness of topic adaptation. Here, a word is
regarded as topical word in topic T, if its probabilities in the written and spoken
language corpora assigned topic category T are larger than those in the whole
corpora with statistical significance (the 5% level). Otherwise it is regarded as
non-topical word in topic T. As a statistical test log-likelihood ratio [4] was used.
The evaluation procedure is as follows.
1. Web pages in Business category were downloaded from Yahoo! Japan, and
words in those pages were classified by the two methods. If the results of the
two methods disagree, such words were stocked.
2. From the stocked words, we randomly sampled 50 topical words in Business
and 50 non-topical words. Note that we did not use such words that are
contained in the gold standard.
3. Using Web pages in Health category, we also sampled 50 topical words in
Health and 50 non-topical words in the same manner.
4. As a result, 100 topical words and 100 non-topical words were prepared. For
each word, two judges (Judge-A and Judge-B) individually assessed which
method successfully classified the word. Some classification results were dif-
ficult even for human judges to assess. In such cases, the results of the both
methods were regarded as correct.
990 N. Kaji and S. Kurohashi
Table 6 represents the classification accuracy of the 100 topical words. For
example, according to assessment by Judge-A, 75 out of 100 words were classified
successfully by Proposed. Similarly, Table 7 represents the accuracy of the 100
non-topical words. The overall agreement between the two judges according to
the Kappa value was 0.56. We compared the result of the two methods using
McNemar’s test [8], and we found statistically significant difference (the 5%
level) in the results. There was no significant difference in the result of non-
topical words assessed by the Judge-A.
Tabl e 6. Accuracy of topical words classi-
fication
Judge Method Accuracy
Judge-A Proposed 75% (75/100)
Baseline 52% (52/100)
Judge-B Proposed 72% (72/100)
Baseline 53% (53/100)
Tabl e 7. Accuracy of non-topical words
classification
Judge Method Accuracy
Judge-A Proposed 48% (48/100)
Baseline 66% (66/100)
Judge-B Proposed 38% (38/100)
Baseline 78% (78/100)
6.4 Discussion and Future Work
Proposed outperformed Baseline in topical words classification. This result
indicates that the difficulty of topical words depends on the topic and we have
to consider the topic. On the other hand, the result of Proposed was not good
when applied to non-topical words. We think this result is caused by two reasons:
(1) the difficulty of non-topical words is independent of the topic, and (2) Base-
line uses larger corpora than Proposed (see Table 1 and Table 4). Therefore,
we think this result does not deny the effectiveness of topic adaptation. These
results mean that Proposed and Baseline are complementary to each other,
and it is effective to combine the two methods: Proposed/Baseline is applied
to topical/non-topical words. It is obvious from the experimental results that
such combination is effective.
We found that Baseline is prone to classify topical words as inappropriate
and such bias decreases the accuracy. Figure 5 depicts typical examples sam-
pled from topical words in Business. Both judges regarded ‘management’ and
‘customer’7as appropriate for spoken language in Business topic. The white tri-
angle and diamond in the Figure represent their features when the probability
is estimated using the corpora in Business category. They are located above the
line, which corresponds to discrimination rules, and are successfully classified as
appropriate by Proposed. However, if the probability is estimated using the
whole corpora, the features shift to the black triangle and diamond, and Base-
line wrongly classified the two as inappropriate. In Health category, we could
observe similar examples such as ‘lung cancer’ or ‘metastasis’.
7Our target language is Japanese. Examples illustrated here are translation of the
original Japanese words.
Lexical Choice via Topic Adaptation for Paraphrasing Written Language 991
These examples can be explained in the following way. Consider topical words
in Business. When the probability is estimated using the whole corpora, it is
influenced by the topic but Business, where topical words in Business are often
inappropriate for spoken language. Therefore, we think that Baseline is biased
to classify topical words as inappropriate.
Besides the lexical choice method addressed in this paper, we proposed lex-
ical paraphrase generation method [10]. Our future direction is to apply these
methods to written language texts and evaluate the output of text-to-speech. So
far, the methods were tested on a small set of reports.
Although the main focus of this paper is lexical paraphrases, we think that
it is also important to deal with structural paraphrases. So far, we implemented
a system that paraphrases compound nouns into nominal phrases. It is our
future work to build a system that generates other kinds of structural para-
phrases.
7 Related Work
Lexical choice has been widely discussed in both paraphrasing and natural lan-
guage generation (NLG). However, to the best of our knowledge, no researches
address topic adaptation. Previous approaches are topic-independent or specific
to only certain topic.
Lexical choice has been one of the central issues in NLG. However, the main
focus is mapping from concepts to words, (e.g., [1]). In NLG, a work by Edmonds
and Hirst is related to our research [5]. They proposed a computational model
that represents the connotation of words.
Some paraphrasing researches focus on lexical choice. Murata and Isahara
addressed paraphrasing written language to spoken language. They used only
probability in spoken language corpus [12]. Kaji et al. also discussed paraphras-
ing written language to spoken language, and they used the probabilities in
written and spoken language corpora [11]. On the other hand, Inkpen et al. ex-
amined paraphrasing positive and negative text [9]. They used the computational
model proposed by Edmonds and Hirst [5].
The proposed method is based on the probability, which can be considered
as a simple language model. In language model works, many researchers have
discussed topic adaptation in order to precisely estimate the probability of topical
words [6,7,13]. Our work can be regarded as one application of such language
model technique.
8Conclusion
This paper proposed lexical choice method that considers the topic. The method
utilizes written and spoken language corpora classified into topic categories, and
estimate the word probability that is adapted to the topic of the input text.
From the experimental result we could confirm the effectiveness of topic adap-
tation.
992 N. Kaji and S. Kurohashi
References
1. Berzilay, R., Lee, L.: Bootstrapping Lexical Choice via Multiple-Sequence Align-
ment. Proceedings of EMNLP. (2002) 50–57
2. Bulyko, I., Ostendorf, M., and Stolcke, A.: Getting More Mileage from Web Text
Sources for Conversational Speech Language Modeling using Class-Dependent Mix-
tures. Proceedings of HLT-NAACL (2003) 7–9
3. Carletta, J.: Assessing Agreement on Classification Tasks: The Kappa Statistic.
Computational Linguistics. 22 (2). (1996) 249–255
4. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence.
Computational Linguistics. 19 (1). (1993) 61–74
5. Edmonds, P., Hirst, G.: Near-Synonymy and Lexical Choice. Computational Lin-
guistics. 28 (2). (2002) 105–144
6. Florian, R., Yarowsky, D.: Dynamic Nonlocal Language Modeling via Hierarchical
Topic-Based Adaptation: Proceedings of ACL. (1999) 167–174
7. Gildea, D., Hofmann, T.; TOPIC-BASED LANGUAGE MODELS USING EM.
Proceedings of EUROSPEECH. (1999) 2167–2170
8. Gillick, L., Cox, S.: Some Statistical Issues in the Comparison of Speech Recogni-
tion Algorithms. Proceedings of ICASSP. (1989) 532–535
9. Inkpen, D., Feiguina, O., and Hirst, G.: Generating more-positive and more-
negative text. Proceedings of AAAI Spring Symposium on Exploring Attitude and
Affect in Text. (2004)
10. Kaji, N., Kawahara, D., Kurohashi, S., and Satoshi, S. : Verb Paraphrase based
on Case Frame Alignment. Proceedings of ACL. (2002) 215–222
11. Kaji, N., Okamoto, M., and Kurohasih, S.: Paraphrasing Predicates from Written
Language to Spoken Language Using the Web. Proceedings of HLT-NAACL. (2004)
241–248
12. Murata, M., Isahara, H.: Automatic Extraction of Differences Between Spoken and
Written Languages, and Automatic Translation from the Written to the Spoken
Language. Proceedings of LREC. (2002)
13. Wu, J., Khudanpur, S.: BUILDING A TOPIC-DEPENDENT MAXIMUM EN-
TROPY MODEL FOR VERY LARGE CORPORA. Proceedings of ICASSP.
(2002) 777–780
... When a computer is programmed in a way that it interacts with humans for specific purposes, it is referred to as a NLP application. Some examples of NLP applications are document summarization (Jusoh et al. 2011), information retrieval (IR) (Zukerman et al. 2002), information extraction (IE) (Sekine 2006), question and answering (Q&A) (Bernhard and Gurevych 2008), machine translation (Zhou et al. 2006), text-to-speech generation (Kaji and Kurohashi 2006) and sentence compression (Zhao et al. 2009a). ...
... • In Q&A (Bernhard and Gurevych 2008), when user submitted a question to a Q&A system and when this question is absent from its database, the system will identify its question paraphrase and return the answer of the question paraphrase to the user. • In text-to-speech generation (Kaji and Kurohashi 2006), generating voice directly from the written text is inappropriate because written language is different from spoken language. A more appropriate natural voice can be produced by paraphrasing written text into spoken language based on the paraphrase table constructed using both the written and spoken languages. ...
... Since paraphrase extraction usually happens before paraphrasing, paraphrasing can be made possible with (Kaji and Kurohashi 2006;Sekine 2006;Zhao et al. 2009a;Zhou et al. 2006) the use of a paraphrase table. For instance, in machine translation (Zhou et al. 2006), system generated summaries can be evaluated automatically by identifying lexical and phrasal paraphrases matches between summaries generated by system and summaries written by human using a paraphrase table. ...
Article
Full-text available
Recent advances in natural language processing have increased the popularity of paraphrase extraction. Most of the attention, however, has been focused on the extraction methods only without taking the resource factor into the consideration. Unknowingly, there is a strong relationship between them and the resource factor also plays an equally important role in paraphrase extraction. In addition, almost all of the previous studies have been focused on corpus-based methods that extract paraphrases from corpora based solely on syntactic similarity. Despite the popularity of corpus-based methods, a considerable amount of research has consistently shown that these methods are vulnerable to several types of erroneous paraphrases. For these reasons, it is necessary to evaluate whether the trend is moving in a positive direction. This paper reviews the major research on paraphrase extraction methods in detail. It begins by exploring the definition of paraphrase from different perspectives to provide a better understanding of the concept of paraphrase extraction. It then studies the characteristics and potential uses of different types of paraphrase resources. After that, it divides paraphrase extraction methods into four main categories: heuristic-based, knowledge-based, corpus-based and hybrid-based and summarizes their strengths and weaknesses. This paper concludes with some potential open research issues for future directions.
... Adaptation of written language to a more spoken language style has previously been tackled as a lexical adaptation problem (Kaji and Kurohashi 2005). They use style and topic classification to gather data representing written and spoken language styles, thereafter, they learn the probabilities of lexemes occurring in both categories. ...
Preprint
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dialectal approach. We study the influence dialectal adaptation has on perceived creativity of computer generated poetry. Our results suggest that the more the dialect deviates from the standard Finnish, the lower scores people tend to give on an existing evaluation metric. However, on a word association test, people associate creativity and originality more with dialect and fluency more with standard Finnish.
... Adaptation of written language to a more spoken language style has previously been tackled as a lexical adaptation problem (Kaji and Kurohashi 2005). They use style and topic classification to gather data representing written and spoken language styles, thereafter, they learn the probabilities of lexemes occurring in both categories. ...
Conference Paper
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dialectal approach. We study the influence dialectal adaptation has on perceived creativity of computer generated poetry. Our results suggest that the more the dialect deviates from the standard Finnish, the lower scores people tend to give on an existing evaluation metric. However, on a word association test, people associate creativity and originality more with dialect and fluency more with standard Finnish.
Conference Paper
In most applications of paraphrasing, contextual information should be considered since a word may have different paraphrases in different contexts. This paper presents a method that automatically acquires lexical context-specific paraphrases from the web. The method includes two main stages, candidate paraphrase extraction and paraphrase validation. Evaluations were conducted on a news title corpus whereby the context-specific paraphrasing method was compared with the Chinese synonymous thesaurus. Results show that the precision of our method is above 60% and the recall is above 55%, which outperforms the thesaurus significantly.
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Chapter
Full-text available
We present experiments on modifying the semantic orientation of the near-synonyms in a text. We analyze a text into an interlingual representation and a set of attitudinal nuances, with particular focus on its near-synonyms. Then we use our text generator to produce a text with the same meaning but changed semantic orientation (more positive or more negative) by replacing, wherever possible, words with near-synonyms that differ in their expressed attitude.
Conference Paper
Full-text available
This paper describes a method of translating a predicate-argument structure of a verb into that of an equivalent verb, which is a core component of the dictionary-based paraphrasing. Our method grasps several usages of a headword and those of the def-heads as a form of their case frames and aligns those case frames, which means the acquisition of word sense disambiguation rules and the detection of the appropriate equivalent and case marker transformation.
Conference Paper
Full-text available
There are a lot of differences between expres- sions used in written language and spoken lan- guage. It is one of the reasons why speech syn- thesis applications are prone to produce unnat- ural speech. This paper represents a method of paraphrasing unsuitable expressions for spo- ken language into suitable ones. Those two expressions can be distinguished based on the occurrence probability in written and spoken language corpora which are automatically col- lected from the Web. Experimental results indi- cated the effectiveness of our method. The pre- cision of the collected corpora was 94%, and the accuracy of learning paraphrases was 76 %.
Conference Paper
Full-text available
The authors present two simple tests for deciding whether the difference in error rates between two algorithms tested on the same data set is statistically significant. The first (McNemar's test) requires the errors made by an algorithm to be independent events and is found to be most appropriate for isolated-word algorithms. The second (a matched-pairs test) can be used even when errors are not independent events and is more appropriate for connected speech
Article
Full-text available
We present experiments on modifying the semantic orientation of the near-synonyms in a text. We analyze a text into an interlingual representation and a set of attitudinal nuances, with particular focus on its near-synonyms. Then we use our text generator to produce a text with the same meaning but changed semantic orientation (more positive or more negative) by replacing, wherever possible, words with nearsynonyms that differ in their expressed attitude.
Article
Full-text available
This paper presents a novel method of generating and applying hierarchical, dynamic topic-based language models. It proposes and evaluates new cluster generation, hierarchical smoothing and adaptive topic-probability estimation techniques. These combined models help capture long-distance lexical dependencies. Experiments on the Broadcast News corpus show significant improvement in perplexity (10.5% overall and 33.5% on target vocabulary).
Article
Currently, computational linguists and cognitive scientists working in the area of discourse and dialogue argue that their subjective judgments are reliable using several different statistics, none of which are easily interpretable or comparable to each other. Meanwhile, researchers in content analysis have already experienced the same difficulties and come up with a solution in the kappa statistic. We discuss what is wrong with reliability measures as they are currently used for discourse and dialogue work in computational linguistics and cognitive science, and argue that we would be better off as a field adopting techniques from content analysis.
Article
An important component of any generation system is the mapping dictionary, a lexicon of elementary semantic expressions and corresponding natural language realizations.Typically, labor-intensive knowledge-based methods are used to construct the dictionary. We instead propose to acquire it automatically via a novel multiple-pass algorithm employing multiple-sequence alignment, a technique commonly used in bioinformatics. Crucially, our method leverages latent information contained in multi-parallel corpora --- datasets that supply several verbalizations of the corresponding semantics rather than just one. We used our techniques to generate natural language versions of computer-generated mathematical proofs, with good results on both a per-component and overall-output basis. For example, in evaluations involving a dozen human judges, our system produced output whose readability and faithfulness to the semantic input rivaled that of a traditional generation system.
Article
Sources of training data suitable for language modeling of conversational speech are limited. In this paper, we show how training data can be supplemented with text from the web filtered to match the style and/or topic of the target recognition task, but also that it is possible to get bigger performance gains from the data by using class-dependent interpolation of N-grams.