ArticlePDF Available

Research of Paraphrasing for Chinese Complex Sentences Based on Templates

Authors:

Abstract

Based on the paraphrasing of Chinese simple sentences, the complex sentence paraphrasing by using templates are studied. Through the classification of complex sentences, syntactic analysis and structural analysis, the proposed methods construct complex sentence paraphrasing templates that the associated words are as the core. The part of speech tagging is used in the calculation of the similarity between the paraphrasing sentences and the paraphrasing template. The joint complex sentence can be divided into parallel relationship, sequence relationship, selection relationship, progressive relationship, and interpretive relationship’s complex sentences. The subordinate complex sentence can be divided into transition relationship, conditional relationship, hypothesis relationship, causal relationship and objective relationship’s complex sentences. Joint complex sentence and subordinate complex sentence are divided to associated words. By using pretreated sentences, the preliminary experiment is carried out to decide the threshold between the paraphrasing sentence and the template. A small scale paraphrase experiment shows the method is availability, acquire the coverage rate of paraphrasing template 40.20% and the paraphrase correct rate 62.61%.
38
Modern Electronic Technology | Volume 06 | Issue 01 | April 2022
Distributed under creative commons license 4.0
Modern Electronic Technology
http://www.advmodoncolres.sg/index.php/amor/index
*Corresponding Author:
Zhongjian Wang,
Guangzhou College of Technology and Business, Foshan, Guangzhou, 510800, China;
Email: zhongj_w@126.com
DOI: https://doi.org/10.26549/met.v6i1.11421
ResearchofParaphrasing forChinese Complex SentencesBasedon
Templates
ZhongjianWang*LingWang
Guangzhou College of Technology and Business, Foshan, Guangzhou, 510800, China
ARTICLE INFO ABSTRACT
Article history
Received: 19 March 2022
Revised: 26 March 2022
Accepted: 9 April 2022
Published Online: 16 April 2022
Basedon the paraphrasingofChinesesimple sentences, the complexsen-
tence paraphrasing by using templates are studied.Through the classica-
tion of complex sentences, syntactic analysisand structuralanalysis, the
proposed methods constructcomplex sentenceparaphrasing templates that
the associated words are as the core. The part of speech tagging is used in
the calculation of the similarity between the paraphrasing sentences and
the paraphrasing template. Thejointcomplex sentencecan bedividedinto
parallel relationship, sequence relationship, selection relationship, progres-
sive relationship, andinterpretive relationship’scomplexsentences. The
subordinate complex sentencecanbedivided into transition relationship,
conditional relationship, hypothesis relationship, causal relationship and
objecti ve relationship’scomplex sentences. Jointcomplex sentencea nd
subordinatecomplex sentencearedivided to associatedwords. By using
pretreatedsentences,thepreliminary experiment iscarried out to decide
the threshold between the paraphrasing sentence and the template. A small
scale paraphrase experiment shows the method isavailability,acquire the
coverage rate of paraphrasing template 40.20% and the paraphrase correct
rate 62.61%.
Keywords:
Complexsentence
Associated word
Paraphrasing template
1. Introduction
Natural language has been widely concerned by domes-
tic and foreign scholars. Many languages, whether written
or verballanguage havedifferentexpressions, Chinese
isno exception. Withthe rapid development ofcomputer
and Internet, the massive sentence needs to be processed,
includingalargenumberofcomplexsentences,whichre-
quires us to paraphrase the sentence of the imminent.
Accordingtoasimpleclassication ofthecomplexity
of paraphrasing sentences, we can paraphrase the simple
sentences and sentence rewriting. The study of simple
sentence paraphrasing is relatively common, and the com-
plexsentence paraphrasingrelates to a lot oflexical and
syntactic parsing, it is difficult to implement because of
the need for a higher level of language processing tech-
niques.
Careful review of a large number of documents, we
found that Chinese sentence research is basically at the
grammar level, the operation, a formal model of the build-
ing, representation of mathematical form and algorithm
procedure and practical research are less. Especially the
paraphrasing of Chinese sentence, few results can be op-
eratedintheeldofnaturallanguageprocessing.
2. AnalysisofComplexSentenceTheory
2.1TheClassicationofComplexSentence
In thispaper, theclassificationof complex sentences
is basically based on the sentence grammar literature, but
39
Modern Electronic Technology | Volume 06 | Issue 01 | April 2022
Distributed under creative commons license 4.0
not limited to grammar rules. According to the research
needs, we can take to increase, delete, and summarize the
grammar of the sentence structure, in order to facilitate
the implementation of the paraphrase.
Generally speaking, simple sentence contains a subject
andapredicatepart,complexsentenceismadeupoftwo
ormorethantwosentences,clausescan besubject-predi-
catesentence,alsocanisanonsubject-predicatesentence.
The division of the grammar studies of sentence cate-
gory, there are many differences. These differences make
sentence category without a clear unified standard [1]. In
thispaper,theclassicationofcomplexsentencesisbased
on Jiaoyan Jia [2], which puts the sentences into the joint
com plex sent enc e, sub ord inat e com plex sent ence and
multiple complex sentencein threecategories. Thejoint
sentence and compound sentence contains five kinds of
small class.
Thejointcomplexsentencecanbedividedintoparallel
relationship, sequence relationship, selection relationship,
progressive relationship, and interpretive relationship’s
complexsentences. The subordinate complex sentence
can be divided into transition relationship, conditional re-
lationship, hypothesis relationship, causal relationship and
objectiverelationship’scomplexsentences.Jointcomplex
sentenceandsubordinatecomplexsentencearedividedto
associated words.
Multiplecomplexsentencesaresentences that contain
twoormorerelationswhichisoneofthemostdifcultto
be rewritten and is very low in terms of overwrite cover-
age.
2.2 ComplexSentence Semantic Analysis
Complexsentencesemanticanalysisresultscontaining
the word segmentation,part-of-speech tagging, andthe
grammar of the sentence structure analysis. Word seg-
mentation and part-of-speech tagging is thefirst stepof
rewriting,forofcomplexsentencewordsegmentationand
part-of-speechtagging,thispaperadoptstheICTCLAS[3]
segmentation software. Part-of-speechtagging is judging
each word of the sentence in given grammatical category,
determiningitspart-of-speechandlabelingprocess[4].
The research target of this paper is mainly tag com-
plex sentencethatcomparedwithtag complexsentence,
no-markedcomplexsentence’sparaphrasingis difficult,
soweonlyextractthemainpartofthesentencetopara-
phrase such as object, predicate and subject.
The f irs t cate gor y is th e joi nt com plex sent ence of
complex sentences.Thejointcomplexsentence includes
parallel, sequence, selection, progressive and interpretive
complex sentences. Parallel complex sentence iscom-
posed of several clauses, each clause shows one thing, a
kind of situation, a phenomenon or a particular aspect of a
thing.
3.ComplexSentence ParaphrasingStrategy
On the basis of simple sentence paraphrasing, we try to
paraphrasethecomplexsentencesthatusetemplatemeth-
od. Through construct corpus as the resources necessary
toparaphrase complexsentences and throughthesimple
sentencestemplatecombinedtoachievecomplexsentenc-
esparaphrasing,and thenexpand the corpussize to fur-
ther paraphrasecomplex sentences to lay thefoundation
for further study.
Thereare alotoftheoreticalresearchoncomplexsen-
tences, as mentioned in the literature [ 5] proposed three
methods for long sentences into short sentences which
are dispersion method, iterative method and segmentation
method.
The basic principle of paraphrasing is the same for the
simple sentence and complexsentence whichis to para-
phrase the sentence structure without changing the mean-
ing of the sentence, we will use the following 4 kinds of
sentence paraphrasing strategy:
1) Extract the sentencetrunk,extractionof themain
componentsfor no-marked complex sentence and com-
plexsentencehavingmanyclauses.
2)The sentenceinacomplex sentencemergedintoan
attributive clause, other clauses remain unchanged. The
sentence is a set of clauses in the same or similar struc-
tures, the scattered sentence is a set of sentence structure
irregular.
3) On the basis ofsimplesentences,we addthe two
clause positionalinvertedwhich exchange thefront and
rear position between the two clauses. Simple sentence
paraphrasing strategy includes the replacement, deletion,
addition, repetition and locomotion of words.
4)Forasentencewithmetaphor,humanandotherrhe-
torical methods, we change the non obvious, ambiguous
wordstotheobviousmodication.
3.1 Template Extraction
Inthe process ofrewritingtemplateextraction, weuse
the above methods, or combination of several methods.
Thefollowingtemplate“[]”hastwokindsofthecontents,
one is part of speech, the other is the associated word and
its part of speech, there is a comma in the < >”, there
is a replaceable associated word in the “{}”. This thesis
selects P and Q as the variable, the variable P and Q are
characterized as follows:
1)PandQarejustsymbols,representingdifferentsen-
tence elements.
DOI: https://doi.org/10.26549/met.v6i1.11421
40
Modern Electronic Technology | Volume 06 | Issue 01 | April 2022
Distributed under creative commons license 4.0
2)The contentsofP and Q can be a sentence, phrase,
word, punctuation or the combination of the above.
3)Inthe same sentencetemplate fortheoriginal sen-
tence and the paraphrasing sentence, P in the original sen-
tence template and correspondingly in the paraphrasing
template is the same sentence components. Similarly, Q in
the original sentence template and correspondingly in the
paraphrasing template is the same sentence components.
Paraphrase the complex sentence template extraction
method:
A. Word segmentation and part of speech tagging on
sentence segmentation system using ICTCLAS.
B.Eachwordanditspartofspeechinthecomplexsen-
tence respectively compares with each word and its part
of speech in a template of in the template library, if there
is the same word and the same part of speech components
between sentence and template. We use position label to
replacetheextractedwords,atthesametime,theposition
ofthelabelandtheextractedwords areinthesameposi-
tion in the original sentence. If there is not the same word
and the same part of speech components between sentence
and template, then the loop terminates.
C.In a complexsentence,the non extracted partsare
bundled into a whole between the two positions. That is
bundled into a block. The word is a block that had been
extractedanditisthesamekeywordwiththetemplate.
Case 1 Original sentence:
如果讳疾忌医
,
就可能小病
拖成大病。
Word segmentation, part of speech tagging:
如果
/c
讳疾忌医
/i , /w
/d
可能
/v
/a
/n
/v
/v
/a
/n
/w
The template matching with the original sentence is:
[
如果
/c]+[/i]+{,/w}+[
/d]+[/v]+[/a]+[Q]
The ingredients contained in Q are: { /n, /v, /v, /a, /n }
AsshowninFigure 1 and Figure 2, complexsentence
is divided into 7 blocks, ingredients 1 to 7. Figure 1 and
Figure 2 in the contents of the corresponding relationship,
Figure 2 is a diagram of sentence components. Among
them, the compositionofone to six, andthetemplatein
the same key words. Elements 7 is the uncertain varia-
bles in the template, the component 7 contains the part of
speech bundled into a whole as a variable Q, the contents
of ingredients 1 to 7 are arranged in the order of the origi-
nal sentence.
Figure1. Sentence composition block diagram
如果
/i
/w
/d /v /a Q
Figure2. Sentence component diagram
The template has the following four categories:
1)It doesn’tcontain variable template, template does
not contain P or Q.
2)Ithasavariabletemplate,thetemplateisonlyoneQ
or a P.
3)IthastwovariabletemplateshasPandQ,orP1and
P2, or Q1 and Q2.
4)IthasthreevariabletemplatescontainsP1,P2andQ,
or P and Q1, Q2.
The template with several uncertain variables is more com-
plex,templateextractionintheprocessmustberenedtemplate,
this reduces the template coverage rate.
The following is a template for the paraphrasing of different
associated words:
Example1containsthe word
”complexsentence
paraphrase.
Original sentence:
Original sentence template:
[ /c]+[/r]+[/v]+[P]+<, /w>+[/r]+{[ /d], [
/d], [ /d]}+[Q]
Paraphrasing sentence template:
[/r]+[/v]+[P]+<, /w>+[/r]+{[ /d], [ /d], [
/d]}+[Q]
[ /c]+[/r]+[/v]+[P]+<, /w>+[/r]+{[ /d], [
/d], [ /d]}+[Q]
[/r]+[Q]+<, /w>+[
如果
/c]+[/r]+[/v]+[P]
Paraphrase the template corresponding to paraphrase
the sentences as follows:
我听妈妈的话
,
我就不会拉肚子了。
假如我听妈妈的话
,
我就不会拉肚子了。
Example2containstheword“
只有……才
”complex
sentence paraphrase
Original sentence:
只有国家强盛了
,
才不会受欺负。
Original sentence template:
[
只有
/c]+[/n]+[P]+<, /w>+[
/d]+[/d]+[/v]+[Q]
Paraphrasing sentence template:
[
唯有
/c]+[/n]+[P]+<, /w>+[
/d]+[/d]+[/v]+[Q]
[
/c]+[
/c]+[/n]+[P]+[
/u]+[
/n]+[
/
f]+<, /w>+[
/d]+[/d]+[/v]+[Q]
Paraphrase the template corresponding to paraphrase
the sentences as follows:
唯有国家强盛了
,
才不会受欺负。
只有在国家强盛了的条件下
,
才不会受欺负。
3.2 Paraphrasong Process
In order to improve the success paraphrasing rate, input
ofcomplexsentences need tomatchtemplateinthe tem-
plateslibrary,bysentencesimilaritycalculationtondthe
appropriate paraphrase template. We need to set a similar
DOI: https://doi.org/10.26549/met.v6i1.11421
41
Modern Electronic Technology | Volume 06 | Issue 01 | April 2022
Distributed under creative commons license 4.0
level, the similarity threshold which determine by prelimi-
nary test.
We put forward an improved algorithm based on simi-
laritycalculationandparaphrasetheowchartasshown
below:
Extractkeywordsofpartof
speech
Weight value calculation of part of
speech
VectorTvioftheweightvalueof
part of speech
Common word vector Evi
VectorTvj of the weight value
of part of speech
Common word vector Evj
Original
sentence
corpus
Original
sentence
template
library
Start
Results > threshold
End
Paraphrase
template
library
Y
N
Similarity computation
Paraphrasing
Figure3.Paraphrasetheowchart
As shown in Figure 3, we calculate the similarity:
Firstofall,thesentenceandthetemplate,weextracted
keywordsandtakevectorrepresentation.Agivencomplex
sentence Ti vector representation of Ti={m1, m2, m3,...,
mn}, the number of Ti words called vector of length Ti, m1
to mn is Ti keyword words.
Secondly, we will introduce the calculation method
of the keyword weight value. The initial weight value of
each word is 1/n, weights constitute the vector called the
weight value vector. Keywords vector’s length is Len(-
Ti),key wordsin thismethod arethe sentence elements
contained in the template which also include punctuation
marks.
Next,we will introducethemethod of calculating the
common word vector. Gi ven two sentences Ti and Tj,
k and n are the length of the vector, respectively, in the
Ti and Tj, among them k<=n. Every word of the mi for
Ti={m1, m2, m3,..., mk}, If mi is also present in vector
Tj={m1, m2, m3,..., mK}, the vector of the same words in Ti
and Tj is called public word vector. This public word vec-
tor and keyword vector are the same, they are expressed
as Ei,j={e1,e2,...,ep}.
Finally, the similarity between the sentence and the
paraphrasing template is calculated, similarity degree for-
mula is shown below:
(1)
In (1), vk represents the value of item K in the common
word vector Evi.
In this formula, the calculation method of the weight
value is as follows:
If any one of the key words wi in Ti or the synonym of
the keyword appears in Tj, and in Tj and Ti, wi and wi-1 are
equal or are synonymous with each other, and the corre-
spondingweightsof Tbi valuebitoincreasethe α times,
in the same way, in Tj and Ti, wi and wi+1 are equal or are
synonymous with each other, the corresponding weights
ofTbivaluebialsoincreaseαtimes,afterseveralteststo
determine the alpha =1.3. If the wi not in Tj, the Tbi corre-
sponding weight value remains the same.
Aftera lot of preliminary experiments we gotapara-
phrasing th reshold of 0.7598, the similarity of input
complex sentence templateand template libraryofupto
75.98%, we can paraphrase the sentence according to the
template.
4. Par ap hra sin g Ex per im ent an d Results
Analysis
4.1ExperimentProcedure
We randomly selected 1500 sentences with associated
wordsfrom the joint and compoundcomplex sentence,
corresponding template sentence is 603. We use the word
segmentation software to carry out word segmentation
and part of speech tagging, the original sentence corpus
is a sentence that has been marked by word segmentation
and part of speech tagging.
Theexperimentalprocessisdividedintotwosteps,one
is needed to create a database, two is to write programs.
4.2 ExperimentalResults Analysis
In the process of manual checking paraphrasing results, we
found that the small errors in the template have a great impact
on the paraphrasing results. The absence of spaces will not only
make a serious error in paraphrasing the results, but also the lack
of spaces of different locations in the same template can lead to a
lot of different errors in the result. The absence of a comma and
period has a negligible effect on the correct rate of paraphrasing.
Errortypesarethefollowing,respectively,giveexamples:
(1) The original sentence missing comma in the template,
such errors account for 77% of the total errors,such as Figure 4.
DOI: https://doi.org/10.26549/met.v6i1.11421
42
Modern Electronic Technology | Volume 06 | Issue 01 | April 2022
Distributed under creative commons license 4.0
Figure4. Original sentence template
Figure5. Error type 1
Paraphrasing results with two comma is because the
original sentence and paraphrasing template each have a
comma, there is no period in the original sentence tem-
plate, adds a full stop to the variable A in the process of
program processing, a comma in the paraphrasing tem-
plate is also added to the paraphrasing result, so there are
two comma in the result, as Figure 5.
(2) Phrase collocation error, this error accounted for
18% of the total error, as Figure 6.
Figure6. Error type 2
Without cons idering the clause phrase collocation,
the sentencedid not exchange the position and previous
clauses together, programming is not reasonable.
(3) Long sentence similarity is low, this error is 5% of
the total error, as Figure 7.
Figure7. Error type 3
Theexperimental data includedrewriting correctrate,
the template coverage, and the rate of not being rewritten,
thefollowingspecicallyintroducesthecalculationmeth-
od of all kinds of data.
WedenethetotalnumberofsentencesasPsum,inthe
result, the total number of sentences to be rewritten is Pa-
sum, paraphrase the correct number of sentences is Rres,
one of the original sentences only corresponds to a correct
paraphrasing sentence. The total number of templates is
Tsum.
The proportion of the sentence that has not been para-
phrasedisshowninthe(2):
100%
Psum pasum
NPrate
Psum
×
=
(2)
Paraphrasecorrectratecalculationasshowninthe(3):
(3)
The formula for calculating the template coverage is
shownin(4):
(4)
According to paraphrase the correct sentence and the
total sentence compared, the proportion of sentences that
havenotbeenrewrittenbythe (2)is7%.Thecorrectrate
ofrewriting is 62.61%,whichis obtained bythe(3).The
template coverage rate was 40.2%, which was obtained by
the(4).
5. Conclusions
This paper presents the method of paraphrasing Chi-
nese sentence based on template, by building to associated
words as the core of the corpus, provides the basis for
sentenceparaphrasing.Theexperimentalresultsshowthe
effectivenessofthemethodanditsdeciency.
The template coverage rate and correct rate is the key
to paraphrase the sentences based on template. In the pro-
cess of rewriting the sentences, we need a further deeper
levelofsyntax and semanticanalysis of sentences,geta
more efficient paraphrasing template, raise paraphrasing
accuracy and template coverage.
References
[1] Rinaldi, F.,Dowdall,J., Moll,D., etal., 2003. Ex-
ploiting Paraphrases in a Question Answering Sys-
tem. Proceedings of Workshop in Paraphrasing at
ACL2003, Sapporo, Japan.
[2] Li,W.G.,Liu,T., Zhang,Y.,etal.,2005.Automated
Generalization of Phrasal Paraphrases from the Web.
The 3rd International Workshop on Paraphrasing.
JejuIsland,SouthKorea.pp.49-57.
[3] ICTCLAS(InstituteofComputingTechnology,Chi-
nese LexicalAnalysis System): http://www.ictclas.
org/index.html.
[4] Zhao,Sh.Q.,Liu,T.,Yuan,X.Ch.,etal.,2007.Auto-
maticAcquisition of Context-Specic LexicalPara-
phrases. Proceedings of IJCAI, Hyderabad, India. pp.
1789-1794.
[5] Wang, Z., Wang,L.,2010. Paraphrase ofChinese
Sentences Based onAssociated Word.ASIA-ICIM
2010, Wuhan, China.
DOI: https://doi.org/10.26549/met.v6i1.11421
ResearchGate has not been able to resolve any citations for this publication.
Article
Rather than creating and storing thou-sands of paraphrase examples, para-phrase templates have strong representation capacity and can be used to generate many paraphrase examples. This paper describes a new template representation and generalization method. Combing a semantic diction-ary, it uses multiple semantic codes to represent a paraphrase template. Using an existing search engine to extend the word clusters and generalize the exam-ples. We also design three metrics to measure our generalized templates. The experimental results show that the rep-resentation method is reasonable and the generalized templates have a higher precision and coverage.
Conference Paper
Lexical paraphrasing aims at acquiring word-level paraphrases. It is critical for many Natural Lan- guage Processing (NLP) applications, such as Question Answering (QA), Information Extraction (IE), and Machine Translation (MT). Since the meaning and usage of a word can vary in distinct contexts, different paraphrases should be acquired according to the contexts. However, most of the existing researches focus on constructing para- phrase corpora, in which little contextual con- straints for paraphrase application are imposed. This paper presents a method that automatically acquires context-specific lexical paraphrases. In this method, the obtained paraphrases of a word depend on the specific sentence the word occurs in. Two stages are included, i.e. candidate paraphrase extraction and paraphrase validation, both of which are mainly based on web mining. Evaluations are conducted on a news title corpus and the presented method is compared with a paraphrasing method that exploits a Chinese thesaurus of synonyms -- Tongyi Cilin (Extended) (CilinE for short). Results show that the f-measure of our method (0.4852) is significantly higher than that using CilinE (0.1127). In addition, over 85% of the correct paraphrases derived by our method cannot be found in CilinE, which suggests that our method is effective in ac- quiring out-of-thesaurus paraphrases.
Exploiting Paraphrases in a Question Answering System
  • F Rinaldi
  • J Dowdall
  • D Moll
Rinaldi, F., Dowdall, J., Moll, D., et al., 2003. Exploiting Paraphrases in a Question Answering System. Proceedings of Workshop in Paraphrasing at ACL2003, Sapporo, Japan.
Paraphrase of Chinese Sentences Based on Associated Word
  • Z Wang
  • L Wang
Wang, Z., Wang, L., 2010. Paraphrase of Chinese Sentences Based on Associated Word. ASIA-ICIM 2010, Wuhan, China.