Available via license: CC BY-NC 4.0
Content may be subject to copyright.
38
Modern Electronic Technology | Volume 06 | Issue 01 | April 2022
Distributed under creative commons license 4.0
Modern Electronic Technology
http://www.advmodoncolres.sg/index.php/amor/index
*Corresponding Author:
Zhongjian Wang,
Guangzhou College of Technology and Business, Foshan, Guangzhou, 510800, China;
Email: zhongj_w@126.com
DOI: https://doi.org/10.26549/met.v6i1.11421
ResearchofParaphrasing forChinese Complex SentencesBasedon
Templates
ZhongjianWang*LingWang
Guangzhou College of Technology and Business, Foshan, Guangzhou, 510800, China
ARTICLE INFO ABSTRACT
Article history
Received: 19 March 2022
Revised: 26 March 2022
Accepted: 9 April 2022
Published Online: 16 April 2022
Basedon the paraphrasingofChinesesimple sentences, the complexsen-
tence paraphrasing by using templates are studied.Through the classica-
tion of complex sentences, syntactic analysisand structuralanalysis, the
proposed methods constructcomplex sentenceparaphrasing templates that
the associated words are as the core. The part of speech tagging is used in
the calculation of the similarity between the paraphrasing sentences and
the paraphrasing template. Thejointcomplex sentencecan bedividedinto
parallel relationship, sequence relationship, selection relationship, progres-
sive relationship, andinterpretive relationship’scomplexsentences. The
subordinate complex sentencecanbedivided into transition relationship,
conditional relationship, hypothesis relationship, causal relationship and
objecti ve relationship’scomplex sentences. Jointcomplex sentencea nd
subordinatecomplex sentencearedivided to associatedwords. By using
pretreatedsentences,thepreliminary experiment iscarried out to decide
the threshold between the paraphrasing sentence and the template. A small
scale paraphrase experiment shows the method isavailability,acquire the
coverage rate of paraphrasing template 40.20% and the paraphrase correct
rate 62.61%.
Keywords:
Complexsentence
Associated word
Paraphrasing template
1. Introduction
Natural language has been widely concerned by domes-
tic and foreign scholars. Many languages, whether written
or verballanguage havedifferentexpressions, Chinese
isno exception. Withthe rapid development ofcomputer
and Internet, the massive sentence needs to be processed,
includingalargenumberofcomplexsentences,whichre-
quires us to paraphrase the sentence of the imminent.
Accordingtoasimpleclassication ofthecomplexity
of paraphrasing sentences, we can paraphrase the simple
sentences and sentence rewriting. The study of simple
sentence paraphrasing is relatively common, and the com-
plexsentence paraphrasingrelates to a lot oflexical and
syntactic parsing, it is difficult to implement because of
the need for a higher level of language processing tech-
niques.
Careful review of a large number of documents, we
found that Chinese sentence research is basically at the
grammar level, the operation, a formal model of the build-
ing, representation of mathematical form and algorithm
procedure and practical research are less. Especially the
paraphrasing of Chinese sentence, few results can be op-
eratedintheeldofnaturallanguageprocessing.
2. AnalysisofComplexSentenceTheory
2.1TheClassicationofComplexSentence
In thispaper, theclassificationof complex sentences
is basically based on the sentence grammar literature, but
39
Modern Electronic Technology | Volume 06 | Issue 01 | April 2022
Distributed under creative commons license 4.0
not limited to grammar rules. According to the research
needs, we can take to increase, delete, and summarize the
grammar of the sentence structure, in order to facilitate
the implementation of the paraphrase.
Generally speaking, simple sentence contains a subject
andapredicatepart,complexsentenceismadeupoftwo
ormorethantwosentences,clausescan besubject-predi-
catesentence,alsocanisanonsubject-predicatesentence.
The division of the grammar studies of sentence cate-
gory, there are many differences. These differences make
sentence category without a clear unified standard [1]. In
thispaper,theclassicationofcomplexsentencesisbased
on Jiaoyan Jia [2], which puts the sentences into the joint
com plex sent enc e, sub ord inat e com plex sent ence and
multiple complex sentencein threecategories. Thejoint
sentence and compound sentence contains five kinds of
small class.
Thejointcomplexsentencecanbedividedintoparallel
relationship, sequence relationship, selection relationship,
progressive relationship, and interpretive relationship’s
complexsentences. The subordinate complex sentence
can be divided into transition relationship, conditional re-
lationship, hypothesis relationship, causal relationship and
objectiverelationship’scomplexsentences.Jointcomplex
sentenceandsubordinatecomplexsentencearedividedto
associated words.
Multiplecomplexsentencesaresentences that contain
twoormorerelationswhichisoneofthemostdifcultto
be rewritten and is very low in terms of overwrite cover-
age.
2.2 ComplexSentence Semantic Analysis
Complexsentencesemanticanalysisresultscontaining
the word segmentation,part-of-speech tagging, andthe
grammar of the sentence structure analysis. Word seg-
mentation and part-of-speech tagging is thefirst stepof
rewriting,forofcomplexsentencewordsegmentationand
part-of-speechtagging,thispaperadoptstheICTCLAS[3]
segmentation software. Part-of-speechtagging is judging
each word of the sentence in given grammatical category,
determiningitspart-of-speechandlabelingprocess[4].
The research target of this paper is mainly tag com-
plex sentencethatcomparedwithtag complexsentence,
no-markedcomplexsentence’sparaphrasingis difficult,
soweonlyextractthemainpartofthesentencetopara-
phrase such as object, predicate and subject.
The f irs t cate gor y is th e joi nt com plex sent ence of
complex sentences.Thejointcomplexsentence includes
parallel, sequence, selection, progressive and interpretive
complex sentences. Parallel complex sentence iscom-
posed of several clauses, each clause shows one thing, a
kind of situation, a phenomenon or a particular aspect of a
thing.
3.ComplexSentence ParaphrasingStrategy
On the basis of simple sentence paraphrasing, we try to
paraphrasethecomplexsentencesthatusetemplatemeth-
od. Through construct corpus as the resources necessary
toparaphrase complexsentences and throughthesimple
sentencestemplatecombinedtoachievecomplexsentenc-
esparaphrasing,and thenexpand the corpussize to fur-
ther paraphrasecomplex sentences to lay thefoundation
for further study.
Thereare alotoftheoreticalresearchoncomplexsen-
tences, as mentioned in the literature [ 5] proposed three
methods for long sentences into short sentences which
are dispersion method, iterative method and segmentation
method.
The basic principle of paraphrasing is the same for the
simple sentence and complexsentence whichis to para-
phrase the sentence structure without changing the mean-
ing of the sentence, we will use the following 4 kinds of
sentence paraphrasing strategy:
1) Extract the sentencetrunk,extractionof themain
componentsfor no-marked complex sentence and com-
plexsentencehavingmanyclauses.
2)The sentenceinacomplex sentencemergedintoan
attributive clause, other clauses remain unchanged. The
sentence is a set of clauses in the same or similar struc-
tures, the scattered sentence is a set of sentence structure
irregular.
3) On the basis ofsimplesentences,we addthe two
clause positionalinvertedwhich exchange thefront and
rear position between the two clauses. Simple sentence
paraphrasing strategy includes the replacement, deletion,
addition, repetition and locomotion of words.
4)Forasentencewithmetaphor,humanandotherrhe-
torical methods, we change the non obvious, ambiguous
wordstotheobviousmodication.
3.1 Template Extraction
Inthe process ofrewritingtemplateextraction, weuse
the above methods, or combination of several methods.
Thefollowingtemplate“[]”hastwokindsofthecontents,
one is part of speech, the other is the associated word and
its part of speech, there is a comma in the “< >”, there
is a replaceable associated word in the “{}”. This thesis
selects P and Q as the variable, the variable P and Q are
characterized as follows:
1)PandQarejustsymbols,representingdifferentsen-
tence elements.
DOI: https://doi.org/10.26549/met.v6i1.11421
40
Modern Electronic Technology | Volume 06 | Issue 01 | April 2022
Distributed under creative commons license 4.0
2)The contentsofP and Q can be a sentence, phrase,
word, punctuation or the combination of the above.
3)Inthe same sentencetemplate fortheoriginal sen-
tence and the paraphrasing sentence, P in the original sen-
tence template and correspondingly in the paraphrasing
template is the same sentence components. Similarly, Q in
the original sentence template and correspondingly in the
paraphrasing template is the same sentence components.
Paraphrase the complex sentence template extraction
method:
A. Word segmentation and part of speech tagging on
sentence segmentation system using ICTCLAS.
B.Eachwordanditspartofspeechinthecomplexsen-
tence respectively compares with each word and its part
of speech in a template of in the template library, if there
is the same word and the same part of speech components
between sentence and template. We use position label to
replacetheextractedwords,atthesametime,theposition
ofthelabelandtheextractedwords areinthesameposi-
tion in the original sentence. If there is not the same word
and the same part of speech components between sentence
and template, then the loop terminates.
C.In a complexsentence,the non extracted partsare
bundled into a whole between the two positions. That is
bundled into a block. The word is a block that had been
extractedanditisthesamekeywordwiththetemplate.
Case 1 Original sentence:
如果讳疾忌医
,
就可能小病
拖成大病。
Word segmentation, part of speech tagging:
如果
/c
讳疾忌医
/i , /w
就
/d
可能
/v
小
/a
病
/n
拖
/v
成
/v
大
/a
病
/n
。
/w
The template matching with the original sentence is:
[
如果
/c]+[/i]+{,/w}+[
就
/d]+[/v]+[/a]+[Q]
The ingredients contained in Q are: { /n, /v, /v, /a, /n }
AsshowninFigure 1 and Figure 2, complexsentence
is divided into 7 blocks, ingredients 1 to 7. Figure 1 and
Figure 2 in the contents of the corresponding relationship,
Figure 2 is a diagram of sentence components. Among
them, the compositionofone to six, andthetemplatein
the same key words. Elements 7 is the uncertain varia-
bles in the template, the component 7 contains the part of
speech bundled into a whole as a variable Q, the contents
of ingredients 1 to 7 are arranged in the order of the origi-
nal sentence.
Figure1. Sentence composition block diagram
如果
/i
,
/w
就
/d /v /a Q
Figure2. Sentence component diagram
The template has the following four categories:
1)It doesn’tcontain variable template, template does
not contain P or Q.
2)Ithasavariabletemplate,thetemplateisonlyoneQ
or a P.
3)IthastwovariabletemplateshasPandQ,orP1and
P2, or Q1 and Q2.
4)IthasthreevariabletemplatescontainsP1,P2andQ,
or P and Q1, Q2.
The template with several uncertain variables is more com-
plex,templateextractionintheprocessmustberenedtemplate,
this reduces the template coverage rate.
The following is a template for the paraphrasing of different
associated words:
Example1containsthe word “
如 果
”complexsentence
paraphrase.
Original sentence:
Original sentence template:
[ /c]+[/r]+[/v]+[P]+<, /w>+[/r]+{[ /d], [
/d], [ /d]}+[Q]
Paraphrasing sentence template:
[/r]+[/v]+[P]+<, /w>+[/r]+{[ /d], [ /d], [
/d]}+[Q]
[ /c]+[/r]+[/v]+[P]+<, /w>+[/r]+{[ /d], [
/d], [ /d]}+[Q]
[/r]+[Q]+<, /w>+[
如果
/c]+[/r]+[/v]+[P]
Paraphrase the template corresponding to paraphrase
the sentences as follows:
我听妈妈的话
,
我就不会拉肚子了。
假如我听妈妈的话
,
我就不会拉肚子了。
Example2containstheword“
只有……才
”complex
sentence paraphrase
Original sentence:
只有国家强盛了
,
才不会受欺负。
Original sentence template:
[
只有
/c]+[/n]+[P]+<, /w>+[
才
/d]+[/d]+[/v]+[Q]
Paraphrasing sentence template:
[
唯有
/c]+[/n]+[P]+<, /w>+[
才
/d]+[/d]+[/v]+[Q]
[
只有
/c]+[
在
/c]+[/n]+[P]+[
的
/u]+[
条件
/n]+[
下
/
f]+<, /w>+[
才
/d]+[/d]+[/v]+[Q]
Paraphrase the template corresponding to paraphrase
the sentences as follows:
唯有国家强盛了
,
才不会受欺负。
只有在国家强盛了的条件下
,
才不会受欺负。
3.2 Paraphrasong Process
In order to improve the success paraphrasing rate, input
ofcomplexsentences need tomatchtemplateinthe tem-
plateslibrary,bysentencesimilaritycalculationtondthe
appropriate paraphrase template. We need to set a similar
DOI: https://doi.org/10.26549/met.v6i1.11421
41
Modern Electronic Technology | Volume 06 | Issue 01 | April 2022
Distributed under creative commons license 4.0
level, the similarity threshold which determine by prelimi-
nary test.
We put forward an improved algorithm based on simi-
laritycalculationandparaphrasetheowchartasshown
below:
Extractkeywordsofpartof
speech
Weight value calculation of part of
speech
VectorTvioftheweightvalueof
part of speech
Common word vector Evi
VectorTvj of the weight value
of part of speech
Common word vector Evj
Original
sentence
corpus
Original
sentence
template
library
Start
Results > threshold
End
Paraphrase
template
library
Y
N
Similarity computation
Paraphrasing
Figure3.Paraphrasetheowchart
As shown in Figure 3, we calculate the similarity:
Firstofall,thesentenceandthetemplate,weextracted
keywordsandtakevectorrepresentation.Agivencomplex
sentence Ti vector representation of Ti={m1, m2, m3,...,
mn}, the number of Ti words called vector of length Ti, m1
to mn is Ti keyword words.
Secondly, we will introduce the calculation method
of the keyword weight value. The initial weight value of
each word is 1/n, weights constitute the vector called the
weight value vector. Keywords vector’s length is Len(-
Ti),key wordsin thismethod arethe sentence elements
contained in the template which also include punctuation
marks.
Next,we will introducethemethod of calculating the
common word vector. Gi ven two sentences Ti and Tj,
k and n are the length of the vector, respectively, in the
Ti and Tj, among them k<=n. Every word of the mi for
Ti={m1, m2, m3,..., mk}, If mi is also present in vector
Tj={m1, m2, m3,..., mK}, the vector of the same words in Ti
and Tj is called public word vector. This public word vec-
tor and keyword vector are the same, they are expressed
as Ei,j={e1,e2,...,ep}.
Finally, the similarity between the sentence and the
paraphrasing template is calculated, similarity degree for-
mula is shown below:
(1)
In (1), vk represents the value of item K in the common
word vector Evi.
In this formula, the calculation method of the weight
value is as follows:
If any one of the key words wi in Ti or the synonym of
the keyword appears in Tj, and in Tj and Ti, wi and wi-1 are
equal or are synonymous with each other, and the corre-
spondingweightsof Tbi valuebitoincreasethe α times,
in the same way, in Tj and Ti, wi and wi+1 are equal or are
synonymous with each other, the corresponding weights
ofTbivaluebialsoincreaseαtimes,afterseveralteststo
determine the alpha =1.3. If the wi not in Tj, the Tbi corre-
sponding weight value remains the same.
Aftera lot of preliminary experiments we gotapara-
phrasing th reshold of 0.7598, the similarity of input
complex sentence templateand template libraryofupto
75.98%, we can paraphrase the sentence according to the
template.
4. Par ap hra sin g Ex per im ent an d Results
Analysis
4.1ExperimentProcedure
We randomly selected 1500 sentences with associated
wordsfrom the joint and compoundcomplex sentence,
corresponding template sentence is 603. We use the word
segmentation software to carry out word segmentation
and part of speech tagging, the original sentence corpus
is a sentence that has been marked by word segmentation
and part of speech tagging.
Theexperimentalprocessisdividedintotwosteps,one
is needed to create a database, two is to write programs.
4.2 ExperimentalResults Analysis
In the process of manual checking paraphrasing results, we
found that the small errors in the template have a great impact
on the paraphrasing results. The absence of spaces will not only
make a serious error in paraphrasing the results, but also the lack
of spaces of different locations in the same template can lead to a
lot of different errors in the result. The absence of a comma and
period has a negligible effect on the correct rate of paraphrasing.
Errortypesarethefollowing,respectively,giveexamples:
(1) The original sentence missing comma in the template,
such errors account for 77% of the total errors,such as Figure 4.
DOI: https://doi.org/10.26549/met.v6i1.11421
42
Modern Electronic Technology | Volume 06 | Issue 01 | April 2022
Distributed under creative commons license 4.0
Figure4. Original sentence template
Figure5. Error type 1
Paraphrasing results with two comma is because the
original sentence and paraphrasing template each have a
comma, there is no period in the original sentence tem-
plate, adds a full stop to the variable A in the process of
program processing, a comma in the paraphrasing tem-
plate is also added to the paraphrasing result, so there are
two comma in the result, as Figure 5.
(2) Phrase collocation error, this error accounted for
18% of the total error, as Figure 6.
Figure6. Error type 2
Without cons idering the clause phrase collocation,
the sentencedid not exchange the position and previous
clauses together, programming is not reasonable.
(3) Long sentence similarity is low, this error is 5% of
the total error, as Figure 7.
Figure7. Error type 3
Theexperimental data includedrewriting correctrate,
the template coverage, and the rate of not being rewritten,
thefollowingspecicallyintroducesthecalculationmeth-
od of all kinds of data.
WedenethetotalnumberofsentencesasPsum,inthe
result, the total number of sentences to be rewritten is Pa-
sum, paraphrase the correct number of sentences is Rres,
one of the original sentences only corresponds to a correct
paraphrasing sentence. The total number of templates is
Tsum.
The proportion of the sentence that has not been para-
phrasedisshowninthe(2):
100%
Psum pasum
NPrate
Psum
×
−
=
(2)
Paraphrasecorrectratecalculationasshowninthe(3):
(3)
The formula for calculating the template coverage is
shownin(4):
(4)
According to paraphrase the correct sentence and the
total sentence compared, the proportion of sentences that
havenotbeenrewrittenbythe (2)is7%.Thecorrectrate
ofrewriting is 62.61%,whichis obtained bythe(3).The
template coverage rate was 40.2%, which was obtained by
the(4).
5. Conclusions
This paper presents the method of paraphrasing Chi-
nese sentence based on template, by building to associated
words as the core of the corpus, provides the basis for
sentenceparaphrasing.Theexperimentalresultsshowthe
effectivenessofthemethodanditsdeciency.
The template coverage rate and correct rate is the key
to paraphrase the sentences based on template. In the pro-
cess of rewriting the sentences, we need a further deeper
levelofsyntax and semanticanalysis of sentences,geta
more efficient paraphrasing template, raise paraphrasing
accuracy and template coverage.
References
[1] Rinaldi, F.,Dowdall,J., Moll,D., etal., 2003. Ex-
ploiting Paraphrases in a Question Answering Sys-
tem. Proceedings of Workshop in Paraphrasing at
ACL2003, Sapporo, Japan.
[2] Li,W.G.,Liu,T., Zhang,Y.,etal.,2005.Automated
Generalization of Phrasal Paraphrases from the Web.
The 3rd International Workshop on Paraphrasing.
JejuIsland,SouthKorea.pp.49-57.
[3] ICTCLAS(InstituteofComputingTechnology,Chi-
nese LexicalAnalysis System): http://www.ictclas.
org/index.html.
[4] Zhao,Sh.Q.,Liu,T.,Yuan,X.Ch.,etal.,2007.Auto-
maticAcquisition of Context-Specic LexicalPara-
phrases. Proceedings of IJCAI, Hyderabad, India. pp.
1789-1794.
[5] Wang, Z., Wang,L.,2010. Paraphrase ofChinese
Sentences Based onAssociated Word.ASIA-ICIM
2010, Wuhan, China.
DOI: https://doi.org/10.26549/met.v6i1.11421