ArticlePDF Available

Evaluating the Performance of a New Text Rhythm Analysis Tool

Authors:

Abstract and Figures

The paper assesses and evaluates the performance of the ProseRhythmDetector (PRD) Text Rhythm Analysis Tool. The research is a case study of 50 English and 50 Russian fictional texts (approximately 88,000 words each) from the 19th to the 21st century. The paper assesses the PRD tool accuracy in detecting stylistic devices containing repetition in their structure such as diacope, epanalepsis, anaphora, epiphora, symploce, epizeuxis, anadiplosis, and polysyndeton. The article ends by discussing common errors, analysing disputable cases and highlighting the use of the tool for author and idiolect identification.
Content may be subject to copyright.
English Studies at NBU, 2020 pISSN 2367-5705
Vol. 6, Issue 2, pp. 217-232 eISSN 2367-8704
https://doi.org/10.33919/esnbu.20.2.3 www.esnbu.org
217
EVALUATING THE PERFORMANCE OF
A NEW TEXT RHYTHM ANALYSIS TOOL
Elena Boychuk1, Ksenia Lagutina2,
Inna Vorontsova3, Elena Mishenkina4, Olga Belyayeva5
1, 3, 4, 5 K. D. Ushinsky Yaroslavl State Pedagogical University, Yaroslavl, Russia,
2 P. G. Demidov Yaroslavl State University, Russia
Abstract
The paper assesses and evaluates the performance of the ProseRhythmDetector (PRD) Text Rhythm
Analysis Tool. The research is a case study of 50 English and 50 Russian fictional texts (approximately
88,000 words each) from the 19th to the 21st century. The paper assesses the PRD tool accuracy in
detecting stylistic devices containing repetition in their structure such as diacope, epanalepsis, anaphora,
epiphora, symploce, epizeuxis, anadiplosis, and polysyndeton. The article ends by discussing common
errors, analysing disputable cases and highlighting the use of the tool for author and idiolect
identification.
Keywords: text rhythm analysis, diacope, epanalepsis, anaphora, epiphora, symploce, epizeuxis,
anadiplosis
Article history: Contributor roles:
Received: 24 May 2020; Conceptualization, Funding acquisition: E.B. (lead)
Reviewed: 30 June 2020; Data curation, Formal analysis, Investigation, Validation: E.B.,
Revised: 15 October 2020; K.L., I.V., E.M., O.B, E.B., K.L., I.V., E.M., O.B. (equal);
Accepted: 29 November 2020; Visualization: E.B., K.L., I.V. (equal); Methodology: E.B., K.L. (lead),
Published: 21 December 2020 I.V., E.M., O.B. (supporting), Software: K.L. (lead), E.B., I.V., E.M.,
O.B. (equal supporting), Writing original draft: E.B., I.V. (lead),
K.L., E.M., O.B. (equal supporting)
Copyright © 2020 Elena Boychuk, Ksenia Lagutina, Inna Vorontsova, Elena Mishenkina, Olga Belyayeva
This open access article is published and distributed under a CC BY-NC 4.0 International
License which permits non-commercial use, distribution, and reproduction in any medium,
provided the original author and source are credited. Permissions beyond the scope of this
license may be available at elena-boychouk@rambler.ru. If you want to use the work commercially, you
must first get the authors’ permission.
Citation: Boychuk, E., Lagutina, K., Vorontsova, I., Mishenkina, E., Belyayeva, O. (2020). Evaluating the
Performance of a New Text Rhythm Analysis Tool. English Studies at NBU, 6(2), 217-232.
https://doi.org/10.33919/esnbu.20.2.3
Funding: This research has been sponsored under Project 19-07-00243 of the Russian Foundation for
Basic Research (RFBR).
Corresponding author:
Elena Boychuk, Doctor of Philological Sciences, is an Associate Professor with the Department of
Romance Languages, K.D. Ushinsky Yaroslavl State Pedagogical University, Russia. She teaches courses in
linguistics, stylistics, intercultural communication, and French as a foreign language. Her research
interests include computer linguistics, phonetics, cognitive linguistics, psycholinguistics, and
communication theory.
E-mail: elena-boychouk@rambler.ru http://orcid.org/0000-0001-6600-2971
*Other authors’ notes at the end
Elena Boychuk, Ksenia Lagutina, Inna Vorontsova, Elena Mishenkina, Olga Belyayeva
218
Rhythm figure analysis
This research aims to assess and evaluate the performance of the
ProseRhythmDetector (PRD) tool (Larionov et al., 2020) in terms of relevant automated
identification of rhythm figures in 50 English and 50 Russian fiction texts
(approximately 88,000 words each)i from the 19th to the 21st century when contrasted
with manual search results.
The PRD tool has been designed to perform a quick and accurate search producing
a quantitative analysis of rhythm figures containing repetition in their structure
(diacope, epanalepsis, anaphora, epiphora, symploce, epizeuxis, anadiplosis,
polysyndeton). These rhythm figures are examples of repetition determined by the
position of repeated units (beginning, end or junction of sentences or clauses, etc.).
Rhythm figure analysis is instrumental in identifying authors’ idiolects and making
conclusions about the uniqueness of their style and language. This is directly related to
the problem of linguistic uniqueness and author identification (e.g. Lagutina et al., 2019;
Boychuk & Belyaeva, 2019). The tool has demonstrated encouraging results in this
respect.
Other stylistic devices containing various forms of repetition in their structure
(chiasmus, polyptoton, derivation, syntactical parallelism etc.) will be considered at a
later stage of the tool performance assessment.
Existing tools: state-of-the-art
Few researchers have addressed the problem of using automated tools for text
rhythm analysis. There are several works on text attribution, where the following
rhythm analysis parameters are considered: rhyme, syllabification, accentuation, and
word repetition. Dumalus and Fernandez (2011) regard text rhythm as a valid author’s
style marker using a simple Naive Bayesian Classifier. Plecháč et al. (2018) apply
rhythm parameters to establishing the authorship of poetic texts. These parameters
include frequencies of stressed syllables at particular metrical positions and frequencies
of particular sounds. Hou and Huang (2019) propose to leverage the phonological
information of tones and rimes in Mandarin Chinese automatically extracted from
unannotated texts. Balint and Trausan-Matu (2016) consider eight features: numbers of
syllables per word, word deemed frequent; normalized numbers of sentence anaphora,
EVALUATING THE PERFORMANCE OF A NEW TEXT RHYTHM ANALYSIS TOOL
219
punctuation unit anaphora and commas; percentage of falling word length patterns,
frequent words at the end of sentences and at the beginning of punctuation units.
Dubremetz and Nivre (2018) assess features based on such rhythm figures as
epanaphora, epiphora, and chiasmus. They apply a binary logistic regression classifier
to a combination of words and achieve decent extraction quality: over 50% of F-score
for all rhythm features.
The authors referred above consider rhythm as a manifestation of one or two
parameters rather than a complex phenomenon revealing itself at the level of grammar
and lexis. Modern computational linguistics obviously lacks systems capable of both
efficiently extracting rhythm features and presenting them in such a way that would
make it possible for a researcher to analyse the rhythm of a fictional text in its entirety
as well as study its particular aspects.
The Prose Rhythm Detector (PRD) tool
When searching for diacope, epanalepsis, anaphora, epiphora, symploce, epizeuxis,
anadiplosis, the PRD filters out words from a stop word list. Each figure can have its own
list of stop words with the exception of polysyndeton that refers to a set list of
conjunctions.
The search for epanalepsis is based on an algorithm that reviews each sentence for
a match of its beginning and ending. If the match is found and the matching units are not
on the stop word list, the case is attributed to epanalepsis.
The tool uses two algorithms for detecting epizeuxis. The first compares the
neighbouring sentences and registers the aspect as epizeuxis if the sentences repeat. The
second checks a single sentence: if it contains words that are repeated in a row, the aspect
is also identified as epizeuxis. In neither case are the matching units identified as
epizeuxis if they contain stop words.
The algorithm for the search of diacope is based on detecting the repetition of
words in a particular sentence. If a word is repeated in a position non-relevant to
epizeuxis or epanalepsis and is not on the stop word list, the aspect is registered.
Finally, when all aspects have been identified, the tool displays their full list, as
well as the text with the highlighted aspects, and a list of figures with the number of their
aspects.
Elena Boychuk, Ksenia Lagutina, Inna Vorontsova, Elena Mishenkina, Olga Belyayeva
220
English-language text analysis
The initial assessment of the efficiency of the PRD tool (Boychuk, et al., 2020. p.
107-119) was performed with the use of randomly selected English fiction texts. This
research has a more structured approach with 50 texts covering a three-century span.
The underlying idea was to see whether texts differ in the use of rhythm figures from
century to century. Another interesting point discussed is how the results obtained for
the English texts compare with those acquired for the Russian texts.
The total number of words in the English texts in this research is about 1,500,000
per century, i.e. approximately 4,500,000 in total.
The analysis algorithm involved the following steps. The text was uploaded in the
text box and processed by the application, which resulted in the generation of an
aggregate rhythm figure list (Fig. 1). Selecting a particular figure, the researchers then
assessed its use in context discriminating between the proper and the improper
automated identification of the figure. In case the tool misidentified the figure, the
context was removed from the list and was not accepted for analysis.
Figure 1. Screenshot of the PRD tool output interface
EVALUATING THE PERFORMANCE OF A NEW TEXT RHYTHM ANALYSIS TOOL
221
The findings were organized in tables reflecting the rhythm figure statistics for
each text. Since the data are very extensive, they cannot be presented at large in the
body of the article, so we describe them in the form of text supplemented with short
summary tables (Table 1 and Table 2).
Diacope is the most frequent rhythm figure in English fictional texts of 19th 21st
centuries, ranging between 800 and 9 000 units per text, which depends on the text size
and the peculiarities of the author’s style:
(1) It may hate him who dares to scrutinise <…> but hate as it will, it is indebted to him
(Ch. Bronte “Jane Eyre”).
The PRD tool demonstrates 87% accuracy of diacope identification (Table 1),
which we undoubtedly consider high. The errors introduced by PRD mainly stem from
the use of stop words which may prove to be a decisive factor for determining the type
of repetition. As has been mentioned previously, all contexts undergo manual
verification for errors as well as cross-identifications:
(2) I thought of course you'd want to see her - I don't want to see her! (I. Murdoch “The
Black Prince”).
The given context contains a case of epiphora rather than a diacope recognized
by the tool as such, with the “her” form filtered out.
Polysyndeton is second only to diacope in relation to the frequency of use:
(3) In fact, he's alert and empty-headed and inexplicably elated (I. McEwan “Saturday”).
In terms of the accuracy, its level is neither high nor low constituting 77%. Some
errors occur due to the misidentification as no difference is detected between, for
example, preposition ‘for’ and conjunction ‘for’:
(4) <…>for Jay Strauss, for there was a possibility of <…> (I. McEwan “Saturday”).
Some inaccuracy of the identification can be explained by the length of the
sentences where the conjunction is repeated not to achieve an artistic effect, but to
connect clauses in one sentence:
(5) Don't you really know, Durbeyfield, that you are the lineal representative of the
ancient and knightly family of the d'Urbervilles, <…> that renowned knight
<…> (Th. Hardy “Tess of the D’Urbervilles”).
Elena Boychuk, Ksenia Lagutina, Inna Vorontsova, Elena Mishenkina, Olga Belyayeva
222
Based on the results of the automated processing of all English texts considered,
anaphora ranks third for the frequency of use in English fiction:
(6) Many strange arms were twined round strange bodies. Many liaisons, some
permanent, were formed in the night <…>. (M. Spark “The Girls of Slender
Means”).
The accuracy of anaphora identification is very high amounting to 92.5%. The
errors are mainly related to cross-identification of anaphora, epizeuxis and simploce
which the tool attributes to two classes simultaneously, e.g. epizeuxis and anaphora:
(7) Only Bradley. Only Bradley. (I. Murdoch “The Black Prince”).
A few cases of misidentification are connected to semantic heterogeneity of the
repeated units associated with different denotata and included in different types of
speech (direct and indirect):
(8) She told me.” [end of dialogue, new paragraph] She appraised him a moment, then
stood <…>. (J. Fowles “The Ebony Tower”).
Epiphora runs fourth in frequency after diacope, polysyndeton and anaphora:
(9) Parallel to this, but further from the fire, is a table with Madame's work-box; her two
pots of flowers, <> and her books of devotion. But Madame reads more than
books of devotion. (E. Gaskell “French Life”).
Compared to diacope and anaphora, the accuracy for epiphora is significantly
lower and constitutes 69.9%. A large part of errors is associated with cross-
identification of epiphora, epizeuxis and simploce (similarly to anaphora). Errors
stemming from the isolated location of the repeated units are not uncommon either.
There are a few overlaps with epanalepsis and diacope and a number of misdetections
of commas, hyphens, dashes and speech marks.
Epizeuxis is thoroughly used in English fiction, but less frequently than
anaphora or epiphora:
(10) <…> and I walked along it through valleys and plateaus, valleys and plateaus (N.
Gaiman “M is for Magic”).
The accuracy of detection attains 72.8% on average, although there might be
from 4 to 219 examples of use per text. Having analyzed them, we would like to
EVALUATING THE PERFORMANCE OF A NEW TEXT RHYTHM ANALYSIS TOOL
223
highlight that there is a considerable proportion of adverbs such as yes, no, all right, well
and exclamation ok used for emphasis mainly in the dialogues rather than narrative:
(11) 'All right, all right,' he says querulously (I. McEwan “Saturday”).
The inaccuracy of detection can be justified by the fact that the PRD tool
sometimes identifies a simple repetition of words as epizeuxis, whereas the author uses
negative and positive forms with a different intent:
(12) He may be in denial, knowing and not knowing; knowing and preferring not to
think about it (I. McEwan “Saturday”).
What is more, the repetition of pronouns ‘you’ or ‘it’ is also identified as the
above-mentioned figure of speech:
(13) Let me reconstruct a scene for you: You were out in the garden <…> (N. Gaiman “M is
for Magic”).
Epanalepsis is among the least frequent rhythm figures being in advance of only
anadiplosis and simploce:
(14) Everyone was going to be a great writer, but everyone! (D. Lessing “The Golden
Notebook”).
The number of units per text ranges from 4 to 76 and does not allow for spotting
any particular trends in terms of its dependence on the time period the text belongs to,
the author’s gender or individual style. The tool accuracy is relatively low constituting
56.01%. The errors are related to its being confused with epizeuxis and positional
remoteness of the repeated units (see anaphora, epiphora). A new type of errors is tied
to the homonymy of forms recognized as epanalepsis:
(15) There were a great many words there. (I. Murdoch “The Black Prince”).
Anadiplosis comes seventh in terms of the frequency of use, although it is a very
important literary device that helps writers to draw readers’ attention to central
characters, their feelings, and the most significant events, etc.:
(16) And then, <…>, I’m falling. I’m falling into a black tunnel, the same black tunnel<…>
(S. Thomas “The End of Mr Y”).
One of the most common cases is the use of proper names:
(17) What’s he on about, Baxter? Baxter shoves the broken wing mirror <…> (I. McEwan
“Saturday”).
Elena Boychuk, Ksenia Lagutina, Inna Vorontsova, Elena Mishenkina, Olga Belyayeva
224
According to the statistics, anadiplosis accounts for 71.5%, so we can see the
level of accuracy is relatively low. The main issue with its identification is that the PRD
tool detects anadiplosis when there is a repetition of personal pronouns, auxiliary
verbs, question words and demonstrative pronouns:
(18) So I fooled you. You were out of position. (I. McEwan “Saturday”).
Symploce is the least frequent rhythm figure of speech found in our corpus of
English fictional texts:
(19) Maybe it was too late. Maybe we got her too late. (R. Galbraith / J.K. Rowling “The
Cuckoo’s Calling”).
In 1/8 of the texts the PRD tool did not detect any examples of it at all. In the rest
of the texts the number of symploce varies from 1 to 12 per text. The accuracy of the
identification of this figure is rather low reaching only 48.6%. There are quite many
overlaps with anaphora and epiphora as the PRD tool regards the repetition of the
whole sentence as symploce:
(20) Get out and run. Get out and run. (S. Thomas “The End of Mr Y”).
Table 1
Accuracy of automated rhythm figure detection in 50 English texts
Devices
Devices quantity
Accuracy (%)
found by the instrument
real quantity
diacope
137 958
120 023
87.00
epanalepsis
1 105
619
56.01
epiphora
3 090
2 160
69.90
anaphora
9 808
9 072
92.50
symploce
183
89
48.60
epizeuxis
3 288
2 396
72.80
anadiplosis
1 029
736
71.50
polysyndeton
53 984
41 567
77.00
Sum total of devices
210 445
176 662
83.94
As could be seen from Table 2 below, the rhythm figure pattern of English
fictional texts changes throughout the centuries. A steady decline in the use of diacope
and polysyndeton is among the most notable trends. Although no objective evidence has
been collected so far, we can hypothesize that such a tendency could be explained by the
20th -21st century authors expressing less interest to the narrative development and
focusing their effort on the unfolding and improvement of dialogues which are intended
EVALUATING THE PERFORMANCE OF A NEW TEXT RHYTHM ANALYSIS TOOL
225
to serve an artistic mould of spontaneous speech. Dialogue (speech)-centred texts are
likely to witness an increase in the use of anaphora, which is another trend indicated by
the research data. The fact is that anaphora is one of the most powerful rhetoric means
capable of producing a strong and convincing impression and thus frequently resorted
to by the speakers to reach their audience. Interestingly, many of the authors analyzed
are (were) university professors or lecturers, which offers ample evidence of their
remarkable speaking skills. The accelerating trend in the use of anaphora in written
texts, as well as the dramatic rise in the use of epizeuxis and epiphora in the 20th
century fiction, could also be (have been) inspired by the employment of these rhythm
figures in the audio and audio-visual media radio, TV and cinema, in the first place.
Finally, a connection could be established between the increase in the use of the above
figures and the growing complexity of the genres and plots of modern fiction, whereby
the clarity as well as the persuasive effect could be achieved through an enhanced role
of rhetoric figures.
Table 2
Rhythm figure distribution statistics for English texts
Devices
XIXc.
XXIc.
diacope
49 432
31 788
epanalepsis
206
203
epiphora
457
738
anaphora
2 380
3 528
symploce
19
39
epizeuxis
806
667
anadiplosis
240
236
polysyndeton
16 638
11 526
Russian-language text analysis
Russian-language texts also cover the period from the 19th to the 21st centuryi. As
is the case with the English texts under analysis, the total number of words in the
Russian texts in this research is around 1,500,000 per century, i.e. approximately
4,500,000 in total.
Polysyndeton. The frequency of its use is very high reaching 86.6%. The most
common conjunction for polysyndeton is the conjunction и, which can be repeated in
the text from 2 to 5 times depending on the author:
Elena Boychuk, Ksenia Lagutina, Inna Vorontsova, Elena Mishenkina, Olga Belyayeva
226
(21) В другом случае характер его был чрезвычайно мрачен, и когда напивался он
пьян, то прятался в бурьяне, и семинарии стоило большого труда его
сыскать там. (N. Gogol “Viy”).
However, for example, in A. Terekhov's work “The Germans”, the repetition of
this conjunction is 9 times within one phrase.
Diacope comes second in Russian texts making 3000 cases per text on average:
(22) Староста расчесал себе бороду и важно упирается на палочку из соседней
рощи, палочку, известную многим в деревне. (V. Sollogub “Serezha”).
The tool accuracy in detecting diacope is 72.06% (Table 2), the error being quite
large and arising out of cross-identification of diacope, anaphora, epiphora, epanalepsis,
syntactical parallelism, epizeuxis and chiasm.
Anaphora ranks third for the frequency of use. The level of accuracy in
identifying anaphora is very high reaching 90.13%. As has been mentioned, the errors
mainly occur due to its cross-identification with diacope:
(23) Бабушка до сих пор любит его без памяти <…> Бабушка знала, что Сен-
Жермен мог располагать большими деньгами. (A. Pushkin “The Queen of
Spades”)
or epizeuxis:
(24) Где доктор? Где доктор, я вас спрашиваю! (A. Strugatsky, B. Strugatsky “Hard
to be a God”).
It should be noted that pronominal anaphora prevails over other types making
90% of cases.
Epizeuxis. In terms of its detection by the tool, the degree of accuracy is 87.89%.
(25) Прощайте, прощайте, храни вас господь! (F. Dostoevsky “Poop Folk”).
In some cases, when the number of repeated elements is greater than two, only
the first and last elements are defined by the tool, attributing this example to
epanalepsis, for intstance:
(26) Мой, мой, мой! (I.Turgenev Annouchka”).
EVALUATING THE PERFORMANCE OF A NEW TEXT RHYTHM ANALYSIS TOOL
227
The tool sometimes detects repetition as epizeuxis, although in the following
cases there is anadiplosis. This is because the comma is between both homogeneous
elements and parts of the sentence.
Epiphora. The tool ensures 87.49% accuracy in detecting epiphora:
(27) А я буду плясать. Жену, детей малых брошу, а пред тобой буду плясать. (F.
Dostoyevsky “The Idiot”).
The errors here are reminiscent of those described above and include mistaking
epizeuxis for epiphora:
(28) Я игра-ать мной не позво-олю! Не позво-олю (A. Chekhov The Duel”),
and misidentification of repeated initials and abbreviations consisting of repeated
letters:
(29) <…> Харитонов А. А. (O. Slavnikova “The Immortal”).
Anadiplosis. The use of words at the junction of the parts of the sentence and
sentences is detected by the tool very well achieving a high level of accuracy which is
89.21%:
(30) <…> он прошел в кабинет. Кабинет медленно осветился внесенной свечой (L.
Tolstoy Anna Karenina).
Regarding the improvements that should be made to the tool, abbreviations with
punctuation ought to be taken into consideration:
(31) <…> при своем превосходном уме и положительном знании жизни и пр. и пр.,
<…> (F. Dostoevsky “The Idiot”).
In the following example, the repeated elements are identified as epizeuxis,
although according to the meaning and structure of the sentence this repetition
corresponds to anadiplosis:
(32) Это был наш общий язык, язык, подаренный мне ею, <…> (E. Vodolazkin “The
Abduction of Europa”).
Epanalepsis. A relatively high level of accuracy for epanalepsis 70.79%
speaks for the correct laydown of the tool specifications:
(33) Аглая мне урок дала; спасибо тебе, Аглая. (F. Dostoyevsky “The Idiot”).
The tool misdetects epanalepsis confusing it with epizeuxis in the following
examples:
Elena Boychuk, Ksenia Lagutina, Inna Vorontsova, Elena Mishenkina, Olga Belyayeva
228
(34) Секунда… секунда… (V. Pikul “Requiem for Convoy”).
In order to avoid such errors, the presence of intermediate components between
repeated units should be included in the tool specifications.
Simploce. The frequency of its use is very low in the texts. The level of accuracy
of symploce detection by the tool is quite high constituting 72.84%.
(35) Он никак не ожидал того, что он увидал и почувствовал у брата. Он ожидал
найти то же состояние самообманыванья, <…> во время осеннего приезда
брата. (L. Tolstoy “Anna Karenina”).
In some cases, the tool detects symploce as a repetition of the conjunction и at
the beginning of the sentence, considering it as a content word at the end of the
sentence:
(36) И тогда мать заплачет. И.., может, он тоже заплачет (V. Pikul Requiem for
Convoy”).
These examples are ambiguous, because, on the one hand, the repetition of the
conjunction и can be anaphoric and can bear a certain meaning, and, on the other hand,
the roles of the link-word and the content word are not equal.
Table 3
Accuracy of automated rhythm figure detection in 50 Russian texts
Devices
Devices quantity
Accuracy (%)
found by the instrument
real quantity
diacope
30 701
22 123
72.06
epanalepsis
493
349
70.79
epiphora
2 542
2 224
87.49
anaphora
4 033
3 635
90.13
symploce
81
59
72.84
epizeuxis
3 855
3 388
87.89
anadiplosis
760
678
89.21
polysyndeton
40 852
35 376
86.60
Sum total of devices
83 317
67 832
81.41
The century-based findings recorded for the Russian literary texts are
summarized in Table 4 and reveal a decline in the use of rhythm figures from the 19th to
the 21st century. It is an observation so far. Still, the figures allow for an assumption that
the above tendency may testify to changes in the literary language quality or other
important processes. However, it undoubtedly requires further comprehensive research
EVALUATING THE PERFORMANCE OF A NEW TEXT RHYTHM ANALYSIS TOOL
229
which will focus on other linguistic parameters of the text structure along text rhythm
exploration.
Table 4
Rhythm figure distribution statistics for Russian texts
Devices
XIXc.
XXc.
XXIc.
diacope
7 838
7 277
7 008
epanalepsis
163
128
49
epiphora
994
797
433
anaphora
1 307
1 278
1 050
symploce
24
18
16
epizeuxis
1 759
852
6 94
anadiplosis
250
220
208
polysyndeton
18 287
8 095
8 774
Discussion
With regards to the research results we consider it essential to address the causes
of the discrepancies noticed when testing the tool:
1. Lower than expected accuracy in detecting diacope, epanalepsis, epizeuxis and
simploce resulting from their cross-identification and automatic attribution to
several classes: the solution to the problem is seen in the introduction of new
stop words and word units (“had had”, “was (.,;)was”, “that that”, “you (.,;) you”,
etc.) as well as accounting for intermediate words between repeated units;
2. Misdetection of punctuation marks (commas, hyphens, dashes and quotations)
preventing the tool from accurately identifying certain rhythm figures, diacope,
anaphora and epanalepsis in the first place;
3. Misrecognizing of initials (with a full stop) as full-fledged sentences: the above
problems can be solved by defining specifications for such cases, e.g. listing the
relevant punctuation marks as stop words;
4. Confusion of rhythm figures, e.g. epiphora and mimesis (the latter is currently
not on the list of rhythm figures available for the tool) which calls for the
necessity of formulating a set of specific rules for the case.
Elena Boychuk, Ksenia Lagutina, Inna Vorontsova, Elena Mishenkina, Olga Belyayeva
230
Conclusions
The PRD tool has demonstrated a rather high level of accuracy in detecting
rhythm figures 83.94% for English texts and 81.41% for Russian texts.
Some of the statistical errors discovered in the course of the research can be
rectified by compiling more comprehensive as well as better targeted stop-word lists,
reducing text portions intended for automated analysis and making other rhythm
figures available for the tool, which is almost certain to improve its accuracy level.
However, not all statistical uncertainties can be eliminated. This is particularly true for
misidentifications stemming from homonymy (polysemy) and other content-based
phenomena.
The different quantity of rhythm figures in the texts of different authors allows
for an assumption that each author has their own bank of rhythm-based stylistic
devices. Thus, this tool can also be used for the identification of specific features of
authors’ idiolect and style.
The quantitative research of rhythm figures in English and Russian fictional texts
covering a time span of three centuries has demonstrated a more extensive use of the
above figures in English fiction as compared to Russian. This can be explained by the
peculiarities of the language morphologic, lexical and semantic structures as well as
their principles of clause and sentence construction. The accuracy of automated rhythm
figure identification is high for both languages: over 83% for English and over 81% for
Russian.
The quantitative data concerning the distribution of rhythm figure show a
downward temporal trend in rhythm figure use in the Russian fiction. The English
fiction witnesses a steady decline in the use of diacope and polysyndeton along with an
appreciable rise in the use of anaphora. Further comprehensive research is required to
conclude on the statistics obtained.
References
Balint, M., & Trausan-Matu, S. (2016). A critical comparison of rhythm in music and
natural language. Annals of the Academy of Romanian Scientists, Series on Science
and Technology of Information, 9(1), 4360.
EVALUATING THE PERFORMANCE OF A NEW TEXT RHYTHM ANALYSIS TOOL
231
Boychuk, E., & Belyaeva, O. (2019). La téchnique de stylométrie réalisée à la base de
l’analyse informatique du rythme du texte. 10-ièmes Journées Internationales de
Linguistique de Corpus (JLC). Université Grenoble-Alpes, 26-28.11.19. 163-167.
Boychuk E., Vorontsova I., Shliakhtina E., Lagutina K., Belyaeva O. (2020) Automated
Approach to Rhythm Figures Search in English Text. In Wil M. P. van der Aalst, V.
Batagelj, D. I. Ignatov, M. Khachay, V. Kuskova, A. Kutuzov, S. O. Kuznetsov, I. A.
Lomazova, N. Loukachevitch, A. Napoli, P. M. Pardalos, M. Pelillo, A. V. Savchenko,
E. Tutubalina (Eds.), Analysis of Images, Social Networks and Texts. AIST 2019,
Communications in Computer and Information Science, vol 1086 (pp. 107-119).
Springer. https://doi.org/10.1007/978-3-030-39575-9_11
Dubremetz, M., & Nivre, J. (2018). Rhetorical Figure Detection: Chiasmus, Epanaphora,
Epiphora. Frontiers in Digital Humanities, 5(10). 1-16.
https://doi.org/10.3389/fdigh.2018.00010
Dumalus, A., & Fernandez, P. (2011). Authorship attribution using writers rhythm based
on lexical stress. Proceedings of the 11th Philippine Computing Science Congress.
8288
Hou, R., & Huang, C. (2020). Robust stylometric analysis and author attribution based on
tones and rimes. Natural Language Engineering, 26(1), 49-71.
https://doi.org/10.1017/S135132491900010X
Lagutina, K., Lagutina, N., Boychuk, E., Vorontsova, I., Shliakhtina, E., Belyaeva, O.,
Paramonov, I. (2019). A Survey on Stylometric Text Features. 25th Conference of
Open Innovations Association (FRUCT), Helsinki, Finland, 2019, 184-195.
https://doi.org/10.23919/FRUCT48121.2019.8981504
Larionov, V., Petryakov, V., Poletaev, A., Lagutina, K., Manakhova, A., Lagutina, N. and
Boychuk, E., (2020). ProseRhythmDetector. K.D. Ushinsky Yaroslavl State
Pedagogical University, Yaroslavl, Russia. https://github.com/text-
processing/prose-rhythm-detector
Plecháč, P., Bobenhausen, K., Hammerich, B. (2018). Versification and authorship
attribution. A pilot study on Czech, German, Spanish, and English poetry. Studia
Metrica et Poetica, 5(2), 2954. https://doi.org/10.12697/smp.2018.5.2.02
Reviewers: Handling Editor
1. Anonymous Assoc. Prof. Boris Naimushin, PhD
2. Anonymous New Bulgarian University
Elena Boychuk, Ksenia Lagutina, Inna Vorontsova, Elena Mishenkina, Olga Belyayeva
232
i Note:
English-language text analysis is based on works by:
19th century - Charles Dickens, Charlotte Bronte, Elizabeth Gaskell, Jane Austen, Thomas
Hardy
20th century - Robert Lewis Stevenson, Virginia Woolf, James Joyce, Iris Murdoch, Muriel
Spark, Daphne du Maurier, John Fowles, David Herbert Lawrence, Doris Lessing
21st century - Ian McEwan, Neil Gaiman, Scarlett Thomas, Joan K. Rowling, Sebastian
Faulks, Jenny Colgan, Kazuo Ishiguro, Paula Hawkings, Sarah Perry, Ruth Hogan, Tony
Parsons
Russian-language text analysis is based on works by:
19th century - Nikolay Gogol, Fyodor Dostoevsky, Alexander Pushkin, Vladimir Sollogub,
Leo Tolstoy, Ivan Turgenev, Anton Chekhov
20th century - Ivan Bunin, Alexander Grin, Mikhail Bulgakov, Maxim Gorky, Vasily
Aksenov, Valentin Pikul, Sergey Dovlatov, Victor Pelevin, Alexander Prokhanov
21st century - Eugeny Vodolazkin, Vladimir Mikushevich, Zakhar Prilepin, Alexander
Terekhov, Dmitry Bykov, Olga Slavnikova
Authors’ note
Ksenia Lagutina, MSc in Computer Science, is a Postgraduate Student and Assistant Professor
with the Department of Theoretical Informatics, P.G. Demidov Yaroslavl State University, Russia.
E-mail: lagutinakv@mail.ru http://orcid.org/0000-0002-1742-3240
Inna Vorontsova, PhD, is an Associate Professor with the Department of Translation and
Interpretation, K.D. Ushinsky Yaroslavl State Pedagogical University, Russia. She teaches
courses in lexicography, legal translation, and English as a foreign language. Her research
interests lie in the field of lexicology and lexicography, discourse analysis and theory of
translation.
E-mail: arinna1@yandex.ru http://orcid.org/0000-0001-5897-9299
Elena Mishenkina, PhD, is an Associate Professor with the Department of Translation and
Interpretation, K.D. Ushinsky Yaroslavl State Pedagogical University, Russia. She teaches
courses in translation, intercultural communication, and English as a foreign language. Her
research interests include linguistics, psycholinguistics, ethnosociolinguistics, communication
theory and theory of translation.
E-mail: vitalt@mail.ru https://orcid.org/0000-0002-1314-4156
Olga Belyayeva is an Assistant Professor with the Department of Translation and
Interpretation, K.D. Ushinsky Yaroslavl State Pedagogical University, Russia. She teaches
courses in translation and English as a foreign language. Her research interests are related to
applied linguistics, corpus linguistics and English literature.
E-mail: olbelyaeva@yandex.ru https://orcid.org/0000-0003-3658-7336
... Biber's work illuminates the need for a nuanced understanding of language registers and their contextual use, emphasizing the dynamic nature of language in diverse sociolinguistic environments. Boychuk et al. (2020) evaluate the performance of a new tool for text rhythm analysis. They conducted research on the approach in the context of the English language using data from English texts. ...
Article
Full-text available
The aim of the study is to identify and analyse frame structures, activators, and concepts in texts to reveal their impact on the reader’s perception. The research employed the methods of semantic, structural, and intertextual frame analysis. Cronbach’s alpha was also used to verify the instruments used. The following cases were used in the work: Oxford English Corpus and International Corpus of English (ICE). It is noted that frame activators such as “great heat,” “hell heat”, “hot wind”, and “scorching sun” evoke associations with scorching heat and unusual warmth in the desert. These frames convey the impression of life danger, exhaustion, and severity of conditions. In turn, phraseological units such as “fresh waters”, “bloody sunset”, and “dried earth” reveal the contradictory nature of the image of the desert, where even such natural phenomena as water and sunset acquire a new, deep shade of meaning. The application of this theory was found to reveal complex language structures and their influence on the understanding and perception of texts. The obtained results open up new opportunities for educational and literary analysis, deepening the understanding of language mechanisms in fiction. Further studies in the field of linguistic analysis of texts should be aimed at considering the interaction between different works of art, focusing on the frame perspective. It is also worth paying attention to the possibilities of using the theory of frames to fulfill practical assignments, such as automatic text analysis.
... Support vector machines and algorithms based on random forests were used as classifiers. The article [22] gives the performance evaluation of the ProseRhythmDetector (PRD), the text rhythm analysis tool, for prose texts in English and Russian. The study was conducted on the basis of 50 English and 50 Russian literary texts written over the last two centuries and consisting of approximately 88,000 words each. ...
Article
Full-text available
This paper presents the study of the author’s style of A.S. Pushkin based on the comparison of his poetic texts with the texts of contemporary poets. The purpose of this study is to determine the features of the author’s style of A.S. Pushkin using machine learning methods. This paper describes the construction of several classifications based on different groups of features, as well as the classification based on a combined set of features from different groups. The quality of all constructed classifications is also analyzed; special attention is paid to the interpretation of the neural network solution and the identification of features of the author’s style.
Article
The article focuses on the study of trans-media in the aspect of linguistic pragmatics and translation studies. The main emphasis of the study is on the Wanda/ Vision line of the multimodal fictional MARVEL universe, represented by comics and TV series. The authors come to the conclusion that the key strategy of translating texts included in the trans-media system should be recognized as adherence to the canon, which guarantees the character and narrative framework recognizability, creates the impression of the wholeness of the trans-media project, and provides the depth of aesthetic experience for the viewer, reader, and fan of the trans-media universe. On the pragmalinguistic plane, the canon definitely includes the semantic and stylistic dominants of the transmedia discourse, formed from the intersections of the dominants of the texts and discourses that form the elements of the project. In the tactical aspect, the translator should be guided by the rules of translation of audiovisual works of the genres that form the trans-media project. Thus, in the context of translating content related to the Wanda/ Vision line, these are, first of all, the rules of translating comics and feature film text.
Article
The aim of the research is to provide a comprehensive linguocultural characteristic of a folk tale. The research is based on the material of the Irish Fairy and Folk Tales tale anthology, compiled and edited by W. B. Yeats. The research results allow for a suggestion that linguocultural markers are to be found on both ideologic-compositional and speech levels of a text. Thus, the motives of Christian morality form the basis for reciprocal altruism which is the conceptual entity of Irish folk tales. The tale structure is often linear and consists of a short introduction, the main part and the climax turning into a short sharp denouement. Irish folk tales are often a metaphor for the rite of passage. The didactic function of tales consists in demonstrating the possibilities of sin purge through their recognition and repentance. Tales also set social rules and norms. Culture-specific language units encountered in the texts of Irish folk tales belong to different levels of the English language system. The phonetic level reveals such features as metathesis, final consonant reduction, imitation of aspiration, alliteration, wordplay based on homophony, etc. They imitate a peculiar Irish accent and exert some vernacular effect. The lexical level is represented by culture-bound vocabulary including ethnographical terms, anthroponyms and geographical names, both real and invented, various kinds of borrowings from Irish Gaeilge,quotations etc. Some cultural features are exhibited in grammar and text rhythm, chiefly through the use of specific verb forms of Irish English as well as certain correlations of repetition-based rhythmic devices – polysyndeton, diacope, anaphora, epizeuxis, symploce etc. The study of linguocultural text markers gives a comprehensive idea of intra- and extralinguistic characteristics of the tale.
Article
Full-text available
This article describes pilot experiments performed as one part of a long-term project examining the possibilities for using versification analysis to determine the authorships of poetic texts. Since we are addressing this article to both stylometry experts and experts in the study of verse, we first introduce in detail the common classifiers used in contemporary stylometry (Burrows' Delta, Argamon's Quadratic Delta, Smith-Aldridge's Cosine Delta, and the Support Vector Machine) and explain how they work via graphic examples. We then provide an evaluation of these classi-fiers' performance when used with the versification features found in Czech, German, Spanish, and English poetry. We conclude that versification is a reasonable stylometric marker, the strength of which is comparable to the other markers traditionally used in stylometry (such as the frequencies of the most frequent words and the frequencies of the most frequent character n-grams).
Article
Full-text available
Rhetorical figures are valuable linguistic data for literary analysis. In this article, we target the detection of three rhetorical figures that belong to the family of repetitive figures: chiasmus (I go where I please, and I please where I go.), epanaphora also called anaphora (“Poor old European Commission! Poor old European Council.”) and epiphora (“This house is mine. This car is mine. You are mine.”). Detecting repetition of words is easy for a computer but detecting only the ones provoking a rhetorical effect is difficult because of many accidental and irrelevant repetitions. For all figures, we train a log-linear classifier on a corpus of political debates. The corpus is only very partially annotated, but we nevertheless obtain good results, with more than 50% precision for all figures. We then apply our models to totally different genres and perform a comparative analysis, by comparing corpora of fiction, science and quotes. Thanks to the automatic detection of rhetorical figures, we discover that chiasmus is more likely to appear in the scientific context whereas epanaphora and epiphora are more common in fiction.
Conference Paper
Full-text available
It is difficult to define writing style in terms that a machine can understand. Finding features in the text that can be indicative of style overcomes that limitation. Non-traditional authorship attribution methods today hinges on that idea. Over the years, a variety of such style markers have been proposed, ranging from function word counts to word n-grams. This study explores the use of writer's rhythm as a possible style marker using a simple Naive Bayesian Classifier and a collection of 587 texts of 51 authors, based from the Gutenberg634 corpora. Our initial results show that rhythm, as defined by lexical stress and pauses, shows promise as a style marker.
Book
This book constitutes the proceedings of the 8th International Conference on Analysis of Images, Social Networks and Texts, AIST 2019, held in Kazan, Russia, in July 2019. The 24 full papers and 10 short papers were carefully reviewed and selected from 134 submissions (of which 21 papers were rejected without being reviewed). The papers are organized in topical sections on general topics of data analysis; natural language processing; social network analysis; analysis of images and video; optimization problems on graphs and network structures; analysis of dynamic behaviour through event data.
Chapter
Text rhythm is recognized as being one of the most important subject areas of modern linguistic studies. There is a considerable amount of literature on the analysis of rhythm in poetry and literary prose. However, few researchers have addressed the problem of using automated tools for rhythm analysis, whereas automated methods can be of great benefit to this cause, especially when the research is conducted on large text corpora. This paper presents a new automated approach to integrated search of rhythm figures in fiction including anaphora, epiphora, anadiplosis, symploce and simple repetition provided for by an original lexical tool designed within the framework of the research. The ad hoc experiments have proved this approach to be reliable and informative.
Article
In this article, we propose an innovative and robust approach to stylometric analysis without annotation and leveraging lexical and sub-lexical information. In particular, we propose to leverage the phonological information of tones and rimes in Mandarin Chinese automatically extracted from unannotated texts. The texts from different authors were represented by tones, tone motifs, and word length motifs as well as rimes and rime motifs. Support vector machines and random forests were used to establish the text classification model for authorship attribution. From the results of the experiments, we conclude that the combination of bigrams of rimes, word-final rimes, and segment -final rimes can discriminate the texts from different authors effectively when using random forests to establish the classification model. This robust approach can in principle be applied to other languages with established phonological inventory of onset and rimes.
A critical comparison of rhythm in music and natural language
  • M Balint
  • S Trausan-Matu
Balint, M., & Trausan-Matu, S. (2016). A critical comparison of rhythm in music and natural language. Annals of the Academy of Romanian Scientists, Series on Science and Technology of Information, 9(1), 43-60.
Automated Approach to Rhythm Figures Search in English Text
  • E Boychuk
  • I Vorontsova
  • E Shliakhtina
  • K Lagutina
  • O Belyaeva
  • M P Wil
  • V Van Der Aalst
  • D I Batagelj
  • M Ignatov
  • V Khachay
  • A Kuskova
  • S O Kutuzov
  • I A Kuznetsov
  • N Lomazova
  • A Loukachevitch
  • P M Napoli
  • M Pardalos
  • A Pelillo
Boychuk E., Vorontsova I., Shliakhtina E., Lagutina K., Belyaeva O. (2020) Automated Approach to Rhythm Figures Search in English Text. In Wil M. P. van der Aalst, V. Batagelj, D. I. Ignatov, M. Khachay, V. Kuskova, A. Kutuzov, S. O. Kuznetsov, I. A. Lomazova, N. Loukachevitch, A. Napoli, P. M. Pardalos, M. Pelillo, A. V. Savchenko, E. Tutubalina (Eds.), Analysis of Images, Social Networks and Texts. AIST 2019, Communications in Computer and Information Science, vol 1086 (pp. 107-119).
Tony Parsons Russian-language text analysis is based on works by: • 19 th century -Nikolay Gogol, Fyodor Dostoevsky
  • V Larionov
  • V Petryakov
  • A Poletaev
  • K Lagutina
  • A Manakhova
  • N Lagutina
  • E Boychuk
  • K Joan
  • Sebastian Rowling
  • Jenny Faulks
  • Kazuo Colgan
  • Paula Ishiguro
  • Sarah Hawkings
  • Ruth Perry
  • Hogan
Larionov, V., Petryakov, V., Poletaev, A., Lagutina, K., Manakhova, A., Lagutina, N. and Boychuk, E., (2020). ProseRhythmDetector. K.D. Ushinsky Yaroslavl State Pedagogical University, Yaroslavl, Russia. https://github.com/textprocessing/prose-rhythm-detector i Note: English-language text analysis is based on works by: • 19 th century -Charles Dickens, Charlotte Bronte, Elizabeth Gaskell, Jane Austen, Thomas Hardy • 20 th century -Robert Lewis Stevenson, Virginia Woolf, James Joyce, Iris Murdoch, Muriel Spark, Daphne du Maurier, John Fowles, David Herbert Lawrence, Doris Lessing • 21 st century -Ian McEwan, Neil Gaiman, Scarlett Thomas, Joan K. Rowling, Sebastian Faulks, Jenny Colgan, Kazuo Ishiguro, Paula Hawkings, Sarah Perry, Ruth Hogan, Tony Parsons Russian-language text analysis is based on works by: • 19 th century -Nikolay Gogol, Fyodor Dostoevsky, Alexander Pushkin, Vladimir Sollogub, Leo Tolstoy, Ivan Turgenev, Anton Chekhov • 20 th century -Ivan Bunin, Alexander Grin, Mikhail Bulgakov, Maxim Gorky, Vasily Aksenov, Valentin Pikul, Sergey Dovlatov, Victor Pelevin, Alexander Prokhanov • 21 st century -Eugeny Vodolazkin, Vladimir Mikushevich, Zakhar Prilepin, Alexander Terekhov, Dmitry Bykov, Olga Slavnikova