Article

Automated Search and Analysis of the Stylometric Features That Describe the Style of the Prose of 19th–21st Centuries

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In particular, Short [30] emphasizes the fundamentally argu-mentative role of stylistic figures in speech. Studying the features of styles and genres, scientists analyze stylistic figures in works of various genres -plays [3,21], poetry [6,13,19]; poems [17], short stories [18,31], novels [5,33] etc. Kaftandjiev and Kotova [11] study the role of stylistic figures such as metaphor, synecdoche for the study of various disciplines in primary and secondary school, analyze the role of stylistic figures in communications related to business and marketing. The works of Hoppmann [9] and Liubchenko et al. [15] are devoted to the analysis of stylistic devices in political discourse. ...
Article
Full-text available
The given article highlights the stylistic figures as a factor in the formation of communicative intention in scientific linguistic texts. For a comprehensive study of stylistic figures in the Ukrainian language is extremely important to learn the basic functions in the scientific linguistic articles. The actuality of the research topic is determined by the need of a systematic study of stylistic figures that are traditionally considered unusual for the text, but we will try to prove that they are relevant. The study of the linguistic features in scientific discourse is importance to find ways of explaining a certain material. The study, using a free associative experiment, has led to the conclusion that the use of stylistic figures in the educational and scientific texts makes it possible to master the material better. Generally speaking, the use of stylistic figures in scientific texts is not widespread, but the student audience prefers these texts. If the text is addressed to the reader for educational purposes, the correct use of paths will facilitate the quickest possible understanding of the basic thought of the message.
... Refs. [7,43,[57][58][59][60][61][62][63][64][65][66][67][68][69][70][71][72][73] ...
Article
Full-text available
Research in computational textual aesthetics has shown that there are textual correlates of preference in prose texts. The present study investigates whether textual correlates of preference vary across different time periods (contemporary texts versus texts from the 19th and early 20th centuries). Preference is operationalized in different ways for the two periods, in terms of canonization for the earlier texts, and through sales figures for the contemporary texts. As potential textual correlates of preference, we measure degrees of (un)predictability in the distributions of two types of low-level observables, parts of speech and sentence length. Specifically, we calculate two entropy measures, Shannon Entropy as a global measure of unpredictability, and Approximate Entropy as a local measure of surprise (unpredictability in a specific context). Preferred texts from both periods (contemporary bestsellers and canonical earlier texts) are characterized by higher degrees of unpredictability. However, unlike canonicity in the earlier texts, sales figures in contemporary texts are reflected in global (text-level) distributions only (as measured with Shannon Entropy), while surprise in local distributions (as measured with Approximate Entropy) does not have an additional discriminating effect. Our findings thus suggest that there are both time-invariant correlates of preference, and period-specific correlates.
Article
This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in natural language is an important component of assessing students knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models were considered based on stylometric numerical features of the character, word, sentence structure levels. The classification of the obtained vectors was carried out by standard machine learning classifiers. The article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, LogisticRegression. Precision, recall and F-score served as quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, were chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 were shown by the Support Vector Classifier with F-score 67 % for the CEFR Levelled English Texts. This approach was compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided the F-score value of 69 %. The analysis of classification errors showed that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depended on the text corpus, that demonstrated a significant difference in F-scores during application of the same text models for different corpora. In general, the obtained results showed the effectiveness of automatic text level detection and the possibility of its practical application.
Conference Paper
Full-text available
The paper is devoted to automatic detection of rhythm in fiction and investigation of how rhythm of prosaic texts changed over 19th-21st centuries, based on results of such detection. The authors developed algorithms, which extract rhythm figures related to word repetitions (anaphora, epiphora, polysyndeton, etc.), and visualized their statistical features in plots and heat maps by decades on the material of British and Russian literature. The experiments allowed to find rhythm changes over periods and give interpretation of their reasons from a linguistic point of view.
Article
Full-text available
Cross-lingual Text Classification (CLC) consists of automatically classifying, according to a common set C of classes, documents each written in one of a set of languages L, and doing so more accurately than when “naïvely” classifying each document via its corresponding language-specific classifier. To obtain an increase in the classification accuracy for a given language, the system thus needs to also leverage the training examples written in the other languages. We tackle “multilabel” CLC via funnelling, a new ensemble learning method that we propose here. Funnelling consists of generating a two-tier classification system where all documents, irrespective of language, are classified by the same (second-tier) classifier. For this classifier, all documents are represented in a common, language-independent feature space consisting of the posterior probabilities generated by first-tier, language-dependent classifiers. This allows the classification of all test documents, of any language, to benefit from the information present in all training documents, of any language. We present substantial experiments, run on publicly available multilingual text collections, in which funnelling is shown to significantly outperform a number of state-of-the-art baselines. All code and datasets (in vector form) are made publicly available.
Conference Paper
Full-text available
Recently, many historical texts have become digitized and made accessible for search and browsing. Professionals who work with collections of such texts often need to verify the correctness of documents’ key metadata - their creation dates. In this paper, we demonstrate an interactive system for estimating the age of documents. It may be useful not only for tagging a large number of undated documents, but also for verifying already known timestamps. In order to infer probable dates, we rely on a large scale lexical corpora, Google Books Ngrams. Besides estimating the document creation year, the system also outputs evidences to support age detection and reasoning process and allows testing different hypotheses about document’s age.
Conference Paper
Full-text available
We investigate temporal resolution of documents, such as determining the date of publication of a story based on its text. We describe and evaluate a model that build histograms encoding the probability of different temporal periods for a document. We construct histograms based on the Kullback-Leibler Divergence between the language model for a test document and supervised language models for each interval. Initial results indicate this language modeling approach is effective for predicting the dates of publication of short stories, which contain few explicit mentions of years.
Article
The analysis of authorial style, termed stylometry, assumes that style is quantifiably measurable for evaluation of distinctive qualities. Stylometry research has yielded several methods and tools over the past 200 years to handle a variety of challenging cases. This survey reviews several articles within five prominent subtasks: authorship attribution, authorship verification, authorship profiling, stylochronometry, and adversarial stylometry. Discussions on datasets, features, experimental techniques, and recent approaches are provided. Further, a current research challenge lies in the inability of authorship analysis techniques to scale to a large number of authors with few text samples. Here, we perform an extensive performance analysis on a corpus of 1,000 authors to investigate authorship attribution, verification, and clustering using 14 algorithms from the literature. Finally, several remaining research challenges are discussed, along with descriptions of various open-source and commercial software that may be useful for stylometry subtasks.
Conference Paper
Rhythm analysis is widely used for texts in a poetic form to determine the individual style of the author, but rarely used in the analysis of prose due to technical problems and human factor influence. To overcome these issues we propose an automated approach that involves the development and use of specialized software for analyzing French literary prose at various stylistic levels: phonetic, lexical, and grammatical. The methods developed for rhythm analysis and implemented in the computer application cover a variety of the phonostylistic devices: calculation of the length of the rhythmic units, finding assonance, alliteration, rhyme, various repetitions, and others. Efficiency of the approach was proved experimentally by the analysis of rhythmization devices in the novels of four French writers. It was shown that the proposed automated approach allows the researcher to analyze the text 15 times faster than using the manual approach.
Metody matematicheskoi lingvistiki v stilisticheskikh issledovaniyakh (Methods of Mathematical Linguistics in Stylistic Studies)
  • G Martynenko
  • Ya
On the problem of prose rhythm
  • N I Golubeva-Monatkina