February 2025
·
1 Read
Scientific and Technical Information Processing
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
February 2025
·
1 Read
Scientific and Technical Information Processing
February 2025
·
1 Citation
Automatic Control and Computer Sciences
December 2024
·
10 Reads
The assessment of text complexity is a significant applied problem with potential applications in drafting legal documents, editing textbooks, and selecting books for extracurricular reading. Different task formulations give rise to various types of text complexity that are weakly correlated. Despite this, researchers typically overlook cross-domain complexity assessment. This study evaluates the applicability of various linguistic features in assessing the complexity of Russian-language texts, adding two new groups of features (rhythmic and cohesion) to those previously studied and introducing a new group of features for lexical complexity. We perform both in-domain and cross-domain comparisons of the features. Our findings indicate that syntactic features are the most significant in terms of Mutual Information. In the in-domain context, lexical and morphological features were found to be the most beneficial, whereas in the cross-domain context, syntactic, morphological, and lexical features proved to be the most effective. Conversely, rhythmic and cohesion features did not significantly impact the quality of the assessment algorithms.
June 2024
·
20 Reads
·
1 Citation
Modeling and Analysis of Information Systems
The paper presents the results of a study of modern text models in order to identify, on their basis, the semantic similarity of English-language texts. The task of determining semantic similarity of texts is an important component of many areas of natural language processing: machine translation, information retrieval, question and answer systems, artificial intelligence in education. The authors solved the problem of classifying the proximity of student answers to the teacher’s standard answer. The neural network language models BERT and GPT, previously used to determine the semantic similarity of texts, the new neural network model Mamba, as well as stylometric features of the text were chosen for the study. Experiments were carried out with two text corpora: the Text Similarity corpus from open sources and the custom corpus, collected with the help of philologists. The quality of the problem solution was assessed by precision, recall, and F-measure. All neural network language models showed a similar F-measure quality of about 86% for the larger Text Similarity corpus and 50–56% for the custom corpus. A completely new result was the successful application of the Mamba model. However, the most interesting achievement was the use of vectors of stylometric features of the text, which showed 80% F-measure for the custom corpus and the same quality of problem solving as neural network models for another corpus.
February 2024
·
8 Reads
·
1 Citation
Automatic Control and Computer Sciences
September 2023
·
224 Reads
·
2 Citations
Modeling and Analysis of Information Systems
This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in natural language is an important component of assessing students knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models were considered based on stylometric numerical features of the character, word, sentence structure levels. The classification of the obtained vectors was carried out by standard machine learning classifiers. The article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, LogisticRegression. Precision, recall and F-score served as quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, were chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 were shown by the Support Vector Classifier with F-score 67 % for the CEFR Levelled English Texts. This approach was compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided the F-score value of 69 %. The analysis of classification errors showed that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depended on the text corpus, that demonstrated a significant difference in F-scores during application of the same text models for different corpora. In general, the obtained results showed the effectiveness of automatic text level detection and the possibility of its practical application.
February 2023
·
9 Reads
·
4 Citations
Automatic Control and Computer Sciences
December 2022
·
294 Reads
·
1 Citation
Modeling and Analysis of Information Systems
The article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were carried out on a corpus of 10,000 texts in five genres: novels, scientific articles, reviews, posts from the social network Vkontakte, news from OpenCorpora. Visualization and analysis of statistics for rhythm features made it possible to identify both the most diverse genres in terms of rhythm: novels and reviews, and the least ones: scientific articles. Subsequently, these genres were classified best with the help of rhythm features and the neural network-classifier LSTM. Clustering and classifying texts by genre using ELMo and BERT embeddings made it possible to separate one genre from another with a small number of errors. The multiclassification F-score reached 99%. The study confirms the efficiency of modern embeddings in the tasks of computational linguistics, and also allows to highlight the advantages and limitations of the complex of rhythm features on the material of genre classification.
April 2022
·
192 Reads
·
7 Citations
The paper is devoted to the single-label topical classification of Russian news. The author compares the BERT features and standard character, word and structure-level features as text models. Experiments with OpenCorpora show that the BERT model is superior to standard ones, and achieves good classification quality for a small dataset of long news. Comparison with the state-of-the-art research allows to consider BERT as a baseline for future investigations of analysis of texts in Russian.
December 2021
·
8 Reads
·
4 Citations
Automatic Control and Computer Sciences
... Lagutina et al. (2021) report that implementing "rhythmic patterns" to range research articles, advertisement, tweets, novels, reviews and political articles into classes resulted in the highest accuracy (F1=98%) for fiction. Two years later the same group of researchers, using a similar algorithm, accomplished an even more ambitious task classifying novels, articles, reviews, VKontakte posts and OpenCorpora news with even higher accuracy (F1=99%) (Lagutina, 2023). A more challenging task, i. e. taxonomy of ten genres, including Fiction, Fantasy, Detectives, Prose, History. ...
February 2024
Automatic Control and Computer Sciences
... Далее считаются -граммы частей речи, = 1, 2, 3, 4. Среди всех -грамм в корпусе выбирается 40 самых частых для каждого знначения , а для отдельных текстов считается частота встречаемости отдельных -грамм относительно отобранных. Описанные характеристики уже успешно показали себя в обработке текстов в предыдущих работах авторов [24]. ...
September 2023
Modeling and Analysis of Information Systems
... Numerical rhythmic characteristics are based on rhythmic schemes that directly appear in the text [19], [20]. This study uses a set of lexical and grammatical schemes: ...
February 2023
Automatic Control and Computer Sciences
... To determine the amount of information at the level of its semantic content (semantic level), the thesaurus measure (Groot et al., 2016;Lagutina et al., 2016;LeCun et al., 2015;Mai et al., 2017;Mai et al., 2018;Wilson et al., 2019). This characteristic determines semantic properties through the student's ability to accept (perceive and assimilate) the information received (Chernigovskaya et al., 2016;Kiselev, 2018;Popova, 2012). ...
January 2016
Modeling and Analysis of Information Systems
... A text's stylistic features serve as markers of different genres and are frequently employed for automatic analysis in this field because they represent a text's structural quirks, among other things [1,2]. Automatic genre classification makes it possible to solve several computational linguistics problems more quickly, including figuring out a word or phrase's meaning or part of speech, locating documents that are pertinent to a semantic query [3], improving authorship attribution [4][5][6], and more [2,7]. ...
December 2022
Modeling and Analysis of Information Systems
... First, the texts of the corpus were tokenized and lemmatized using the Stanza library (Qi et al., 2020) for the Python 3.7 programming language 3 . We chose this library because it showed good results in processing both structured and unstructured text data of various genres in Russian (Lagutina, 2022;Mamaev et al., 2023). Secondly, on the basis of the Russian National Corpus 4 and a Frequency Dictionary of Russian (Lyashevskaya & Sharov, 2009), a list of stop-words was compiled to exclude lexical units that do not contain an important semantic component: prepositions, conjunctions, auxiliary words. ...
April 2022
... In particular, Short [30] emphasizes the fundamentally argu-mentative role of stylistic figures in speech. Studying the features of styles and genres, scientists analyze stylistic figures in works of various genres -plays [3,21], poetry [6,13,19]; poems [17], short stories [18,31], novels [5,33] etc. Kaftandjiev and Kotova [11] study the role of stylistic figures such as metaphor, synecdoche for the study of various disciplines in primary and secondary school, analyze the role of stylistic figures in communications related to business and marketing. The works of Hoppmann [9] and Liubchenko et al. [15] are devoted to the analysis of stylistic devices in political discourse. ...
December 2021
Automatic Control and Computer Sciences
... • As character-based features. Character-and word-based features are taken from my previous research [18], where these text models show good results in authorship verification. Character-based features include average sentence length in character, frequencies of occurrences of each letter among all letters, and frequencies of occurrences of each punctuation mark among all punctuation in the text. ...
October 2021
Modeling and Analysis of Information Systems
... classification accuracy for all 5 classes. Lagutina et al. (2021) report that implementing "rhythmic patterns" to range research articles, advertisement, tweets, novels, reviews and political articles into classes resulted in the highest accuracy (F1=98%) for fiction. Two years later the same group of researchers, using a similar algorithm, accomplished an even more ambitious task classifying novels, articles, reviews, VKontakte posts and OpenCorpora news with even higher accuracy (F1=99%) (Lagutina, 2023). ...
October 2021
Modeling and Analysis of Information Systems
... AA approaches have emerged as promising tools in addressing these challenges. AA aims to determine the authorship of texts by analyzing distinctive writing styles and linguistic patterns [4][5][6]. It represents a critical classification challenge in natural language processing, leveraging ML models to recognize and attribute texts to their rightful authors [7]. ...
May 2021