UMap clusters for embeddings a) BERT, b) ELMo

UMap clusters for embeddings a) BERT, b) ELMo

Source publication
Article
Full-text available
The article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were carried out on a corpus of 10,000 texts in five genr...

Similar publications

Preprint
Full-text available
Keyphrase selection is a challenging task in natural language processing that has a wide range of applications. Adapting existing supervised and unsupervised solutions for the Russian language faces several limitations due to the rich morphology of Russian and the limited number of training datasets available. Recent studies conducted on English te...
Preprint
Full-text available
Generative poetry systems require effective tools for data engineering and automatic evaluation, particularly to assess how well a poem adheres to versification rules, such as the correct alternation of stressed and unstressed syllables and the presence of rhymes. In this work, we introduce the Russian Poetry Scansion Tool library designed for stre...
Conference Paper
Full-text available
Controllable story generation towards keywords or key phrases is one of the purposes of using language models. Recent work has shown that various decoding strategies prove to be effective in achieving a high level of language control. Such strategies require less computational resources compared to approaches based on fine-tuning pre-trained langua...
Preprint
Full-text available
Grammatical error correction is one of the fundamental tasks in Natural Language Processing. For the Russian language, most of the spellcheckers available correct typos and other simple errors with high accuracy, but often fail when faced with non-native (L2) writing, since the latter contains errors that are not typical for native speakers. In thi...

Citations

... A text's stylistic features serve as markers of different genres and are frequently employed for automatic analysis in this field because they represent a text's structural quirks, among other things [1,2]. Automatic genre classification makes it possible to solve several computational linguistics problems more quickly, including figuring out a word or phrase's meaning or part of speech, locating documents that are pertinent to a semantic query [3], improving authorship attribution [4][5][6], and more [2,7]. ...
... A text's stylistic features serve as markers of different genres and are frequently employed for automatic analysis in this field because they represent a text's structural quirks, among other things [1,2]. Automatic genre classification makes it possible to solve several computational linguistics problems more quickly, including figuring out a word or phrase's meaning or part of speech, locating documents that are pertinent to a semantic query [3], improving authorship attribution [4][5][6], and more [2,7]. ...
... These texts span 5 billion words; however, out of the six genres, only two can be attributed to literature-fiction and poetry. The corpus of [2] contains 10,000 texts assigned to five different genres-novels, scientific articles, reviews, posts from the VKontakte social network [10], and news texts from OpenCorpora [11], the open corpus of Russian texts. Only one genre in this corpus is a literature genre-the novels. ...
Article
Full-text available
Within the literary domain, genres function as fundamental organizing concepts that provide readers, publishers, and academics with a unified framework. Genres are discrete categories that are distinguished by common stylistic, thematic, and structural components. They facilitate the categorization process and improve our understanding of a wide range of literary expressions. In this paper, we introduce a new dataset for genre classification of Russian books, covering 11 literary genres. We also perform dataset evaluation for the tasks of binary and multi-class genre identification. Through extensive experimentation and analysis, we explore the effectiveness of different text representations, including stylometric features, in genre classification. Our findings clarify the challenges present in classifying Russian literature by genre, revealing insights into the performance of different models across various genres. Furthermore, we address several research questions regarding the difficulty of multi-class classification compared to binary classification, and the impact of stylometric features on classification accuracy.