Maria Tikhonova’s research while affiliated with National Research University Higher School of Economics and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (28)


MMTEB: Massive Multilingual Text Embedding Benchmark
  • Preprint
  • File available

February 2025

·

56 Reads

·

1 Citation

Kenneth Enevoldsen

·

Isaac Chung

·

Imene Kerboua

·

[...]

·

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.

Download



Prompts used for ruMTEB evaluation. For the tasks with two different prompts, query and document, they are written with a slash. E5 prefix shows prefixes used for mE5 small/medium/large models. E5 instruction shows instructions used for E5 mistral-7b-instruct and mE5 large-instruct models.
The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

August 2024

·

55 Reads

Embedding models play a crucial role in Natural Language Processing (NLP) by creating text embeddings used in various tasks such as information retrieval and assessing semantic text similarity. This paper focuses on research related to embedding models in the Russian language. It introduces a new Russian-focused embedding model called ru-en-RoSBERTa and the ruMTEB benchmark, the Russian version extending the Massive Text Embedding Benchmark (MTEB). Our benchmark includes seven categories of tasks, such as semantic textual similarity, text classification, reranking, and retrieval. The research also assesses a representative set of Russian and multilingual models on the proposed benchmark. The findings indicate that the new model achieves results that are on par with state-of-the-art models in Russian. We release the model ru-en-RoSBERTa, and the ruMTEB framework comes with open-source code, integration into the original framework and a public leaderboard.


Figure 1: The LIBRA benchmark is a set of 21 longcontext tasks grouped into four categories based on the complexity of required skills
The table presents the average model's fertility.
The table presents the model evaluation scores for different context lengths. Model Name shows the name of the model. The columns 4k, 8k, 16k, 32k, 64k, 128k present evaluation scores averaged over all tasks. The Overall score is obtained by averaging the results over all lengths. The best score is put in bold, the second best is underlined.
The table presents the evaluation results. Model Name shows the name of the model. The score for each task is averaged by the context length. The best score is put in bold, the second best is underlined.
The table presents the evaluation results of LLaMA-2-32K. Dataset Name shows the name of the dataset. The rows 4k, 8k, 16k, 32k, 64k, 128k show evaluation scores of datasets for each context length, respectively. The Overall score is obtained by averaging the results over each length.
Long Input Benchmark for Russian Analysis

August 2024

·

55 Reads

Recent advancements in Natural Language Processing (NLP) have fostered the development of Large Language Models (LLMs) that can solve an immense variety of tasks. One of the key aspects of their application is their ability to work with long text documents and to process long sequences of tokens. This has created a demand for proper evaluation of long-context understanding. To address this need for the Russian language, we propose LIBRA (Long Input Benchmark for Russian Analysis), which comprises 21 adapted datasets to study the LLM's abilities to understand long texts thoroughly. The tests are divided into four complexity groups and allow the evaluation of models across various context lengths ranging from 4k up to 128k tokens. We provide the open-source datasets, codebase, and public leaderboard for LIBRA to guide forthcoming research.



Static, Dynamic, or Contextualized: What is the Best Approach for Discovering Semantic Shifts in Russian Media?

March 2024

·

7 Reads

Lecture Notes in Computer Science

This paper is focused on discovering diachronic semantic shifts in Russian news and social media using different embedding methods. Namely, in our work, we explore the effectiveness of static, dynamic, and contextualized approaches. Using these methods, we reveal social, political, and cultural changes through semantic shifts in the News and Social media corpora; the latter was collected and released as a part of this work. In addition, we compare the performance of these three approaches and highlight their strengths and weaknesses for this task.


mGPT: Few-Shot Learners Go Multilingual

January 2024

·

190 Reads

·

63 Citations

Transactions of the Association for Computational Linguistics

This paper introduces mGPT, a multilingual variant of GPT-3, pretrained on 61 languages from 25 linguistically diverse language families using Wikipedia and the C4 Corpus. We detail the design and pretraining procedure. The models undergo an intrinsic and extrinsic evaluation: language modeling in all languages, downstream evaluation on cross-lingual NLU datasets and benchmarks in 33 languages, and world knowledge probing in 23 languages. The in-context learning abilities are on par with the contemporaneous language models while covering a larger number of languages, including underrepresented and low-resource languages of the Commonwealth of Independent States and the indigenous peoples in Russia. The source code and the language models are publicly available under the MIT license.




Citations (13)


... In addition, we explore whether neural models consistently outperform traditional lexical methods across diverse scenarios. With this aim, we evaluate a range of models, including BM25, BGE-M3, mE5, RoSBERTa, and LaBSE [6,25,23,12]. This benchmark offers a comprehensive resource for advancing and evaluating IR systems in Russian, fostering research into the comparative strengths of neural and lexical approaches. ...

Reference:

Building Russian Benchmark for Evaluation of Information Retrieval Models
The Russian-focused embedders’ exploration: ruMTEB benchmark and Russian embedding model design
  • Citing Conference Paper
  • January 2025

... "ChatGLM" is a large language model used to revise data. To minimize the computational resources used for model finetuning, we applied the commonly used LoRA technique (Hu et al., 2021), which has been shown to effectively adapt LLMs to specific domain tasks and improve their performance (Lukichev et al., 2023). LoRA applies a simple linear design that allows the trainable matrix to be combined with frozen weights during model deployment. ...

Parameter-Efficient Tuning of Transformer Models for Anglicism Detection and Substitution in Russian
  • Citing Conference Paper
  • June 2023

... In this work, to ensure that the results are not the artifact of a specific language model, we generated surprisal measures from both the GPT-3 base (text-davinci-001; Brown et al., 2020) and the mGPT language models (Shliazhko et al., 2024), both being trained on multilingual data. For each token w in the dependency corpora, we obtained its surprisal − log p(w t | w <t ) given the preceding context from both models. ...

mGPT: Few-Shot Learners Go Multilingual

Transactions of the Association for Computational Linguistics

... The Russian Winograd Schema Challenge dataset from TAPE (Taktasheva et al., 2022) was utilized for the anaphora resolution task to gather information on participants' eye movements. The experiment comprised 150 complex or compoundcomplex sentences extracted from the Winograd schema challenge dataset, each containing an anaphoric pronoun and its antecedent. ...

TAPE: Assessing Few-shot Russian Language Understanding
  • Citing Conference Paper
  • January 2022

... The previous research has provided us with references regarding the research subjects (pre-trained language models for ancient Chinese) and methods (complex networks). However, most studies [6][7][8][9][10][11][12] on probing language models have not comprehensively elucidated the patterns and principles about the organization of linguistic elements within the models. Therefore, we use complex networks to comprehensively study how these pre-trained language models organize linguistic elements at different levels. ...

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task – CORRIGENDUM
  • Citing Article
  • April 2023

Natural Language Engineering

... In the industrial runs, we discuss the practical difficulties (e.g., data collection, computational resources, cost) of RLHF LLM training. The next part is devoted to prompt-tuning methods (Li and Liang, 2021;Lester et al., 2021;Liu et al., 2021;Konodyuk and Tikhonova, 2021), for automatically learning language model prompts. ...

Continuous Prompt Tuning for Russian: How to Learn Prompts Efficiently with RuGPT3?
  • Citing Chapter
  • August 2022

Communications in Computer and Information Science

... One of the primary reasons for this disparity is the limitation in the amount of training data in our experiment. With only 411 labeled data points, models like BERT and RoBERTa, which rely on supervised fine-tuning, struggle to develop adequate generalization capability [52][53][54]. According to the definitions of the five requirement categories in Section 3.1, Table 1, pre-trained fine-tuned models often encounter significant challenges when processing requirement texts such as interface requirements and design constraints. ...

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task
  • Citing Article
  • June 2022

Natural Language Engineering

... In Bukhtiyarov et al. [28], the authors fine-tuned two Transformer-based pre-trained models, mBART and BertSumAbs, and achieved good results. In Sberdevices et al. [29], the authors propose a generative model that combines the Transformer-3 and RuGPT-3 models, which uses zero-shot and minimum optimization methods in news clustering and news HG tasks. ...

Using Generative Pretrained Transformer-3 Models for Russian News Clustering and Title Generation tasks
  • Citing Conference Paper
  • June 2021