The table presents the model evaluation scores for different context lengths. Model Name shows the name of the model. The columns 4k, 8k, 16k, 32k, 64k, 128k present evaluation scores averaged over all tasks. The Overall score is obtained by averaging the results over all lengths. The best score is put in bold, the second best is underlined.

The table presents the model evaluation scores for different context lengths. Model Name shows the name of the model. The columns 4k, 8k, 16k, 32k, 64k, 128k present evaluation scores averaged over all tasks. The Overall score is obtained by averaging the results over all lengths. The best score is put in bold, the second best is underlined.

Source publication
Preprint
Full-text available
Recent advancements in Natural Language Processing (NLP) have fostered the development of Large Language Models (LLMs) that can solve an immense variety of tasks. One of the key aspects of their application is their ability to work with long text documents and to process long sequences of tokens. This has created a demand for proper evaluation of l...

Context in source publication

Context 1
... baseline results with respect to context length are shown in Table 4 and with respect to tasks are shown in Tables 5, 6, 7. Detailed results for each model are given in Appendix D. Based on the obtained results we can draw the following conclusions for each group of tasks. ...

Similar publications

Preprint
Full-text available
In this paper, an approach to training and evaluating an adapter model for the popular language model "zephyr-7b-beta" is described. The adapter was developed to improve the performance of the base model in tasks related to programming and understanding the Russian language. Considering the high quality of the original model in tasks in the English...