ArticlePDF Available

JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources

Authors:

Abstract

Neural Information Retrieval has advanced rapidly in high-resource languages, but progress in lower-resource ones such as Japanese has been hindered by data scarcity, among other challenges. Consequently, multilingual models have dominated Japanese retrieval, despite their computational inefficiencies and inability to capture linguistic nuances. While recent multi-vector monolingual models like JaColBERT have narrowed this gap, they still lag behind multilingual methods in large-scale evaluations. This work addresses the suboptimal training methods of multi-vector retrievers in lower-resource setting, focusing on Japanese. We systematically evaluate and improve key aspects of the inference and training settings of JaColBERT, and more broadly, multi-vector models. We further enhance performance through a novel checkpoint merging step, showcasing it to be an effective way of combining the benefits of fine-tuning with the generalization capabilities of the original checkpoint. Building on our analysis, we introduce a novel training recipe, resulting in the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and trained in under 15 hours on 4 A100 GPUs, significantly outperforms all existing methods across all common benchmarks, reaching an average score of 0.754, significantly above the previous best of 0.720. To support future research, we make our final models, intermediate checkpoints and all data used publicly available.
General Paper Peer-Reviewed
JaColBERTv2.5: Optimising Multi-Vector Retrievers to
Create State-of-the-Art Japanese Retrievers with
Constrained Resources
Benjamin Clavi´ea
Neural Information Retrieval has advanced rapidly in high-resource languages, but
progress in lower-resource ones such as Japanese has been hindered by data scarcity,
among other challenges. Consequently, multilingual models have dominated Japanese
retrieval, despite their computational inefficiencies and inability to capture linguistic
nuances. While recent multi-vector monolingual models like JaColBERT have nar-
rowed this gap, they still lag behind multilingual methods in large-scale evaluations.
This work addresses the suboptimal training methods of multi-vector retrievers in
lower-resource setting, focusing on Japanese. We systematically evaluate and improve
key aspects of the inference and training settings of JaColBERT, and more broadly,
multi-vector models. We further enhance performance through a novel checkpoint
merging step, showcasing it to be an effective way of combining the benefits of fine-
tuning with the generalization capabilities of the original checkpoint. Building on
our analysis, we introduce a novel training recipe, resulting in the JaColBERTv2.5
model. JaColBERTv2.5, with only 110 million parameters and trained in under 15
hours on 4 A100 GPUs, significantly outperforms all existing methods across all com-
mon benchmarks, reaching an average score of 0.754, significantly above the previous
best of 0.720. To support future research, we make our final models, intermediate
checkpoints and all data used publicly available.
Key Words:Information Retrieval, Japanese Information Retrieval, Lower Resource
Languages, Multi-Vector Retrieval, ColBERT, Knowledge Distillation, Model
Training Optimisation
1 Introduction
Text-based Information Retrieval (IR) (Kobayashi and Takeda 2000; Baeza-Yates and
Ribeiro-Neto 1999) is a Natural Language Processing application that enables users to retrieve
relevant documents for a given query (Manning 2008). Historically, the field has been dominated
by lexical-matching methods, where retrieval is performed by directly matching query terms
with document content (Belkin and Croft 1987), with both undergoing various normalization
aAnswer.AI, Tokyo, Japan
(C) The Association for Natural Language Processing. Licensed under CC BY 4.0
(https://creativecommons.org/licenses/by/4.0/).
Clavi´e JaColBERTv2.5
steps to facilitate this process (Trotman et al. 2014). Okapi BM25 (Robertson et al. 1995), the
most successful lexical-matching method, remains widely used today, either independently or in
conjunction with more modern approaches (Thakur et al. 2021).
Despite its inability to capture semantic information, BM25 is still considered a robust base-
line that can be challenging to surpass, even with significantly more computationally intensive
approaches (Kato et al. 2017). However, in recent years, there have been large advances in
the development Neural IR approaches for the English language, with numerous models now
consistently strongly outperforming BM25 (Muennighoff et al. 2022).
These improvements have been attributed to two factors: the first is the development of much
more powerful neural architectures, such as the transformers architecture (Vaswani et al. 2017),
and the pre-trained models based on it such as BERT (Devlin et al. 2019). The second factor is
the release of large-scale retrieval training datasets, with Microsoft’s MS-MARCO (Nguyen et al.
2016) being widely credited as one of the most important factors to the rapid development of
English Neural IR (Zhang et al. 2021). Moreover, there also exists a large range of high quality,
domain-specific datasets on which Neural retrievers can be further trained, providing English
models with even stronger generalisation capabilities (Craswell et al. 2020; Thakur et al. 2021).
Outside of the English language, and some other very high-resources languages such as
Mandarin Chinese (Xiao et al. 2023), progress in Neural IR has been considerably slower and
has yielded more modest results. In most mid and lower-resource languages, as is the case in
Japanese, retrieval benchmarks have largely been dominated by multilingual models (Wang et al.
2024; Chen et al. 2024), taking advantage of the vast quantities of English training data to bolster
their performances on all languages.
However, these models come with multiple trade-offs. Noticeably, their retrieval performance
ceiling appears to be lower than monolingual models, as, in the case of English, monolingual
retrievers trained on the same amount of English consistently show stronger performances than
their multilingual counterpart (Wang et al. 2022; Xiao et al. 2023). Their relative compute costs
are also considerably higher, both at training time, as they require several billion query-document
pairs, and inference time, with a similarly performing multilingual retriever having 3-to-5 times
as many parameters as monolingual variants (Clavi´e 2023; Chen et al. 2024; Conneau et al. 2020).
Finally, multilingual models have been empirically shown to miss out on cultural understanding
in some settings (Wang et al. 2024), which represents an ethical issue as well as a potential
performance one in certain settings.
In the case of Japanese, the performance gap has been particularly wide, with monolingual
neural models often representing a 40% average performance decrease from multilingual ones,
177
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
with even starker degradation on larger-scale benchmarks (Clavi´e 2023).
All existing mono-lingual Japanese retrieval models have followed the most common approach
to Neural IR, which consists in the use of single-vector dense representations, where a document
is represented as a single averaged vector (Yates et al. 2021). However, multi-vector retrievers,
where documents are represented by multiple vectors, one for each token it contains, have shown
remarkable training efficiency and out-of-domain generalisation since their introduction with
the ColBERT model (Khattab and Zaharia 2020), and its further refinement in ColBERTv2
(Santhanam et al. 2022b). The promising results of ColBERT models in English and their
strong generalisation ability while training on a single domain, appeared to suggest the potential
suitability of this approach in lower-resources settings, such as the Japanese language.
Following these findings, a recent study has introduced Japanese ColBERT variants, the
JaColBERT and JaColBERTv2 models (Clavi´e 2023), respectively following the ColBERT and
ColBERTv2 training recipes. These models, while trained exclusively on lower-quality Japanese
training data, have demonstrated considerable performance improvements. They have vastly
outperformed existing monolingual retrievers and reached competitive results with the best mul-
tilingual ones, albeit still falling short on large scale benchmarks containing millions of documents.
In spite of this, while there has been considerable improvements in understanding how to
best train both general NLP (Loshchilov and Hutter 2017; Hu et al. 2024; Defazio et al. 2024;
Achiam et al. 2023; Touvron et al. 2023) and retrieval models (Hofst¨atter et al. 2022; Izacard
and Grave 2021; Chen et al. 2024; Lin et al. 2023), the training of ColBERT models has followed
the standard ColBERTv2 recipe with some changes, but no systematic overhaul or in-depth
modifications (Nair et al. 2022; Clavi´e 2023; Louis et al. 2024).
1.1 Contributions
In this study, we present the hypothesis that it is possible to outperform all existing retrieval
methods, including the best multilingual models, in Japanese with an optimized multi-vector
retriever training recipe.
To achieve this aim, we systematically evaluate potential improvements via many small-scale
ablation studies, seeking to increase both training efficiency and downstream performance.
We begin by evaluating inference-time setting, and demonstrate that dynamic query-length
is strictly superior to fixed-length querying.
The impact of using various teacher models for knowledge distillation is then systematically
evaluated, as well as the performance changes when ensembling various teachers, as is oftentimes
the best option in English (Hofst¨atter et al. 2020). We show that this does not hold true in a
178
Clavi´e JaColBERTv2.5
Japanese setting, and generate teacher scores using the best-performing model in our ablation
studies, BGE-M3 (Chen et al. 2024).
Our subsequent experiments then demonstrate that conventionally used training settings for
ColBERT models are either computationally inefficient, place unnecessary constraint on data
gathering for no performance improvement, or are simply sub-optimal in terms of retrieval per-
formance. Based on our results, we propose an improved training setting.
We propose an additional two final steps to the training process. The first one is a post-
training step, where we leverage data with higher-quality Japanese to fine-tune our final model
checkpoint, in the hopes of improving its results on common Japanese retrieval tasks. We then
introduce a final checkpoint averaging step, where the models resulting from this post-training
steps are averaged with checkpoints from the pre-training phase, to create a model that retains
the performance gains on tasks which are in-domain for the post-trained model, without losing
any performance on out-of-domain tasks, further increasing the generalisation potential of our
model.
Our resulting model, JaColBERTv2.5, is the best performing model on all evaluation datasets,
reaching an average score of 0.752, representing a 4.5% improvement over JaColBERTv2 (0.720),
a 60% performance improvement on the best-performing monolingual single-vector retriever,
GLuCoSE (0.470) and a 5.32% improvement over BGE-M3 (0.714), the strongest multilin-
gual model. These results hold true for MIRACL (Zhang et al. 2023), a large-scale retrieval
benchmark on which previous versions of JaColBERT trailed significantly behind BGE-M3, but
JaColBERTv2.5 reaches a 6.87% performance improvement over it.
Our model prior to the post-training step, JaColBERTv2.4, also outperforms all other existing
approaches, even while being fully out-of-domain on every evaluation dataset.
We achieve these results with a constrained compute budget, where the compute used for our
final model, teacher score generation included, cannot meaningfully exceed that of JaColBERTv2,
to confirm that downstream performance comes from our improved recipe rather than from
increased computational budget.
Moreover, we obtain these results with a simplified training recipe, which fully discards the
“positive” and “negative” labels assigned to each document, focusing entirely on the relative
distribution of teacher scores instead. These results can help support future research in the area
of data curation and hard-negative mining, and help simplifying both processes.
We make all the resources in this paper publicly available to support further research. This
179
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
includes all of our training data, with teacher scores, for both the ablation and final training runs,1
the JaColBERTv2.52and JaColBERTv2.4 models,3as well as all intermediate model checkpoints
generated during training.4
Finally, while our study focused entirely on the Japanese language, we believe that our method
can be directly applicable to any other language with at least moderate data volumes and lead
to similar performance improvements, as even our 320,000 triplets vastly outperformed previous
monolingual approaches.
2 Background
In this section, we will provide a brief overview of the mechanisms used by multi-vector
retrieval models such as ColBERT and JaColBERT, on which our work builds.
2.1 ColBERT
Multi-vector retrievers, also sometimes called late-interaction models, were introduced and
popularised by the Contextualized Late Interaction over BERT, or ColBERT, model architecture
(Khattab and Zaharia 2020).
The ColBERT architecture relies on a simple intuition, which is that the dominant way of
creating document representation, by creating a single vector for each document, causes too much
information loss. While this information loss can be mitigated when evaluated on in-domain tasks,
it can result in consistently poorer out-of-domain generalisation. To solve this problem, multi-
vector retrievers do not create a single large vector to represent a document, but multiple smaller
ones with each individual vector representing a single token.
Query Augmentation Identifying the fact that retrieval is by nature a very asymmet-
rical task, with queries often being short, a query-augmentation mechanism is also introduced.
Rather than padding inputs to the maximum length with padding tokens, it leverages the masked-
language-modeling objective of its base models (Devlin et al. 2019). To do so, it pads all queries
with [MASK] tokens, which are then attended to by the model. These mask tokens have been
heavily studied in subsequent work, and appear to provide useful term-weighting and amplify se-
mantic information, resulting in improved retrieval performance. (Formal et al. 2021b; Giacalone
1https://huggingface.co/datasets/answerdotai/MMarco-japanese- 32-scored- triplets
2https://huggingface.co/answerdotai/JaColBERTv2.5
3https://huggingface.co/answerdotai/JaColBERTv2.4
4https://huggingface.co/collections/bclavie/jacolbertv25-checkpoints- 66a37d8da6b0d4d69c14f9c3
180
Clavi´e JaColBERTv2.5
et al. 2024)
MaxSim In order to perform scoring, the ColBERT authors introduce a new scoring mech-
anism, which they dub MaxSim, for Maximum Similarity. In this setting, the score of a [query,
document] pair is obtained via the following formula, where E represents all the embeddings
representing a given document or query:
Scorequery,document :=
i[|Equery |]
max
j[|Edocument|]EqueryiET
documentj(1)
Effectively, for a given query token, its cosine similarity with every document token is com-
puted, and the highest similarity is kept. This process is repeated over all query tokens, and
the final [query, document] score is represented as the sum of all those maximum similarities per
query token, hence the name “MaxSim”.
ColBERT Initial Performance This approach appears to consistently result in better
out-of-domain generalisation, even when trained on considerably smaller data volumes than com-
petitive single-vector approaches (Khattab and Zaharia 2020; Hofst¨atter et al. 2022). However,
its performance on in-domain tasks remained lower than that of single-vector retrievers, while
requiring an order of magnitude more memory and storage space to index the same document
volume due to the need of storing significantly more vectors.
2.2 ColBERTv2
A subsequently improved version of ColBERT, named ColBERTv2, seeks to address these
two issues (Santhanam et al. 2022b). This second version overhauls many parts of the original
process, using a more modern training recipe, albeit without a clear evaluation of the impact
of each component. Most notably on the training side, it introduces both in-batch negatives
(Karpukhin et al. 2020) and knowledge distillation (Hinton et al. 2015). To help alleviate the
storage issue, it introduces a novel indexing approach, allowing for a 6-fold index-size reduction
by clustering the individual token representations and offloading most of the stored information
to the cluster centroids before compressing the vectors to just 2 bits. This method successfully
brings the storage and memory requirements of ColBERTv2 down to the same order of magnitude
as that of single-vector models, though still noticeably higher, while reaching even stronger results
on out-of-domain datasets. Further work has shown that this approach also addresses the issue
of weaker in-domain performance, with fine-tuned versions of the model being able to outperform
all other approaches on multiple benchmarks (Saad-Falcon et al. 2023).
181
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
ColBERT and ColBERTv2 have garnered a lot of attention, with various studies attempting
to better understand and improve its various mechanisms, such as the various effects of [MASK]-
based query augmentation (Giacalone et al. 2024; Formal et al. 2021b), the impact of introducing
full-word rather than token-level representations (Hofst¨atter et al. 2022), potential improvements
to its scoring approach (Lee et al. 2023), or to the mechanisms around its optimised indexing
approach (Santhanam et al. 2022a; MacAvaney and Tonellotto 2024).
2.3 JaColBERT
In Japanese, both of these approaches have been reproduced, with even greater success than
their English equivalents (Clavi´e 2023). JaColBERTv1, following the training recipe of the origi-
nal ColBERT model, became the then-strongest monolingual Japanese retriever. However, it fell
short of the strongest multilingual models on multiple benchmarks, with the most notable per-
formance gap being found on large-scale retrieval tasks. Subsequently, JaColBERTv2, trained
following the ColBERTv2 recipe, helped address these issues. JaColBERTv2, at the time of
this work, is the strongest out-of-domain retrievers on all existing Japanese benchmarks. How-
ever, on MIRACL (Zhang et al. 2023), a large-scale retrieval dataset which was used to train
most multilingual retrievers and on which they are therefore in-domain, it still noticeably lags in
performance.
3 Experiments
In this section, we will present the various steps of our experimental process. As this study
focuses on systematically evaluating various approaches to using and training multi-vector models,
we will conduct short training runs, also called ablations, on small data scales, to identify the best
possible setting. We believe this sort of small-scale training to identify optimal model settings
is well-suited to helping us refine optimal training for constrained resources settings, as it has
proved to be a particularly strong indicator of full-sized model performance. In recent months, it
has notably become the preferred way of predicting model behaviour for large language models
(Achiam et al. 2023; Hestness et al. 2017; Touvron et al. 2023).
In each section, we will discuss the rationale for our proposed settings and, when relevant,
the results and learnings of the relevant ablations.
Firstly, we will define our hardware constraints in Section 3.1, before discussing our choice of
training data is Section 3.2. We then present an overview of our the baselines our models will be
evaluated against in Section 3.3 and of our chosen valuation benchmarks in Section 3.4.
182
Clavi´e JaColBERTv2.5
In Section 3.5, we will explore different approaches to defining the query length on ColBERT’s
query-augmentation mechanism.
We will then evaluate the impact of various alterations of the model’s training settings in
Section 3.6, through a series of small-scale training runs.
Our next area of focus in Section 3.7 is systematically studying the use of various teacher
models for knowledge distillation, in the context of Japanese-language retrieval. We will explore
the impact of using a variety of models as teachers, as well as various ensembling methods.
Finally, we propose two ways of improving our final model, after the original pre-training has
concluded. In Section 3.8, we describe a post-training phase, where our model will be fine-tuned
on smaller-scale, considerably higher-quality data. In Section 3.9, we discuss the introduction of
checkpoint averaging (Polyak 1990) as a method to ensure that the final model remains strong
across the board, and mitigating potential performance losses on out-of-domain tasks during
post-training.
For all training runs, both small and large scale, we follow JaColBERTv2 in initializing our
models off the original JaColBERT checkpoint, which has been trained on a total of 8 million
[query, positive document, negative document] training triplets (Clavi´e 2023).
All of our experiments are performed with the same random seed, set to 42. Due to the
computational constraints detailed in Section 3.1, we do not perform multiple training runs for
each setting.
3.1 Hardware Constraints
All experiments conducted in this study are done under a compute constraint, in order to
highlight that our final model’s performance is a consequence of our improvements, rather than
due to substantially increased compute.
JaColBERTv2 was trained for 28.5 hours on 8 NVidia A100 GPUs (Clavi´e 2023), representing
a total training budget of 228 A100 hours. All teacher scores used by JaColBERTv2 were re-used
from the original ColBERTv2 teacher scores (Santhanam et al. 2022b), and therefore came at no
additional compute cost.
Based on this, we constrain our total training compute budget to the same 228 hours, plus or
minus 5%, resulting in a final upper bound budget of 239 A100 hours. We include all time spent
training each ablation model, generating teacher scores and training the final JaColBERTv2.5
model under this budget.
183
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
3.2 Training Data
For both the final model training and ablation runs, we follow existing practices (Clavi´e 2023;
Wang et al. 2024) and train our model using the Japanese split of MMarco (Bonifacio et al. 2021).
MMarco is a machine-translated of MS Marco (Nguyen et al. 2016), a large English Information
Retrieval (IR) dataset which has widely been credited as unlocking vast advances in English IR,
thanks in large part to its scale and wide variety of queries. For a long time, no equivalent dataset
existed in non-English languages, and efforts to create ones were largely unsuccessful due to the
cost of such an endeavour. MMarco was introduced to bridge the gap between English and other
languages, by providing a fully machine-translated version of MS Marco in 14 languages, and
empirically showcasing that, while the resulting dataset produced poorer results than in English,
it still contained useful signal on a scale usually not available in these languages.
Retrievals models are generally trained on triplets. These triplets can either be be standard
triplets, as with ColBERTv1 (Khattab and Zaharia 2020) and JaColBERTv1 described above,
where each triplet contains a single query, a single positive document, and a single negative
document, or n-way triplets. n can be any number, and represents the total number of documents
passed to the model: in the case of 16-way triplets, the model would be presented with a query,
and 16 documents, rather than just 2 in the standard setting. Doing so allows us to more
efficiently use knowledge distillation (Hinton et al. 2015), where the model learns from teacher
scores, generally generated by strong cross-encoder models, and attempts to emulate them or
their distribution (Qu et al. 2021).
Our models are trained using 32-way triplets with knowledge distillation. This means that,
for every single query, the model is given 32 documents per triplet, as well as teacher scores for
every [query, document] pair. The goal of the model’s training is to attempt to learn to reproduce
the provided scores, through a knowledge distillation loss function which we explore further in
Section 3.6.4.
We use a downsample of the set of triplets used to train the original English ColBERTv2
model (Santhanam et al. 2022b). We downsample the triplets in two ways: firstly, in order
to meet the compute constraints of this work, we randomly sample 3,200,000 triplets out of
the 19,000,000 originally provided. Secondly, as ColBERTv2 was trained with 64-way triplets,
we randomly sample 31 negative documents from the original 63 in every individual triplet. We
choose to train on 3,200,000 triplets, which represents 40% of the 8,000,000 triplets JaColBERTv2
was trained on, in order to respect our compute constraints and allocate sufficient compute to
generating teacher scores.
As MMarco is a direct translation of MS Marco, it is possible to reuse the ColBERTv2 triplets
184
Clavi´e JaColBERTv2.5
with no further modifications. We use the teacher scores provided by the ColBERTv2 authors
as our baseline teacher scores for all of our experiments, and will extensively cover the effect of
different teachers in Section 3.7.
In recent years, higher quality datasets such as Mr.TyDi (Zhang et al. 2021) and its improved
version, MIRACL (Zhang et al. 2023), have been introduced. However, both of them contain
noticeably fewer queries and relevance labels per language than MMarco, and are more commonly
used to train multi-lingual models (Chen et al. 2024; Wang et al. 2024), while best-performing
monolingual retrievers largely continue to pre-train on MMarco (Louis 2024). However, we believe
that MIRACL could constitute a particularly suitable post-training dataset, which we will further
explore in Section 3.8.
Ablation Training Data For all ablation training runs we conduct as part of our experiment,
we use a further downsampled version of our final training set. We create this set by sampling the
first 10% triplets of the full training set, resulting in 320,000 training triplets, which represents
4% of the original JaColBERTv2 training data. Following previous work (Achiam et al. 2023;
Hestness et al. 2017), we believe that this data volume is sufficient to show trends which will
scale to the final training run.
3.3 Baselines
Our final models are evaluated against a large range of representative baselines, including
the current best-performing retrievers. To do so, we evaluate our model against BGE-M3 (Chen
et al. 2024), the current best-performing multilingual embedding model. BGE-M3 is a multi-
output model: it is capable of producing single-vector dense representations, but is also able to
output sparse and computationally heavy multi-vector representations to act as a “self-reranker”.
As a result, we report results from BGE-M3 in two settings: dense, using only its single-vector
retriever output, and all, leveraging all three forms of outputs in the way recommended by its
original authors. BGE-M3’s model size is roughly 5.11x that of JaColBERT.
Results for the multilingual-E5 (mE5) family of models (Wang et al. 2024) are also presented
in all three existing model sizes, small (˜
JaColBERT-sized), base (˜
2.5x JaColBERT) and large
(˜
5x JaColBERT). The mE5 family is one of the most widely used model family for Japanese
retrieval, and has consistently shown strong results on benchmarks (Clavi´e 2023).
We also report results for the best-performing single-vector retrievers in Japanese, GLuCoSE,5
5https://huggingface.co/pkshatech/GLuCoSE-base- ja
185
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
an embedding model based on LUKE (Yamada et al. 2020), as well as Nagoya University’s sup-
simcse family of models (Tsukagoshi et al. 2023), in both base and large sizes.
Finally, we also report results for JaColBERTv1 and JaColBERTv2, the two previous best
multi-vector retriever models for Japanese, respectively trained following the ColBERTv1
(Khattab and Zaharia 2020) and ColBERTv2 (Santhanam et al. 2022b) training recipes.
3.4 Evaluation Data
We define two evaluation sets: one used for final evaluation, described in Section 3.4.1 and
a smaller, quicker-to-run one, that will be used for the various experiments we are planning to
run to find the optimal training setting presented in Section 3.4.2. All metric calculations are
performed using the ranx evaluation toolkit (Bassani 2022).
3.4.1 Final evaluation data
For the final evaluation, five commonly used evaluation datasets will be used to cover the
model’s performance in a variety of settings. For each dataset, we choose a main evaluation
metric in line with previous work in order to provide clear comparisons. However, detailed
evaluation results for our models will also be reported.
JSQuAD is a QA dataset introduced in the JGLUE evaluation set (Kurihara et al. 2022),
inspired by the English SQuAD (Rajpurkar et al. 2016). We evaluate JSQuAD in the same
setting as in previous studies (Clavi´e 2023), using Nouu.me’s evaluation benchmark, where the
dataset is reduced to 1,600 passages, and the model’s goal is to extract the relevant passage for
each query in its top results. The metric we report for this dataset, following previous work, is
Recall3. This task is the easiest amongst the evaluation set.
MIRACL (Zhang et al. 2023) is a large-scale multilingual evaluation benchmarks. We use
its Japanese subsplit, which is composed of over 6 million documents, extracted from Wikipedia,
and contains human-created relevance judgements for 860 queries over this corpus. We choose to
use exclusively MIRACL, rather than both MIRACL and Mr.TyDi (Zhang et al. 2021), another
large-scale multilingual information retrieval dataset, as MIRACL is a refinement of Mr.TyDi,
with additional judgements added and dubious labels removed. The main metric reported for
this dataset is NDCG@10. It has been noted in the past that MIRACL contains “holes”: that is,
the positive judgements are not thorough, and the data contains many false negatives.6However,
6While there is no formal citation for this claim, it can deduced from the annotation process used for MIRACL
and has been frequently noted, notably as part of Cohere’s work on multilingual embeddings: https://
huggingface.co/datasets/Cohere/miracl-en- queries-22- 12
186
Clavi´e JaColBERTv2.5
it remains the only large-scale evaluation benchmark for most languages it covers, including
Japanese, and is one of the most commonly used non-English IR benchmark (Chen et al. 2024;
Wang et al. 2024; Clavi´e 2023; Louis et al. 2024).
JQaRA (Tateno 2024b) is another dataset built from a QA dataset commonly used for
Japanese QA evaluation, JAQKET (鈴木 2021). The aim is, similarly to SQuAD, to find a
document containing the answer to a given query, over 1667 queries. The dataset was constructed
via a mix of LLM usage, before going through human validation to ensure all negatives are
negatives and all questions have at least one real positive passage. The task is presented as a hard
reranking task: for each query, we are provided with 100 documents, with one or more of them
containing the information necessary to answer the query and all other documents containing
very adjacent information which does not directly address the query. These adjacent documents
are called “hard negatives”, as they’re purposefully designed to be hard to differentiate from
positive examples. The main evaluation for this task is NDCG@10. All evaluations on JQaRa
are conduced with the official evaluation code provided by the dataset author (Tateno 2024b).
JaCWIR (Tateno 2024a) is a medium-scale (500,000 documents) retrieval dataset, using a
large variety of web-scrapped documents. It is an entirely auto-generated dataset, where GPT-3.5
(Brown et al. 2020) was asked to produce queries for which a given document would be relevant.
We also use it as a reranking task, similarly to how it was introduced. For each of its 5,000 queries,
the model must attempt to identify the relevant document among 99 hard negatives. The main
evaluation metric for this task is NDCG@10. All evaluations on JaCWIR are conduced with the
official evaluation code provided by the dataset author (Tateno 2024a).
ESCI (Reddy et al. 2022) is an addition to the JaColBERT evaluation set, which was not
used in previous work. It is a “Shopping Queries Data Set” dataset provided by Amazon Science
as part of the KDD Cup 2022. The goal of this dataset is to evaluate a model’s ability to
match very short (1 to 5 tokens) queries with the textual description of relevant products. We
use ESCI as a retrieval task, similarly to one of the settings it is available in in the Japanese
Massive Text Embeddings Benchmark (JMTEB),7an ongoing effort inspired by the English
MTEB (Muennighoff et al. 2022). For any given product query, the model must attempt to
retrieve the description of relevant products among 149,999 product descriptions. The main
evaluation metric for this task is NDCG@10.
We provide an overview of key information on each dataset in Table 1. Despite the relative
sparsity of Japanese retrieval benchmarks in comparison with English, we believe that these five
7https://huggingface.co/datasets/sbintuitions/JMTEB
187
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
Type # Queries # Documents Task Setting
MIRACL General domain QA 860 6,953,614 Large-Scale Retrieval
JSQuAD General domain QA 4,420 1,145 Small-Scale Retrieval
JQaRA Trivia QA 1,667 100 Reranking
JaCWIR Synthetic Web QA 5,000 100 Reranking
ESCI Amazon Product Search 4,206 149,999 Large-Scale Retrieval
Table 1 Brief overview of the key information about the datasets used for evaluation. We consider as
“Large-Scale Retrieval” any task where more than 100,000 documents are considered at once.
For Reranking tasks, the number of documents is the number of documents to rerank.
datasets in these settings provide a good overview of retrieval models’ capabilities on a wide array
of real-world relevant usages.
3.4.2 Development evaluations
A large part of our study focuses on systematically evaluating a large variety of improvements
to the ColBERT training and inference routine. As a result, we need a representative development
set that is computationally inexpensive to run, while providing us with enough information to
make decisions. We decide to use two evaluation sets, and report two key metrics for each of
them. The first one is JQaRA, as presented above. We choose JQaRA due to its small size,
being presented as a reranking task, while it has consistently shown a good ability to discriminate
between models and a good correlation to performance on other datasets (Tateno 2024b). We
report both NDCG@10 and MRR@10 as our development metrics.
As our second task, we follow ongoing efforts in creating lighter embeddings benchmarks8
and introduce a smaller version of MIRACL’s Japanese split, which we dub MIRACL-small-
ja. This dataset is built through hard-negative mining. Using BM25, we retrieve the top 250
results for each of the 860 MIRACL development queries. We then enrich this data with all
positive examples, if they were not present in the BM25 results. The resulting dataset contains
197,610 documents and 860 queries.
While the ideal situation would be for the final evaluation and development sets to be fully
separate, this is not possible with the current amount of evaluation resources available. We pick
these two datasets as we believe they provide a good balance between useful signal and minimising
the risks of overfitting too strongly on these development evaluations, which would introduce a
bias to our final results.
8The team behind the English MTEB is currently developing a multilingual version of it, with a lite
variant, including datasets similar to MIRACL-small, currently under discussion. https://github.com/
embeddings-benchmark/mteb/issues/784
188
Clavi´e JaColBERTv2.5
3.5 Dynamic Query Length
ColBERT models use a query augmentation mechanism, leveraging the use of [MASK] tokens,
which are appended to every query, replacing traditional padding (Khattab and Zaharia 2020),
until the predefined maximum query length is reached. These tokens have been shown to learn
different types of information, occasionally acting as term-importance weights (Formal et al.
2021b; Giacalone et al. 2024).
However, the exact impact of mask tokens and the best way to use them has been under-
studied. Instead of variable-length augmentation, a past study has explored simply appending
8[MASK] tokens to every query, regardless of the actual query length (Hofst¨atter et al. 2020).
Another previous study has chosen to remove this augmentation mechanism entirely (Hofst¨atter
et al. 2022). However, no study has compared the effects of these various choices against the
original implementation.
We believe all three of these approaches to be suboptimal, and propose dynamic query
length as a replacement. Dynamic query length effectively aims to improve the initial approach
of padding each query with [MASK] tokens until the query maximum length by allowing it to
more easily adapt to longer queries, while also borrowing from fixed-length augmentation for edge
cases.
Effectively, our approach is to set the maximum query length to the nearest higher multiple
of 32 (ColBERT’s original maximum query length) before performing the [MASK]-padding.
Additionally, if fewer than 8 augmentation tokens would be appended with the new query length,
we ensure that at least 8 tokens are appended, overriding the maximum length.
To select the default query augmentation mechanism to use for JaColBERTv2, we evaluate
all four of the discussed approaches.
3.5.1 Results
Results for this experiment are presented in Table 2. The results are pretty clear. On
all datasets, disabling [MASK] augmentation is consistently largely outperformed by all query
augmentation approaches. Fixed 8-token query augmentation, while performing considerably
better than no augmentation, is similarly outperformed by both fixed and dynamic query lengths
on every dataset.
On JQaRA, where the query length fluctuates more and some queries are considerably longer
than others, dynamic query length outperforms a flat, higher token limit, suggesting that append-
ing too many [MASK] tokens can produce a slightly detrimental effect. An empirical analysis of
the results obtained on JQaRA also reveals that dynamic query length has virtually no impact
189
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
JQaRA MIRACL-small-ja Average
NDCG@10 MRR@10 NDCG@10 Recall@5 NDCG@10 Overall
Baseline 0.578 0.820 0.681 0.707 0.619 0.691
No Augmentation 0.557 0.817 0.632 0.661 0.595 0.667
Fixed 8 tokens aug. 0.577 0.813 0.670 0.692 0.624 0.688
Dynamic query length 0.581 0.820 0.681 0.707 0.631 0.700
Table 2 Results of various query augmentation methods on our development set.
on queries that are noticeably shorter than the maximum query length in a fixed query length
setting. However, it noticeably improves NDCG@10 on queries which are nearer the maximum
length, and would thus lose most of the query augmentation mechanism. This explains the small
increase in overall NDCG@10 on the full dataset.
On MIRACL, where all queries have similar token counts and are all well under 32 tokens,
the results for fixed and dynamic query lengths are identical, as would be expected.
Based on the results of this experiment, every result we report on subsequent ablation, as
well as the final model results, will be using dynamic query length.
3.6 Training Setting
In this section, we will evaluate the impact of certain changes to common components of
the retrieval model training pipeline. We will first explore concerns around the impact of using
in-batch negatives in Section 3.6.1. We next explore the optimal way of scheduling the model’s
training, and the relevance of the recently recently schedule-free training method (Defazio et al.
2024) in Section 3.6.2. We will then study the benefits of score normalization, applied to both
teacher and student models, in Section 3.6.3, as well as explore the use and impact of different
commonly used knowledge distillation loss functions in Section 3.6.4.
Finally, we will present the results of all of our experimental small-scale runs and discuss their
implications in Section 3.6.5.
3.6.1 In-batch negatives
In-batch negatives (IBNeg ) are frequently used as a way to augment the training of retrieval
models (Karpukhin et al. 2020). Effectively, within a given training batch, the IBNeg approach
treats every other query’s positive documents as additional negative examples in a binary rele-
vance classification exercise. The query’s original positive example is treated as a positive label,
and every other query’s positive example becomes a negative example. The model is asked to
190
Clavi´e JaColBERTv2.5
predict relevance over those newly created [query, document] pairs, and a cross-entropy loss is
calculated over this prediction and then added to the model’s main training loss.
This method has shown modest but consistent across studies performance improvements when
used with single-vector retriever models (Karpukhin et al. 2020; Wang et al. 2022). This method
is added to the ColBERT training recipe in the paper introducing ColBERTv2, following the
promising results obtained by other models (Santhanam et al. 2022b). This choice was not
thoroughly evaluated, and its impact is therefore unknown.
However, we empirically note that the resulting cross-entropy loss values are two orders of
magnitude lower than the ones from the main KL-Divergence loss used in typical ColBERTv2
training. Moreover, we hypothesize that in-batch negatives are an unnecessary signal for training
multi-vector models using 32-way triplets for multiple reasons, especially in non-English, data-
constrained settings. Firstly, the information obtained from distilling the ranking distribution of
a strong cross-encoder model over 32 documents should carry a much stronger signal than binary
relevance labelling between a positive document and randomly sampled negatives. Secondly, this
loss relies on the positive examples consistently being well-annotated and true positives, which is
not guaranteed to be the case with lossy annotation processes, or even the partially-automated
positive selection used in ColBERTv2 (Santhanam et al. 2022b).
To confirm the validity of our hypothesis, we will train two separate ablation models on the
same data.
3.6.2 Scheduling
The learning rate scheduler used for the training of neural networks has been shown to have a
potentially large impact on the performances of the resulting model (Zhai et al. 2022). There are
a few common schedulers yielding strong results, such as WSD (Warmup-Stable-Decay) (Hu et al.
2024), which increases the learning rate steadily before plateauing for the majority of training
and entering a decay phase, which should ideally be performed on higher quality data, or Linear-
Decay, where the model’s learning rate increases during a fixed number of warmup steps before
linearly lowering until the final training step, among others. The original ColBERTv2, as well as
JaColBERTv2 (Clavi´e 2023), used linear decay scheduling.
However, while tuned learning-rate scheduling consistently outperforms non-scheduled ap-
proached, it is not without constraints. Notably, for the best performance, it requires knowing
the total number of steps in advance in all cases, or has constraints such having a higher quality
data mix for schedulers relying on them for their annealing phase. Moreover, an optimal schedule
for a large number of steps is not guaranteed to work as well for lower data quantities. These
191
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
constraints are especially noticeable for retrieval models, which are expected to be put to use on a
wide variety of downstream uses with varying data distributions, and therefore benefit immensely
from being able to easily resume training without huge performance impact.
Recently, schedule-free learning has been proposed (Defazio et al. 2024). This new approach,
while not yet thoroughly tested across all domains, has empirically shown very encouraging results
on a large number of benchmarks. In practice, it introduces additional calculations as part of
the optimizer steps, allowing it to vary the learning rate without the need for a fixed, pre-defined
schedule. This considerably simplifies both the pre-training and fine-tuning processes, as there
is no need to optimise the scheduler used for training and similar parameters can be re-used for
different data scales.
Schedule-free learning has been noted as potentially requiring a higher learning rate than
scheduled learning, with optimal values empirically falling in the range of 1 to 10 times the original
learning rate (Defazio et al. 2024). We thus conduct an experiment comparing the training setting
used for JaColBERTv2 and its ablations and schedule-free learning, with learning rates set to
1x, 3x and 5x the original 0.0001 learning rate. For all experiments, we retain the AdamW
(Loshchilov and Hutter 2017) optimizer used in previous work.
3.6.3 Score normalization
In the current ColBERTv2 training recipe, scores are unnormalized: the raw logits outputted
by the teacher cross-encoder are used as teacher scores, and the output of the maxsim scoring
function is used as the student model’s score. These scores are on different scales, as the theo-
retical range for maxsim score is [0, Number of Query Tokens], while the range for cross-encoder
logits include negative numbers and is on a scale dependant on the original model’s training,
which varies teacher by teacher.
While some distillation losses can be seen as being partially robust to scale differences, which
partially justified the use of KL-Divergence in the ColBERTv2 paper (Santhanam et al. 2022b),
we believe the lack of normalization to be suboptimal for two main reasons. The first one is that,
given the large difference in scale, the loss calculation is likely to lead to a better approximation
of information loss if operating on a similar scale. The second is that normalized scores allow the
models to focus purely on the relative ranking of results, rather than absolute scores, the latter
of which may provide less useful information due to the automated nature of triplet generations.
We experiment with two used normalization approaches: one where only the teacher scores
are normalized (Lassance et al. 2024), and one where both the teacher and student scores are
normalized. In all cases, we use min-max normalization, defined as:
192
Clavi´e JaColBERTv2.5
scorenormalized =score scoreleast relevant
scoremost relevant scoreleast relevant
(2)
Effectively, this gives a score of 1 to the most relevant document identified by the teacher and
0 to the least relevant one, regardless of their absolute score. Every other score is then placed on
this scale depending on their distance to those two scores.
3.6.4 Loss functions
For knowledge distillation in retrieval models, two loss functions are commonly used (Formal
et al. 2021a; Ren et al. 2021): MarginMSE and KL-Divergence, the latter of which is the one
used by ColBERTv2.
MarginMSE (Hofst¨atter et al. 2020) consists in computing the Mean-Squared Error on the
difference in the margin between the predictions of the model being trained and the teacher model.
The margin defined is the difference between the score the model gives to the positive document
and the score it gives to negative documents. In the case of n-way training, this margin is
calculated over every [positive document score, negative document n score]. MarginMSE is thus
computed as follows (where N is the batch size):
MarginMSE = 1
N
N
i=1
max(0,margin (scoreteacher(xi)scorestudent(xi)))2(3)
Effectively, the training objective becomes for the student model’s margin to reproduce the
teacher’s margin as closely as possible.
KL-Divergence (Kim et al. 2021), on the other hand, seeks to directly minimise the differ-
ence between the distributions of scores of the model being trained and its teacher. KL-Divergence
loss is computed as follows (where N is the batch size):
KL-Div = 1
N
N
i=1
score
Pteacher(score|xi) log (Pteacher(score|xi)
Pstudent(score|xi))(4)
In effect, it computes an estimation of how much information appears to be lost between the
teacher model’s distribution and the student model’s one, and minimising this loss becomes the
primary training objective.
The use of either MarginMSE or KL-Divergence has been reported for knowledge distillation
into various retrieval models. While both loss functions have been shown to be strictly superior
to more traditional MSE-based losses (Izacard and Grave 2021; Hofst¨atter et al. 2020), there has
been little head-to-head comparison of the two. Recently, the SPLADE-V3 authors anecdotally
193
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
reported noticing overall similar performances between the losses, with MarginMSE favouring
recall and KL-Divergence precision, and opted for a mixed-loss approach for their final model,
using a conjunction of both with a lower weight attributed to MarginMSE (Lassance et al. 2024).
However, SPLADE-V3 was trained on only 8-way triplets, considerably fewer than our 32-way
approach.
Our hypothesis is that, given our training setting with numerous negatives and its reduced
reliance on having a strong positive, as only the distribution of scores matter, KL-Divergence
remains the optimal choice to train ColBERT models with knowledge distillation. In order to
test this hypothesis, we will compare its results with MarginMSE, as well as well as mixes of both
with various MarginMSE weighting: λ={0.2,0.1,0.05}.
3.6.5 Ablation results
We present the results of all the ablation runs related to experiments detailed in the previous
subsections in Table 3. As we run our experiments sequentially, the results are presented in the
order of experiments, and the baseline for each new category is the best performing approach of
the previous one.
In-batch negatives Our results on in-batch negatives confirm our hypothesis: they do not
appear to provide useful training to the model, even resulting in slightly decreased performance
on MIRACL-small-ja. We believe this to be due to the factors we have highlighted with the
main one being that the signal from distilling a teacher’s score distribution over 32 documents
appears to constitute a strong enough learning signal on its own. Additionally, some of our
positive examples may be false positives, and many of our negative examples are false negatives.
In-batch negatives are costly to compute, as they require an additional scoring stage for every
query against all the other queries’ positive documents, and materialising and computing a cross-
entropy matrix. Moreover, they place additional constraints on training as we need to ensure
higher data quality to fully leverage them. Removing the use of in-batch negatives thus represents
an efficiency gain by lowering the compute and memory use of training at no performance cost.
Following the results of this experiment, we remove the use of in-batch negatives from
our training recipe.
Scheduling The results of our experiments on learning-rate schedulers highlight schedule-
free learning (Defazio et al. 2024) as a strong alternative to the Linear-Decay scheduler used
in standard ColBERTv2 training. With a slight tweak to the learning rate, increasing it from
the best-performing linear decay rate of 1e05 to 3e05, schedule-free learning results in
a slight performance increase across the board. Additionally, schedule-free learning drastically
194
Clavi´e JaColBERTv2.5
JQaRA MIRACL-small-ja Average
NDCG@10 MRR@10 NDCG@10 Recall@5 NDCG@10 Overall
In-Batch Negs. (IBNeg)
With IBNeg (baseline) 0.581 0820 0.681 0.707 0.631 0.697
Without IBNeg 0.580 0.820 0.682 0.713 0.631 0.699
Scheduling
Linear Decay (baseline) 0.581 0.820 0.681 0.713 0.631 0.699
Schedule-free (1x LR, 1e-05) 0.576 0.820 0.669 0.707 0.623 0.693
Schedule-free (3x LR, 3e-05) 0.581 0.821 0.683 0.717 0.632 0.701
Schedule-Free (5x LR, 5-05) 0.575 0.815 0.681 0.709 0.628 0.695
Normalization
No norm (baseline) 0.581 0.820 0.681 0.713 0.631 0.699
Normalize teacher scores 0.565 0.802 0.680 0.717 0.623 0.691
Normalize student & teacher 0.585 0.827 0.691 0.716 0.638 0.705
Loss Function
KL-Divergence (baseline) 0.585 0.827 0.691 0.716 0.638 0.705
MarginMSE 0.583 0.827 0.672 0.699 0.628 0.695
Mixed (KL-Div λ=1.0 )
+MarginMSE λ= 0.20.576 0.813 0.688 0.716 0.632 0.698
+ MarginMSE λ= 0.10.582 0.816 0.687 0.714 0.635 0.700
+ MarginMSE λ= 0.05 0.578 0.812 0.691 0.691 0.635 0.693
Table 3 Results of all training settings ablation results, compared to the relevant baseline
reduces the level of optimisations required to operate at different data scales, and removes any
costs associated with continuously stopping and restarting our model’s training to expose it to
different data distributions when attempting to use it in a different domain. We choose to use
schedule-free learning as part of our training. This leads to us disabling the use of gradient
clipping (Wilson et al. 2017), which has been observed to cause schedule-free learning runs to fail
to converge (Defazio et al. 2024).
Normalization Normalizing both teacher and student scores results in a sizeable performance
increase on all datasets, while normalizing only teacher scores yield a consistent performance
decrease. This is in line with our expectations. Moreover, normalized teacher scores, which
appear to only function well when used in conjunction with normalized student scores, are a
prerequisite to being able to best utilise an ensemble of teachers outputting scores on different
scales. Based on these clear results and this clear constraint, we introduce teacher
and student score normalization to our training method.
Loss Functions Unlike our other experiments, our empirical results demonstrate that the
currently most commonly used loss function for knowledge distillation in ColBERT models, KL-
195
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
Divergence, is the best performing option. MarginMSE exhibits reduced performance on all
datasets, with it being noticeably more pronounced on MIRACL-small-ja. Interestingly, combin-
ing MarginMSE and KL-Divergence consistently results in worse results than using either loss on
its own. We hypothesize that the varying quality of positive examples in our dataset could be
a partial explanation for the substandard performance of MarginMSE, as it is calculated based
on the margin between the positive example and negative examples, rather than strictly focus-
ing on the score distribution in the same way as KL-Divergence. As it performs strictly worse
than KL-Divergence and additionally frees us for data quality constraints,we choose to discard
MarginMSE and retain the KL-Divergence loss function for our model training.
3.7 Teacher Models
Another important part of knowledge distillation is the choice of teacher model. In Informa-
tion Retrieval, teacher scores are most often obtained from the logits of cross-encoder reranker
models, which assign a relevance score to query-document pairs (Lin et al. 2021). Cross-encoders
are powerful retrieval models, as they are aware of both the query and the document at scoring
time, whereas most other retrieval methods encode documents and queries separately (Mitra
et al. 2018). However, this means that they are particularly costly, as the model must run a full
forward pass on every single pair in order to be able to output a score. As a result, they’re un-
suitable as a single retriever for medium-to-large scale document collections, but are particularly
powerful as teachers for distillation.
The original ColBERTv2 paper used scores from a lightweight 22 million parameter MiniLM
(Wang et al. 2020)-based model, itself distilled from a larger, more powerful cross-encoder
(Thakur et al. 2021). This choice was partially motivated due to the computational requirements
of generating teacher scores for the entirety of the ColBERTv2 training set. As it comprises of
over 19 million triplets, each made up of 64 individual query-document pairs, the score generation
process for ColBERTv2 required computing cross-encoder scores for a total of 1.2 billion pairs.
However, larger cross-encoders generally result in better performance (Thakur et al. 2021;
Nogueira et al. 2020), and successful distilled models in the general domains have shown that
stronger teachers yield stronger distilled models (Wang et al. 2020; Sanh et al. 2019). Specifically,
anecdotal results have shown that using T5-3B based rerankers such as MonoT5-3B (Nogueira
et al. 2020), based on a 3 billion parameters sequence-to-sequence model (Raffel et al. 2020),
consistently resulted in noticeably stronger distilled multi-vector models than distillation from
smaller rerankers (Yang et al. 2024).
196
Clavi´e JaColBERTv2.5
3.7.1 Single-Teacher
Models To investigate the performance of various rerankers as teachers on Japanese re-
trieval, we generate teacher scores using a wide variety of models, via the rerankers library
(Clavi´e 2024). As for monolingual Japanese models, we use the jp-small9reranker, based
on Multilingual-MiniLM (Wang et al. 2020), the jp-base10 and jp-large11 rankers, based on
Nagoya University’s SIMCSE models (Gao et al. 2022). We also generate scores using the BGE-
M3-reranker (Chen et al. 2024) (M3 ) model, the highest performing multi-lingual rerankers, as
well as BGE-M3-jp12 (M3-jp), a version of it further fine-tuned on small-scale Japanese datasets.
We selected these models among the available Japanese rerankers for their strong results on
existing benchmarks (舘野 2024). Finally, leveraging the fact that MMarco is a translation of
MS Marco, we report the baseline performance of using the original ColBERTv2 triplets (origi-
nal), re-used in JaColBERTv2. Finally, we evaluate using the scores of the MonoT5-3B model
mentioned above, also generated on the English version of the dataset.
Results The results of models trained using various models to generate teacher scores are
presented in Table 4. The impact of using different teachers is immediately clear. Interestingly,
the jp-* models, in small,base and large sizes, all yield stronger results on MIRACL-small-ja
than the original teacher’s scores. However, their use as teachers for our models appear to result
in a significant performance decrease on JQaRA compared the original scores. On the other hand,
the BGE-Reranker-M3 models prove to be very strong teachers. Interestingly, the multi-lingual
version of M3, which has not been fine-tuned on Japanese data, vastly outperforms its fine-tuned
JQaRA MIRACL-small-ja Average
NDCG@10 MRR@10 NDCG@10 Recall@5 NDCG@10 Overall
Baseline 0.585 0.827 0.691 0.716 0.638 0.705
Single Teacher
jp-small 0.567 0.807 0.703 0.737 0.635 0.704
jp-base 0.569 0.81 0.713 0.747 0.641 0.701
jp-large 0.577 0.816 0.713 0.741 0.645 0.712
M3 0.589 0.836 0.740 0.788 0.665 0.738
M3-jp 0.588 0.838 0.728 0.757 0.658 0.728
MonoT5-3B 0.587 0.835 0.594 0.642 0.591 0.665
Table 4 Results of ablation runs using various models as distillation teachers
9https://huggingface.co/hotchpotch/japanese-reranker- cross-encoder- small-v1
10 https://huggingface.co/hotchpotch/japanese-reranker- cross-encoder- base-v1
11 https://huggingface.co/hotchpotch/japanese-reranker- cross-encoder- large-v1
12 https://huggingface.co/hotchpotch/japanese-bge- reranker-v2- m3-v1
197
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
version on MIRACL-small-ja, and roughly matches its performance on JQaRA, resulting in a
noticeably stronger average performance. Ultimately, BGE-Reranker-M3, in its non-finetuned
multilingual version, appears to be strongest available teacher, reaching an overall score of 0.738,
compared to the noticeably lower 0.705 score of the original training scores.
Finally, it is also worth noting that the MonoT5-3B reranker yields strong results on JQaRA,
but leads to a sizeable performance drop on MIRACL-small-ja. This behaviour indicates a
potentially interesting trend: models generating scores on English-language MSMarco rather
than MMarco, as is the case for the original scores and MonoT5-3B, appears to result in strong
performance on JQaRA and weaker results on MIRACL-small-ja, while the opposite holds true
for Japanese-language models.13
3.7.2 Ensembled Teachers
Ensembling Teachers Additionally, recent work on English Information Retrieval has in-
creasingly highlighted that ensembling the scores of multiple teachers produce better distilled
models (Lassance et al. 2024), even when the ensembled teachers’ individual performance are
largely similar (Hofst¨atter et al. 2020). The most common way of ensembling multiple teachers’
scores is via a two-step process, where the scores are first normalized using min-max normaliza-
tion as in Section 3.6.3, before being averaged. We believe that these results may not reproduce
in a Japanese setting, as there exists far fewer Japanese base models and strong rerankers than
in high-resource languages settings. However, we evaluate a large range of teacher ensembling as
part of our study, using the teacher models described above.
Results The results of various ensembling combinations are presented in Table 5. Overall,
no ensembling combination outperforms simply using BGE-Reranker-M3 as a single teacher on
all metrics, which confirms our initial impression that its individual performance outweighs any
gains from ensembling with weaker models.
It is worth noting that some ensembling combinations, such as BGE-Reranker-M3 + M3-jp
+ MonoT5-3B reach promising results, with higher NDCG@10 results on both development sets.
However, generating teacher scores is costly, particularly with the very large MonoT5-3B model,
and the weaker performance of the ensembled teachers on non-NDCG metrics does not appear to
justify this computational cost. We thus choose to retain BGE-Reranker-M3 as our single-model
teacher for the full training run.
Concurrently to this work, more recent efforts have attempted to leverage Large Language
13 We do not explore this phenomenon further as generating scores with a 3 billion parameter model is particularly
costly, and this analysis is outside the scope of this study.
198
Clavi´e JaColBERTv2.5
JQaRA MIRACL-small-ja Average
NDCG@10 MRR@10 NDCG@10 Recall@5 NDCG@10 Overall
Baseline (orig.) 0.585 0.827 0.691 0.716 0.638 0.705
Best single teacher (M3) 0.589 0.836 0.740 0.788 0.665 0.738
Ensembled Teachers
M3 + M3-jp 0.587 0.835 0.733 0.764 0.660 0.730
M3 + M3-jp + orig. 0.585 0.832 0.742 0.778 0.664 0.734
M3 + M3-jp + large 0.587 0.828 0.739 0.766 0.663 0.730
M3 + M3-jp + large + base 0.589 0.834 0.738 0.763 0.664 0.731
M3 + MT5 0.598 0.837 0.702 0.745 0.650 0.721
M3-jp + MT5 0.596 0.832 0.694 0.734 0.645 0.714
M3 + M3-jp + MT5 0.598 0.832 0.743 0.769 0.671 0.736
M3 + orig. + MT5 0.597 0.837 0.713 0.747 0.655 0.724
M3-jp + orig. + MT5 0.596 0.835 0.707 0.740 0.652 0.720
M3 + M3-jp + orig. + MT5 0.596 0.835 0.727 0.760 0.662 0.730
All rerankers 0.589 0.841 0.711 0.759 0.65 0.725
Table 5 Results of ablation runs using various ensembles of models as distillation teachers. Best overall
results are reported in bold, and best results within the ensembled category in italic. “orig.”
refers to the original training set, “MT5” to to MonoT5, “large” to jp-large and “base” to
jp-base.
Models (LLMs) as rerankers or, more generally, as powerful retrieval models (Zhuang et al. 2024;
Qin et al. 2024; Pradeep et al. 2023; Li et al. 2023; Lee et al. 2024). A lot of these approaches are
unsuitable to generate teacher scores, as they do not generate relevance scores but simply output
reranked lists, with the scoring mechanism largely remaining a black box (Pradeep et al. 2023; Qin
et al. 2024). However, very recent work has highlighted the potential of methods generating scores
which could be potentially be useful distillation signal, in a similar to cross-encoder logits (Li
et al. 2023; Lee et al. 2024). Due to the recency of these methods and computational constraints,
we choose to leave the exploration of such methods through future work, although we believe
that their use may result in further downstream performance improvements.
3.8 Post-Training
As discussed in Section 3.2, our models will be trained on MMarco, a dataset which was
machine translated from English to Japanese in 2019, and therefore often contains lower quality
Japanese or odd sentence constructions.
While we believe that this issue is unlikely to have a large impact on our final model, we
propose post-training (also called fine-tuning ) the model on smaller, higher quality datasets. To
199
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
Triplets Proportion (w/o MMarco) Proportion (with MMarco)
MIRACL 115,365 64.01% 60.72%
JQaRA 35,733 19.93% 14.17%
JaGovFaqs 28,902 16.06% 15.21%
Total without MMarco 180,000 100% 90.9%
MMarco 18,000 10% (for reference) 9.9%
Total with MMarco 198,000 110% (for reference) 100%
Table 6 Overview of the datasets used in the post-training data mixes.
do so, we choose to use the previously discussed datasets MIRACL (Zhang et al. 2023) and
JQaRA (Tateno 2024b), as well as JaGovFaqs, a subcomponent of JMTEB containing questions
and answers from the Japanese Government’s FAQ sections.
This post-training phase has two aims: the first is to improve our model’s ability to understand
Japanese in more diverse settings than the ones observed during pre-training. Our second aim
is to highlight the potential gains that can be obtained with relatively small scale post-training
on domain specific data, which should be reflected in potential performance improvements on
JQaRA and MIRACL.
We present the full make-up of the post-training data in Table 6. We experiment with two
post-training settings. The first setting only contains the three datasets listed above. The second
one additionally includes 10% data randomly sampled from the MMarco triplets used for pre-
training, to address the issue of catastrophic forgetting, where the model forgets previous training
when exposed to exclusively new data (Kirkpatrick et al. 2017)
3.9 Checkpoint Averaging
Finally, we present the hypothesis that checkpoint averaging can improve our model’s general-
isation ability. Checkpoint averaging, also called “model merging”, consists in taking multiplied
different checkpoints of similarly sized models and averaging their parameters at each layer to
create a merged model.
This practice has a long history in statistical Machine Learning (Polyak and Juditsky 1992),
with early research showing that averaging the parameters of a trained model during its learning
rate decay phase can outperform the final checkpoint (Yin 1992). More recently, this method
has experienced renewed interest, with the merging of different Large Language Models (LLMs)
showing noticeably stronger benchmark results than a single checkpoint (Akiba et al. 2024).
During the writing of this work, Meta released the Llama-3.1 family of models, where the final
models consist of an averaged version of the final few checkpoints (LlamaTeam 2024), following
200
Clavi´e JaColBERTv2.5
the intuition of Polyak’s method (Polyak 1990).
A similar line of work, originally observed in Kaggle machine learning competitions14 and
later confirmed in reproducible research (Wortsman et al. 2022), has highlighted the potential
of so-called “model soups” (Wortsman et al. 2022). With this approach, the output of multiple
independent training runs are averaged, consistently yielding statistically significant downstream
improvements.
The intuition behind such averaging methods is conceptually rooted in methods such as
Exponential Weights Averaging (EMA). EMA is a frequently used as a way to improve the
learning effectiveness of of Teacher-Student models, where it is found that averaged weights yield
better generalization potential, although the underlying mechanisms of this generalization gain
are rarely further explored (He et al. 2020; Grill et al. 2020). Recent investigations into EMA
have suggested that the role of weight averaging is largely to provide additional regularization,
leading to the improvements consistently observed in downstream tasks (Morales-Brotons et al.
2024).
In Information Retrieval, this practice is largely understudied, with no previous work reporting
its use to the best of our knowledge. We believe that it is especially suitable to creating better
overall retrievers by merging the weights of post-trained (fine-tuned) models, which might hurt
performance on datasets out of its fine-tuning distribution, with the weights of the original model.
We hypothesize that doing so will return most of the performance improvements on datasets
similar to the post-training set while avoiding degradation on other tasks.
4 Final Experimental Setting
The final experimental setting, used for our model training, is deduced from the results of the
various ablation experiments detailed in the previous sections. We provide an overview of the
ablation-informed decisions made in Table 7.
We have shown that using dynamic query length at inference-time is strictly superior to fixed-
length queries in Section 3.5, and thus adopt dynamic query length for all our final evaluations.
As for our training setting, following the ablation results, we have confirmed the hypothesis
presented in Section 3.6.1, and opt not use in-batch negatives. We adopt schedule-free learning,
as presented in Section 3.6.2, due to its reduced constraints with no performance decrease. As
14 As discussed in the context of the SETI competition, and referred to as “Checkpoint Averaging”:
https://www.kaggle.com/competitions/seti-breakthrough- listen/discussion/266852
201
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
Setting Setting Used Changed
Query-Length Dynamic Yes
In-Batch Negatives No Yes
Scheduler Schedule-Free Yes
Gradient Clipping Disabled Yes
Learning Rate 3e-5 Yes
Teacher Scores Normalization Min-Max Normalization Yes
Student Scores normalization Min-Max Normalization Yes
Mixed Loss No No
Loss Function KL-Divergence No
Teacher Model BGE-M3-Reranker Yes
Batch size (per GPU) 16 No
Maximum Document Length 300 No
Warmup steps 5% of total No
Gradient Accumulation Disabled No
Table 7 Overview of the final optimal training settings resulting from our experiments, and whether
or not they represent a change from previously used settings. Settings in italic are retained
from previous approaches with no further experiments within this paper.
suggested in Section 3.6.3, we normalize both teacher and student scores using min-max nor-
malization, and use KL-Divergence loss, presented in Section 3.6.4. We train our models using
knowledge distilled from the teacher scores of BGE-M3, as our results in Section 3.7 show it to
be the best performing option on our development sets. We also report training settings that are
unchanged from JaColBERTv2’s original settings,15 to facilitate reproduction.
Potential Impact on Data Selection An important aspect of our training recipe changes
are that the labels of “positives” and “negatives” for each document become unused. Indeed, our
only learning metric is based on the KL-Divergence loss on min-max normalized teacher scores,
which trains our model to attempt to learn the score distribution of its teacher model. While
we do not further modify the training data mix in this work, this lifts a considerable constraint
for future endeavours, as the use of positive and negative labels often renders the curation of
training data more difficult due to the porous nature of relevance judgements. Indeed, curating
the proper mix of “hard” and “easy” negatives is a complex process whose outcomes are not yet
fully understood (Pradeep et al. 2022), and some “hard negatives” could very well be positives,
and filtering “too-hard-negatives” remains a mostly empirical, error-prone process (Merrick et al.
2024).
15 After carefully analysing the data make-up of the ablation dataset and verifying it would not have a strong
impact, maximum document length was set to 228 for ablation runs in order to minimise computational cost.
For the full length training run, we reverted to JaColBERTv2’s 300 to account for longer outlier documents.
202
Clavi´e JaColBERTv2.5
MIRACL
NDCG@10
JQaRA
NDCG@10
JSQuAD
Recall@3
JaCWIR
MAP@10
ESCI
NDCG@10 Average
Final checkpoint 0.756 0.601 0.973 0.928 0.462 0.744
+ full post-train 0.780 0.608 0.970 0.924 0.451 0.747
+ post-train (no mmarco) 0.772 0.613 0.972 0.923 0.452 0.746
JaColBERTv2.5 (final)
(post-train, averaged) 0.778 0.618 0.974 0.928 0.462 0.752
JQaRA MIRACL-small-ja Average
NDCG@10 MRR@10 NDCG@10 Recall@5 NDCG@10 Overall
JaColBERTv1
(Base Model) 0.550 0.811 0.652 0.681 0.601 0.674
JaColBERTv2 0.585 0.836 0.727 0.751 0.656 0.725
Original settings 0.578 0.820 0.681 0.707 0.630 0.697
Final Settings 0.589 0.836 0.740 0.788 0.665 0.738
Table 8 Results of JaColBERTv1 (the base model for both our experiments and JaColBERTv2),
JaColBERTv2, the original training settings as well as our final experimental setting abla-
tion runs on our development evaluation set. Best results are indicated in bold.
Comparison to previous models Table 8 presents a comparison of an ablation run with
the original JaColBERTv2’s training settings, an ablation run comprising of all our final’s cho-
sen parameters, JaColBERTv1, our base model, and JaColBERTv2, the previous best-performing
Japanese ColBERT model. Our results appear to support our claim that the original
JaColBERTv2 training recipe is highly suboptimal. With just 320,000 32-way triplets, our abla-
tion model vastly outperforms the original setting, reaching an average score of 0.738 while the
original settings only marginally outperforms JaColBERTv1, with a score of 0.675. Even more
noteworthy, our ablation run, with just 4% of JaColBERTv2’s training data, outperforms it on
both development evaluation sets. Finally, it is important to note that all models discussed here
outperform JaColBERTv1’s score of 0.674, further showcasing the importance of nway-training
and knowledge distillation, even in sub-optimal training settings.
Hardware Usage As part of these experiments, a total of 28 ablation models were trained.
Each training run represents 1.3 hours of training time on 4 A100 GPUs, or 5.2 A100 hours
per run. As a result, we estimate that a total of 146 A100 hours were spent on ablation runs.
Additionally, 15 A100 hours were spent on generating teacher scores over the full 3,200,000
training set using the BGE-M3-Reranker model. Finally, an estimated 8 GPU hours were spent
generating teacher scores with all models for the ablation runs, the majority of it dedicated to
the MonoT5-3B model. This results in a total pre-final training GPU usage of 169 A100 hours.
203
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
5 Final Results
The results for all newly introduced models and the baselines are presented in Table 9. In
the interest of readability, we report the results for all models on the main evaluation of each
dataset, as defined in Section 3.4.1. Full evaluation results of our new models covering other, less
commonly reported metrics, are available in Appendix A.
Initial Results Immediately, we notice that all versions of JaColBERTv2.5, no matter the
post-training and checkpoint averaging setup, largely outperform JaColBERTv2 and all previous
approaches on all five benchmarks. The final checkpoint resulting from the initial training phase
reaches an average score of 0.744, with JaColBERTv2 previously reaching 0.720 and BGE-M3 in
its all setting, combining dense, sparse and multi-vector representations, 0.714. More interest-
ingly, this checkpoint also achieves an entirely out-of-domain performance of 0.756 NDCG@10 on
JSQuAD
Recall@3
MIRACL
NDCG@10
JQaRA
NDCG@10
JaCWIR
MAP@10
ESCI
NDCG@10 Average
Baselines (Multi)
bge-m3 0.939 0.728 0.539 0.864 0.399 0.694
bge-m3 (all) 0.958 0.752 0.576 0.906 0.380 0.714
multilingual-e5 (large) 0.953 0.706 0.554 0.876 0.320 0.682
multilingual-e5 (base) 0.934 0.647 0.471 0.852 0.347 0.650
multilingual-e5 (small) 0.934 0.636 0.492 0.869 0.331 0.652
Baselines (Mono)
GLuCoSE 0.798 0.348 0.309 0.686 0.207 0.470
sup-simcse-ja (base) 0.793 0.171 0.312 0.578 0.140 0.399
sup-simcse-ja (large) 0.777 0.199 0.392 0.474 0.140 0.396
JaColBERTv1 0.961 0.583 0.550 0.904 0.418 0.683
JaColBERTv2 0.968 0.667 0.585 0.919 0.462 0.720
JaColBERTv2.5
Final checkpoint 0.973 0.756 0.601 0.928 0.462 0.744
+ full post-train 0.970 0.780 0.608 0.924 0.451 0.747
+ post-train (no mmarco) 0.972 0.772 0.613 0.923 0.452 0.746
JaColBERTv2.4
(no post train, averaged) 0.973 0.757 0.601 0.929 0.463 0.745
JaColBERTv2.5 (final)
(post-train, averaged) 0.974 0.778 0.618 0.928 0.462 0.752
Table 9 Results for all baselines and newly introduced models on the main metric for all five evaluation
datasets, as well as their averaged results. Best overall results are indicated in bold. Results
in italics indicate that the model was exposed to the task’s training set.
204
Clavi´e JaColBERTv2.5
MIRACL, largely surpassing JaColBERTv2’s out-of-domain 0.667 and outperforming BGE-M3
(all)’s score of 0.0752, despite the latter having been trained on MIRACL. These very strong
results confirm our intuition that the existing and most commonly used training recipe for multi-
vector models, used in training JaColBERT, was largely suboptimal, and that our proposed
improvements result in substantially stronger downstream results. Moreover, this confirms our
intuition that it is possible to reach state-of-the-art Japanese retrieval performance while using
two orders of magnitude fewer data and compute than leading models.
Post-Training We then explore the results of our post-training step. While they both result
in a slight average score improvement, bringing the average model scores to 0.747 when post-
training with MMarco data re-injected and 0.746 without. However, these results are obtained
via large gains on datasets which are now in-domain, namely MIRACL and JQaRA, while causing
moderate degradation on all three other datasets. Interestingly, while adding MMarco data to
the post-training set results in less pronounced performance increases on JQaRA, it counter-
intuitively further increases the model’s performance on MIRACL, despite reducing its relative
importance in the training set.
Checkpoint Averaging Finally, it is noticeable that averaging the two best-performing
checkpoints of the initial training with the final one result in a slight overall performance increase,
although it does not represent a substantial improvement. As a result of the slight increase, we
release this model as JaColBERTv2.4. However, averaging the two post-trained checkpoints,
with and without MMarco, with the three checkpoints from the original run results in greatly
improved performance across the board. Indeed, this final version of the model, which is the
version we name JaColBERTv2.5, largely outperforms all other variants, reaching an average
score of 0.752. This model retains most of the performance gain of the most successful post-
training runs on MIRACL, resulting in an increase in the NDCG@10 score on it from 0.757 to
0.778, just 0.002 points short of the best post-trained model (0.780). Even more interestingly, it
is the strongest performing model of JQaRA, representing a 0.017NDCG@10 increase on the non
post-trained model, but also a 0.005 increase on the post-trained versions.
These gains are achieved with little or no degradation on any of the other datasets, with
averaging fully compensating for the degradation experienced in the pre-averaging post-trained
models. This suggests that checkpoint averaging has a strong ability to revert catastrophic
forgetting (Kirkpatrick et al. 2017) in retrieval models, while retaining most of the domain-
specific performance gained from post-training.
Statistical Significance Due to our limited computational budget, all the results discussed
here are produced by single-run trainings, using the same random seed of 42. Such reporting is
205
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
common in the literature (Santhanam et al. 2022b; Yates et al. 2021; Ren et al. 2021), where
improvements of more than 0.001 NDCG@10 over multiple datasets are generally presented as
significant with no further discussion (Qu et al. 2021; Xiao et al. 2023; Wang et al. 2022).
Hardware Usage Our final training run took 15.5 hours on 4 A100 GPUs, representing a
total GPU usage of 62 hours. Post-training without MMarco took an hour on the same hardware
setting, 1.2 hours with MMarco, representing a total usage of 8.8 hours over the two runs, which
we round up to 9 for clearer reporting. Our final training and post-training steps, in total,
required 70 A100 hours. Combined with the 162 GPU hours budget spent on generating teacher
scores, the total GPU usage of this study represents 233 A100 hours. While slightly above the
JaColBERTv2 computational budget of 228 A100 hours, we fall within the upper bound of our
allocated computational budget of 239 hours, while reaching significantly stronger results than
any previous approach.
6 Conclusion
In this work, we present JaColBERTv2.5, a model obtained by systematically evaluating
the impact of potential improvements to the ColBERTv2 inference and training recipes, through
small-scale ablation runs. Notably, we identify a better way to handle inference-time query length,
devise a much improved training setting, and identify the optimal teacher model for knowledge
distillation.
JaColBERTv2.5, while trained on only 40% as much data as JaColBERTv2, largely outper-
forms all other retrieval methods in Japanese, including multilingual models with five times the
parameter count and trained using two orders of magnitude more compute and data.
Throughout our experiments, we have shown that multi-vector retrieval models can not only
effectively bridge the gap between multilingual models and monolingual, but largely outperform
the former, with limited compute resources and lower quality data than their English equivalents.
The results of our training recipe also show that it is possible to train models, using knowledge
distillation, without any reliance on hard “positive” or “negative” labels, focusing entirely on the
teacher score distribution instead. This represents valuable information for future work, as a step
towards greatly simplifying the data curation process.
We have also demonstrated that checkpoint averaging, where the weights of multiple check-
points of a similarly-shaped model are averaged to create a merged model, can greatly improve the
generalisation potential of fine-tuned JaColBERT models, while retaining the same out-of-domain
performance as the original model.
206
Clavi´e JaColBERTv2.5
We make both JaColBERTv2.5,16 our final checkpoint resulting from averaging our final
model with post-training runs on two slightly different distributions, and JaColBERTv2.4,17 the
outcome of merging the three best checkpoints of the original pre-training, publicly available.
We believe that our work can support the development of future mono-lingual retrievers, both
in Japanese and other lower-resources languages. Notably, we believe that our improved training
recipe can be directly applied to sparse retrieval models such as SPLADE (Lassance et al. 2024),
which has already shown strong potential in a Japanese setting.18
In order to best support such future work, we make the entirety of our training data for
both ablation and full training runs, teacher scores included, publicly available.19 To further
studies into better understanding of multi-vector retrieval models, we release all mid-training
checkpoints, saved every 2,000 training steps.20
More broadly, while this work modernises and considerably improves the state-of-the-art meth-
ods of multi-vector retrieval training, it largely leaves the exploration of newer, LLM-enhanced
training methods to future work. We believe that combining our method with such models,
through the use of synthetic data or LLM-based teacher scores, could further improve model
capabilities across the board, and intend to explore this in future work.
Finally, while our specific application case is focused on the Japanese language, all of our
training recipe improvements are language-agnostic, and even our ablation-sized models, trained
on just 320,000 triplets, vastly outperform previous monolingual approaches. As a result, we
believe that our method can be directly applied to other languages and domains and yield large
performance gains.
7 Ethical Considerations
We acknowledge the importance of ethical consideration in Natural Language Processing work.
We have ensured to make our work as reproducible as possible, making both the the final model,
in-development versions of it, and the entirety of our training data publicly available to facilitate
reproduction and future work.
There are no extreme ethical risks associated with our models. However, while they are not
16 https://huggingface.co/answerdotai/JaColBERTv2.5
17 https://huggingface.co/answerdotai/JaColBERTv2.4
18 As demonstrated by an early release from the University of Tsukuba available at https://huggingface.co/
aken12/splade-japanese- v3
19 https://huggingface.co/datasets/answerdotai/MMarco-japanese- 32-scored- triplets
20 https://huggingface.co/collections/bclavie/jacolbertv25-checkpoints- 66a37d8da6b0d4d69c14f9c3
207
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
generative and will not, by themselves, generate harmful content, they fall in line with existing
retrieval work. As such, our work is not exempt from potential biases, especially as the largest
part of our training data is a lightly filtered internet corpus which then underwent machine
translation. It is possible that our models might unduly favor certain types of content, and may
rank misinformation or harmful content highly for certain queries.
Acknowledgement
The author thank Yuichi Tateno for his extensive work in creating Japanese retrieval bench-
marks, as well as training and evaluating Japanese rerankers, and Hayato Tsukagoshi for his
willingness to share his work on Japanese SimCSE models and exploring the Japanese embed-
ding training and data landscape, resulting in the concurrent publication of the Ruri family
of Japanese embeddings (Tsukagoshi and Sasano 2024). Further thanks extend to Benjamin
Warner for sharing his expert advice on the various ways to optimise model training and detect
inefficiencies, as well as Alexis Gallagher, for very insightful feedback during the writing of this
work. The author would also like acknowledge the helpful exchanges with Omar Khattab, Antoine
Chaffin and Griffin Adams for their eagerness to discuss and help clarify ideas, as well as Professor
Makoto P. Kato for helpful exchanges around building better Japanese retrieval models.
References
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Al-
tenschmidt, J., Altman, S., Anadkat, S., et al. (2023). “GPT-4 Technical Report.” arXiv
preprint arXiv:2303.08774.
Akiba, T., Shing, M., Tang, Y., Sun, Q., and Ha, D. (2024). “Evolutionary Optimization of
Model Merging Recipes.” arXiv preprint arXiv:2403.13187.
Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval, Vol. 463. ACM
Press, New York.
Bassani, E. (2022). “ranx: A Blazing-Fast Python Library for Ranking Evaluation and Compar-
ison.” In European Conference on Information Retrieval, pp. 259–264. Springer.
Belkin, N. and Croft, W. (1987). “Retrieval Techniques.” Annual Review of Information Science
and Technology,22, pp. 109–145.
Bonifacio, L., Jeronymo, V., Abonizio, H. Q., Campiotti, I., Fadaee, M., Lotufo, R., and Nogueira,
R. (2021). “MMARCO: A Multilingual Version of the MS MMARCO Passage Ranking
208
Clavi´e JaColBERTv2.5
Dataset.” arXiv preprint arXiv:2108.13897.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., et al. (2020). “Language Models are Few-Shot Learners.”
Advances in Neural Information Processing Systems,33, pp. 1877–1901.
Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z. (2024). “BGE M3-Embedding: Multi-
Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge
Distillation.” arXiv preprint arXiv:2402.03216.
Clavi´e, B. (2023). “Towards Better Monolingual Japanese Retrievers with Multi-Vector Models.”
arXiv preprint arXiv:2312.16144.
Clavi´e, B. (2024). “rerankers: A Lightweight Python Library to Unify Ranking Methods.” arXiv
preprint arXiv:2408.17344.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzm´an, F., Grave, ´
E.,
Ott, M., Zettlemoyer, L., and Stoyanov, V. (2020). “Unsupervised Cross-lingual Represen-
tation Learning at Scale.” In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pp. 8440–8451.
Craswell, N., Mitra, B., Yilmaz, E., Campos, D., and Voorhees, E. M. (2020). “Overview of the
TREC 2019 Deep Learning Track.” arXiv preprint arXiv:2003.07820.
Defazio, A., Xingyu, Yang, Mehta, H., Mishchenko, K., Khaled, A., and Cutkosky, A. (2024).
“The Road Less Scheduled.” arXiv preprint arXiv:2405.15682.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Con-
ference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186.
Formal, T., Piwowarski, B., and Clinchant, S. (2021a). “SPLADE: Sparse Lexical and Expansion
Model for First Stage Ranking.” In Proceedings of the 44th International ACM SIGIR
Conference on Research and Development in Information Retrieval, pp. 2288–2292.
Formal, T., Piwowarski, B., and Clinchant, S. (2021b). “A White Box Analysis of ColBERT.” In
Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021,
Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43, pp. 257–263. Springer.
Gao, T., Yao, X., and Chen, D. (2022). “SimCSE: Simple Contrastive Learning of Sentence
Embeddings.” arXiv preprint cs.CL arXiv:2104.08821.
Giacalone, B., Paiement, G., Tucker, Q., and Zanibbi, R. (2024). “Beneath the [MASK]: An
Analysis of Structural Query Tokens in ColBERT.” In European Conference on Information
Retrieval, pp. 431–439. Springer.
209
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
Grill, J.-B., Strub, F., Altch´e, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C.,
Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., Piot, B., Kavukcuoglu, K., Munos, R.,
and Valko, M. (2020). “Bootstrap Your Own Latent: A New Approach to Self-Supervised
Learning.” Advances in Neural Information Processing Systems,33, pp. 21271–21284.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). “Momentum Contrast for Unsupervised
Visual Representation Learning.” In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pp. 9729–9738.
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A.,
Yang, Y., and Zhou, Y. (2017). “Deep Learning Scaling is Predictable, Empirically.” arXiv
preprint arXiv:1712.00409.
Hinton, G., Vinyals, O., and Dean, J. (2015). “Distilling the Knowledge in a Neural Network.”
arXiv preprint arXiv:1503.02531.
Hofst¨atter, S., Althammer, S., Schr¨oder, M., Sertkan, M., and Hanbury, A. (2020). “Improving
Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation.” arXiv
preprint arXiv:2010.02666.
Hofst¨atter, S., Khattab, O., Althammer, S., Sertkan, M., and Hanbury, A. (2022). “Introduc-
ing Neural Bag of Whole-Words with Colberter: Contextualized Late Interactions using
Enhanced Reduction.” In Proceedings of the 31st ACM International Conference on Infor-
mation & Knowledge Management, pp. 737–747.
Hu, S., Tu, Y., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y., Huang, Y., Zhao, W.,
et al. (2024). “Minicpm: Unveiling the Potential of Small Language Models with Scalable
Training Strategies.” arXiv preprint arXiv:2404.06395.
Izacard, G. and Grave, E. (2021). “Distilling Knowledge from Reader to Retriever for Question
Answering.” In ICLR 2021-9th International Conference on Learning Representations.
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t.
(2020). “Dense Passage Retrieval for Open-Domain Question Answering.” In Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pp. 6769–6781.
Kato, S., Togashi, R., Maeda, H., Fujita, S., and Sakai, T. (2017). “LSTM vs. BM25 for open-
domain QA: A Hands-on Comparison of Effectiveness and Efficiency.” In Proceedings of the
40th International ACM SIGIR Conference on Research and Development in Information
Retrieval, pp. 1309–1312.
Khattab, O. and Zaharia, M. (2020). “Colbert: Efficient and Effective Passage Search via Con-
textualized Late Interaction over BERT.” In Proceedings of the 43rd International ACM
210
Clavi´e JaColBERTv2.5
SIGIR conference on research and development in Information Retrieval, pp. 39–48.
Kim, T., Oh, J., Kim, N., Cho, S., and Yun, S.-Y. (2021). “Comparing Kullback-Leibler Diver-
gence and Mean Squared Error Loss in Knowledge Distillation.” In Proceedings of the 30th
International Joint Conference on Artificial Intelligence (IJCAI), pp. 2628–2635.
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan,
K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran,
D., and Hadsell, R. (2017). “Overcoming Catastrophic Forgetting in Neural Networks.”
Proceedings of the National Academy of Sciences,114 (13), pp. 3521–3526.
Kobayashi, M. and Takeda, K. (2000). “Information Retrieval on the Web.” ACM Computing
Surveys (CSUR),32 (2), pp. 144–173.
Kurihara, K., Kawahara, D., and Shibata, T. (2022). “JGLUE: Japanese General Language
Understanding Evaluation.” In Proceedings of the 13th Language Resources and Evaluation
Conference, pp. 2957–2966.
Lassance, C., ejean, H., Formal, T., and Clinchant, S. (2024). “SPLADE-v3: New Baselines
for SPLADE.” arXiv preprint arXiv:2403.06789.
Lee, C., Roy, R., Xu, M., Raiman, J., Shoeybi, M., Catanzaro, B., and Ping, W. (2024). “NV-
Embed: Improved Techniques for Training LLMs as Generalist Embedding Models.” arXiv
preprint arXiv:2405.17428.
Lee, J., Dai, Z., Duddu, S. M. K., Lei, T., Naim, I., Chang, M.-W., and Zhao, V. (2023).
“Rethinking the Role of Token Retrieval in Multi-Vector Retrieval.” Advances in Neural
Information Processing Systems,36.
Li, C., Liu, Z., Xiao, S., and Shao, Y. (2023). “Making Large Language Models A Better
Foundation For Dense Retrieval.” arxiv preprint arXiv:2312.15503.
Lin, J., Alfonso-Hermelo, D., Jeronymo, V., Kamalloo, E., Lassance, C., Nogueira, R., Ogundepo,
O., Rezagholizadeh, M., Thakur, N., Yang, J.-H., and Zhang, X. (2023). “Simple Yet
Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval.”
arXiv preprint arXiv:2304.01019.
Lin, S.-C., Yang, J.-H., and Lin, J. (2021). “In-Batch Negatives for Knowledge Distillation
with Tightly-Coupled Teachers for Dense Retrieval.” In Proceedings of the 6th Workshop on
Representation Learning for NLP (RepL4NLP-2021), pp. 163–173.
LlamaTeam (2024). “The Llama 3 Herd of Models.” Human & Machine Intelligence, AI @ Meta
Reports.
Loshchilov, I. and Hutter, F. (2017). “Decoupled Weight Decay Regularization.” arXiv preprint
arXiv:1711.05101.
211
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
Louis, A. (2024). “D´ecouvrIR: A Benchmark for Evaluating the Robustness of Information Re-
trieval Models in French.” https://huggingface.co/spaces/antoinelouis/decouvrir.
Louis, A., Saxena, V., van Dijck, G., and Spanakis, G. (2024). “ColBERT-XM: A Modular
Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval.” arXiv
preprint arXiv:2402.15059.
MacAvaney, S. and Tonellotto, N. (2024). “A Reproducibility Study of PLAID.” arXiv preprint
arXiv:2404.14989.
Manning, C. D. (2008). Introduction to Information Retrieval. Syngress Publishing.
Merrick, L., Xu, D., Nuti, G., and Campos, D. (2024). “Arctic-Embed: Scalable, Efficient, and
Accurate Text Embedding Models.” arXiv preprint arXiv:2405.05374.
Mitra, B., Craswell, N., et al. (2018). “An Introduction to Neural Information Retrieval.” Foun-
dations and Trends®in Information Retrieval,13 (1), pp. 1–126.
Morales-Brotons, D., Vogels, T., and Hendrikx, H. (2024). “Exponential Moving Average of
Weights in Deep Learning: Dynamics and Benefits.” Transactions on Machine Learning
Research.
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2022). “MTEB: Massive Text Embedding
Benchmark.” arXiv preprint arXiv:2210.07316.
Nair, S., Yang, E., Lawrie, D., Duh, K., McNamee, P., Murray, K., Mayfield, J., and Oard,
D. W. (2022). “Transfer Learning Approaches for Building Cross-Language Dense Retrieval
Models.” In European Conference on Information Retrieval, pp. 382–396. Springer.
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., and Deng, L. (2016).
“MS MARCO: A Human Generated Machine Reading Comprehension Dataset.” choice,
2640, p. 660.
Nogueira, R., Jiang, Z., and Lin, J. (2020). “Document Ranking with A Pretrained Sequence-to-
Sequence Model.” arXiv preprint arXiv:2003.06713.
Polyak, B. (1990). “New Stochastic Approximation Type Procedures.” Avtomatica i Tele-
mekhanika,7, pp. 98–107.
Polyak, B. T. and Juditsky, A. B. (1992). “Acceleration of Stochastic Approximation by Aver-
aging.” SIAM Journal on Control and Optimization,30 (4), pp. 838–855.
Pradeep, R., Liu, Y., Zhang, X., Li, Y., Yates, A., and Lin, J. (2022). “Squeezing Water from A
Stone: A Bag of Tricks for Further Improving Cross-Encoder Effectiveness for Reranking.”
In European Conference on Information Retrieval, pp. 655–670. Springer.
Pradeep, R., Sharifymoghaddam, S., and Lin, J. (2023). “RankZephyr: Effective and Robust
Zero-Shot Listwise Reranking is a Breeze!” arXiv preprint arXiv:2312.02724.
212
Clavi´e JaColBERTv2.5
Qin, Z., Jagerman, R., Hui, K., Zhuang, H., Wu, J., Yan, L., Shen, J., Liu, T., Liu, J., Met-
zler, D., Wang, X., and Bendersky, M. (2024). “Large Language Models are Effective Text
Rankers with Pairwise Ranking Prompting.” In Findings of the Association for Computa-
tional Linguistics: NAACL 2024, pp. 1504–1518.
Qu, Y., Ding, Y., Liu, J., Liu, K., Ren, R., Zhao, W. X., Dong, D., Wu, H., and Wang, H.
(2021). “RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-
Domain Question Answering.” In Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies,
pp. 5835–5847.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and
Liu, P. J. (2020). “Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer.” Journal of Machine Learning Research,21 (140), pp. 1–67.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). “SQuAD: 100,000+ Questions
for Machine Comprehension of Text.” In Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, pp. 2383–2392.
Reddy, C. K., M`arquez, L., Valero, F., Rao, N., Zaragoza, H., Bandyopadhyay, S., Biswas,
A., Xing, A., and Subbian, K. (2022). “Shopping Queries Dataset: A Large-Scale ESCI
Benchmark for Improving Product Search.” arXiv preprint arXiv:2206.06588.
Ren, R., Qu, Y., Liu, J., Zhao, W. X., She, Q., Wu, H., Wang, H., and Wen, J.-R. (2021). “Rock-
etQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking.” In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
pp. 2825–2835.
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. (1995).
“Okapi at TREC-3.” Nist Special Publication Sp,109, p. 109.
Saad-Falcon, J., Khattab, O., Santhanam, K., Florian, R., Franz, M., Roukos, S., Sil, A., Sultan,
M., and Potts, C. (2023). “UDAPDR: Unsupervised Domain Adaptation via LLM Prompting
and Distillation of Rerankers.” In Proceedings of the 2023 Conference on Empirical Methods
in Natural Language Processing, pp. 11265–11279.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). “DistilBERT, a Distilled Version of
BERT: Smaller, Faster, Cheaper and Lighter.” arXiv preprint arXiv:1910.01108.
Santhanam, K., Khattab, O., Potts, C., and Zaharia, M. (2022a). “PLAID: An Efficient Engine
for Late Interaction Retrieval.” In Proceedings of the 31st ACM International Conference
on Information & Knowledge Management, pp. 1747–1756.
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. (2022b). “ColBERTv2:
213
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
Effective and Efficient Retrieval via Lightweight Late Interaction.” In Proceedings of the 2022
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pp. 3715–3734.
鈴木潤,松田耕史,鈴木正敏,加藤拓真,宮脇峻平,西田京介 (2021). ライブコンペティショ
ン:AI 王〜クイズ AI 日本一決定戦〜」.自然言語処理,28 (3), pp. 888–894. [J. Suzuki et
al. (2021). Live Competition: “AI King: Quiz AI Japan Championship”. Journal of Natural
Language Processing, 28 (3), pp. 888–894.].
舘野祐一 (2024). 日本 Reranker 作成のテクニカルレポートhttps://secon.dev/entry/
2024/04/02/080000-japanese-reranker-tech- report/. [Y. Tateno (2024). Technical Re-
port on Japanese Reranker https://secon.dev/entry/2024/04/02/080000-japanese-reranker-
tech-report/].
Tateno, Y. (2024a). “JaCWIR: Japanese Casual Web IR.” HuggingFace Datasets.https:
//huggingface.co/datasets/hotchpotch/JaCWIR.
Tateno, Y. (2024b). “JQaRA: Japanese Question Answering with Retrieval Augmentation.”
HuggingFace Datasets.https://huggingface.co/datasets/hotchpotch/JQaRA.
Thakur, N., Reimers, N., uckl´e, A., Srivastava, A., and Gurevych, I. (2021). “Beir: A Heteroge-
nous Benchmark for Zero-Shot Evaluation of Information Retrieval Models.” arXiv preprint
arXiv:2104.08663.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra,
S., Bhargava, P., Bhosale, S., et al. (2023). “Llama 2: Open Foundation and Fine-Tuned
Chat Models.” arXiv preprint arXiv:2307.09288.
Trotman, A., Puurula, A., and Burgess, B. (2014). “Improvements to BM25 and Language Mod-
els Examined.” In Proceedings of the 19th Australasian Document Computing Symposium,
pp. 58–65.
Tsukagoshi, H. and Sasano, R. (2024). “Ruri: Japanese General Text Embeddings.” arxiv
preprint cs.CL arXiv:2409.07737.
Tsukagoshi, H., Sasano, R., and Takeda, K. (2023). “Japanese SimCSE Technical Report.” arXiv
preprint arXiv:2310.19349.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and
Polosukhin, I. (2017). “Attention is All You Need.” Advances in Neural Information Pro-
cessing Systems,30.
Wang, B., Liu, Z., Huang, X., Jiao, F., Ding, Y., Aw, A., and Chen, N. (2024). “SeaEval for
Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning.”
In Proceedings of the 2024 Conference of the North American Chapter of the Association
214
Clavi´e JaColBERTv2.5
for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),
pp. 370–390.
Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F.
(2022). “Text Embeddings by Weakly-Supervised Contrastive Pre-Training.” arXiv preprint
arXiv:2212.03533.
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024). “Multilingual e5
Text Embeddings: A Technical Report.” arXiv preprint arXiv:2402.05672.
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. (2020). “Minilm: Deep Self-
Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers.” Ad-
vances in Neural Information Processing Systems,33, pp. 5776–5788.
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. (2017). “The Marginal Value
of Adaptive Gradient Methods in Machine Learning.” Advances in Neural Information
Processing Systems,30.
Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S.,
Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., and Schmidt, L. (2022). “Model
Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy with-
out Increasing Inference Time.” In International Conference on Machine Learning,
pp. 23965–23998. PMLR.
Xiao, S., Liu, Z., Zhang, P., and Muennighof, N. (2023). “C-pack: Packaged Resources to Advance
General Chinese Embedding.” arXiv preprint arXiv:2309.07597.
Yamada, I., Asai, A., Shindo, H., Takeda, H., and Matsumoto, Y. (2020). “LUKE: Deep Contex-
tualized Entity Representations with Entity-aware Self-attention.” In Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6442–6454.
Yang, E., Lawrie, D., Mayfield, J., Oard, D. W., and Miller, S. (2024). “Translate-Distill:
Learning Cross-Language Dense Retrieval by Translation and Distillation.” In European
Conference on Information Retrieval, pp. 50–65. Springer.
Yates, A., Nogueira, R., and Lin, J. (2021). “Pretrained Transformers for Text Ranking: BERT
and Beyond.” In Proceedings of the 14th ACM International Conference on Web Search and
Data Mining, pp. 1154–1156.
Yin, G. (1992). “Stochastic Approximation via Averaging: The Polyak’s Approach Revisited.” In
Simulation and Optimization: Proceedings of the International Workshop on Computation-
ally Intensive Methods in Simulation and Optimization Held at the International Institute for
Applied Systems Analysis (IIASA), Laxenburg, Austria, August 23–25, 1990, pp. 119–134.
Springer.
215
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. (2022). “Scaling Vision Transformers.”
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 12104–12113.
Zhang, X., Ma, X., Shi, P., and Lin, J. (2021). “Mr. TyDi: A Multi-lingual Benchmark for Dense
Retrieval.” In Proceedings of the 1st Workshop on Multilingual Representation Learning,
pp. 127–137.
Zhang, X., Thakur, N., Ogundepo, O., Kamalloo, E., Alfonso-Hermelo, D., Li, X., Liu, Q.,
Rezagholizadeh, M., and Lin, J. (2023). “MIRACL: A Multilingual Retrieval Dataset Cover-
ing 18 Diverse Languages.” Transactions of the Association for Computational Linguistics,
11, pp. 1114–1131.
Zhuang, S., Ma, X., Koopman, B., Lin, J., and Zuccon, G. (2024). “PromptReps: Prompt-
ing Large Language Models to Generate Dense and Sparse Representations for Zero-Shot
Document Retrieval.” arXiv preprint arXiv:2404.18424.
216
Clavi´e JaColBERTv2.5
Appendix
A Full JaColBERTv2.5 Results
Baseline Our Models
JaColBERT
v2
JaColBERT
v2.5
JaColBERT
v2.4
Final
Checkpoint
Post-Train
(no mmarco)
Post-Train
(full)
MIRACL
NDCG@10 0.667 0.778 0.757 0.756 0.772 0.780
MRR@10 0.688 0.795 0.780 0.781 0.793 0.802
Recall@10 0.802 0.887 0.869 0.871 0.880 0.887
Recall@100 0.961 0.989 0.987 0.987 0.987 0.990
JQaRA
NDCG@10 0.585 0.618 0.601 0.601 0.613 0.608
MRR@10 0.836 0.856 0.846 0.843 0.849 0.846
JaCWIR
MAP@10 0.919 0.928 0.929 0.928 0.923 0.924
Hit Rate@10 0.982 0.979 0.980 0.979 0.979 0.978
JSQuAD
Recall@1 0.917 0.930 0.930 0.930 0.928 0.925
Recall@3 0.967 0.974 0.974 0.974 0.972 0.970
Recall@5 0.976 0.982 0.982 0.982 0.980 0.978
Recall@10 0.982 0.987 0.987 0.987 0.986 0.986
ESCI
NDCG@10 0.462 0.462 0.463 0.462 0.452 0.451
MRR@10 0.619 0.619 0.621 0.622 0.610 0.606
Recall@10 0.381 0.386 0.386 0.386 0.379 0.376
Table 10 Presentation of the full results of different variants of our newly introduced models across a
range of metrics. Best results are presented in bold. Results in italic indicate that the task
is in-domain.
217
Journal of Natural Language Processing Vol. 32 No. 1 March 2025
Benjamin Clavi´e: Benjamin Clavi´e was born in 1995 and is an R&D Researcher
at Answer.AI. He received an MScR in Informatics from the University of
Edinburgh. He has been an R&D Engineer, working in Natural Language
Processing and Information Retrieval for several years at various companies,
including legal information providers LexisNexis. His research interests include
natural language processing, information retrieval, especially multi-vector re-
trieval, and the development of lightweight models to further the performance
of Large Language Models-based applications.
(Received July 31, 2024)
(Revised November 6, 2024)
(Accepted December 4, 2024)
218
... Due to the compute and time costs of indexing corpora containing tens of millions of documents, evaluating every model checkpoint and ablation on every task is not feasible. Therefore, we follow recent works (Clavié, 2024;Merrick et al., 2024) by comparing models' quality on smaller sampled-corpus versions of HotpotQA, NQ, MS MARCO, and MIRACL (Chinese, French, German, Japanese, Spanish). These sampled corpora are constructed by combining the top 250 BM25-retrieved 2 passages with all judged passages. ...
... Our experiment presented in Table 6, however, shows this method to have inconclusive benefit to nDCG@10 on the BEIR and MIRACL datasets when applied to our model. We consider this result to be understandable given Clavié (2024)'s very small observed effect. ...
Article
Full-text available
Large language models (LLMs) have become increasingly capable, but their development often requires substantial computational resources. Although model merging has emerged as a cost-effective promising approach for creating new models by combining existing ones, it currently relies on human intuition and domain knowledge, limiting its potential. Here we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models such as a Japanese LLM with math reasoning capabilities. Surprisingly, our Japanese math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with substantially more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally aware Japanese vision–language model generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese vision–language models. This work not only contributes new state-of-the-art models back to the open-source community but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.
Conference Paper
State-of-the-art neural retrievers predominantly focus on high-resource languages like English, which impedes their adoption in retrieval scenarios involving other languages. Current approaches circumvent the lack of high-quality labeled data in non-English languages by leveraging multilingual pretrained language models capable of cross-lingual transfer. However, these models require substantial task-specific fine-tuning across multiple languages, often perform poorly in languages with minimal representation in the pretraining corpus, and struggle to incorporate new languages after the pretraining phase. In this work, we present a novel modular dense retrieval model that learns from the rich data of a single high-resource language and effectively zero-shot transfers to a wide array of languages, thereby eliminating the need for language-specific labeled data. Our model, ColBERT-XM, demonstrates competitive performance against existing state-of-the-art multilingual retrievers trained on more extensive datasets in various languages. Further analysis reveals that our modular approach is highly data-efficient, effectively adapts to out-of-distribution data, and significantly reduces energy consumption and carbon emissions. By demonstrating its proficiency in zero-shot scenarios, ColBERT-XM marks a shift towards more sustainable and inclusive retrieval systems, enabling effective information accessibility in numerous languages. We publicly release our code and models for the community.
Chapter
ColBERT is a highly effective and interpretable retrieval model based on token embeddings. For scoring, the model adds cosine similarities between the most similar pairs of query and document token embeddings. Previous work on interpreting how tokens affect scoring pay little attention to non-text tokens used in ColBERT such as [MASK]. Using MS MARCO and the TREC 2019-2020 deep passage retrieval task, we show that [MASK] embeddings may be replaced by other query and structural token embeddings to obtain similar effectiveness, and that [Q] and [MASK] are sensitive to token order, while [CLS] and [SEP] are not.
Chapter
Prior work on English monolingual retrieval has shown that a cross-encoder trained using a large number of relevance judgments for query-document pairs can be used as a teacher to train more efficient, but similarly effective, dual-encoder student models. Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR), where queries and documents are in different languages, is challenging due to the lack of a sufficiently large training collection when the query and document languages differ. The state of the art for CLIR thus relies on translating queries, documents, or both from the large English MS MARCO training set, an approach called Translate-Train. This paper proposes an alternative, Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model. This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR. Trained models and artifacts are publicly available on Huggingface.