ModernBERT training settings. Dropout and below are shared across all phases.

ModernBERT training settings. Dropout and below are shared across all phases.

Source publication
Preprint
Full-text available
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing...

Context in source publication

Context 1
... training settings can be found in Table 3. During training we used MNLI as a live evaluation, along with validation loss and token accuracy metrics on a 500 million randomly sampled sequences from the source datasets. ...

Citations

... More recently, biases are not included, which improves training stability, throughput, and final performance. Additionally, improvements like SwiGLU activation functions and rotary positional embeddings are also commonly utilized 3,4,34,35 . GPT (Generative Pretrained Transformer) models, such as OpenAI's GPT series (GPT-3, GPT-4, etc.), are designed for generative tasks and use transformer decoders [36][37][38] . ...
... We chose the recent ModernBERT base model as the pretrained model of choice for our experiments 35 . This bidirectional model employs efficient implementations of masking and attention to speed up training and inference while reducing memory costs. ...
... We trained ModernBERT directly without modification and with MoE extension. The ModernBERT models offer much higher NLP benchmark performance per parameter than the first generation of BERT models following the 2019 release and subsequent fine-tuning releases 35 . To benchmark against our training scheme, we compiled several popular BERT-like models, BERT models fine-tuned on scientific literature, sentence similarity models, and a recent SOTA GPT-like transformer (Table 2). ...
Article
Full-text available
The advancement of transformer neural networks has significantly enhanced the performance of sentence similarity models. However, these models often struggle with highly discriminative tasks and generate sub-optimal representations of complex documents such as peer-reviewed scientific literature. With the increased reliance on retrieval augmentation and search, representing structurally and thematically-varied research documents as concise and descriptive vectors is crucial. This study improves upon the vector embeddings of scientific text by assembling domain-specific datasets using co-citations as a similarity metric, focusing on biomedical domains. We introduce a novel Mixture of Experts (MoE) extension pipeline applied to pretrained BERT models, where every multi-layer perceptron section is copied into distinct experts. Our MoE variants are trained to classify whether two publications are cited together (co-cited) in a third paper based on their scientific abstracts across multiple biological domains. Notably, because of our unique routing scheme based on special tokens, the throughput of our extended MoE system is exactly the same as regular transformers. This holds promise for versatile and efficient One-Size-Fits-All transformer networks for encoding heterogeneous biomedical inputs. Our methodology marks advancements in representation learning and holds promise for enhancing vector database search and compilation.
... In this context, the recent introduction of ModernBERT presents a further advancement that may challenge this established standard [16]. Within this landscape of specialized small-scale models, conventional BERT architectures have adequately served radiology applications, yet they exhibit inherent inefficiencies when processing Japanese medical text. ...
Preprint
Full-text available
Objective: This study aims to evaluate and compare the performance of two Japanese language models-conventional Bidirectional Encoder Representations from Transformers (BERT) and the newer ModernBERT-in classifying findings from chest CT reports, with a focus on tokenization efficiency, processing time, and classification performance. Methods: We conducted a retrospective study using the CT-RATE-JPN dataset containing 22,778 training reports and 150 test reports. Both models were fine-tuned for multi-label classification of 18 common chest CT conditions. The training data was split in 18,222:4,556 for training and validation. Performance was evaluated using F1 scores for each condition and exact match accuracy across all 18 labels. Results: ModernBERT demonstrated superior tokenization efficiency, requiring 24.0% fewer tokens per document (258.1 vs. 339.6) compared to BERT Base. This translated to significant performance improvements, with ModernBERT completing training in 1877.67 seconds versus BERT's 3090.54 seconds (39% reduction). ModernBERT processed 38.82 samples per second during training (1.65x faster) and 139.90 samples per second during inference (1.66x faster). Despite these efficiency gains, classification performance remained comparable, with ModernBERT achieving superior F1 scores in 8 conditions, while BERT performed better in 4 conditions. Overall exact match accuracy was slightly higher for ModernBERT (74.67% vs. 72.67%), though this difference was not statistically significant (p=0.6291). Conclusion: ModernBERT offers substantial improvements in tokenization efficiency and training speed without sacrificing classification performance. These results suggest that ModernBERT is a promising candidate for clinical applications in Japanese radiology reports analysis.