Jeremy Howard’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (1)


ModernBERT training settings. Dropout and below are shared across all phases.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
  • Preprint
  • File available

December 2024

·

62 Reads

·

2 Citations

Benjamin Warner

·

Antoine Chaffin

·

Benjamin Clavié

·

[...]

·

Iacopo Poli

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

Download

Citations (1)


... More recently, biases are not included, which improves training stability, throughput, and final performance. Additionally, improvements like SwiGLU activation functions and rotary positional embeddings are also commonly utilized 3,4,34,35 . GPT (Generative Pretrained Transformer) models, such as OpenAI's GPT series (GPT-3, GPT-4, etc.), are designed for generative tasks and use transformer decoders [36][37][38] . ...

Reference:

Contrastive learning and mixture of experts enables precise vector embeddings in biological databases
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference