ModernBERT-TR

A Modern Encoder Foundation Model for Turkish

Besher Alkurdi1 Himmet Toprak Kesgin2 Muzaffer Kaan Yuce2 Mehmet Fatih Amasyali2

1Faculty of Information Technology, University of Jyväskylä, Finland   2Department of Computer Engineering, Yildiz Technical University, Turkey

150M
Parameters
144.4B
Tokens

Overview

ModernBERT-TR is a 150M-parameter encoder pretrained from scratch on 144.4 billion Turkish tokens. The model uses the ModernBERT architecture: rotary position embeddings, gated linear units, alternating local-global attention, and sequence-packed training.

Under frozen-encoder linear probing, ModernBERT-TR achieves a 60.2% average across 11 Turkish NLP tasks, surpassing the next-best model by 13.1% relative. Under full fine-tuning on the 28-task TabiBench benchmark, it scores 77.28, matching TabiBERT within 0.30 points and leading in 5 of 8 categories despite training on 7x fewer tokens.

Benchmark Results

Frozen-encoder linear probing across 11 Turkish NLP tasks spanning classification, regression, and token classification.

Benchmark bar chart Benchmark bar chart

Per-Task Breakdown

Per-task heatmap Per-task heatmap

Best scores per task shown in bold. ModernBERT-TR leads on 5 of 11 tasks.

TabiBench Fine-Tuning

Full fine-tuning evaluation on the 28-task TabiBench benchmark with grid search over learning rate, weight decay, and batch size.

TabiBench radar TabiBench radar
TabiBench bars TabiBench bars

ModernBERT-TR (77.28) matches TabiBERT (77.58) and leads in 5 of 8 categories.

Training Data and Tokenizer

We combine FineWeb-2 Turkish (41.2B tokens) with BertTurk Corpus upsampled 5x (31.0B tokens), yielding 72.2B tokens per epoch trained for 2 full epochs. A custom 50K WordPiece tokenizer is trained on Turkish data, achieving strong compression with efficient vocabulary usage.

Tokenizer comparison Tokenizer comparison

Turkish compression ratio across tokenizer configurations.

Methodology

ModernBERT-TR follows the ModernBERT-base architecture (22 layers, 768 hidden, 150M parameters) and was pretrained with systematic ablations to select the final configuration.

Acknowledgments

We thank Answer.AI for the ModernBERT codebase, HuggingFace for the FineWeb-2 dataset, and Schweter for the BertTurk Corpus. This work was supported by Yildiz Technical University Scientific Research Projects Coordination Unit under project number FDK-2024-6070. This study has also been supported by the Scientific and Technological Research Council of Turkey (TUBITAK) Grant No: 124E055.

Citation

@article{alkurdi2025modernberttr, title={ModernBERT-TR: A Modern Encoder Foundation Model for Turkish}, author={Alkurdi, Besher and Kesgin, Himmet Toprak and Yuce, Muzaffer Kaan and Amasyali, Mehmet Fatih}, year={2025} }