⚠️ Disclaimer: This is ongoing academic research — the methodology and results are subject to change, and new models will be added over time.
This benchmark evaluates open and closed LLMs on Turkish ↔ English document-level machine translation. Each model is scored in both directions (en→tr and tr→en) over 800 source documents drawn from four diverse domains, using XCOMET — a learned neural metric that correlates strongly with human MQM judgments.
Datasets (200 documents each, 800 total per direction):
- FLORES Wiki — encyclopedic Wikipedia content from the FLORES-200 evaluation set.
- NTREX News — news articles from the NTREX-128 multilingual news translation benchmark.
- AYM Legal — Turkish Constitutional Court decisions, formal legal register.
- TWiST Tech — technical and software documentation.
- XQuAD Educational — educational/QA passages spanning multiple subjects.
Methodology: Each model translates the English documents to Turkish and the Turkish documents to English. Then XCOMET scores are computed for each translation against the reference translation.
Prompt: All models receive the same instruction-style prompt (taken from the TranslateGemma technical report):
You are a professional {source_lang} ({src_lang_code}) to {target_lang} ({tgt_lang_code}) translator. Your goal is to accurately convey the meaning and nuances of the original {source_lang} text while adhering to {target_lang} grammar, vocabulary, and cultural sensitivities. Produce only the {target_lang} translation, without any additional explanations or commentary. Please translate the following {source_lang} text into {target_lang}:
{text}
XCOMET (Guerreiro et al., 2023, arXiv:2310.10482) is a state-of-the-art reference-based MT evaluation metric.
Prepared by Atahan Uz and Hüseyin Emir Akdağ.
| Rank | Model | Average | XCOMET (en→tr) | XCOMET (tr→en) |
|---|---|---|---|---|
| 1 | google/translategemma-27b-it | 78.39 | 77.65 | 79.12 |
| 2 | google/translategemma-12b-it (FP8) | 77.09 | 76.40 | 77.77 |
| 3 | google/translategemma-12b-it | 76.92 | 76.13 | 77.71 |
| 4 | google/translategemma-12b-it (FP4) | 75.95 | 74.95 | 76.94 |
| 5 | google/gemma-3-27b-it | 75.91 | 74.23 | 77.58 |
| 6 | openai/gpt-oss-120b | 75.75 | 74.32 | 77.17 |
| 7 | ytu-ce-cosmos/Turkish-Gemma-9b-v0.1 | 74.99 | 72.61 | 77.37 |
| 8 | google/gemma-3-12b-it | 74.89 | 72.11 | 77.66 |
| 9 | google/translategemma-4b-it | 72.08 | 68.48 | 75.67 |
| 10 | openai/gpt-oss-20b | 70.29 | 65.12 | 75.46 |
| 11 | ytu-ce-cosmos/Turkish-Gemma-9b-T1 | 68.12 | 62.59 | 73.64 |
| 12 | google/gemma-3-4b-it | 67.72 | 61.88 | 73.56 |