About the Benchmark

⚠️ Disclaimer: This is ongoing academic research — the methodology and results are subject to change, and new models will be added over time.

This benchmark evaluates open and closed LLMs on Turkish ↔ English document-level machine translation. Each model is scored in both directions (en→tr and tr→en) over 800 source documents drawn from four diverse domains, using XCOMET — a learned neural metric that correlates strongly with human MQM judgments.

Datasets (200 documents each, 800 total per direction):

  • FLORES Wiki — encyclopedic Wikipedia content from the FLORES-200 evaluation set.
  • NTREX News — news articles from the NTREX-128 multilingual news translation benchmark.
  • AYM Legal — Turkish Constitutional Court decisions, formal legal register.
  • TWiST Tech — technical and software documentation.
  • XQuAD Educational — educational/QA passages spanning multiple subjects.

Methodology: Each model translates the English documents to Turkish and the Turkish documents to English. Then XCOMET scores are computed for each translation against the reference translation.

Prompt: All models receive the same instruction-style prompt (taken from the TranslateGemma technical report):

You are a professional {source_lang} ({src_lang_code}) to {target_lang} ({tgt_lang_code}) translator. Your goal is to accurately convey the meaning and nuances of the original {source_lang} text while adhering to {target_lang} grammar, vocabulary, and cultural sensitivities. Produce only the {target_lang} translation, without any additional explanations or commentary. Please translate the following {source_lang} text into {target_lang}:


{text}

XCOMET (Guerreiro et al., 2023, arXiv:2310.10482) is a state-of-the-art reference-based MT evaluation metric.

Prepared by Atahan Uz and Hüseyin Emir Akdağ.

Leaderboard
Rank Model Average XCOMET (en→tr) XCOMET (tr→en)
1 google/translategemma-27b-it 78.39 77.65 79.12
2 google/translategemma-12b-it (FP8) 77.09 76.40 77.77
3 google/translategemma-12b-it 76.92 76.13 77.71
4 google/translategemma-12b-it (FP4) 75.95 74.95 76.94
5 google/gemma-3-27b-it 75.91 74.23 77.58
6 openai/gpt-oss-120b 75.75 74.32 77.17
7 ytu-ce-cosmos/Turkish-Gemma-9b-v0.1 74.99 72.61 77.37
8 google/gemma-3-12b-it 74.89 72.11 77.66
9 google/translategemma-4b-it 72.08 68.48 75.67
10 openai/gpt-oss-20b 70.29 65.12 75.46
11 ytu-ce-cosmos/Turkish-Gemma-9b-T1 68.12 62.59 73.64
12 google/gemma-3-4b-it 67.72 61.88 73.56