LLM Translation Benchmark (TR⇔EN)

About the Benchmark

⚠️ Disclaimer: This is ongoing academic research — the methodology and results are subject to change, and new models will be added over time.

This benchmark evaluates open and closed LLMs on Turkish ↔ English document-level machine translation. Each model is scored in both directions (en→tr and tr→en) over 800 source documents drawn from four diverse domains, using XCOMET — a learned neural metric that correlates strongly with human MQM judgments.

Datasets (200 documents each, 800 total per direction):

FLORES Wiki — encyclopedic Wikipedia content from the FLORES-200 evaluation set.
NTREX News — news articles from the NTREX-128 multilingual news translation benchmark.
AYM Legal — Turkish Constitutional Court decisions, formal legal register.
TWiST Tech — technical and software documentation.
XQuAD Educational — educational/QA passages spanning multiple subjects.

Methodology: Each model translates the English documents to Turkish and the Turkish documents to English. Then XCOMET scores are computed for each translation against the reference translation.

Prompt: All models receive the same instruction-style prompt (taken from the TranslateGemma technical report):

You are a professional {source_lang} ({src_lang_code}) to {target_lang} ({tgt_lang_code}) translator. Your goal is to accurately convey the meaning and nuances of the original {source_lang} text while adhering to {target_lang} grammar, vocabulary, and cultural sensitivities. Produce only the {target_lang} translation, without any additional explanations or commentary. Please translate the following {source_lang} text into {target_lang}:


{text}

XCOMET (Guerreiro et al., 2023, arXiv:2310.10482) is a state-of-the-art reference-based MT evaluation metric.

Prepared by Atahan Uz and Hüseyin Emir Akdağ.

Leaderboard

Rank	Model	Average	XCOMET (en→tr)	XCOMET (tr→en)
1	google/translategemma-27b-it	78.39	77.65	79.12
2	google/translategemma-12b-it (FP8)	77.09	76.40	77.77
3	google/translategemma-12b-it	76.92	76.13	77.71
4	google/translategemma-12b-it (FP4)	75.95	74.95	76.94
5	google/gemma-3-27b-it	75.91	74.23	77.58
6	openai/gpt-oss-120b	75.75	74.32	77.17
7	ytu-ce-cosmos/Turkish-Gemma-9b-v0.1	74.99	72.61	77.37
8	google/gemma-3-12b-it	74.89	72.11	77.66
9	google/translategemma-4b-it	72.08	68.48	75.67
10	openai/gpt-oss-20b	70.29	65.12	75.46
11	ytu-ce-cosmos/Turkish-Gemma-9b-T1	68.12	62.59	73.64
12	google/gemma-3-4b-it	67.72	61.88	73.56