A Comparative Benchmark Study of Large Language Models on Turkish NLP Tasks: A Comparison of ChatGPT and DeepSeek


Ünsal E., Karakuş A. T.

International Journal of Innovative Engineering Applications, cilt.9, sa.2, ss.184-192, 2025 (TRDizin)

Özet

Large language models (LLMs) have demonstrated remarkable success in high-resource languages such as English. Despite the increasing development of multilingual models, there is a lack of comprehensive, task-diverse benchmarking for Turkish Natural Language Processing (NLP) tasks. On the other hand, their effectiveness in low-resource and morphologically rich languages like Turkish have not been sufficiently investigated. This study presents a comprehensive performance evaluation of two leading LLMs, ChatGPT (GPT-4o) and DeepSeek-v3, on Turkish NLP tasks, addressing challenges in low-resource and morphologically complex languages. Five task specific datasets (Turkish NLP Question-Answering, XQuAD, MLSUM, Turkish News Headlines, and Turkish-English Translation) were used to evaluate model performance on question answering, summarization, headline generation, and translation tasks. Evaluation metrics include BLEU, ROUGE, METEOR, and BERTScore to measure both syntactic accuracy and semantic relevance. ChatGPT consistently outperformed DeepSeek in most tasks. GPT scored ROUGE-1: 0.52, METEOR: 0.62, and BERTScore: 0.68, while DeepSeek scored 0.26, 0.30, and 0.52 respectively. On the MLSUM dataset, GPT scored BLEU: 0.04 and ROUGE-1: 0.62 compared to DeepSeek’s 0.03 and 0.26. Both models performed equally well on the Turkish News Headlines dataset (ROUGE-1, ROUGE-L, METEOR: 1.0; BLEU: 0.83). For translation tasks, GPT held a slight advantage (BLEU: 0.29 vs. 0.23; METEOR: 0.62 vs. 0.60). Although GPT’s overall average metric score was 18% higher, DeepSeek occasionally performed better in BERTScore, which reflects surface-level semantic matching (e.g., XQuAD: 0.89 vs. 0.61). During the error analysis it was found that semantically valid outputs were sometimes penalized by ROUGE-L due to expression differences such as “1156–1241” vs. “He was born in 1156 and died in 1241”. The findings highlight the need for Turkish-specific LLM development and improved evaluation metrics. This study provides comprehensive comparison data and methodological insights to guide future improvements.