Developing Deep Learning Models for Turkish Automatic Punctuation Restoration Using a Novel Dataset


Görmez Y., Arslan H., Elyakan M.

ACM Transactions on Asian and Low-Resource Language Information Processing, cilt.24, sa.11, ss.1-15, 2025 (SCI-Expanded, Scopus)

Özet

Today, automatic speech recognition systems are widely used by individuals, institutions, and organizations. However, the lack of punctuation marks in the texts produced by these systems complicates the comprehensibility of the texts and hinders advanced text analysis. Consequently, there is an increasing need for automatic punctuation restoration models. A review of existing studies reveals that most research focuses on the English language, while languages like Turkish, which belong to the agglutinative language group, have been relatively underexplored. In this study, a unique dataset has been created for Turkish automatic punctuation restoration. Models developed using convolutional neural networks, transformer encoder, and FnetEncoder layers were trained and analyzed with this dataset. The hyper-parameters of the developed models were optimized using Bayesian optimization. The analysis results showed that the best performance was achieved by the transformer encoder-based model with an overall F-score of 90.10%. Additionally, all models were observed to be more successful in predicting periods and spaces compared to commas. This study contributes to the literature by focusing on the Turkish language and offers a novel approach to automatic punctuation restoration with the creation of a new dataset and the developed models.