Efficient TF-IDF method for alignment-free DNA sequence similarity analysis


DELİBAŞ E.

Journal of Molecular Graphics and Modelling, cilt.137, 2025 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 137
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1016/j.jmgm.2025.109011
  • Dergi Adı: Journal of Molecular Graphics and Modelling
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, Chemical Abstracts Core, Chimica, Compendex, EMBASE, INSPEC, MEDLINE
  • Anahtar Kelimeler: Alignment-free method, DNA sequence analysis, Genomic data, Phylogenetic analysis, TF-IDF
  • Sivas Cumhuriyet Üniversitesi Adresli: Evet

Özet

This study proposes a pioneering alignment-free approach for the analysis of DNA sequence similarity. The method employs the representation of DNA sequences as n-grams, a technique that involves the adaptation of the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm to genomic data. The primary objective of this approach is to enhance the accuracy of the results while concomitantly reducing the computational costs of the process, by ascertaining the most informative n-grams. The approach adopted in this study successfully circumvents the limitations of both traditional alignment-based and alignment-free methods, thereby demonstrating a commendable level of performance. The proposed method was tested on three different datasets and achieved high agreement with reference phylogenetic trees in the AFProject benchmark system. The results demonstrate that TF-IDF-based similarity matrices effectively capture phylogenetic relationships and significantly reduce processing time. The high accuracy rates obtained prove that the method offers a scalable and robust alternative in large genomic datasets. The method demonstrates considerable potential in DNA sequence similarity analysis, exhibiting high accuracy and low computational cost.