Centroid-based language identification using letter feature set

Takci H. , Sogukpinar I.

COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, cilt.2945, ss.640-648, 2004 (SCI İndekslerine Giren Dergi) identifier identifier

  • Cilt numarası: 2945
  • Basım Tarihi: 2004
  • Sayfa Sayıları: ss.640-648


In recent years, an unexpected amount of growth of the text documents volume has been observed on the internet, intranet, in digital libraries and newsgroups. To obtain useful information and meaningful patterns from these documents, a great many researchers known under the term "text mining" have been carried out. Among them text categorization is to be mentioned that covers the problem of classifying documents relative to their similarities. One of techniques applied in this area is called centroid-based document classification method. All researchers on text categorization use the notion of frequency somehow or other. In this study, letter frequencies (LF) have been used for text categorization. By making use of letter frequencies information, the centroid-based document classification has been carried out. An experiment has been done on language detection for text documents. Its results allow propose that the letter-based text categorization should be done prior to term based text categorization.