Centroid-based language identification using letter feature set

Takci, HİDAYET; Sogukpinar, I

Centroid-based language identification using letter feature set

COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, cilt.2945, ss.640-648, 2004 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 2945
Basım Tarihi: 2004
Dergi Adı: COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Sayfa Sayıları: ss.640-648
Sivas Cumhuriyet Üniversitesi Adresli: Hayır

Özet

In recent years, an unexpected amount of growth of the text documents volume has been observed on the internet, intranet, in digital libraries and newsgroups. To obtain useful information and meaningful patterns from these documents, a great many researchers known under the term "text mining" have been carried out. Among them text categorization is to be mentioned that covers the problem of classifying documents relative to their similarities. One of techniques applied in this area is called centroid-based document classification method. All researchers on text categorization use the notion of frequency somehow or other. In this study, letter frequencies (LF) have been used for text categorization. By making use of letter frequencies information, the centroid-based document classification has been carried out. An experiment has been done on language detection for text documents. Its results allow propose that the letter-based text categorization should be done prior to term based text categorization.