Centroid-based language identification using letter feature set


Takci H. , Sogukpinar I.

COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, vol.2945, pp.640-648, 2004 (Journal Indexed in SCI) identifier identifier

  • Publication Type: Article / Article
  • Volume: 2945
  • Publication Date: 2004
  • Title of Journal : COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING
  • Page Numbers: pp.640-648

Abstract

In recent years, an unexpected amount of growth of the text documents volume has been observed on the internet, intranet, in digital libraries and newsgroups. To obtain useful information and meaningful patterns from these documents, a great many researchers known under the term "text mining" have been carried out. Among them text categorization is to be mentioned that covers the problem of classifying documents relative to their similarities. One of techniques applied in this area is called centroid-based document classification method. All researchers on text categorization use the notion of frequency somehow or other. In this study, letter frequencies (LF) have been used for text categorization. By making use of letter frequencies information, the centroid-based document classification has been carried out. An experiment has been done on language detection for text documents. Its results allow propose that the letter-based text categorization should be done prior to term based text categorization.