Letter based text scoring method for language identification


Takci H., Sogukpinar I.

ADVANCES IN INFORMATION SYSTEMS, PROCEEDINGS, cilt.3261, ss.283-290, 2004 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 3261
  • Basım Tarihi: 2004
  • Dergi Adı: ADVANCES IN INFORMATION SYSTEMS, PROCEEDINGS
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Sayfa Sayıları: ss.283-290
  • Sivas Cumhuriyet Üniversitesi Adresli: Hayır

Özet

In recent years, an unexpected amount of growth has been observed in the volume of text documents on the internet, intranet, digital libraries and news groups. It is an important issue to obtain useful information and meaningful patterns from these documents. Identification of Languages of these text documents is an important problem which is studied by many researchers. In these researches generally words (terms) have been used for language identification. Researchers have studied on different approaches like linguistic and statistical based. In this work, Letter Based Text Scoring Method has been proposed for language identification. This method is based on letter distributions of texts. Text scoring has been performed to identify the language of each text document. Text scores are calculated by using letter distributions of new text document. Besides its acceptable accuracy proposed method is easier and faster than short terms and n-gram methods.