首页    期刊浏览 2024年11月25日 星期一
登录注册

文章基本信息

  • 标题:A MODEL FOR OVERLAPPING TRIGRAM TECHNIQUE FOR TELUGU SCRIPT
  • 本地全文:下载
  • 作者:B.Vishnu Vardhan ; L.Pratap Reddy ; A.VinayBabu
  • 期刊名称:Journal of Theoretical and Applied Information Technology
  • 印刷版ISSN:1992-8645
  • 电子版ISSN:1817-3195
  • 出版年度:2007
  • 卷号:3
  • 期号:3
  • 页码:9-14
  • 出版社:Journal of Theoretical and Applied
  • 摘要:N-grams are consecutive overlapping N-character sequences formed from an input stream. N-grams are used as alternatives to word-based retrieval in a number of systems. In this paper we propose a model applicable to categorization of Telugu document. Telugu is an official language derived from ancient Brahmi script and also the official language of the state of Andhra Pradesh. Brahmi based script is noted for complex conjunct formations. The canonical structure is described as ((C) C) CV. The structure evolves any character from a set of basic syllables known as vowels and consonants where consonant, vowel (CV) core is the basic unit optionally preceded by one or two consonants. A huge set of characters that resemble the phonetic nature with an equivalent character shape are derived from the canonical structure. Words formed from this set evolved into a large corpus. Stringent grammar rules in word formation are part of this corpus. Certain word combinations result in the formation of single word is to be addressed where the last character of the first word and first character of the successive word are combined. Keeping in view of these complexities we propose a trigram based system that provides a reasonable alternative to a word based system in achieving document categorization for the language Telugu.
  • 关键词:canonical structure ; Text categorization ; trigram ; bigram ; conjuncts
国家哲学社会科学文献中心版权所有