期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2007
卷号:3
期号:3
页码:9-14
出版社:Journal of Theoretical and Applied
摘要:N-grams are consecutive overlapping N-character sequences formed from an input stream. N-grams are used as alternatives to word-based retrieval in a number of systems. In this paper we propose a model applicable to categorization of Telugu document. Telugu is an official language derived from ancient Brahmi script and also the official language of the state of Andhra Pradesh. Brahmi based script is noted for complex conjunct formations. The canonical structure is described as ((C) C) CV. The structure evolves any character from a set of basic syllables known as vowels and consonants where consonant, vowel (CV) core is the basic unit optionally preceded by one or two consonants. A huge set of characters that resemble the phonetic nature with an equivalent character shape are derived from the canonical structure. Words formed from this set evolved into a large corpus. Stringent grammar rules in word formation are part of this corpus. Certain word combinations result in the formation of single word is to be addressed where the last character of the first word and first character of the successive word are combined. Keeping in view of these complexities we propose a trigram based system that provides a reasonable alternative to a word based system in achieving document categorization for the language Telugu.