首页    期刊浏览 2024年11月25日 星期一
登录注册

文章基本信息

  • 标题:AUTHORSHIP ATTRIBUTION OF TELUGU TEXTS BASED ON SYNTACTIC FEATURES AND MACHINE LEARNING TECHNIQUES
  • 本地全文:下载
  • 作者:N V GANAPATHI RAJU ; Dr V VIJAY KUMAR ; Dr O SRINIVASA RAO
  • 期刊名称:Journal of Theoretical and Applied Information Technology
  • 印刷版ISSN:1992-8645
  • 电子版ISSN:1817-3195
  • 出版年度:2016
  • 卷号:85
  • 期号:1
  • 出版社:Journal of Theoretical and Applied
  • 摘要:The automatic recognition of an author of a document on the basis of linguistic features of the text is known as authorship attribution and the present paper performs this on one of the very popular and largely spoken languages of India �Telugu�. The present paper strongly believes that each author has got his own unique style of writing pattern, which is the signature of that author. The author attribution is similar to text categorization based on stylistic properties that deals with properties of the form of linguistic expression as opposed to the content of a text. The present paper is based on �shallow� features such as function words frequencies and part of speech (POS). The present paper experimented with a corpus that consists editorial articles of Telugu language by different journalists. The token and lexical based features are not considered because all the documents are in a similar genre and roughly constant over the different authors. The present paper focused on the use of syntax-based (shallow) features of an author's style, and evaluated most frequently used syntactic N-gram (unigram, bi-gram and tri-gram with and without overlapping) POS tagging features after performing the preprocessing step. The present paper also computed authorship attribution by considering Avyayas (similar to stop words in English language) of Telugu language. Further the present paper integrated the above two cases (POS tagging with Avyayas) in finding authorship attribution. Modern supervised machine learning algorithms are used by the present paper to explore large feature vectors to achieve high attribution accuracy. We have achieved an average of above 85% attribution rate on all classifiers with different feature vectors.
  • 关键词:N-Gram; POS Tagging; Function Words; Shallow Features; Lexical; Stop Words
国家哲学社会科学文献中心版权所有