首页    期刊浏览 2024年10月06日 星期日
登录注册

文章基本信息

  • 标题:Punjabi Text Clustering by Sentence Structure Analysis
  • 本地全文:下载
  • 作者:Saurabh Sharma ; Vishal Gupta
  • 期刊名称:Computer Science & Information Technology
  • 电子版ISSN:2231-5403
  • 出版年度:2012
  • 卷号:2
  • 期号:4
  • 页码:237-244
  • DOI:10.5121/csit.2012.2420
  • 出版社:Academy & Industry Research Collaboration Center (AIRCC)
  • 摘要:Punjabi Text Document Clustering is done by analyzing the sentence structure of similar documents sharing same topics and grouping them into clusters. The prevalent algorithms in this field utilize the vector space model which treats the documents as a bag of words. The meaning in natural language inherently depends on the word sequences which are overlooked and ignored while clustering. The current paper deals with a new Punjabi text clustering algorithm named Clustering by Sentence Structure Analysis(CSSA) which has been carried out on 221 Punjabi news articles available on news sites. The phrases are extracted for processing by a meticulous analysis of the structure of a sentence by applying the basic grammatical rules of Karaka. Sequences formed from phrases, are used to find the topic and for finding similarities among all documents which results in the formation of meaningful clusters.
  • 关键词:Punjabi language; Text clustering; Sentence structure analysis; Karaka theory.
国家哲学社会科学文献中心版权所有