首页    期刊浏览 2024年10月08日 星期二
登录注册

文章基本信息

  • 标题:Distributional Features for Text Categorization Based on Weight
  • 本地全文:下载
  • 作者:CH.GOWTHAMI ; P.RAJA SEKHAR
  • 期刊名称:International Journal of Computer Science and Information Technologies
  • 电子版ISSN:0975-9646
  • 出版年度:2011
  • 卷号:2
  • 期号:5
  • 页码:2116-2120
  • 出版社:TechScience Publications
  • 摘要:Text categorization is the task of assigning predefined categories to natural language text. With the widely used “bag-of-word” representation, previous researches usually assign a word with values that express whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abundant information contained in the document. This paper explores the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called distributional features, which include the compactness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a tfidf style equation, and different features are combined using ensemble learning techniques. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency values solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved. Further analysis shows that the distributional features are especially useful when documents are long and the writing style is casual.
  • 关键词:Text categorization; text mining; machine;learning; distributional feature; tfidi.
国家哲学社会科学文献中心版权所有