首页    期刊浏览 2024年11月24日 星期日
登录注册

文章基本信息

  • 标题:The textcat Package for n-Gram Based Text Categorization in R
  • 本地全文:下载
  • 作者:Kurt Hornik ; Patrick Mair ; Johannes Rauch
  • 期刊名称:Journal of Statistical Software
  • 印刷版ISSN:1548-7660
  • 电子版ISSN:1548-7660
  • 出版年度:2013
  • 卷号:52
  • 期号:1
  • 页码:1-17
  • 语种:English
  • 出版社:University of California, Los Angeles
  • 摘要:Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.
国家哲学社会科学文献中心版权所有