首页    期刊浏览 2024年11月24日 星期日
登录注册

文章基本信息

  • 标题:Unsupervised Spam Detection by Document Probability Estimation with Maximal Overlap Method
  • 本地全文:下载
  • 作者:Takashi Uemura ; Daisuke Ikeda ; Takuya Kida
  • 期刊名称:Information and Media Technologies
  • 电子版ISSN:1881-0896
  • 出版年度:2011
  • 卷号:6
  • 期号:1
  • 页码:231-240
  • DOI:10.11185/imt.6.231
  • 出版社:Information and Media Technologies Editorial Board
  • 摘要:In this paper, we study content-based spam detection for spams that are generated by copying a seed document with some random perturbations. We propose an unsupervised detection algorithm based on an entropy-like measure called document complexity, which reflects how many similar documents exist in the input collection of documents. As the document complexity, however, is an ideal measure like Kolmogorov complexity, we substitute an estimated occurrence probability of each document for its complexity. We also present an efficient algorithm that estimates the probabilities of all documents in the collection in linear time to its total length. Experimental results showed that our algorithm especially works well for word salad spams, which are believed to be difficult to detect automatically.
  • 关键词:unsupervised spam detection;document complexity;suffix tree;maximal overlap method;word salad
国家哲学社会科学文献中心版权所有