首页    期刊浏览 2024年11月08日 星期五
登录注册

文章基本信息

  • 标题:Authorship Attribution and Optical Character Recognition Errors
  • 本地全文:下载
  • 作者:Patrick Juola ; John I. Noecker Jr ; Michael V. Ryan
  • 期刊名称:Traitement Automatique des Langues
  • 印刷版ISSN:1248-9433
  • 电子版ISSN:1965-0906
  • 出版年度:2012
  • 卷号:53
  • 期号:3
  • 出版社:ATALA - Assoc Traitement Automatique Langues
  • 摘要:Stylometric authorship attribution is a fundamental problem. The basic idea behind the research is that one can determine the authorship of a document on the basis of cognitive and linguistic quirks that uniquely identify a person. In many cases, however, noise in the original documents can make this analysis more difficult and less reliable. We investigate the errors introduced by a typical optical character recognition (OCR) process. Using simulated (random) errors in a standard benchmark corpus, we test to see how sensitive the authorship attribution process is to character mis-recognition. Our results indicate that, while accuracy decreases measurably with noise, the decrease is not substantial.
国家哲学社会科学文献中心版权所有