文章基本信息

标题：Authorship Attribution and Optical Character Recognition Errors
本地全文：下载
作者：Patrick Juola ; John I. Noecker Jr ; Michael V. Ryan 等
期刊名称：Traitement Automatique des Langues
印刷版ISSN：1248-9433
电子版ISSN：1965-0906
出版年度：2012
卷号：53
期号：3
出版社：ATALA - Assoc Traitement Automatique Langues
摘要：Stylometric authorship attribution is a fundamental problem. The basic idea behind the research is that one can determine the authorship of a document on the basis of cognitive and linguistic quirks that uniquely identify a person. In many cases, however, noise in the original documents can make this analysis more difﬁcult and less reliable. We investigate the errors introduced by a typical optical character recognition (OCR) process. Using simulated (random) errors in a standard benchmark corpus, we test to see how sensitive the authorship attribution process is to character mis-recognition. Our results indicate that, while accuracy decreases measurably with noise, the decrease is not substantial.