首页    期刊浏览 2025年02月23日 星期日
登录注册

文章基本信息

  • 标题:Orthographic Errors in Web Pages: Toward Cleaner Web Corpora
  • 本地全文:下载
  • 作者:Christoph Ringlstetter ; Klaus U. Schulz ; Stoyan Mihov
  • 期刊名称:Computational Linguistics
  • 印刷版ISSN:0891-2017
  • 电子版ISSN:1530-9312
  • 出版年度:2006
  • 卷号:32
  • 期号:3
  • 页码:295-340
  • DOI:10.1162/coli.2006.32.3.295
  • 语种:English
  • 出版社:MIT Press
  • 摘要:Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.
国家哲学社会科学文献中心版权所有