文章基本信息

标题：Web Text Corpus for Natural Language Processing
本地全文：下载
作者：Vinci Liu ; James R. Curran
期刊名称：Conference on European Chapter of the Association for Computational Linguistics (EACL)
出版年度：2006
卷号：2006
出版社：ACL Anthology
摘要：Web text has been successfully used as training data for many NLP applications. While most previous work accesses web text through search engine hit counts, we created a Web Corpus by downloading web pages to create a topic-diverse collection of 10 billion words of English. We show that for context-sensitive spelling correction the Web Corpus results are better than using a search engine. For thesaurus extraction, it achieved similar overall results to a corpus of newspaper text. With many more words available on the web, better results can be obtained by collecting much larger web corpora.