期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
出版年度:2006
卷号:2006
出版社:ACL Anthology
摘要:Web text has been successfully used as
training data for many NLP applications.
While most previous work accesses web
text through search engine hit counts, we
created a Web Corpus by downloading
web pages to create a topic-diverse collection
of 10 billion words of English. We
show that for context-sensitive spelling
correction the Web Corpus results are better
than using a search engine. For thesaurus
extraction, it achieved similar overall
results to a corpus of newspaper text.
With many more words available on the
web, better results can be obtained by collecting
much larger web corpora.