期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
出版年度:2006
卷号:2006
出版社:ACL Anthology
摘要:In this paper, we present an automated,
quantitative, knowledge-poor method to
evaluate the randomness of a collection
of documents (corpus), with respect to a
number of biased partitions. The method
is based on the comparison of the word
frequency distribution of the target corpus
to word frequency distributions from corpora
built in deliberately biased ways. We
apply the method to the task of building a
corpus via queries to Google. Our results
indicate that this approach can be used,
reliably, to discriminate biased and unbiased
document collections and to choose
the most appropriate query terms.