期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
出版年度:2012
卷号:2012
出版社:ACL Anthology
摘要:Classically, training relation extractors relies
on high-quality, manually annotated training
data, which can be expensive to obtain. To
mitigate this cost, NLU researchers have considered
two newly available sources of less
expensive (but potentially lower quality) labeled
data from distant supervision and crowd
sourcing. There is, however, no study comparing
the relative impact of these two sources
on the precision and recall of post-learning answers.
To fill this gap, we empirically study
how state-of-the-art techniques are affected by
scaling these two sources. We use corpus sizes
of up to 100 million documents and tens of
thousands of crowd-source labeled examples.
Our experiments show that increasing the corpus
size for distant supervision has a statistically
significant, positive impact on quality
(F1 score). In contrast, human feedback has a
positive and statistically significant, but lower,
impact on precision and recall.