文章基本信息

标题：Reliability Measurement without Limits
本地全文：下载
作者：Dennis Reidsma ; Jean Carletta
期刊名称：Computational Linguistics
印刷版ISSN：0891-2017
电子版ISSN：1530-9312
出版年度：2008
卷号：34
期号：3
页码：319-326
DOI：10.1162/coli.2008.34.3.319
语种：English
出版社：MIT Press
摘要：In computational linguistics, a reliability measurement of 0.8 on some statistic such as κ is widely thought to guarantee that hand-coded data is fit for purpose, with 0.67 to 0.8 tolerable, and lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with low reliability as long as any disagreement among human coders looks like random noise. When the disagreement introduces patterns, however, the machine learner can pick these up just like it picks up the real patterns in the data, making the performance figures look better than they really are. For the range of reliability measures that the field currently accepts, disagreement can appreciably inflate performance figures, and even a measure of 0.8 does not guarantee that what looks like good performance really is. Although this is a commonsense result, it has implications for how we work. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.