文章基本信息

标题：A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation ofNLPSystems Using Checklist
本地全文：下载
作者：Shaily Bhatt ; Rahul Jain ; Sandipan Dandapat 等
期刊名称：Conference on European Chapter of the Association for Computational Linguistics (EACL)
出版年度：2021
卷号：2021
页码：120-130
语种：English
出版社：ACL Anthology
摘要：Despite state-of-the-art performance, NLP systems can be fragile in real-world situations. This is often due to insufficient understanding of the capabilities and limitations of models and the heavy reliance on standard evaluation benchmarks. Research into non-standard evaluation to mitigate this brittleness is gaining increasing attention. Notably, the behavioral testing principle ‘Checklist’, which decouples testing from implementation revealed significant failures in state-of-the-art models for multiple tasks. In this paper, we present a case study of using Checklist in a practical scenario. We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist. We lay out the challenges and open questions based on our observations of using Checklist for human-in-loop evaluation and improvement of NLP systems. Disclaimer: The paper contains examples of content with offensive language. The examples do not represent the views of the authors or their employers towards any person(s), group(s), practice(s), or entity/entities.