首页    期刊浏览 2024年11月07日 星期四
登录注册

文章基本信息

  • 标题:Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach
  • 本地全文:下载
  • 作者:Arne Defauw ; Sara Szoc ; Anna Bardadym
  • 期刊名称:Informatics
  • 电子版ISSN:2227-9709
  • 出版年度:2019
  • 卷号:6
  • 期号:3
  • 页码:35-55
  • DOI:10.3390/informatics6030035
  • 出版社:MDPI Publishing
  • 摘要:To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR–NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance.
  • 关键词:data-curation; web crawling; neural machine translation data-curation ; web crawling ; neural machine translation
国家哲学社会科学文献中心版权所有