文章基本信息

标题：Paraphrasing Training Data for Statistical Machine Translation
本地全文：下载
作者：Eric Nichols ; Francis Bond ; D. Scott Appling 等
期刊名称：Information and Media Technologies
电子版ISSN：1881-0896
出版年度：2010
卷号：5
期号：2
页码：950-971
DOI：10.11185/imt.5.950
出版社：Information and Media Technologies Editorial Board
摘要：Large amounts of data are essential for training statistical machine translation systems. In this paper we show how training data can be expanded by paraphrasing one side of a parallel corpus. The new data is made by parsing then generating using an open-source, precise HPSG-based grammar. This gives sentences with the same meaning, but with minor variations in lexical choice and word order. In experiments paraphrasing the English in the Tanaka Corpus, a freely-available Japanese-English parallel corpus, we show consistent, statistically-significant gains on training data sets ranging from 10,000 to 147,000 sentence pairs in size as evaluated by the BLEU and METEOR automatic evaluation metrics.
关键词：Natural Language Processing;Machine Translation;Paraphrasing;HPSG