首页    期刊浏览 2025年02月21日 星期五
登录注册

文章基本信息

  • 标题:Improving Machine Translation Performance by Exploiting Non-Parallel Corpora
  • 本地全文:下载
  • 作者:Dragos Stefan Munteanu ; Daniel Marcu
  • 期刊名称:Computational Linguistics
  • 印刷版ISSN:0891-2017
  • 电子版ISSN:1530-9312
  • 出版年度:2005
  • 卷号:31
  • 期号:4
  • 页码:477-504
  • DOI:10.1162/089120105775299168
  • 语种:English
  • 出版社:MIT Press
  • 摘要:We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.
国家哲学社会科学文献中心版权所有