期刊名称:The Prague Bulletin of Mathematical Linguistics
印刷版ISSN:0032-6585
电子版ISSN:1804-0462
出版年度:2017
卷号:108
期号:1
页码:283-294
DOI:10.1515/pralin-2017-0027
语种:English
出版社:Walter de Gruyter GmbH
摘要:Corpora are precious resources, as they allow for a proper estimation of statistical machine translation models. Data selection is a variant of the domain adaptation field, aimed to extract those sentences from an out-of-domain corpus that are the most useful to translate a different target domain. We address the data selection problem in statistical machine translation as a classification task. We present a new method, based on neural networks, able to deal with monolingual and bilingual corpora. Empirical results show that our data selection method provides slightly better translation quality, compared to a state-of-the-art method (cross-entropy), requiring substantially less data. Moreover, the results obtained are coherent across different language pairs, demonstrating the robustness of our proposal.