文章基本信息

标题：Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages
本地全文：下载
作者：Nasser Zalmout ; Nizar Habash
期刊名称：The Prague Bulletin of Mathematical Linguistics
印刷版ISSN：0032-6585
电子版ISSN：1804-0462
出版年度：2017
卷号：108
期号：1
页码：257-269
DOI：10.1515/pralin-2017-0025
语种：English
出版社：Walter de Gruyter GmbH
摘要：Tokenization is very helpful for Statistical Machine Translation (SMT), especially when translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.