文章基本信息

标题：Noise-aware Character Alignment for Extracting Transliteration Fragments
本地全文：下载
作者：Katsuhito Sudoh ; Shinsuke Mori ; Masaaki Nagata 等
期刊名称：Information and Media Technologies
电子版ISSN：1881-0896
出版年度：2015
卷号：10
期号：1
页码：88-112
DOI：10.11185/imt.10.88
出版社：Information and Media Technologies Editorial Board
摘要：This paper proposes a novel noise-aware character alignment method for automatically extracting transliteration fragments in phrase pairs that are extracted from parallel corpora. The proposed method extends a many-to-many Bayesian character alignment method by distinguishing transliteration (signal) parts from non-transliteration (noise) parts. The model can be trained efficiently by a state-based blocked Gibbs sampling algorithm with signal and noise states. The proposed method bootstraps statistical machine transliteration using the extracted transliteration fragments to train transliteration models. In experiments using Japanese-English patent data, the proposed method was able to extract transliteration fragments with much less noise than an IBM-model-based baseline, and achieved better transliteration performance than sample-wise extraction in transliteration bootstrapping.
关键词：Statistical Machine Transliteration;Bayesian Many-to-many Alignment;Machine Translation