期刊名称:The Prague Bulletin of Mathematical Linguistics
印刷版ISSN:0032-6585
电子版ISSN:1804-0462
出版年度:2016
卷号:106
期号:1
页码:169-179
DOI:10.1515/pralin-2016-0016
语种:English
出版社:Walter de Gruyter GmbH
摘要:FaDA is a free/open-source tool for aligning multilingual documents. It employs a novel crosslingual information retrieval (CLIR)-based document-alignment algorithm involving the distances between embedded word vectors in combination with the word overlap between the source-language and the target-language documents. In this approach, we initially construct a pseudo-query from a source-language document. We then represent the target-language documents and the pseudo-query as word vectors to find the average similarity measure between them. This word vector-based similarity measure is then combined with the term overlap-based similarity. Our initial experiments show that s standard Statistical Machine Translation (SMT)- based approach is outperformed by our CLIR-based approach in finding the correct alignment pairs. In addition to this, subsequent experiments with the word vector-based method show further improvements in the performance of the system.