文章基本信息

标题：FaDA: Fast Document Aligner using Word Embedding
本地全文：下载
作者：Pintu Lohar ; Debasis Ganguly ; Haithem Afli 等
期刊名称：The Prague Bulletin of Mathematical Linguistics
印刷版ISSN：0032-6585
电子版ISSN：1804-0462
出版年度：2016
卷号：106
期号：1
页码：169-179
DOI：10.1515/pralin-2016-0016
语种：English
出版社：Walter de Gruyter GmbH
摘要：FaDA is a free/open-source tool for aligning multilingual documents. It employs a novel crosslingual information retrieval (CLIR)-based document-alignment algorithm involving the distances between embedded word vectors in combination with the word overlap between the source-language and the target-language documents. In this approach, we initially construct a pseudo-query from a source-language document. We then represent the target-language documents and the pseudo-query as word vectors to find the average similarity measure between them. This word vector-based similarity measure is then combined with the term overlap-based similarity. Our initial experiments show that s standard Statistical Machine Translation (SMT)- based approach is outperformed by our CLIR-based approach in finding the correct alignment pairs. In addition to this, subsequent experiments with the word vector-based method show further improvements in the performance of the system.