文章基本信息

标题：Extracting Parallel Paragraphs from Common Crawl
本地全文：下载
作者：Jakub Kúdela ; Irena Holubová ; Ondřej Bojar 等
期刊名称：The Prague Bulletin of Mathematical Linguistics
印刷版ISSN：0032-6585
电子版ISSN：1804-0462
出版年度：2017
卷号：107
期号：1
页码：39-56
DOI：10.1515/pralin-2017-0003
语种：English
出版社：Walter de Gruyter GmbH
摘要：Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages. We believe that there still exists a non-negligible amount of parallel data spread across sources not satisfying this assumption. We propose an approach based on a combination of bivec (a bilingual extension of word2vec) and locality-sensitive hashing which allows us to efficiently identify pairs of parallel segments located anywhere on pages of a given web domain, regardless their structure. We validate our method on realigning segments from a large parallel corpus. Another experiment with real-world data provided by Common Crawl Foundation confirms that our solution scales to hundreds of terabytes large set of web-crawled data.