首页    期刊浏览 2024年07月19日 星期五
登录注册

文章基本信息

  • 标题:Extracting Parallel Paragraphs from Common Crawl
  • 本地全文:下载
  • 作者:Jakub Kúdela ; Irena Holubová ; Ondřej Bojar
  • 期刊名称:The Prague Bulletin of Mathematical Linguistics
  • 印刷版ISSN:0032-6585
  • 电子版ISSN:1804-0462
  • 出版年度:2017
  • 卷号:107
  • 期号:1
  • 页码:39-56
  • DOI:10.1515/pralin-2017-0003
  • 语种:English
  • 出版社:Walter de Gruyter GmbH
  • 摘要:Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages. We believe that there still exists a non-negligible amount of parallel data spread across sources not satisfying this assumption. We propose an approach based on a combination of bivec (a bilingual extension of word2vec) and locality-sensitive hashing which allows us to efficiently identify pairs of parallel segments located anywhere on pages of a given web domain, regardless their structure. We validate our method on realigning segments from a large parallel corpus. Another experiment with real-world data provided by Common Crawl Foundation confirms that our solution scales to hundreds of terabytes large set of web-crawled data.
国家哲学社会科学文献中心版权所有