首页    期刊浏览 2024年11月06日 星期三
登录注册

文章基本信息

  • 标题:Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor
  • 作者:Miquel Esplà-Gomis ; Mikel Forcada
  • 期刊名称:The Prague Bulletin of Mathematical Linguistics
  • 印刷版ISSN:0032-6585
  • 电子版ISSN:1804-0462
  • 出版年度:2010
  • 卷号:93
  • 期号:1
  • 页码:77-86
  • DOI:10.2478/v10108-010-0003-9
  • 语种:English
  • 出版社:Walter de Gruyter GmbH
  • 摘要:Nowadays, many websites in the Internet are multilingual and may be considered sources of parallel corpora. In this paper we will describe the free/open-source tool Bitextor, created to harvest aligned bitexts from these multilingual websites, which may be used to train corpus-based machine translation systems. This tool uses the work developed in previous approaches with modifications and improvements in order to obtain a tool as adaptable as possible to make it easier to process any kind of websites and work with any pairs of languages. Content-based and URL-based heuristics and algorithms applied to identify and align the parallel web pages in a website will be described and, finally, some results will be presented to show the functionality of the application and set the future work lines for this project.
Loading...
联系我们|关于我们|网站声明
国家哲学社会科学文献中心版权所有