期刊名称:The Prague Bulletin of Mathematical Linguistics
印刷版ISSN:0032-6585
电子版ISSN:1804-0462
出版年度:2010
卷号:93
期号:1
页码:77-86
DOI:10.2478/v10108-010-0003-9
语种:English
出版社:Walter de Gruyter GmbH
摘要:Nowadays, many websites in the Internet are multilingual and may be considered sources of parallel corpora. In this paper we will describe the free/open-source tool Bitextor, created to harvest aligned bitexts from these multilingual websites, which may be used to train corpus-based machine translation systems. This tool uses the work developed in previous approaches with modifications and improvements in order to obtain a tool as adaptable as possible to make it easier to process any kind of websites and work with any pairs of languages. Content-based and URL-based heuristics and algorithms applied to identify and align the parallel web pages in a website will be described and, finally, some results will be presented to show the functionality of the application and set the future work lines for this project.