首页    期刊浏览 2025年07月05日 星期六
登录注册

文章基本信息

  • 标题:Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities
  • 本地全文:下载
  • 作者:Motaz Saad ; Motaz Saad ; David Langlois
  • 期刊名称:Procedia - Social and Behavioral Sciences
  • 印刷版ISSN:1877-0428
  • 出版年度:2013
  • 卷号:95
  • 页码:40-47
  • DOI:10.1016/j.sbspro.2013.10.620
  • 语种:English
  • 出版社:Elsevier
  • 摘要:AbstractParallel corpora are not available for all domains and languages, but statistical methods in multilingual research domains require huge parallel/comparable corpora. Comparable corpora can be used when the parallel is not sufficient or not available for specific domains and languages. In this paper, we propose a method to extract all comparable articles from Wikipedia for multiple languages based on interlanguge links. We also extract comparable articles from Euro News website. We also present two comparability measures (CM) to compute the degree of comparability of multilingual articles. We extracted about 40K and 34K comparable articles from Wikipedia and Euro News respectively in three languages including Arabic, French, and English. Experimental results of comparability measures show that our measure can capture the comparability of multilingual corpora and allow to retrieve articles from different language concerning the same topic.
  • 关键词:computational linguistics;comparable corpora;comparability measure
国家哲学社会科学文献中心版权所有