首页    期刊浏览 2024年10月01日 星期二
登录注册

文章基本信息

  • 标题:Compiling and Using the IJS-ELAN Parallel Corpus
  • 本地全文:下载
  • 作者:Tomaž Erjavec
  • 期刊名称:Informatica
  • 印刷版ISSN:1514-8327
  • 电子版ISSN:1854-3871
  • 出版年度:2002
  • 卷号:26
  • 期号:3
  • 页码:299-308
  • 出版社:The Slovene Society Informatika, Ljubljana
  • 摘要:With increasing amounts of text being available in electronic form, it is becoming relatively easy to obtain digital texts together with their translations. The paper presents the processing steps necessary to compile such texts into parallel corpora, an extremely useful language resource. Parallel corpora can be used asa translation aid for second-language learners, for translators and lexicographers, or as a data-source for various language technology tools. We present our work in this direction, which is characterised by the use of open standards for text annotation, the use of publicly available third-party tools and wide availability of the produced resources. Explained is the corpus annotation chain involving normalisation, tokenisation, segmentation, alignment, word-class syntactic tagging, and lemmatisation. Two exploitation results over our annotated corpora are also presented, namely a Web concordancer and the extraction of bi-lingual lexica
  • 关键词:natural language processing; corpus annotation; multilinguality; lexicon extraction
国家哲学社会科学文献中心版权所有