首页    期刊浏览 2025年07月18日 星期五
登录注册

文章基本信息

  • 标题:A Web Corpus and Word Sketches for Japanese
  • 本地全文:下载
  • 作者:Irena Srdanovic´ Erjavec ; Tomaz Erjavec ; Adam Kilgarriff
  • 期刊名称:Information and Media Technologies
  • 电子版ISSN:1881-0896
  • 出版年度:2008
  • 卷号:3
  • 期号:3
  • 页码:529-551
  • DOI:10.11185/imt.3.529
  • 出版社:Information and Media Technologies Editorial Board
  • 摘要:Of all the major world languages, Japanese is lagging behind in terms of publicly accessible and searchable corpora. In this paper we describe the development of JpWaC (Japanese Web as Corpus), a large corpus of 400 million words of Japanese web text, and its encoding for the Sketch Engine. The Sketch Engine is a web-based corpus query tool that supports fast concordancing, grammatical processing, ‘word sketching’ (one-page summaries of a word's grammatical and collocational behaviour), a distributional thesaurus, and robot use. We describe the steps taken to gather and process the corpus and to establish its validity, in terms of the kinds of language it contains. We then describe the development of a shallow grammar for Japanese to enable word sketching. We believe that the Japanese web corpus as loaded into the Sketch Engine will be a useful resource for a wide number of Japanese researchers, learners, and NLP developers.
  • 关键词:Japanese web corpus;Corpus query tool;Sketch Engine;Word sketches
国家哲学社会科学文献中心版权所有