首页    期刊浏览 2024年11月07日 星期四
登录注册

文章基本信息

  • 标题:From characters to words: the turning point ofBPEmerges
  • 本地全文:下载
  • 作者:Ximena Gutierrez-Vasques ; Christian Bentz ; Olga Sozinova
  • 期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
  • 出版年度:2021
  • 卷号:2021
  • 页码:3454-3468
  • DOI:10.18653/v1/2021.eacl-main.302
  • 语种:English
  • 出版社:ACL Anthology
  • 摘要:The distributions of orthographic word types are very different across languages due to typological characteristics, different writing traditions and potentially other factors. The wide range of cross-linguistic diversity is still a major challenge for NLP and the study of language. We use BPE and information-theoretic measures to investigate if distributions become similar under specific levels of subword tokenization. We perform a cross-linguistic comparison, following incremental merges of BPE (we go from characters to words) for 47 diverse languages. We show that text entropy values (a feature of probability distributions) tend to converge at specific subword levels: relatively few BPE merges (around 350) lead to the most similar distributions across languages. Additionally, we analyze the interaction between subword and word-level distributions and show that our findings can be interpreted in light of the ongoing discussion regarding different types of morphological complexity.
国家哲学社会科学文献中心版权所有