首页    期刊浏览 2025年02月20日 星期四
登录注册

文章基本信息

  • 标题:An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
  • 本地全文:下载
  • 作者:Valentin Zhikov ; Hiroya Takamura ; Manabu Okumura
  • 期刊名称:人工知能学会論文誌
  • 印刷版ISSN:1346-0714
  • 电子版ISSN:1346-8030
  • 出版年度:2013
  • 卷号:28
  • 期号:3
  • 页码:347-360
  • DOI:10.1527/tjsai.28.347
  • 出版社:The Japanese Society for Artificial Intelligence
  • 摘要:This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a least-effort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the ``CHILDES'' corpus for research in language development reveals that the algorithm achieves a F-score, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced computational time. In view of its capability to induce the vocabulary of large-scale corpora of domain-specific text, the method has potential to improve the coverage of morphological analyzers for languages without explicit word boundary markers. A semi-supervised word segmentation approach is also proposed, in which the word boundaries obtained through the unsupervised model are used as features for a state-of-the-art word segmentation method.
  • 关键词:unsupervised word segmentation ; semi-supervised word segmentation ; branching entropy ; minimum description length
国家哲学社会科学文献中心版权所有