文章基本信息

标题：An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
本地全文：下载
作者：Valentin Zhikov ; Hiroya Takamura ; Manabu Okumura 等
期刊名称：Information and Media Technologies
电子版ISSN：1881-0896
出版年度：2013
卷号：8
期号：2
页码：514-527
DOI：10.11185/imt.8.514
出版社：Information and Media Technologies Editorial Board
摘要：This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local pre-dictability of adjacent character sequences, while searching for a least-effort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the “CHILDES” corpus for research in language development reveals that the algorithm achieves a F-score, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced computational time. In view of its capability to induce the vocabulary of large-scale corpora of domain-specific text, the method has potential to improve the coverage of morphological analyzers for languages without explicit word boundary markers. A semi-supervised word segmentation approach is also proposed, in which the word boundaries obtained through the unsupervised model are used as features for a state-of-the-art word segmentation method.
关键词：unsupervised word segmentation;semi-supervised word segmentation;branching entropy;minimum description length