首页    期刊浏览 2025年02月19日 星期三
登录注册

文章基本信息

  • 标题:Segmenting Words in Thai Language Using Minimum Text Units and Conditional Random Field
  • 本地全文:下载
  • 作者:Kannikar Paripremkul ; Ohm Sornil
  • 期刊名称:Journal of Advances in Information Technology
  • 印刷版ISSN:1798-2340
  • 出版年度:2021
  • 卷号:12
  • 期号:2
  • 页码:135-141
  • DOI:10.12720/jait.12.2.135-141
  • 出版社:Academy Publisher
  • 摘要:Word segmentation is important to natural language processing tasks. Thai language as well as many Asian languages does not have word delimiter. Word segmentation in Thai language does not only require to focus on dividing a sequence of characters into meaningful words, but the word must also be divided correctly and relevant to the context of a sentence. With the popularity of social media, unknown, informal and slang words are widely used, in addition to words adopted from other languages. Word segmentation methods, generally trained from formal corpuses or dictionaries, do not yield good performance. This research proposes a novel technique to Thai word segmentation where the smallest units constituting words are first extracted, then combined into syllables using Conditional Random Field. Words are then segmented by merging the syllables together with a set of rules learned from language characteristics. The technique is evaluated on both formal and informal datasets against a method based on a convolutional neural network, currently giving the best performance for Thai word segmentation. The results show that the proposed method outperforms the comparing system and gives F-score of 0.9965 and 0.9857 for formal and informal text, respectively.
国家哲学社会科学文献中心版权所有