首页    期刊浏览 2024年09月07日 星期六
登录注册

文章基本信息

  • 标题:Word Segmentation Model for Sindhi Text
  • 本地全文:下载
  • 作者:Zeeshan Bhatti ; Imdad Ali Ismaili ; Waseem Javaid Soomro
  • 期刊名称:American Journal of Computing Research Repository
  • 印刷版ISSN:2377-4606
  • 电子版ISSN:2377-4266
  • 出版年度:2014
  • 卷号:2
  • 期号:1
  • 页码:1-7
  • DOI:10.12691/ajcrr-2-1-1
  • 语种:English
  • 出版社:Science and Education Publishing
  • 摘要:Through this research the problem of Sindhi Word Segmentation has been addressed and various techniques have been discussed to solve this problem. Word Segmentation is the preliminary phase involved in any tool based on Natural Language Processing (NLP). For any system to understand the written text, it needs to be able to break it into individual tokens for processing. Sindhi being a cursive ligature based Persio-Arabic script, is quite complex and rich having large number of characters in its script with all characters having multiple glyph’s based on its position in the text. In this paper Sindhi word Tokenization model has been proposed implementing various algorithms showing the process of tokenizing Sindhi text into individual words for corpus building and creating word repository for Sindhi Spell, grammar checker and other NLP applications. The problem of tokenization is resolved by first identifying the sentence boundaries and extracting each sentence into isolated list form, where each list element is a complete sentence. Then the segregated sentences are broken down into words with hard space character used as word boundaries and soft spaces are considered as part of word and thus ignored from segmenting. Finally each word is again filtered to remove special characters and then each word is converted and saved as token after validation.
  • 关键词:word segmentation; sindhi tokenization; sindhi language; Sindhi Spell Checker
国家哲学社会科学文献中心版权所有