文章基本信息

标题：Word Segmentation Model for Sindhi Text
本地全文：下载
作者：Zeeshan Bhatti ; Imdad Ali Ismaili ; Waseem Javaid Soomro 等
期刊名称：American Journal of Computing Research Repository
印刷版ISSN：2377-4606
电子版ISSN：2377-4266
出版年度：2014
卷号：2
期号：1
页码：1-7
DOI：10.12691/ajcrr-2-1-1
语种：English
出版社：Science and Education Publishing
摘要：Through this research the problem of Sindhi Word Segmentation has been addressed and various techniques have been discussed to solve this problem. Word Segmentation is the preliminary phase involved in any tool based on Natural Language Processing (NLP). For any system to understand the written text, it needs to be able to break it into individual tokens for processing. Sindhi being a cursive ligature based Persio-Arabic script, is quite complex and rich having large number of characters in its script with all characters having multiple glyph’s based on its position in the text. In this paper Sindhi word Tokenization model has been proposed implementing various algorithms showing the process of tokenizing Sindhi text into individual words for corpus building and creating word repository for Sindhi Spell, grammar checker and other NLP applications. The problem of tokenization is resolved by first identifying the sentence boundaries and extracting each sentence into isolated list form, where each list element is a complete sentence. Then the segregated sentences are broken down into words with hard space character used as word boundaries and soft spaces are considered as part of word and thus ignored from segmenting. Finally each word is again filtered to remove special characters and then each word is converted and saved as token after validation.
关键词：word segmentation; sindhi tokenization; sindhi language; Sindhi Spell Checker