文章基本信息

标题：Rule-based Text Normalization for Malay Social Media Texts
其他标题：Rule-based Text Normalization for Malay Social Media Texts
本地全文：下载
作者：Siti Noor Allia Noor Ariffin ; Sabrina Tiun
期刊名称：International Journal of Advanced Computer Science and Applications(IJACSA)
印刷版ISSN：2158-107X
电子版ISSN：2156-5570
出版年度：2020
卷号：11
期号：10
DOI：10.14569/IJACSA.2020.0111021
出版社：Science and Information Society (SAI)
摘要：Malay social media text is a text written on social media networks like Twitter. Commonly, this text comprises non-standard words, filled with dialects, foreign languages, word abbreviations, grammatical neglect, spelling errors, and many more. It is well known that this type of text is difficult to process due to its high noise and distinct text structure. Such problems can be resolved using rigorous text normalization, which is critical before any technique can be implemented and evaluated on social media text. In this paper, an improved normalization method towards Malay social media text was proposed by converting non-standard Malay words using a rule-based model. The method normalizes common language words often used by Malaysian users, such as non-standard Malay (like dialect and slangs), Romanized Arabic, and English words. Thus, a Malay text normalizer was proposed using a set of rules that extend across different domains of natural language processing (NLP) and is expected to address the challenges of processing Malay social media text. This study implements the proposed Malay text normalizer in a Part-of-Speech (POS) tagging application to evaluate the normalizer’s performance. The implementation demonstrates a substantial improvement in the POS tagging efficiency over several pre-processing stages, with an improvement of accuracy up to 31.8%. The increase of accuracy in the POS tagging indicates two main points. First, the Malay text normalizer’s rules improve the performance of a Malay text normalizer on social media text. Second, our proposed Malay text normalizer has successfully improved the POS tagging percentage and demonstrates the importance of normalized pre-processing in any NLP application.
关键词：Malay normalization; Malay text normalization; informal Malay text; Malay tweets; rule-based normalizer