期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2017
卷号:95
期号:2
出版社:Journal of Theoretical and Applied
摘要:One of the challenges of natural language processing is social media text like tweets. Conversational text in contrast to genres that are highly edited (standard language) which traditional NLP tools have been developed for contains many syntactic patterns and non-standard lexical items. These are the outcomes of dialectal variation, diversity in topic, orthography, unintended errors, conversational errors and creative language use. The fact that twitter text is characterized by idiosyncratic style, noise and linguistic errors makes it difficult to part-of-speech tag. The aim of this paper is to design and implement models of speech tagging for Arabic tweets by investigating numerous models of machine learning like K-Nearest Neighbour, Naive Bayes and Decision tree models. In this paper, a novel Arabic Twitter corpus is introduced while assessing various state-of-the-art POS taggers which retrained on the given corpus. A state-of-the-art accuracy of 87.97% is achieved when tagging twitter.
关键词:Arabic part of speech tagging; Arabic tweets Classification; Feature Extraction