文章基本信息

标题：SENTIMENT ANALYSIS FOR ARABIC TWEETS DATASETS: LEXICON-BASED AND MACHINE LEARNING APPROACHES
本地全文：下载
作者：AHMAD ALOQAILY ; MALAK AL-HASSAN ; KAMAL SALAH 等
期刊名称：Journal of Theoretical and Applied Information Technology
印刷版ISSN：1992-8645
电子版ISSN：1817-3195
出版年度：2020
卷号：98
期号：4
页码：612-623
出版社：Journal of Theoretical and Applied
摘要：Recently, Sentiment Analysis applied to social media data has gradually become one of the significant research interest in the data mining domain due to the large volume of data available on social media networks. Sentiment Analysis is concerned with analyzing text to identify opinions or emotions and categorizing them as positive, negative or neutral. Applying sentiment analysis to short texts such as Twitter messages is a challenging task because tweets might contain a combination of formal and informal language, special characters, emojis and symbols. Therefore, it is often difficult to understand the semantics of the text and it is complex to extract the proper emotions expressed by users. In this paper, sentiment analysis approaches, namely: lexicon-based and machine learning approaches, are applied and evaluated on an Arabic tweets dataset (short texts) regarding the Syrian civil war and crises. The experimental results revealed that machine learning approaches outperformed the lexicon-based in the context of predicting the subjectivity of tweets. In terms of machine learning, five popular machine learning algorithms were applied and evaluated. According to the experimental results, the Logistic Model Trees (LMT) algorithm achieved the highest performance results, followed by the simple logistic and the SVM algorithms, respectively. The results also showed that there are enhancements in performance when utilizing feature selection approaches. Based on all performance evaluation measures, the LMT algorithms reported the best performance results (Acc= 85.55, F1= 0.92 and AUC= 0.86).
关键词：Machine Learning;Lexicon-Based Approach;Sentiment Analysis;Opinion Mining;Social Media;Twitter Datasets.