文章基本信息

标题：ARABIC TEXT CLUSTERING BASED ON K-MEANS ALGORITHM WITH SEMANTIC WORD EMBEDDING
本地全文：下载
作者：HASNAA R. H. SOLIMAN ; MOHAMED GRIDA ; MOHAMED HASSAN 等
期刊名称：Journal of Theoretical and Applied Information Technology
印刷版ISSN：1992-8645
电子版ISSN：1817-3195
出版年度：2019
卷号：97
期号：21
页码：2497-2509
出版社：Journal of Theoretical and Applied
摘要：With the massive growth of Arabic content on the web, clustering of the Arabic textual data into a small number of meaningful groups becomes an essential component in various information retrieval applications, such as recommender systems, sentiment analysis, question answering systems, and search engines. Clustering methods, which are traditionally based on bag of words (BOW) model for text representation, do not consider the order relationships between terms and may result in unsatisfactory clusters especially with complex languages as Arabic. This study introduces a model for enhancing the accuracy of Arabic document clusters by integrating the K-means clustering algorithm with embedding approaches, including Word to Vector (Word2Vec) as a representational basis instead of BOW to capture the semantic information between individual terms. The model performance in the clustering news dataset utilized in previous similar studies was investigated. Accordingly, it was concluded that combing embedding techniques with the k-means algorithm improves the various evaluation measures of clustering as purity, F-measure, and accuracy.
关键词：Arabic Text Clustering; Document Embeddings; Word Embeddings; Doc2vec; Word2Vec