首页    期刊浏览 2025年07月10日 星期四
登录注册

文章基本信息

  • 标题:ARABIC TEXT CLUSTERING BASED ON K-MEANS ALGORITHM WITH SEMANTIC WORD EMBEDDING
  • 本地全文:下载
  • 作者:HASNAA R. H. SOLIMAN ; MOHAMED GRIDA ; MOHAMED HASSAN
  • 期刊名称:Journal of Theoretical and Applied Information Technology
  • 印刷版ISSN:1992-8645
  • 电子版ISSN:1817-3195
  • 出版年度:2019
  • 卷号:97
  • 期号:21
  • 页码:2497-2509
  • 出版社:Journal of Theoretical and Applied
  • 摘要:With the massive growth of Arabic content on the web, clustering of the Arabic textual data into a small number of meaningful groups becomes an essential component in various information retrieval applications, such as recommender systems, sentiment analysis, question answering systems, and search engines. Clustering methods, which are traditionally based on bag of words (BOW) model for text representation, do not consider the order relationships between terms and may result in unsatisfactory clusters especially with complex languages as Arabic. This study introduces a model for enhancing the accuracy of Arabic document clusters by integrating the K-means clustering algorithm with embedding approaches, including Word to Vector (Word2Vec) as a representational basis instead of BOW to capture the semantic information between individual terms. The model performance in the clustering news dataset utilized in previous similar studies was investigated. Accordingly, it was concluded that combing embedding techniques with the k-means algorithm improves the various evaluation measures of clustering as purity, F-measure, and accuracy.
  • 关键词:Arabic Text Clustering; Document Embeddings; Word Embeddings; Doc2vec; Word2Vec
国家哲学社会科学文献中心版权所有