首页    期刊浏览 2024年11月29日 星期五
登录注册

文章基本信息

  • 标题:HMATC: Hierarchical multi-label Arabic text classification model using machine learning
  • 本地全文:下载
  • 作者:Nawal Aljedani ; Reem Alotaibi ; Mounira Taileb
  • 期刊名称:Egyptian Informatics Journal
  • 印刷版ISSN:1110-8665
  • 出版年度:2021
  • 卷号:22
  • 期号:3
  • 页码:225-237
  • DOI:10.1016/j.eij.2020.08.004
  • 语种:English
  • 出版社:Elsevier
  • 摘要:AbstractMulti-label classification assigns multiple labels to each document concurrently. Many real-world classification problems tend to employ high-dimensional label spaces, which can be naturally structured in a hierarchy. In this type of problem, each instance may belong to multiple labels and labels are organized in a hierarchical structure. It presents a more complex problem than flat classification, given that the classification algorithm has to take into account hierarchical relationships between labels and be able to predict multiple labels for the same instance. Few studies have investigated multi-label text classification for the Arabic language. Most of these studies have focused mainly on flat classification and have neglected the hierarchical structure. Therefore, this paper explores the hierarchical multi-label classification in the context of the Arabic language. It proposes a hierarchical multi-label Arabic text classification (HMATC) model with a machine learning approach. The impact of feature selection methods and feature set dimensions on classification performance are also investigated. In addition, the Hierarchy Of Multilabel ClassifiER (HOMER) algorithm is optimized via examination of different sets of multi-label classifiers, clustering algorithms and different numbers of clusters to improve the hierarchical classification. Moreover, this study contributes to existing research by introducing a hierarchical multi-label Arabic dataset in an appropriate format for hierarchical classification and making it publicly available. The results reveal that the proposed model outperforms all models considered in the experiments in terms of the computational cost, which consumed less cost (2 h) compared with other evaluated models. In addition, it shows a significant improvement compared with the state-of-the-art model (Fatwa model) in terms of Hamming loss (0.004), hierarchical loss (1.723), multi-label accuracy (0.758), subset accuracy (0.292), micro-averaged precision (0.879), micro-averaged recall (0.828), and micro-averaged F-measure (0.853).
  • 关键词:Text classification;Multi-label classification;Hierarchical classification;Machine learning;Arabic natural language processing
国家哲学社会科学文献中心版权所有