首页    期刊浏览 2024年10月05日 星期六
登录注册

文章基本信息

  • 标题:A High-Performing Similarity Measure for Categorical Dataset with SF-Tree Clustering Algorithm
  • 作者:Mahmoud A. Mahdi ; Samir E. Abdelrahman ; Reem Bahgat
  • 期刊名称:International Journal of Advanced Computer Science and Applications(IJACSA)
  • 印刷版ISSN:2158-107X
  • 电子版ISSN:2156-5570
  • 出版年度:2018
  • 卷号:9
  • 期号:5
  • DOI:10.14569/IJACSA.2018.090565
  • 出版社:Science and Information Society (SAI)
  • 摘要:Tasks such as clustering and classification assume the existence of a similarity measure to assess the similarity (or dissimilarity) of a pair of observations or clusters. The key difference between most clustering methods is in their similarity measures. This article proposes a new similarity measure function called PWO “Probability of the Weights between Overlapped items ”which could be used in clustering categorical dataset; proves that PWO is a metric; presents a framework implementation to detect the best similarity value for different datasets; and improves the F-tree clustering algorithm with Semi-supervised method to refine the results. The experimental evaluation on real categorical datasets, such as “Mushrooms, KrVskp, Congressional Voting, Soybean-Large, Soybean-Small, Hepatitis, Zoo, Lenses, and Adult-Stretch” shows that PWO is more effective in measuring the similarity between categorical data than state-of-the-art algorithms; clustering based on PWO with pre-defined number of clusters results a good separation of classes with a high purity of average 80% coverage of real classes; and the overlap estimator perfectly estimates the value of the overlap threshold using a small sample of dataset of around 5% of data size.
  • 关键词:Algorithm; clustering; similarity; measurement; categorical; F-Tree; SF-Tree
Loading...
联系我们|关于我们|网站声明
国家哲学社会科学文献中心版权所有