首页    期刊浏览 2024年11月30日 星期六
登录注册

文章基本信息

  • 标题:An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data
  • 本地全文:下载
  • 作者:Mohamed Reda Al-Bana ; Marwa Salah Farhan ; Nermin Abdelhakim Othman
  • 期刊名称:Data
  • 印刷版ISSN:2306-5729
  • 出版年度:2022
  • 卷号:7
  • 期号:1
  • 页码:1-22
  • DOI:10.3390/data7010011
  • 语种:English
  • 出版社:MDPI Publishing
  • 摘要:Frequent itemset mining (FIM) is a common approach for discovering hidden frequentpatterns from transactional databases used in prediction, association rules, classification, etc. Aprioriis an FIM elementary algorithm with iterative nature used to find the frequent itemsets. Apriori isused to scan the dataset multiple times to generate big frequent itemsets with different cardinalities.Apriori performance descends when data gets bigger due to the multiple dataset scan to extract thefrequent itemsets. Eclat is a scalable version of the Apriori algorithm that utilizes a vertical layout.The vertical layout has many advantages; it helps to solve the problem of multiple datasets scanningand has information that helps to find each itemset support. In a vertical layout, itemset support canbe achieved by intersecting transaction ids (tidset/tids) and pruning irrelevant itemsets. However,when tids become too big for memory, it affects algorithms efficiency. In this paper, we introduceSHFIM (spark-based hybrid frequent itemset mining), which is a three-phase algorithm that utilizesboth horizontal and vertical layout diffset instead of tidset to keep track of the differences betweentransaction ids rather than the intersections. Moreover, some improvements are developed to decreasethe number of candidate itemsets. SHFIM is implemented and tested over the Spark framework,which utilizes the RDD (resilient distributed datasets) concept and in-memory processing that tacklesMapReduce framework problem. We compared the SHFIM performance with Spark-based Eclatand dEclat algorithms for the four benchmark datasets. Experimental results proved that SHFIMoutperforms Eclat and dEclat Spark-based algorithms in both dense and sparse datasets in terms ofexecution time.
  • 关键词:big data;frequent pattern mining;horizontal layout;vertical layout;diffset;Spark
国家哲学社会科学文献中心版权所有