首页    期刊浏览 2025年08月03日 星期日
登录注册

文章基本信息

  • 标题:SCALABLE PARALLEL BIG DATA SUMMARIZATION TECHNIQUE BASED ON HIERARCHICAL CLUSTERING ALGORITHM
  • 本地全文:下载
  • 作者:VERONICA S. MOERTINI ; MATTHEW ARIEL
  • 期刊名称:Journal of Theoretical and Applied Information Technology
  • 印刷版ISSN:1992-8645
  • 电子版ISSN:1817-3195
  • 出版年度:2020
  • 卷号:98
  • 期号:21
  • 页码:3559-3581
  • 出版社:Journal of Theoretical and Applied
  • 摘要:Data reduction or summarization techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. For summarizing data, Agglomerative hierarchical clustering algorithms, has few advantages. It is quite simple, can produce summaries at specific level (in the form of cluster patterns) with simple adjustment, and can be paralyzed. Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and it is currently a standard tool for analyzing big data. In processing data, Spark can distribute data processing tasks across multiple machines. Spark can run on top of Hadoop with its distributed file system (HDFS) and resources management (YARN) such that it can access HDFS files and uses network resources efficiently. To achieve high performance and scalability of the data analysis technique in Spark environment, stages, narrow and wide transformation, cost of I/O and network must be considered. In this research, we develop a data summarization technique by employing Agglomerative algorithm for Spark. To avoid biased results, records in the given big data are randomly split into bags of dataset stored as Resilient Distributed Datasets (RDD) partitions in the worker machines. To reduce network and I/O cost, we employ one wide transformation that involves data shuffling across the network. In addition, the RDD partitions are then processed locally to produce cluster patterns by worker tasks. Functions with complex computations are designed as Spark parallel tasks. A series of experiments were conducted on a Spark cluster with one driver and ten worker nodes by varying data size (5 to 20 Gb), machine cores used (10 to 50) and the application variables (data split and maximum objects/dendrogram). The results show that the technique is scalable and efficient. The execution time is mostly determined by the parallel tasks run locally on the workers.
  • 关键词:Big Data Summarization;Parallel Agglomerative;Apache Spark Application Design
国家哲学社会科学文献中心版权所有