期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2020
卷号:98
期号:21
页码:3559-3581
出版社:Journal of Theoretical and Applied
摘要:Data reduction or summarization techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. For summarizing data, Agglomerative hierarchical clustering algorithms, has few advantages. It is quite simple, can produce summaries at specific level (in the form of cluster patterns) with simple adjustment, and can be paralyzed. Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and it is currently a standard tool for analyzing big data. In processing data, Spark can distribute data processing tasks across multiple machines. Spark can run on top of Hadoop with its distributed file system (HDFS) and resources management (YARN) such that it can access HDFS files and uses network resources efficiently. To achieve high performance and scalability of the data analysis technique in Spark environment, stages, narrow and wide transformation, cost of I/O and network must be considered. In this research, we develop a data summarization technique by employing Agglomerative algorithm for Spark. To avoid biased results, records in the given big data are randomly split into bags of dataset stored as Resilient Distributed Datasets (RDD) partitions in the worker machines. To reduce network and I/O cost, we employ one wide transformation that involves data shuffling across the network. In addition, the RDD partitions are then processed locally to produce cluster patterns by worker tasks. Functions with complex computations are designed as Spark parallel tasks. A series of experiments were conducted on a Spark cluster with one driver and ten worker nodes by varying data size (5 to 20 Gb), machine cores used (10 to 50) and the application variables (data split and maximum objects/dendrogram). The results show that the technique is scalable and efficient. The execution time is mostly determined by the parallel tasks run locally on the workers.
关键词:Big Data Summarization;Parallel Agglomerative;Apache Spark Application Design