期刊名称:International Journal of Computer Science and Network Security
印刷版ISSN:1738-7906
出版年度:2019
卷号:19
期号:5
页码:81-86
出版社:International Journal of Computer Science and Network Security
摘要:Apache Spark is a common big data platform that is built based on a Resilient Distributed Dataset (RDD). This data structure abstraction is able to handle large datasets by partitioning and computing the data in parallel across many nodes. In addition, Apache Spark also features fault tolerance and interoperability with the Hadoop ecosystem. However, Apache Spark is written in high-level programming languages which do not support high parallelism like other native parallel programming models such as Message Passing Interface (MPI) and OpenACC. Furthermore, the use of the Java Virtual Machine (JVM) in the Spark implementation negatively affects performance. On the other hand, the tremendous volume of big data may not be suitable for distributed tools such as MPI and OpenACC to support a high level of parallelism. The distributed architecture of big data platforms is different from the architecture of High Performance Computing (HPC) clusters. Big data applications running on HPC clusters cannot exploit the capabilities afforded by HPC. In this paper, a hybrid approach is proposed that takes the best of both worlds by handling big data with Spark combined with the fast processing of MPI. In addition, the availability of graphics processing units (GPUs) available in modern systems can further speed up the computation time of an application. Therefore, the hybrid Spark+MPI approach may be extended by using OpenACC to include the GPU processor as well. To test the approach, the PageRank algorithm was implemented using all three methods: Spark, Spark+MPI and Spark+MPI+OpenACC.