摘要:With the deep development and application of Internet technology, data need to be processed more and more, when dealing with large amounts of data. Spark is a versatile high-performance and parallel computing framework, which can be applied to data mining. This paper is based on the parallelization of platforms’ K-means algorithm, by building a YARN cluster environment and making experiments to analyze performance of two distributed platforms, and finally find out that the match of Spark and YARN shows more effective on clustering results and consumes less time on the execution of programs, so it’s more suitable for cluster analysis of big data.
其他关键词:Clustering algorithm, distributed platforms, research of performance.