期刊名称:International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)
印刷版ISSN:2278-1323
出版年度:2014
卷号:3
期号:1
页码:077-080
出版社:Shri Pannalal Research Institute of Technolgy
摘要:Data clustering is a common technique used in data analysis and is used in fields including data mining and image analysis. It is partitioning technique of similar data and also determine similarity between data or group of data. Data clustering can be expensive and time consuming because of its iteration and repeat data clustering. Hence parallelizing and distributing becomes attractive in terms of its speed-up in computation and increase memory available in computation. Traditional distributed programming methods are sophisticated for parallel computing because they face problems like deadlocks and synchronization. Map-Reduce is a programming paradigm for solving certain problems of computing cluster. MapReduce having simple two steps process. In Map step a master node divides problem into number of different parts, that are forward to map tasks. Each map task processes its part and output results as key-value pairs. The Reduce step receives the outputs of maps , where particular receiver will receive only particular part of map and it will process those. Apache Hadoop provide Map-Reduce programming paradigm that allow parallel and distributed programmer to program easily for data clustering. This paper covers various data clustering algorithms using MapReduce and their benefits.
关键词:MapReduce; Data clustering; Hadoop; HDFS; ; Distance metrics