期刊名称:International Journal of Computer Science and Information Technologies
电子版ISSN:0975-9646
出版年度:2015
卷号:6
期号:1
页码:194-197
出版社:TechScience Publications
摘要:Similarity Join is an important operation in data integration and cleansing, record linkage, data deduplication and pattern matching. It finds similar sting pairs from two collections of strings. Number of approaches have been proposed as well as compared for string similarity joins. The rising era of big data demands for scalable algorithms to support large scale string similarity joins. In this paper we study the string similarity joins, their use. Further we look at three different techniques for scalable string similarity join using MapReduce, which are- Parallel set-similarity join, MGJoin and MassJoin. Finally, we try to compare them based on some common characteristics