期刊名称:International Journal of Engineering and Computer Science
印刷版ISSN:2319-7242
出版年度:2013
卷号:2
期号:11
页码:3097-3100
出版社:IJECS
摘要:One of the biggest challenges today on web is to deal with the “Big data” problem. Finding documents which are near duplicates ofeach other is another challenge which is in turn brought up by Big data. In this paper the author focuses on finding out the near duplicatedocuments using a technique called shingling. This paper also presents the different types of shingling that can be used. Further, a measurecalled the Jaccard coefficient is discussed which can be used to judge the degree of similarity between the documents