期刊名称:International Journal of Multimedia and Ubiquitous Engineering
印刷版ISSN:1975-0080
出版年度:2016
卷号:11
期号:7
页码:159-168
DOI:10.14257/ijmue.2016.11.7.17
出版社:SERSC
摘要:With development, access of Internet has allowed storage of huge documents containing information. Identifying near duplicate documents among those documents is a major problem in information retrieval due to their dimensionality which leads to high cost time. We propose an algorithm based on tf-idf method with importance and discriminative power of a term within a single document to speed up search process for detecting how documents are similar in collection. Using only 26.6% of original document size, our method performs well on efficiency and memory usage as we have reduced compare to the original one and that leads to a decreased time in searching process for similar documents in a collection.