期刊名称:International Journal of Computer Science and Network
印刷版ISSN:2277-5420
出版年度:2016
卷号:5
期号:3
页码:523-525
出版社:IJCSN publisher
摘要:The deduplication process is always given by a set of manually labeled pairs. But in a very large datasets, producing manually labeled pairs is a tedious process to complete. So in this article, a two-stage sampling selection procedure that reduces the set of pairs to tune the deduplication process is proposed. T3S executes in two stages. In the first stage a balanced subset of data are produced for labeling. In the next stage, the redundant and the duplicated data are removed and only the deduplicated data are produced as the output