期刊名称:International Journal of Computer Science & Technology
印刷版ISSN:2229-4333
电子版ISSN:0976-8491
出版年度:2012
卷号:3
期号:4
页码:618-624
语种:English
出版社:Ayushmaan Technologies
摘要:In a Network database scenario, most state-of-the-art evidence identical methods such as SVM, OSVM, PEBL, and Christen are efficient in IR systems. But such methods require huge training data sets for pre learning. Earlier to address this problem Unsupervised Duplicate Detection (UDD) a inquiry-dependent evidence identical method was developed. For a given inquiry, it can effectively identify duplicates from the inquiry results of various Network databases. Non duplicate evidences from the same source can be used as training examples. Starting from a non duplicate set, UDD uses two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier that iteratively identifies duplicates in the inquiry results from various Network databases. For String Similarity calculation UDD uses any kind of similarity calculation method. We propose to use a faster better string similarity calculation (Sim String using SWIG) foroptimizing the performance of UDD. Evidence identical is an essential step in duplicate detection as it identifies evidences representing same real-world entity. Supervised evidence identical methods require users to provide training data and therefore cannot be applied for network databases where inquiry results are generated on-the-fly. To overcome the problem, a new evidence identical method named Unsupervised Duplicate Elimination (UDE) is proposed for identifying and eliminating duplicates among evidences in dynamic inquiry results. The idea of this paper is to adjust the weights of evidence fields in calculating similarities among evidences. Two classifiers namely weight component similarity summing classifier, support vector machine classifier are iteratively employed with UDE where the first classifier utilizes the weights set to match evidences from different data sources. With the matched evidences as positive dataset and non duplicate evidences as negative set, the second classifier identifies new duplicates. Then, a new methodology to automatically interpret and cluster knowledge documents using an ontology schema is presented. Moreover, a fuzzy logic control approach is used to match suitable document cluster(s) for given patents based on their derived ontological semantic networks. Thus, this paper takes advantage of similarity among evidences from network databases and solves the online duplicate detection problem.