期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2016
卷号:86
期号:1
出版社:Journal of Theoretical and Applied
摘要:Traditional methods of document comparison are based on the similarities called "surfaces": a model of similarity based on descriptive properties of objects without considering the relationships between these properties. We have proposed a new structural measure, based on sub-graph isomorphism, taking into account the distribution (order, position, etc) of components of the documents compared and the relationships between these components (preserve more sense). Our measure reflects both the contextual and structural aspects of documents compared. In this work, we will show in detail our similarity measure and study the impact of the similarity threshold (a parameter fixed previously) on generated clusters. We evaluate our approach on a corpus of multimedia documents extracted randomly from the INEX 2007 corpus and the corpus of descriptive records of books in XML format from the library of the University of Toulouse 1 Capitole.