摘要:We study a document retrieval problem in the new framework where D text documents are organized in a category tree with a pre-defined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(logÏf(1+o(1))+log D+O(h)) + O(Î") bits of space and O( p +t) query time, where n is the total length of the documents, Ïf the size of the alphabet used in the documents and Î" is the total number of nodes in the category tree. Another solution uses n(logÏf(1+o(1))+O(log D))+O(Î")+O(Dlog n) bits of space and O( p +tlog D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.
关键词:pattern matching; document retrieval; category tree; space-efficient data structures