文章基本信息

标题：A framework for multilevel indexing in search engines.
作者：Gupta, Parul ; Gupta, A.K.
期刊名称：International Journal of Applied Engineering Research
印刷版ISSN：0973-4562
出版年度：2009
期号：August
语种：English
出版社：Research India Publications
摘要：
Search Engines are used to find the data in response to the user queries. The indexing phase of these search engines can be viewed as a web content mining process. Starting from a collection of unstructured documents, the indexer extracts a large amount of information like the list of documents, which contain a given term. It also keeps account of number of all the occurrences of each term within every document. This information is maintained in a global index, which is usually represented using an inverted file (IF). The index consists of an array of the posting lists where each posting list is associated with a term and contains the term as well as the identifiers of the documents containing the term. For example, consider the posting list ((job; 5) 1, 4, 14, 20, 27) indicating that the term job appears in five documents having the document identifiers 1, 4,14,20,27 respectively. The above posting list can be written as ((job; 5) 1, 3, 10, 6, 7) where the items of the list represent the difference between the successive document identifiers. The following figure shows the structure of index.
```
 Figure 1: Structure of an Index in search engines  Term   No. of docs    Doc id of docs in   Doc ids stored        in which       which term          with difference        term appears   appears             coding technique 
```
关键词：Indexing;Indexing (Content analysis);Information management;Information services;Internet/Web search services;Online information services;Online services;Search engines

A framework for multilevel indexing in search engines.

Gupta, Parul ; Gupta, A.K.

Introduction

Search Engines are used to find the data in response to the user queries. The indexing phase of these search engines can be viewed as a web content mining process. Starting from a collection of unstructured documents, the indexer extracts a large amount of information like the list of documents, which contain a given term. It also keeps account of number of all the occurrences of each term within every document. This information is maintained in a global index, which is usually represented using an inverted file (IF). The index consists of an array of the posting lists where each posting list is associated with a term and contains the term as well as the identifiers of the documents containing the term. For example, consider the posting list ((job; 5) 1, 4, 14, 20, 27) indicating that the term job appears in five documents having the document identifiers 1, 4,14,20,27 respectively. The above posting list can be written as ((job; 5) 1, 3, 10, 6, 7) where the items of the list represent the difference between the successive document identifiers. The following figure shows the structure of index.

Figure 1: Structure of an Index in search engines

Term   No. of docs    Doc id of docs in   Doc ids stored
       in which       which term          with difference
       term appears   appears             coding technique

Clustering is a widely adopted technique aimed at dividing a collection of data into disjoint groups of homogenous elements. A cluster is therefore a collection of objects, which are similar between them and are dissimilar to the objects belonging to other clusters. Clustering is used to organize data for efficient retrieval and to perform efficient search of elements in a data set. Document clustering has been widely investigated as a technique to improve effectiveness and efficiency in information retrieval. Clustering algorithms attempt to group together the documents based on their similarities. Thus documents relating to a certain topic will hopefully be placed in a single cluster. So if the documents are clustered, comparisons of the documents against the user's query are only needed with certain clusters and not with the whole collection of documents.

Related Work

In this paper, a review of previous work on index organization is given. In this field of index organization and maintenance, many algorithms and techniques have already been proposed but they seem to be less efficient in efficiently accessing the index.

In [2] the authors introduce a novel pipelining technique for structuring the index- building system that substantially reduces the index constructing time. They demonstrate that for large collections, the pipelining technique can speed up index construction by several hours. They propose and compare different schemes for storing and managing inverted files using an embedded database system. They show that an intelligent scheme for packing inverted lists in the storage structures of the database can provide performance and storage efficiency comparable to tailored inverted file implementations.

Another work proposed was the reordering algorithm which partitions the set of documents into k ordered clusters on the basis of similarity measure. According to this algorithm, the biggest document is selected as centroid of the first cluster and n/k1 most similar documents are assigned to this cluster. Then the biggest document is selected and the same process repeats. The process keeps on repeating until all the k clusters are formed and each cluster gets completed with n/k documents. This algorithm is not effective in clustering the most similar documents. The biggest document may not have similarity with any of the documents but still it is taken as the representative of the cluster.

Another proposed work was the threshold based clustering algorithm [8] in which the number of clusters is unknown. However, two documents are classified to the same cluster if the similarity between them is below a specified threshold. This threshold is defined by the user before the algorithm starts. It is easy to see that if the threshold is small, all the elements will get assigned to different clusters. If the threshold is large, the elements may get assigned to just one cluster. Thus the algorithm is sensitive to specification of threshold.

The given paper discusses the multilevel index structure in which the clustering of data is done at multiple levels so as to direct the search to a specific path from high level index to low level index.

Proposed Work

The indexing phase of the search engine collects information from the web documents gathered in the web repository. This global large sized index is stored in the form of inverted files. The proposed work explains the concept of multilevel indexes which are created at multiple levels of hierarchy of documents clustering. At first level of document hierarchy, the similar documents are clustered into clusters on the basis of documents similarity. Then at the next level, the similar clusters are clustered into mega clusters and at the last level of hierarchy, the similar mega clusters are clubbed together to form super clusters.

Architecture of Multilevel Indexes

At the lowest level that is at the document level, the index [13] consists of the union of the terms in all the documents in different clusters. This index is a descriptive index whose structure is similar as that of the global large size index that exists in the current architecture. This document level index is stored as array of posting lists containing the term as well the document identifiers of the documents containing the given term. The second level index that exists at the mega cluster level consists of the terms that come after the intersection of the words of the similar clusters. The index at this level is a brief index which contains only the terms with no description at all. The highest level index which exists at the super cluster level consists of the intersection of the terms of the similar mega clusters. This highest level index is also a small size index containing only the terms. The search proceeds from the highest level index to the document level index due to which there is reduction in the search space. The proposed architecture is shown in figure 2 which shows that the documents from the repository are initially preprocessed and then the terms are extracted on the basis of which the index is formed. The preprocessing involves stemming, removal of stop words etc. The index at the document level is formed first, then comes the index at the mega cluster level and finally the index at the super cluster level is formed. The searcher after getting the query from the user searches the index at the highest level, then moves to the next level index and finally to the lowest level index by matching the keywords and hence get the document identifiers of the relevant documents which are stored in the index that exists in the form of the posting lists.

[FIGURE 2 OMITTED]

Structure of Multilevel Indexes

The structure of the index at the lowest level is in the form of posting lists as shown in figure:

Figure 3: Structure of Cluster level index.

                No. of docs in       Docid of docs in
Term            which term appears   which term appears

Indexing        50                   12,34,45,49 ...
Search engine   59                   15,20,34,55 ...
Pipeline        15                   3,6,9,12 ...

However the indexes at the higher level consist of only the document terms with no extra information stored in them. The structure of the higher level indexes is shown in fig.

Figure 4: Structure of higher level indexes.

Term
Indexing
Crawling
Search engine
Pipeline
Parallel
Distributed

Algorithm

Algorithm for how search proceeds in case of multilevel indexing:

The user query is accepted by the searcher module.

The query is matched against the highest level indexes which exist at the super cluster level.

Now the highest level index which contains the terms in the query is considered and rest of the indexes are discarded.

Now the indexes that exist at the mega clusters levels of this super cluster are matched against the user's query.

Now the mega cluster index that has the best possible match with query is considered and the remaining indexes are discarded.

Then the search proceeds to the lowest level of indexes i.e. indexes that exist at the cluster and the user's query is matched with them.

Now from these indexes, the searcher extracts the document identifiers of the documents that contain the best possible with the user's query.

The extracted web pages are provided to the user in response to his query.

Example Illustrating the Formation of Hierarchy of the Web Documents

[FIGURE 5 OMITTED]

The above figure shows the hierarchy of the documents created. The index formation starts at the lowest level where each cluster has a separate index consisting of the terms contained in the similar documents contained within that cluster. This is a descriptive index which is stored in the form of inverted files. The index consists of an array of posting lists as in case of global large size index. The index at the mega cluster level consists of the terms that come after the intersection of the terms in the clusters that appear in that particular mega cluster. This index is of small size and contains only the similar terms of the clusters within the same mega cluster. The index does not contain any additional information. The highest level index is formed last at the super cluster level which consists of the terms that are obtained after the intersection of the terms in the similar mega cluster within that super cluster and the search starts with this index and proceeds downwards to the lowest level index.

Conclusion

The given paper proposes a multilevel indexing structure for a search engine index so as to speed up the search process by directing the search to a specific path from the higher level index to the lower level indexes. The proposed architecture and the structure of the indexes are the answer to one of the biggest problems for search optimization in search engines. The proposed work aims at reducing the search space and search time by maintaining the index at various levels. This paper shows an ongoing work and further improvements will be made to produce the best quality index.

References

[1] Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Building a Distributed Full-Text Index for the Web. In World Wide Web, pages 396-406, 2001.

[2] Fabrizio Silvestri, Raffaele Perego and Salvatore Orlando. Assigning Document Identifiers to Enhance Compressibility of Web Search Engines Indexes. In the proceedings of SAC, 2004.

[3] S.Brin and L.Page.The Anatomy of a Large-Scale Hypertextual Web Search Engine. In the 7th International WWW Conference, (WWW7), pp. 107-117, Brisbane, Australia.

[4] Van Rijsbergen C.J. Information Retrieval. Butterworth 1979

[5] Soumen Chakrabarti. Mining the Web - Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publications, 2003.

[6] Oren Zamir and Oren Etzioni. Web Document Clustering: A feasibility demonstration. In the proceedings of SIGIR, 1998.

[7] A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988 [8] P. Elias. Universal codeword sets and representation of the integers. IEEE Trans. on Information Theory, 21(2):194-203, Feb 1975.

[9] Sanjiv K. Bhatia. Adaptive K-Means Clustering. American Association for Artificial Intelligence, 2004.

[10] Krishna Kummarmuru, Rohit Lotlikar, Shourya Roy. Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. WWW 2004, New York.

[11] Bhatia, S.K. and Deougan , J.S. 1998. Conceptual Clustering in Information Retrieval. IEEE Transactions on Systems, Man and Cybernetics.

[12] Dan Bladford and Guy Blelloch. Index compression through document reordering. In IEEE, editor, Proc. Of DCC'02. IEEE, 2002.

[13] Parul Gupta, Dr. A.K. Sharma, Hierarchical Clustering based Indexing in Search Engines, communicated to International Journal of Information and Communication Technology.

(1) Parul Gupta and (2) Dr. A.K. Gupta

Y.M.C.A. Institute of Engineering, Faridabad, Haryana, India

(1) parulgupta_gem@yahoo.com

(2) ashokkkale2@rediffmail.com