A framework for multilevel indexing in search engines.
Gupta, Parul ; Gupta, A.K.
Introduction
Search Engines are used to find the data in response to the user
queries. The indexing phase of these search engines can be viewed as a
web content mining process. Starting from a collection of unstructured
documents, the indexer extracts a large amount of information like the
list of documents, which contain a given term. It also keeps account of
number of all the occurrences of each term within every document. This
information is maintained in a global index, which is usually
represented using an inverted file (IF). The index consists of an array
of the posting lists where each posting list is associated with a term
and contains the term as well as the identifiers of the documents
containing the term. For example, consider the posting list ((job; 5) 1,
4, 14, 20, 27) indicating that the term job appears in five documents
having the document identifiers 1, 4,14,20,27 respectively. The above
posting list can be written as ((job; 5) 1, 3, 10, 6, 7) where the items
of the list represent the difference between the successive document
identifiers. The following figure shows the structure of index.
Figure 1: Structure of an Index in search engines
Term No. of docs Doc id of docs in Doc ids stored
in which which term with difference
term appears appears coding technique
Clustering is a widely adopted technique aimed at dividing a
collection of data into disjoint groups of homogenous elements. A
cluster is therefore a collection of objects, which are similar between
them and are dissimilar to the objects belonging to other clusters.
Clustering is used to organize data for efficient retrieval and to
perform efficient search of elements in a data set. Document clustering
has been widely investigated as a technique to improve effectiveness and
efficiency in information retrieval. Clustering algorithms attempt to
group together the documents based on their similarities. Thus documents
relating to a certain topic will hopefully be placed in a single
cluster. So if the documents are clustered, comparisons of the documents
against the user's query are only needed with certain clusters and
not with the whole collection of documents.
Related Work
In this paper, a review of previous work on index organization is
given. In this field of index organization and maintenance, many
algorithms and techniques have already been proposed but they seem to be
less efficient in efficiently accessing the index.
In [2] the authors introduce a novel pipelining technique for
structuring the index- building system that substantially reduces the
index constructing time. They demonstrate that for large collections,
the pipelining technique can speed up index construction by several
hours. They propose and compare different schemes for storing and
managing inverted files using an embedded database system. They show
that an intelligent scheme for packing inverted lists in the storage
structures of the database can provide performance and storage
efficiency comparable to tailored inverted file implementations.
Another work proposed was the reordering algorithm which partitions
the set of documents into k ordered clusters on the basis of similarity
measure. According to this algorithm, the biggest document is selected
as centroid of the first cluster and n/k1 most similar documents are
assigned to this cluster. Then the biggest document is selected and the
same process repeats. The process keeps on repeating until all the k
clusters are formed and each cluster gets completed with n/k documents.
This algorithm is not effective in clustering the most similar
documents. The biggest document may not have similarity with any of the
documents but still it is taken as the representative of the cluster.
Another proposed work was the threshold based clustering algorithm
[8] in which the number of clusters is unknown. However, two documents
are classified to the same cluster if the similarity between them is
below a specified threshold. This threshold is defined by the user
before the algorithm starts. It is easy to see that if the threshold is
small, all the elements will get assigned to different clusters. If the
threshold is large, the elements may get assigned to just one cluster.
Thus the algorithm is sensitive to specification of threshold.
The given paper discusses the multilevel index structure in which
the clustering of data is done at multiple levels so as to direct the
search to a specific path from high level index to low level index.
Proposed Work
The indexing phase of the search engine collects information from
the web documents gathered in the web repository. This global large
sized index is stored in the form of inverted files. The proposed work
explains the concept of multilevel indexes which are created at multiple
levels of hierarchy of documents clustering. At first level of document
hierarchy, the similar documents are clustered into clusters on the
basis of documents similarity. Then at the next level, the similar
clusters are clustered into mega clusters and at the last level of
hierarchy, the similar mega clusters are clubbed together to form super
clusters.
Architecture of Multilevel Indexes
At the lowest level that is at the document level, the index [13]
consists of the union of the terms in all the documents in different
clusters. This index is a descriptive index whose structure is similar
as that of the global large size index that exists in the current
architecture. This document level index is stored as array of posting
lists containing the term as well the document identifiers of the
documents containing the given term. The second level index that exists
at the mega cluster level consists of the terms that come after the
intersection of the words of the similar clusters. The index at this
level is a brief index which contains only the terms with no description
at all. The highest level index which exists at the super cluster level
consists of the intersection of the terms of the similar mega clusters.
This highest level index is also a small size index containing only the
terms. The search proceeds from the highest level index to the document
level index due to which there is reduction in the search space. The
proposed architecture is shown in figure 2 which shows that the
documents from the repository are initially preprocessed and then the
terms are extracted on the basis of which the index is formed. The
preprocessing involves stemming, removal of stop words etc. The index at
the document level is formed first, then comes the index at the mega
cluster level and finally the index at the super cluster level is
formed. The searcher after getting the query from the user searches the
index at the highest level, then moves to the next level index and
finally to the lowest level index by matching the keywords and hence get
the document identifiers of the relevant documents which are stored in
the index that exists in the form of the posting lists.
[FIGURE 2 OMITTED]
Structure of Multilevel Indexes
The structure of the index at the lowest level is in the form of
posting lists as shown in figure:
Figure 3: Structure of Cluster level index.
No. of docs in Docid of docs in
Term which term appears which term appears
Indexing 50 12,34,45,49 ...
Search engine 59 15,20,34,55 ...
Pipeline 15 3,6,9,12 ...
However the indexes at the higher level consist of only the
document terms with no extra information stored in them. The structure
of the higher level indexes is shown in fig.
Figure 4: Structure of higher level indexes.
Term
Indexing
Crawling
Search engine
Pipeline
Parallel
Distributed
Algorithm
Algorithm for how search proceeds in case of multilevel indexing:
The user query is accepted by the searcher module.
The query is matched against the highest level indexes which exist
at the super cluster level.
Now the highest level index which contains the terms in the query
is considered and rest of the indexes are discarded.
Now the indexes that exist at the mega clusters levels of this
super cluster are matched against the user's query.
Now the mega cluster index that has the best possible match with
query is considered and the remaining indexes are discarded.
Then the search proceeds to the lowest level of indexes i.e.
indexes that exist at the cluster and the user's query is matched
with them.
Now from these indexes, the searcher extracts the document
identifiers of the documents that contain the best possible with the
user's query.
The extracted web pages are provided to the user in response to his
query.
Example Illustrating the Formation of Hierarchy of the Web
Documents
[FIGURE 5 OMITTED]
The above figure shows the hierarchy of the documents created. The
index formation starts at the lowest level where each cluster has a
separate index consisting of the terms contained in the similar
documents contained within that cluster. This is a descriptive index
which is stored in the form of inverted files. The index consists of an
array of posting lists as in case of global large size index. The index
at the mega cluster level consists of the terms that come after the
intersection of the terms in the clusters that appear in that particular
mega cluster. This index is of small size and contains only the similar
terms of the clusters within the same mega cluster. The index does not
contain any additional information. The highest level index is formed
last at the super cluster level which consists of the terms that are
obtained after the intersection of the terms in the similar mega cluster
within that super cluster and the search starts with this index and
proceeds downwards to the lowest level index.
Conclusion
The given paper proposes a multilevel indexing structure for a
search engine index so as to speed up the search process by directing
the search to a specific path from the higher level index to the lower
level indexes. The proposed architecture and the structure of the
indexes are the answer to one of the biggest problems for search
optimization in search engines. The proposed work aims at reducing the
search space and search time by maintaining the index at various levels.
This paper shows an ongoing work and further improvements will be made
to produce the best quality index.
References
[1] Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector
Garcia-Molina. Building a Distributed Full-Text Index for the Web. In
World Wide Web, pages 396-406, 2001.
[2] Fabrizio Silvestri, Raffaele Perego and Salvatore Orlando.
Assigning Document Identifiers to Enhance Compressibility of Web Search
Engines Indexes. In the proceedings of SAC, 2004.
[3] S.Brin and L.Page.The Anatomy of a Large-Scale Hypertextual Web
Search Engine. In the 7th International WWW Conference, (WWW7), pp.
107-117, Brisbane, Australia.
[4] Van Rijsbergen C.J. Information Retrieval. Butterworth 1979
[5] Soumen Chakrabarti. Mining the Web - Discovering Knowledge from
Hypertext Data. Morgan Kaufmann Publications, 2003.
[6] Oren Zamir and Oren Etzioni. Web Document Clustering: A
feasibility demonstration. In the proceedings of SIGIR, 1998.
[7] A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice
Hall, 1988 [8] P. Elias. Universal codeword sets and representation of
the integers. IEEE Trans. on Information Theory, 21(2):194-203, Feb
1975.
[9] Sanjiv K. Bhatia. Adaptive K-Means Clustering. American
Association for Artificial Intelligence, 2004.
[10] Krishna Kummarmuru, Rohit Lotlikar, Shourya Roy. Hierarchical
Monothetic Document Clustering Algorithm for Summarization and Browsing
Search Results. WWW 2004, New York.
[11] Bhatia, S.K. and Deougan , J.S. 1998. Conceptual Clustering in
Information Retrieval. IEEE Transactions on Systems, Man and
Cybernetics.
[12] Dan Bladford and Guy Blelloch. Index compression through
document reordering. In IEEE, editor, Proc. Of DCC'02. IEEE, 2002.
[13] Parul Gupta, Dr. A.K. Sharma, Hierarchical Clustering based
Indexing in Search Engines, communicated to International Journal of
Information and Communication Technology.
(1) Parul Gupta and (2) Dr. A.K. Gupta
Y.M.C.A. Institute of Engineering, Faridabad, Haryana, India
(1) parulgupta_gem@yahoo.com
(2) ashokkkale2@rediffmail.com