期刊名称:International Journal of Engineering and Computer Science
印刷版ISSN:2319-7242
出版年度:2015
卷号:4
期号:5
页码:11956-11961
出版社:IJECS
摘要:Internet is being used at a greater extent nowadays. All the types of data are available very easily on the internet. The usersubmits a query to the search engine and thousands of related documents are retuned as a result to the query. The web documentscontain different types of data like text, images, videos, etc. So, the web documents are not structured properly and are unorganized. Itbecomes much difficult for users to find specific document from thousands of documents. The solution to this problem is clustering theweb documents. Clustering congregates the documents showing similar context to the user query. The similar documents are assembledin a cluster. So, clustering reduces user’s task to discriminate among the thousands documents returned as a result to a query. Also,ranking can be applied further to view the most relevant documents at the top. Different documents in a cluster are ranked and thedocuments can be arranged according to their similarity. Different functions can be used to calculate the similarity measure among thedocuments. We combine these two concepts and propose a tf-idf based apriori scheme for web document clustering and ranking. In thisscheme, first clustering is applied on the documents. The modified tf-idf based apriori algorithm is used to serve this purpose. And then,ranking is performed to arrange the most pertinent documents at the top with regard to the user query. We use online web pagesreturned as results for a query as the dataset for our experimental work. This approach gives a good F-measure value, i.e. 81%. Theproposed method is found superior to some traditional clustering approaches.
关键词:apriori algorithm; web documents; search results; term frequency; inverse document frequency