期刊名称:International Journal of Future Generation Communication and Networking
印刷版ISSN:2233-7857
出版年度:2014
卷号:7
期号:6
页码:13-20
DOI:10.14257/ijfgcn.2014.7.6.02
出版社:SERSC
摘要:With the rapid development of network and information technology, there is a wealth of huge amounts of data on the internet. But it's a major problem faced by the majority of researchers how to effectively filter out a particular subject or field of information from these data. In this paper, we try to builder a focused crawler based on vector space model and TF- IDF text correlation analysis. We take the seed URL as a collection entrance and fetch web pages from internet. Then analysis page information though technological like web content extraction, page link analysis technology and get the main content of one page. By the correlation analysis method based on VSM and TF-IDF text, we calculation the correlation between pages and the topics what have been defined, so we can achieve the purpose of the focus areas of the web.