期刊名称:International Journal of Hybrid Information Technology
印刷版ISSN:1738-9968
出版年度:2013
卷号:6
期号:6
出版社:SERSC
摘要:The internet has become indispensable part of people's life. For enterprises, there are mass of valuable information in the internet. It not only includes competitor information, but also includes customer's evaluation of products. These information is an important source of business intelligence. This paper aims to build a focused crawler to filter business intelligence from vast amounts of information in the internet. The crawler takes a certain number of web pages as seed. Then extract URLs in these pages, and parse main text of every URL. After that, the crawler calculates relevancy between every main text and the crawler's topic based on VSM (vector space model) and TF-IDF (Term Frequency-Inverse Document Frequency). If a web page is relevant, it will be saved; otherwise, it will be discarded. At last, an experiment is done to test the performance of crawler. It can be seen that the recall rate and accuracy of the crawler is very high though the result of this experiment