摘要:Specific-subject oriented information collection is one of the key technologies of vertical search engines, which directly affects the speed and relevance of search results. The topic information collection algorithm is widely used for its accuracy. The Hidden Markov Model (HMM) is used to learn and judge the relevance between the Uniform Resource Locator (URL) and the topic information. The Rocchio method is used to construct the prototype vectors relevant to the topic information, and the HMM is used to learn the preferred browsing paths. The concept maps including the semantics of the webpage are constructed and the web's link structures can be decided. The validity of the algorithm is proved by the experiment at last. Comparing with the Best-First algorithm, this algorithm can get more information pages and has higher precision ratio.
关键词:Topic Information Collection;Hidden Markov Model;Crawler;URL (Uniform Resource Locator);Prototype Vector;;Precision Ratio;Recall Ratio