摘要:It is usually true that some structures like title can express the main content of texts, and these structures may have an influence on the effectiveness of text categorization. However, the most common feature weighting algorithms, called term frequency-inverse document frequency (TF-IDF) doesn’t think about the structural information of texts. To solve this problem, a new feature weighting algorithm based on Particle Swarm Optimization algorithm is put forward. It considers the structure information (i.e., HTML tags) of web pages. Firstly, web pages are crawled and pre-processed, at the same time, the content of four HTML tags is reserved; secondly, Chi-squared (CHI) is used to select features; thirdly, a new feature weighting algorithm, which is called the feature tag weighting algorithm, is come up with. In the feature tag weighting algorithm, we use particle swarm optimization (PSO) to calculate tag weighting coefficients; lastly, k-nearestneighbor (kNN) is used as the web text categorization. The experiment results show that feature tag weighting algorithm has better performance than TF-IDF in the effectiveness of web text categorization.
其他关键词:Text categorization, TF-IDF, PSO, web text, HTML tag.