摘要:Problem statement: Spam is an irrelevant or inappropriate message sent on the internet to a large number of newsgroups or users. A spam word is a list of well-known words that often appear in spam mails. Bloom Filter (BF) is used for identification of spam word. Approach: BF is a simple but powerful data structure that can check membership to a static set. The trade-off to use BF is a certain configurable risk of false positives. The odds of a false positive can be made very low if the hash bitmap is sufficiently large. Bin Bloom Filter (BBF) has number of BFs which assign group of words into bins with different false positive rates based on weight of the spam words. Genetic Algorithm (GA) was employed to minimize the total membership invalidation cost of BBF. GA had premature convergence problem. Simulated Annealing (SA) was incorporated with GA to prevent the premature convergence effectively. Results: The experimental results of total membership invalidation cost are analyzed for various sizes of bins. The results showed that the combined GA-SA model outperforms SA and GA model. Conclusion: GA has premature convergence due to its genetic operators that are not able to generate offsprings which are superior to the parents. So more number of similar chromosomes presented on the population. When GA is incorporated with SA new genes were introduced which causes diversity in the population and prevents premature convergence. The combined GA-SA outperforms GA and SA.
关键词:Spam word; hash function; genetic algorithm; simulated annealing; static set; premature convergence; long vector; bit array