文章基本信息

标题：Asigning relevancies to individual features for large patterns in ARTMAP networks.
作者：Cosoi, Alexandru Catalin ; Vlad, Madalin Stefan ; Sgarciu, Valentin 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2007
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Key words: ART, ARTMAP, spam, AntiSpam, heuristics
关键词：Artificial neural networks;Neural networks;Spam (Junk email)

Asigning relevancies to individual features for large patterns in ARTMAP networks.

Cosoi, Alexandru Catalin ; Vlad, Madalin Stefan ; Sgarciu, Valentin 等

Abstract: Spam has become a global problem. Latest studies estimate that as much as 9 out of 10 emails are spam (Routers, 2006). Many solutions have been published so far, but every time a suitable solution is found, spam mutates into something new, so new ways to fight it must be found. A good method to fight spam at a proactive level would be the use of neural networks (Cosoi, 2006), but, as you will see in this paper, applying neural network theory per see is not enough.

Key words: ART, ARTMAP, spam, AntiSpam, heuristics

1. INTRODUCTION

Although sending billions of email messages advertising ridiculous products that most of us would never in our lives consider buying, what makes spamming profitable is its large volume. According to the New York Times, people click and buy products advertised in pharmaceutical spam emails. Other articles suggest that it costs about 300$ to send 1 million emails. Assuming that a spammer makes just 25$ from each sale (which is the lowest profit he can make), it's easy to see that it makes only slightly more than 2 million messages to make an immediate 10 000$ profit (Beckman, 2007).

Over time, several techniques have been proposed to address this problem, like Bayesian Filtering Techniques (Graham, 2002), URL filtering, heuristic filtering, spam image filtering (Cosoi, 2006) and so on, but each time an acceptable solution was found, spam quickly mutated to something new and harder to catch.

Due to the fact that all the techniques enumerated above are all reactive, the need for a proactive solution is obvious. Heuristic filters look for patterns in the content of an email and match them against a database of known spam characteristics. These characteristics can be in the form of certain words, phrases, punctuation and altered dates. These are strong patterns and they match a single type of spam, offering zero false positives (a legitimate email mistakenly classified as spam), but the process of creating strong patterns is usually insidious and time consuming.

A good way to create strong patterns would be to use a neural network that combines short weaker patterns (if the email has words like "Viagra", "Valium", or if the date of the message is in the future and so on, which individually have a high false positive rate) and to use a neural network in order to combine these into stronger and longer patterns.

2. PROPOSED METHOD

A good neural network type up for this task would be ARTMAP networks (Cosoi, 2006). ARTMAP architectures are neural networks that develop stable recognition codes in real time into response to arbitrary sequences of input patterns. They were designed to solve the stability-plasticity dilemma that every intelligent machine learning system is facing: how to keep learning from new events without forgetting previously learned information. ARTMAP networks were designed to accept binary or fuzzy input patterns (Carpenter & Grossberg, 1991). ARTMAP networks consist of two ART1 networks, ARTa and ARTb, bridged via an inter-ART module, as shown on Fig 1. An ART module has three layers: the input layer (F0), the comparison layer (F1), and the recognition layer (F2).

The neurons, or nodes, in the F2 layer represent input categories. The F1 and F2 layers interact with each other through weighted bottom-up and top-down connections, which are modified when the network learns. There are additional gain control signals in the network that regulate its operation.

[FIGURE 1 OMITTED]

In the training phase, the system has to receive a list of features extracted from the email messages and an output category. For example, ARTa will receive an input vector where each field indicates the existence of a certain spam or legitimate characteristic. Also, each input vector will be associated to a label which indicates if the current pattern was extracted from a spam or a legitimate email message, which will be fed to the ARTb module. When the training phase starts, the system will quickly associate inputs and outputs by creating strong patterns for each category.

The results are very good (Cosoi, 2006), with a false positive rate of almost 1% (which is not exactly the best yet obtained, but it can be rated among the top 3 AntiSpam filters) and a false positives rate (spam messages mistakenly misclassified as legitimate messages) under 10%. The problem that appears is that since the training phase is performed on a few million legitimate and spam messages samples, and since the individual heuristics are generally weak, the extracted patterns can be quite confusing for the neural network algorithm. For example we can have a situation where important legitimate features and standard weak spam features can determine a mistakenly "this is spam" answer, and vice-versa.

These situations are generally determined by the large corpus of messages on which the neural network has to train in order to achieve an acceptable accuracy. In many situations, in our experiments, the training phase stopped after a fixed number of training iterations was achieved, and not when reaching a pre-established accuracy.

The solution we found to address this problem is to a priory offer a numerical relevance to each individual feature, and also the category (spam or legitimate) for which this feature was created. Our purpose was to create in inhibited connection, in order to stop the neural network to give an answer if the relevance of the pattern was smaller then a pre-established threshold T. Of course, this means that good hits would be eliminated to, but common-sense would say that we can't actually say an email is a spam message only because it contains the word "Viagra".

If we consider I and S the relevance for the legitimate heuristics within a subset of a pattern and respectively S the relevance for the spam heuristics, we can combine them in a total relevance for a pattern by using the following simple rule:

R = 1-I+S / 2 (1)

Where, I and S are computed as percents of the total sum of the relevancies within a pattern.

By using this result, the neural network can determine if this is an important pattern for the decision process or not. Of course, now this approach is more of a heuristic filter than a neural network. In order to keep all the facilities that a neural network would offer, (and we also chose this type of neural network in order to solve the stability-plasticity dilemma), we had to add a punishment-reward system in the control subsystem of the ARTa module (see Fig. 2). The process we developed is quite simple to explain. Each time the prediction matched the expectation we increased by a small amount the relevance of that pattern. If the prediction and the expectation were different, we decreased the relevance with a small amount. The process can be defined using the following formula.

[R.sub.i+1] = (1-w) [R.sub.i] + w(R + [(-1).sup.c] x w/100) (2)

Where [(-1).sup.c] has a negative value when the expectation and the prediction are different, and a positive one when the two are the same.

3. RESULTS

Our tests showed that by applying the improvements presented in this paper, the false positives rates dropped radically from an initial 1% to 1 in a million, while the false negative rate reached 13%, compared to an initial value of 10%. Although this method provides a slightly increase of the false negatives rate, it is far more important to prevent tagging as spam a legitimate email message than overlooking a few spam messages.

[FIGURE 2 OMITTED]

The conditions in which the experiments took place are the following:

* 2.5 million spam messages

* Almost 1 million legitimate email messages

* 75% of the message corpus were used for training the neural network and,

* 25% were used in testing the neural network.

4. CONCLUSIONS

From the end-consumer point of view, not being able to receive crucial information (a salary increase or getting fired), represents a greater loss than having it's inbox bombed with another 3% of spam messages.

By using our approach, we minimize the risk of false positives which is an important problem in the AntiSpam community, although this solution has its side effects.

We are confident that this theory can also be applied to other neural network types, providing us a base for future research projects.

5. REFERENCES

Beckman S. (2007). High-Performance Asynchronous IO for SMTP Multiplexing Available from: http://www.spamconference.org Accessed: 2007-03-31

Cosoi, A. C. (2006). The medium or the message? Dealing with image spam. Available from: http://www.virusbtn.com Accessed: 2006-12-3

Cosoi A. C. (2006). An AntiSpam filter based on adaptive neural networks Available from: http://www.spamconference.org Accessed: 2006-04-15

Graham P. (2002). A plan for spam Available from: http://www.paulgraham.com/spam.html Accessed: 2007-05-27

Carpenter, G. & Grossberg, S. (1991). Supervised real-time learning and classification of nonstationary data by a self-organizing neural network, In: Pattern recognition by self organizing neural networks, Carpenter, G. & Grossberg, S., (Ed. MIT press), 501-544, Publisher MIT press, ISBN 0-262-03176-0, Cambridge Massachusetts