文章基本信息

标题：Heuristic image recognition.
作者：Cosoi, Alexandru Catalin ; Vlad, Madalin Stefan ; Sgarciu, Valentin 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2008
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Internet users have already become used to a new category of spam messages, one in which the content is actually delivered in the form of a picture attachment. Also, as they probably noticed, each image is slightly different from its predecessors, by having some noise added.
关键词：Algorithms;Email filtering software;Spam (Junk email)

Heuristic image recognition.

Cosoi, Alexandru Catalin ; Vlad, Madalin Stefan ; Sgarciu, Valentin 等

1. INTRODUCTION

Internet users have already become used to a new category of spam messages, one in which the content is actually delivered in the form of a picture attachment. Also, as they probably noticed, each image is slightly different from its predecessors, by having some noise added.

Two years ago, such "image spam" consisted of approximately 10% of the total amount of spam. Such message series typically consisted of about 10-15 spam images with some minor modifications. In the last 6 months, however, spammers noticed that many of the current AntiSpam solutions are almost ineffective against this new trick so they started attacking this niche in earnest. Image spam increased to 30-40% of the total amount of circulating spam, with random noise changing at almost every image sent. Detection rates dropped even further. (Graham 2006)

Usually noise consisted in:

* Adding random pixel in the image

* Animated GIF's with random bogus frames

* Modifying the size of the image

* A long line at the end of the image (some sort of border) with random part missing

* Splitting the image into sub-images and using the HTML tables to reconstruct it.

* Similar colors between different parts of text in the image.

* GIF's with transparent frames, with parts of the text in different frames to confuse OCR filters

As one may notice, part of this noise types are also met in an image retrieval system. Let's consider that a professional photographer has to take a picture of the Eiffel Tower. He will fix his camera on his tripod in the place where he can have the best view on the tower and take a few pictures. Even though he will take consecutive pictures during a small time interval, and for his eye the pictures are quite similar, computationally speaking, the pictures will be different. If a bird or a plane flies over the tower, the picture will be even more different.

2. PROPOSED METHOD

If in spam images this process is voluntary, in reality this process can happen by mistake, or simply because of the medium in which the pictures are taken. Another problem would be that the information contained in the image can be rotated with different angles in order to confuse OCR filters. The same problem would appear if the photographer would want to take a landscape picture, or a diagonal picture

BitDefender solved this problem by performing analysis only on the histograms of the spam images by using a special distance developed for this type of pictures, called SID (spam image distance). Using a frequently updated database of histograms of images met in spam emails, almost 98% of the spam images were correctly identified. (Cosoi, 2006)

Many histogram distances have been presented so far, (Stevens et al., 2002) among which we can enumerate: Histogram Euclidian distance (Sablak et al., 2006), Histogram intersection distance, Histogram quadratic (cross) distance, (Scheunders, 2003) where, if we consider h and g two color histograms, then Histogram Intersection Distance (the most appealing one for our purpose) represents

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)

Where [absolute value of h] and [absolute value of g] give the magnitude of each histogram, which is equal to the number of samples. Colors not present in the user's query image do not contribute to the intersection distance. This reduces contribution of the background colors. The sum is normalized by the histogram with fewest samples. (Ling & Okada, 2006)

It is useful because it can ignore changes in background colors, but problematic when changes appear in foreground. Also, if the noise added by the spammers uses the same colors but in different quantity, these colors will be considered in the distance. The algorithm used was quite simple. There had to be a component that cleaned the image as much as possible from the noise added by spammers, and there had to be a distance which compared two histograms. Since this type of noise is particular for spam images, one way to avoid unnecessary problems was to ignore all colors that could be found less than x% in that image. Some noise persisted, but the major part is out. Of course, this eliminates good colors too, but only insignificant ones. The cleaning function Ch accepts a given histogram and returns a new shorter histogram in which unnecessary colors are eliminated. Ch can be defined like this:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)

In other words, only colors that can be found in a large amount (normalized to the surface of the picture) are kept. Experimentally, a good value for [gamma] would be 0.01. SID is a modified intersection distance which can be defined like this:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)

where

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)

wherein h and g denote pixel counts of the target and reference histograms, respectively, a coordinate triplet (a, b, c) tags an individual color bin, A, B and C denote the basis colors of the color space, parameters [delta] and [DELTA] denote a color neighborhood size and a color bin similarity threshold, respectively, while [absolute value of h] and [absolute value of g] denote the magnitude (i.e., total number of samples) of the target and reference histograms, respectively. The distance of eq. [3] is symmetrical with respect to h and g--i.e. the distance depends on the contents of the two images, but not on which of the two images is a target histogram and which is a reference histogram. In other words, we do not consider only identical colors, but also similar, which can differ from the original with no more than [delta] values, with the restriction that we consider this element in the sum only if the difference between the sizes of the bins is smaller than [DELTA]%. These two parameters can be determined using ROC curves, trial and error or using any machine learning technique.

This way, SID is comparing a first bin of a first histogram representation to a range of second bins of a second histogram representation to determine a minimum distance between the first bin and the range of second bins, wherein the first histogram representation and the second histogram representation are selected from the target histogram representation and the reference histogram representation, and wherein the first histogram representation is distinct from the second histogram representation, and employing the minimum distance to determine the histogram distance between the target histogram representation and the reference histogram representation. A better representation of the SID algorithm can be seen in Figure 1.

Basically. this filter compares the query image with all the images contained in the database in order to find a match. The database contains only clean histograms.

The algorithm can be presented in the following steps

1. Extract histogram

2. q = Clean histogram (skip the unnecessary colors)

3 y = max( SID (q, [h.sub.i]))

4. If y [greater than or equal to] [ohm] return "We have a match"

It is obvious that this filter can be modified in order to develop an image retrieval system. The only difference is that the database has to contain the original images and not only the pre-computed histograms. All we have to do in the algorithm is to add another step and to modify the one where we extract only the maximum hit. We want all the hits.

3. RESULTS

On our test corpus, our image retrieval system has an accuracy of 97.9%, but since our query system permits the user to choose the degree of accepted noise, the degree of similarity between colors or the degree of similarity between pixel counts, it's accuracy can rise up even to 100%.

Also, working with complex pictures with a high number of colors, might cause some problems to SID, but if this method would be combined with other techniques for image similarity, we believe that there will be some amazing results.

[FIGURE 1 OMITTED]

4. CONCLUSIONS

By comparing each bin with a plurality of bins, SID is able to identify images that are slightly different from the original with high accuracy. Of course, since retrieval is performed only on the histograms of the images, we can't say that this process is flawless, but we are confident that this distance is one step further in image retrieval systems research.

Acknowledgements

This work was entirely supported by BitDefender--AntiSpam Laboratory. Web: www.bitdefender.com

5. REFERENCES

Cosoi A. C. (2006). The medium or the message--Dealing with image spam, Spam Bulletin

Graham-Cumming J. (2006). Title of conference paper, The Rise and Rise of Image-Based Spam, Spam Bulletin Ueda,

Ling H., Okada K. (2006). Diffusion Distance for Histogram Comparison, Center for Automatic Research, Computer Science Department, University of Maryland and Imaging and Visualization Department, Siemens Corporate Research Inc

Stevens, M.; Culberston B.; Malzbender T(2002) A histogram-based Color Consistency Test for Voxel Coloring, HP research labs

Scheunders P. (2003). A comparison of clustering algorithms applied to color image, Dept. of Physics, RUCA university of Antwerp

Sablak, S.; Boult T S. & Udiljak T. (2006). Multilevel Color Histogram Representation of Color Images by Peaks for Omni-Camera, Department of Electrical engineering & Computer Science (IASTED proceedings, October 18-21 1999, Nassau, Bahamas