Heuristic image recognition.
Cosoi, Alexandru Catalin ; Vlad, Madalin Stefan ; Sgarciu, Valentin 等
1. INTRODUCTION
Internet users have already become used to a new category of spam
messages, one in which the content is actually delivered in the form of
a picture attachment. Also, as they probably noticed, each image is
slightly different from its predecessors, by having some noise added.
Two years ago, such "image spam" consisted of
approximately 10% of the total amount of spam. Such message series
typically consisted of about 10-15 spam images with some minor
modifications. In the last 6 months, however, spammers noticed that many
of the current AntiSpam solutions are almost ineffective against this
new trick so they started attacking this niche in earnest. Image spam
increased to 30-40% of the total amount of circulating spam, with random
noise changing at almost every image sent. Detection rates dropped even
further. (Graham 2006)
Usually noise consisted in:
* Adding random pixel in the image
* Animated GIF's with random bogus frames
* Modifying the size of the image
* A long line at the end of the image (some sort of border) with
random part missing
* Splitting the image into sub-images and using the HTML tables to
reconstruct it.
* Similar colors between different parts of text in the image.
* GIF's with transparent frames, with parts of the text in
different frames to confuse OCR filters
As one may notice, part of this noise types are also met in an
image retrieval system. Let's consider that a professional
photographer has to take a picture of the Eiffel Tower. He will fix his
camera on his tripod in the place where he can have the best view on the
tower and take a few pictures. Even though he will take consecutive
pictures during a small time interval, and for his eye the pictures are
quite similar, computationally speaking, the pictures will be different.
If a bird or a plane flies over the tower, the picture will be even more
different.
2. PROPOSED METHOD
If in spam images this process is voluntary, in reality this
process can happen by mistake, or simply because of the medium in which
the pictures are taken. Another problem would be that the information
contained in the image can be rotated with different angles in order to
confuse OCR filters. The same problem would appear if the photographer
would want to take a landscape picture, or a diagonal picture
BitDefender solved this problem by performing analysis only on the
histograms of the spam images by using a special distance developed for
this type of pictures, called SID (spam image distance). Using a
frequently updated database of histograms of images met in spam emails,
almost 98% of the spam images were correctly identified. (Cosoi, 2006)
Many histogram distances have been presented so far, (Stevens et
al., 2002) among which we can enumerate: Histogram Euclidian distance
(Sablak et al., 2006), Histogram intersection distance, Histogram
quadratic (cross) distance, (Scheunders, 2003) where, if we consider h
and g two color histograms, then Histogram Intersection Distance (the
most appealing one for our purpose) represents
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)
Where [absolute value of h] and [absolute value of g] give the
magnitude of each histogram, which is equal to the number of samples.
Colors not present in the user's query image do not contribute to
the intersection distance. This reduces contribution of the background
colors. The sum is normalized by the histogram with fewest samples.
(Ling & Okada, 2006)
It is useful because it can ignore changes in background colors,
but problematic when changes appear in foreground. Also, if the noise
added by the spammers uses the same colors but in different quantity,
these colors will be considered in the distance. The algorithm used was
quite simple. There had to be a component that cleaned the image as much
as possible from the noise added by spammers, and there had to be a
distance which compared two histograms. Since this type of noise is
particular for spam images, one way to avoid unnecessary problems was to
ignore all colors that could be found less than x% in that image. Some
noise persisted, but the major part is out. Of course, this eliminates
good colors too, but only insignificant ones. The cleaning function Ch
accepts a given histogram and returns a new shorter histogram in which
unnecessary colors are eliminated. Ch can be defined like this:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)
In other words, only colors that can be found in a large amount
(normalized to the surface of the picture) are kept. Experimentally, a
good value for [gamma] would be 0.01. SID is a modified intersection
distance which can be defined like this:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)
where
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)
wherein h and g denote pixel counts of the target and reference
histograms, respectively, a coordinate triplet (a, b, c) tags an
individual color bin, A, B and C denote the basis colors of the color
space, parameters [delta] and [DELTA] denote a color neighborhood size
and a color bin similarity threshold, respectively, while [absolute
value of h] and [absolute value of g] denote the magnitude (i.e., total
number of samples) of the target and reference histograms, respectively.
The distance of eq. [3] is symmetrical with respect to h and g--i.e. the
distance depends on the contents of the two images, but not on which of
the two images is a target histogram and which is a reference histogram.
In other words, we do not consider only identical colors, but also
similar, which can differ from the original with no more than [delta]
values, with the restriction that we consider this element in the sum
only if the difference between the sizes of the bins is smaller than
[DELTA]%. These two parameters can be determined using ROC curves, trial
and error or using any machine learning technique.
This way, SID is comparing a first bin of a first histogram
representation to a range of second bins of a second histogram
representation to determine a minimum distance between the first bin and
the range of second bins, wherein the first histogram representation and
the second histogram representation are selected from the target
histogram representation and the reference histogram representation, and
wherein the first histogram representation is distinct from the second
histogram representation, and employing the minimum distance to
determine the histogram distance between the target histogram
representation and the reference histogram representation. A better
representation of the SID algorithm can be seen in Figure 1.
Basically. this filter compares the query image with all the images
contained in the database in order to find a match. The database
contains only clean histograms.
The algorithm can be presented in the following steps
1. Extract histogram
2. q = Clean histogram (skip the unnecessary colors)
3 y = max( SID (q, [h.sub.i]))
4. If y [greater than or equal to] [ohm] return "We have a
match"
It is obvious that this filter can be modified in order to develop
an image retrieval system. The only difference is that the database has
to contain the original images and not only the pre-computed histograms.
All we have to do in the algorithm is to add another step and to modify
the one where we extract only the maximum hit. We want all the hits.
3. RESULTS
On our test corpus, our image retrieval system has an accuracy of
97.9%, but since our query system permits the user to choose the degree
of accepted noise, the degree of similarity between colors or the degree
of similarity between pixel counts, it's accuracy can rise up even
to 100%.
Also, working with complex pictures with a high number of colors,
might cause some problems to SID, but if this method would be combined
with other techniques for image similarity, we believe that there will
be some amazing results.
[FIGURE 1 OMITTED]
4. CONCLUSIONS
By comparing each bin with a plurality of bins, SID is able to
identify images that are slightly different from the original with high
accuracy. Of course, since retrieval is performed only on the histograms
of the images, we can't say that this process is flawless, but we
are confident that this distance is one step further in image retrieval
systems research.
Acknowledgements
This work was entirely supported by BitDefender--AntiSpam
Laboratory. Web: www.bitdefender.com
5. REFERENCES
Cosoi A. C. (2006). The medium or the message--Dealing with image
spam, Spam Bulletin
Graham-Cumming J. (2006). Title of conference paper, The Rise and
Rise of Image-Based Spam, Spam Bulletin Ueda,
Ling H., Okada K. (2006). Diffusion Distance for Histogram
Comparison, Center for Automatic Research, Computer Science Department,
University of Maryland and Imaging and Visualization Department, Siemens
Corporate Research Inc
Stevens, M.; Culberston B.; Malzbender T(2002) A histogram-based
Color Consistency Test for Voxel Coloring, HP research labs
Scheunders P. (2003). A comparison of clustering algorithms applied
to color image, Dept. of Physics, RUCA university of Antwerp
Sablak, S.; Boult T S. & Udiljak T. (2006). Multilevel Color
Histogram Representation of Color Images by Peaks for Omni-Camera,
Department of Electrical engineering & Computer Science (IASTED proceedings, October 18-21 1999, Nassau, Bahamas