Build your antiphishing technology in just 5 minutes.
Cosoi, Alexandru Catalin ; Sgarciu, Valentin ; Vlad, Madalin Stefan 等
1. INTRODUCTION
Phishing is a form of social engineering in which an attacker
attempts to fraudulently acquire sensitive information from a victim by
impersonating a trustworthy third party.
Nowadays, current AntiSpam technologies have obtained competitive
detection rates on phishing emails, but since recently phishers are
advertising their fake websites via a plurality of communication methods
(e.g. email spam, instant messaging, social networks, blog posts and
even sms (Cosoi & Petre, 2008; Hatlestad, 2006)), and having some
starting information about their victims from social network profiles,
(Jagatic et al., 2005) they can easily social engineer their way to the
user's trust, which means that a browser level protection must be
assured in order to prevent the user to access the website, even though
he was persuaded to access the fake URL.
Current browser based technologies employ whitelists, blacklists,
various heuristics to see if a URL is similar to a well-known URL,
community ratings and content based heuristics (Cranor et al., 2006) and
lately visual similarity (Wenyin et al., 2005). Blacklisting worked
great so far, but the timeframe needed for a URL to become worldwide
blacklisted is in most cases overlapping with the time in which the
phishing attack is most successful. Also, current content based
solutions, mostly blacklists and body heuristics (Cranor et al., 2006)
do not always make use of whitelists, which sometimes might cause the
filter to consider eBay's official website as a phishing website
(SpamConference, 2008 and Wu et al., 2006).
2. PROPOSED METHOD
In developing our method we started from the following hypothesis:
in a given language, the number of possible rephrases of a given text
that transmits the same or similar information (e.g. "This is your
online banking account. You must log in first in order to access your
funds. Please be careful to phishing attempts") and not considering
obfuscation purposes, is limited by the speaker's common sense
(e.g. the information will be phrased in a simple readable and
understandable form). In other words, we assume that all English log-in
pages of financial institutions will have a large set of commons words,
since they share common purposes and specialized financial vocabulary
(Landauer et al., 1998; Kelleher, 2004; Shin & Choi, 2004;
McConnell-Ginet, 1973; Merlo et al., 2003; Biemann & Quasthoff,
2007).
Considering two documents A and B (in our situation, websites of
financial institutions like PayPal or
BankOfAmerica), we can represent them as A = C [union] [N.sub.1]
and B = C [union] [N.sub.2], where C represents the common words,
and [N.sub.1] and [N.sub.2] the distinct words. This means, that
the necessary number of words to construct a database with triples of
the form (word, document, occurrences), is |C| + |[N.sub.1]| +
|[N.sub.2]| [less than or equal to] |C| + |[N.sub.1]| + |[N.sub.2]|, or
in short (A [intersection] B) [less than or equal to] (A [union] B). If
we consider the case of only two documents, this technique might not
bring considerable improvements, but in the case of several documents
which serve the same purpose (e.g. financial institutions websites), it
is acceptable to presume that the outcome will consist in a large number
of common words.
We will now define a similarity indicator between two documents,
known as the Jaccard Distance for sets.
D = 1 - |A [intersection] B| / |A [union] B| (1)
On identical documents, this distance will have a null value, while
in case of almost similar documents, it will be close to 0. Since these
are not standard sets (e.g. in ordinary sets, identical elements appear
just once, while in this set, we decided that each element (word)
appears as many times as it is found in the document), the distance
actually provides an acceptable similarity value, judging by the number
of words.
On a corpus of 101 financial institutions, from 3 different
countries: top 5 phished banks from Romania, 7 websites from Germany and
randomly chosen 89 phished US institutions which showed a high frequency
on email phishing in our internal email corpus, with an average of 100
words per page, we obtained a database of just 4422 different words,
instead of an expected minimum of 10000 words.
Considering a pool of WebPages (e.g. the ones presented above for
example), we can construct a database in the format presented in Table
1.
If we consider that it is necessary to run the presented forgery
filter on the target webpage, we then start computing the Jaccard
distance for each institution on which the filter has trained on (e.g.
the words from learned webpages are to be found in the database). The
lowest distance obtained, indicates the highest similarity (judging by
the specified distance) between the target webpage and one reference
webpage from our database. If the computed distance is smaller then a
predefined threshold, we will consider this website a forged page.
When dealing with this technology, using an up to date whitelist is
a necesity, because after this filter has learned the original website,
it will score a perfect match when visiting the target webpage. An up to
date whitelist will inhibit runnig the forgery filter on original
websites in order to avoid false positives.
Based on this initial background, our proposed method can be better
understood from Figure 1. First, the presented webpage is verified
against a blacklist and a whitelist. Afterwards, some simple heuristics
are runned on the webpage's content, to check whether this page
would actually try to mimic an official log-in page (e.g. contains a
submit button or words like eBay, PayPal, etc). We introduced this step
for speed optimization purposes (e.g. it would be pointless to check if
a webpage without a submit form tries to duplicate a webpage which has
such a form).
3. RESULTS
Usually, if the filter was trained on a certain webpage, we will
have a similarity distance of a at least .01, and experimentally (e.g.
10 000 samples) we observed that on phishing websites, we never obtained
a distance higher than 0.2. For training, we used a corpus of 101 pages
(presented earlier) and a value of 0.25 for the similarity threshold.
We tested our filter on two different corpuses: one containing 10
000 forged websites of the exact pages on which the filter has trained
on (randomly selected from real phishing pages) and the URLs published
on PhishTank on a timeframe of 10 days.
We obtained a 99.8% detection on the first corpus, which means that
we had 20 false negatives, mostly because they were generated as a
screenshot from the original webpage and not showing enough text content
for a discriminative decision, and a 42.8% detection on PhishTank URLs.
Although it may seem low, our data indicates that we obtained these
results due to two major reasons:
* According to Antiphishing Working Group, in December '07
there were 144 hijacked brands co-opted in phishing attacks (far more
than our training corpus)
* PhishTank's database of fresh phishing submissions is
sometimes polluted since anyone can submit a website (we even found
BitDefender's website submitted as a possible phishing site).
This experiment can be easily reproduced if in a multicategorial
Bayesian filter, we change the probability function with equation 1 and
the probability for each word to belong to category, will actually
represent the number of occurrences of that word in that specific
category. Then, if instead of choosing the category with the highest
probability obtained, we would choose the category with the smallest
distance, we could obtain the same results as presented above.
As for false positives, on a corpus of 25 000 samples of WebPages
containing login forms, or any other information that would activate the
forgery filter, we obtained 10 false alarms. 8 of them were actually
real financial institutions, which should have been in the whitelist if
the filter would have been properly trained, while the other two were
real false positives (2 online financial newspapers) and this problem
can be easily solved by whitelisting them.
4. CONCLUSIONS
Since phishing websites are no longer advertised on just email
spam, we believe that it is time for companies to invest more in
research and development on browser level antiphishing protection.
[FIGURE 1 OMITTED]
The proposed method comes as an add-on to current technologies, by
providing the user with extra information about the visited webpages.
Although, not a complete solution on its own, (it is ineffective on
phishing websites that do not mimic the original website) used with
other technologies (e.g. blacklists, content and URL heuristics) it
increases the value of any antiphishing toolbar.
The obtained results show that this is a viable method to provide
forgery detection to official financial institutions websites. Also, it
is not necessary to run this system on all the pages visited by the
user, focusing just on the ones that require information submission, and
thereby, highly increasing the user's tolerance level by decreasing
the time spent for analysis.
Acknowledgements
This work was entirely supported by BitDefender--AntiSpam
Laboratory. Web: www.bitdefender.com
Also, a grateful thanks to Mr. Lucian Lupsescu and Mr. Razvan Visan
for their precious help in developing this project.
5. REFERENCES
Biemann C., Quasthoff U.(2007). Similarity of documents and
document Collections using attributes with low noise, Institute of
Computer Science, NLP department, University of Leipzig, Johannisgasse
26, 04103 Leipzig, Germany
Cosoi A.C., Petre G. (2008). Spam 2.0. Workshop on Digital Social
Networks, SpamConference 2008, Boston, MIT
Cranor L., Egelman S., Hong J., Zhang Y. (2006). Phinding Phish: An
Evaluation of Anti-Phishing Toolbars, November 13, 2006,
CMU-CyLab-06-018
Jagatic T., Johnson N., Jakobsson M., Menczer F. (2005). Social
Phishing, School of Informatics, Indiana University, Bloomington,
December 12, 2005
Kelleher D. (2004). Spam Filtering Using Contextual Network Graphs,
Available from https://www.cs.tcd.ie/courses/ csll/dkellehe0304.pdf,
Accessed: 2006-06-18
Merlo P., Henderson J., Schneider G., Wehrli E. (2003). Learning
Document Similarity Using Natural Language Processing, Geneva
Shin S., Choi K. (2004). Automatic Word Sense Clustering Using
Collocation for Sense Adaptation, KORTERM, KAIST 373-1 Guseong-dong,
Yuseong-gu, Daejeon, Republic of Korea
Wenyin L., Huang G., Xiaoyue L., Min Z., Deng X. (2005). Detection
of phishing webpages based on visual Similarity, WWW2005, May 10-14,
Chiba, Japan, ACM 1-59593-051-5/05/0005
Wu M., Miller R. C., Garfinkel S. L. (2006). Do Security Toolbars
Actually Prevent Phishing Attacks?, MIT Computer Science and Artificial
Intelligence Lab, CHI 2006, April 22-27, 2006, Montreal, Quebec, Canada
Tab. 1: Format of the filter dictionary
Words OCCURENCES
Bank 1 Bank 2 Bank 3 Bank 4 Bank 5
Word 1 3 1 0 2 1
Word 2 0 0 3 1 0