文章基本信息

标题：Build your antiphishing technology in just 5 minutes.
作者：Cosoi, Alexandru Catalin ; Sgarciu, Valentin ; Vlad, Madalin Stefan 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2008
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Phishing is a form of social engineering in which an attacker attempts to fraudulently acquire sensitive information from a victim by impersonating a trustworthy third party.
关键词：Detection equipment;Detectors;Internet fraud;Phishing

Build your antiphishing technology in just 5 minutes.

Cosoi, Alexandru Catalin ; Sgarciu, Valentin ; Vlad, Madalin Stefan 等

1. INTRODUCTION

Phishing is a form of social engineering in which an attacker attempts to fraudulently acquire sensitive information from a victim by impersonating a trustworthy third party.

Nowadays, current AntiSpam technologies have obtained competitive detection rates on phishing emails, but since recently phishers are advertising their fake websites via a plurality of communication methods (e.g. email spam, instant messaging, social networks, blog posts and even sms (Cosoi & Petre, 2008; Hatlestad, 2006)), and having some starting information about their victims from social network profiles, (Jagatic et al., 2005) they can easily social engineer their way to the user's trust, which means that a browser level protection must be assured in order to prevent the user to access the website, even though he was persuaded to access the fake URL.

Current browser based technologies employ whitelists, blacklists, various heuristics to see if a URL is similar to a well-known URL, community ratings and content based heuristics (Cranor et al., 2006) and lately visual similarity (Wenyin et al., 2005). Blacklisting worked great so far, but the timeframe needed for a URL to become worldwide blacklisted is in most cases overlapping with the time in which the phishing attack is most successful. Also, current content based solutions, mostly blacklists and body heuristics (Cranor et al., 2006) do not always make use of whitelists, which sometimes might cause the filter to consider eBay's official website as a phishing website (SpamConference, 2008 and Wu et al., 2006).

2. PROPOSED METHOD

In developing our method we started from the following hypothesis: in a given language, the number of possible rephrases of a given text that transmits the same or similar information (e.g. "This is your online banking account. You must log in first in order to access your funds. Please be careful to phishing attempts") and not considering obfuscation purposes, is limited by the speaker's common sense (e.g. the information will be phrased in a simple readable and understandable form). In other words, we assume that all English log-in pages of financial institutions will have a large set of commons words, since they share common purposes and specialized financial vocabulary (Landauer et al., 1998; Kelleher, 2004; Shin & Choi, 2004; McConnell-Ginet, 1973; Merlo et al., 2003; Biemann & Quasthoff, 2007).

Considering two documents A and B (in our situation, websites of financial institutions like PayPal or

BankOfAmerica), we can represent them as A = C [union] [N.sub.1]

and B = C [union] [N.sub.2], where C represents the common words,

and [N.sub.1] and [N.sub.2] the distinct words. This means, that the necessary number of words to construct a database with triples of the form (word, document, occurrences), is |C| + |[N.sub.1]| + |[N.sub.2]| [less than or equal to] |C| + |[N.sub.1]| + |[N.sub.2]|, or in short (A [intersection] B) [less than or equal to] (A [union] B). If we consider the case of only two documents, this technique might not bring considerable improvements, but in the case of several documents which serve the same purpose (e.g. financial institutions websites), it is acceptable to presume that the outcome will consist in a large number of common words.

We will now define a similarity indicator between two documents, known as the Jaccard Distance for sets.

D = 1 - |A [intersection] B| / |A [union] B| (1)

On identical documents, this distance will have a null value, while in case of almost similar documents, it will be close to 0. Since these are not standard sets (e.g. in ordinary sets, identical elements appear just once, while in this set, we decided that each element (word) appears as many times as it is found in the document), the distance actually provides an acceptable similarity value, judging by the number of words.

On a corpus of 101 financial institutions, from 3 different countries: top 5 phished banks from Romania, 7 websites from Germany and randomly chosen 89 phished US institutions which showed a high frequency on email phishing in our internal email corpus, with an average of 100 words per page, we obtained a database of just 4422 different words, instead of an expected minimum of 10000 words.

Considering a pool of WebPages (e.g. the ones presented above for example), we can construct a database in the format presented in Table 1.

If we consider that it is necessary to run the presented forgery filter on the target webpage, we then start computing the Jaccard distance for each institution on which the filter has trained on (e.g. the words from learned webpages are to be found in the database). The lowest distance obtained, indicates the highest similarity (judging by the specified distance) between the target webpage and one reference webpage from our database. If the computed distance is smaller then a predefined threshold, we will consider this website a forged page.

When dealing with this technology, using an up to date whitelist is a necesity, because after this filter has learned the original website, it will score a perfect match when visiting the target webpage. An up to date whitelist will inhibit runnig the forgery filter on original websites in order to avoid false positives.

Based on this initial background, our proposed method can be better understood from Figure 1. First, the presented webpage is verified against a blacklist and a whitelist. Afterwards, some simple heuristics are runned on the webpage's content, to check whether this page would actually try to mimic an official log-in page (e.g. contains a submit button or words like eBay, PayPal, etc). We introduced this step for speed optimization purposes (e.g. it would be pointless to check if a webpage without a submit form tries to duplicate a webpage which has such a form).

3. RESULTS

Usually, if the filter was trained on a certain webpage, we will have a similarity distance of a at least .01, and experimentally (e.g. 10 000 samples) we observed that on phishing websites, we never obtained a distance higher than 0.2. For training, we used a corpus of 101 pages (presented earlier) and a value of 0.25 for the similarity threshold.

We tested our filter on two different corpuses: one containing 10 000 forged websites of the exact pages on which the filter has trained on (randomly selected from real phishing pages) and the URLs published on PhishTank on a timeframe of 10 days.

We obtained a 99.8% detection on the first corpus, which means that we had 20 false negatives, mostly because they were generated as a screenshot from the original webpage and not showing enough text content for a discriminative decision, and a 42.8% detection on PhishTank URLs. Although it may seem low, our data indicates that we obtained these results due to two major reasons:

* According to Antiphishing Working Group, in December '07 there were 144 hijacked brands co-opted in phishing attacks (far more than our training corpus)

* PhishTank's database of fresh phishing submissions is sometimes polluted since anyone can submit a website (we even found BitDefender's website submitted as a possible phishing site).

This experiment can be easily reproduced if in a multicategorial Bayesian filter, we change the probability function with equation 1 and the probability for each word to belong to category, will actually represent the number of occurrences of that word in that specific category. Then, if instead of choosing the category with the highest probability obtained, we would choose the category with the smallest distance, we could obtain the same results as presented above.

As for false positives, on a corpus of 25 000 samples of WebPages containing login forms, or any other information that would activate the forgery filter, we obtained 10 false alarms. 8 of them were actually real financial institutions, which should have been in the whitelist if the filter would have been properly trained, while the other two were real false positives (2 online financial newspapers) and this problem can be easily solved by whitelisting them.

4. CONCLUSIONS

Since phishing websites are no longer advertised on just email spam, we believe that it is time for companies to invest more in research and development on browser level antiphishing protection.

[FIGURE 1 OMITTED]

The proposed method comes as an add-on to current technologies, by providing the user with extra information about the visited webpages. Although, not a complete solution on its own, (it is ineffective on phishing websites that do not mimic the original website) used with other technologies (e.g. blacklists, content and URL heuristics) it increases the value of any antiphishing toolbar.

The obtained results show that this is a viable method to provide forgery detection to official financial institutions websites. Also, it is not necessary to run this system on all the pages visited by the user, focusing just on the ones that require information submission, and thereby, highly increasing the user's tolerance level by decreasing the time spent for analysis.

Acknowledgements

This work was entirely supported by BitDefender--AntiSpam Laboratory. Web: www.bitdefender.com

Also, a grateful thanks to Mr. Lucian Lupsescu and Mr. Razvan Visan for their precious help in developing this project.

5. REFERENCES

Biemann C., Quasthoff U.(2007). Similarity of documents and document Collections using attributes with low noise, Institute of Computer Science, NLP department, University of Leipzig, Johannisgasse 26, 04103 Leipzig, Germany

Cosoi A.C., Petre G. (2008). Spam 2.0. Workshop on Digital Social Networks, SpamConference 2008, Boston, MIT

Cranor L., Egelman S., Hong J., Zhang Y. (2006). Phinding Phish: An Evaluation of Anti-Phishing Toolbars, November 13, 2006, CMU-CyLab-06-018

Jagatic T., Johnson N., Jakobsson M., Menczer F. (2005). Social Phishing, School of Informatics, Indiana University, Bloomington, December 12, 2005

Kelleher D. (2004). Spam Filtering Using Contextual Network Graphs, Available from https://www.cs.tcd.ie/courses/ csll/dkellehe0304.pdf, Accessed: 2006-06-18

Merlo P., Henderson J., Schneider G., Wehrli E. (2003). Learning Document Similarity Using Natural Language Processing, Geneva

Shin S., Choi K. (2004). Automatic Word Sense Clustering Using Collocation for Sense Adaptation, KORTERM, KAIST 373-1 Guseong-dong, Yuseong-gu, Daejeon, Republic of Korea

Wenyin L., Huang G., Xiaoyue L., Min Z., Deng X. (2005). Detection of phishing webpages based on visual Similarity, WWW2005, May 10-14, Chiba, Japan, ACM 1-59593-051-5/05/0005

Wu M., Miller R. C., Garfinkel S. L. (2006). Do Security Toolbars Actually Prevent Phishing Attacks?, MIT Computer Science and Artificial Intelligence Lab, CHI 2006, April 22-27, 2006, Montreal, Quebec, Canada

Tab. 1: Format of the filter dictionary

Words OCCURENCES

 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5

Word 1 3 1 0 2 1
Word 2 0 0 3 1 0