A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National...

download A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.

If you can't read please download the document

description

Machine Learning and Bioinformatics Laboratory Outline  Introduction  Phishing URL Types  Modeling Phishing URLs  Feature Analysis  Training With Features  Analysis and Findings  Conclusion 2/25/2016 Slide 3 (of 35)

Transcript of A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National...

A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide 1 (of 35) Machine Learning and Bioinformatics Laboratory Reference Workshop On Rapid Malcode Proceedings of the 2007 ACM workshop on Recurring malcode Alexandria, Virginia, USA SESSION: Threats Pages: Year of Publication: 2007 ISBN: /25/2016 Slide 2 (of 35) Machine Learning and Bioinformatics Laboratory Outline Introduction Phishing URL Types Modeling Phishing URLs Feature Analysis Training With Features Analysis and Findings Conclusion 2/25/2016 Slide 3 (of 35) Machine Learning and Bioinformatics Laboratory INTRODUCTION Phishing is form of identity theft social engineering techniques sophisticated attack vectors To harvest financial information from unsuspecting consumers. Often a phisher tries to lure her victim into clicking a URL pointing to a rogue page. 2/25/2016 Slide 4 (of 35) Machine Learning and Bioinformatics Laboratory PHISHING URL TYPES We examined a black list of phishing URLs maintained by Google This black list is used to provide phishing protection in Firefox 2/25/2016 Slide 5 (of 35) Machine Learning and Bioinformatics Laboratory PHISHING URL TYPES The prominent obfuscation techniques are: Type I: Obfuscating the Host with an IP address Type II: Obfuscating the Host with another Domain Type III: Obfuscating with large host names Type IV: Domain unknown or misspelled 2/25/2016 Slide 6 (of 35) Machine Learning and Bioinformatics Laboratory PHISHING URL TYPES 2/25/2016 Slide 7 (of 35) Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Using logistic regression classifier For training the model training black list and white list as follows We use 1245 URLs from this list as our training black list We used a list of the top 1000 most popular URLs as the basis of our training white list set 2/25/2016 Slide 8 (of 35) Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Feature Analysis We categorize our features into four groups: Page Based Domain Based Type Based Word Based 2/25/2016 Slide 9 (of 35) Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Page Based : a numeric value on a scale of [0,1] relative importance of a page within a set of web pages 2/25/2016 Slide 10 (of 35) Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Page Based : 2/25/2016 Slide 11 (of 35) Page Rank distribution for the white list and black list URLs hostname Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Domain Based This category contains only one feature: whether or not the URLs domain name can be found in the White Domain Table. 2/25/2016 Slide 12 (of 35) Machine Learning and Bioinformatics Laboratory 2/25/2016 Slide 13 (of 35) MODELING PHISHING URLS Domain Based 51.2% of the white list URLs were present in the table 0.2% of the black list URLs were found in this table. Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Type Based Type I URL Almost all non-phishing (white list) URLs in our training data do not contain host obfuscation A significant portion of the phishing URLs are host obfuscated with an IP address. Type II URL portion of the black list URLs are Type II URLs. 2/25/2016 Slide 14 (of 35) Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Type Based 2/25/2016 Slide 15 (of 35) Distribution of Type I and Type II URLs in the training data Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Type Based Type III URL we determine the number of characters present after an organization in the hostname 2/25/2016 Slide 16 (of 35) Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Type Based non-phishing URL bin/getmsg 0 characters after msn.com & before the path separator the maximum number noticed in a white list URL are 14 characters Type III phishing URLs 7.34 characters (on average) after the target before the path separator a maximum of 63 characters 2/25/2016 Slide 17 (of 35) Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Word Based Features Phishing URLs are found to contain several suggestive word tokens login and signin are very often found in a phishing URL We discarded all tokens with length < 5 containe several common URL parts such asand www. We discarded organization name tokens We further removed query parameters 2/25/2016 Slide 18 (of 35) Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS 2/25/2016 Slide 19 (of 35) Distribution of these features in our training set Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Training With Features Our labeled data consisted of 2508 URLs 1245 were phishing URLs 1263 were benign URLs Phishing URLs were placed under the positive (true) class non-phishing ones were under the negative (false) class 66% of URLs were used for training and the remaining 34% were used as the test set 2/25/2016 Slide 20 (of 35) Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS To indicate the relative strength of each feature in identifying a Phishing URL we report the corresponding odds ratios, ecoefficient 2/25/2016 Slide 21 (of 35) Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS 2/25/2016 Slide 22 (of 35) Machine Learning and Bioinformatics Laboratory MODELING PHISHING URLS Evaluation Result We evaluated the trained model on the 34% test set split. We performed our evaluation over multiple runs with randomized partitioning. This evaluation gave us an average accuracy of 97.31% with True Positive Rate of 95.8 % False Positive Rate of 1.2%. 2/25/2016 Slide 23 (of 35) Machine Learning and Bioinformatics Laboratory ANALYSIS AND FINDINGS We collected several million URLs from August 20th to August The data consisted of two main components, unique URLs which are visited each day consecutive look up requests to these URLs 2/25/2016 Slide 24 (of 35) Machine Learning and Bioinformatics Laboratory ANALYSIS AND FINDINGS Average Phishing URLs per day. The average number of phishing URLs which have been visited from Googles toolbar in a day. we find that on average there are 777 URL phishing attacks in a day 5073 viewers to a phishing page 2/25/2016 Slide 25 (of 35) Machine Learning and Bioinformatics Laboratory ANALYSIS AND FINDINGS Average Phishing URLs per day. 2/25/2016 Slide 26 (of 35) the distribution of phishing attacks on each day of our study. Machine Learning and Bioinformatics Laboratory ANALYSIS AND FINDINGS Average Phishing URLs per day. 2/25/2016 Slide 27 (of 35) Machine Learning and Bioinformatics Laboratory ANALYSIS AND FINDINGS Average Phishing URLs per day. 2/25/2016 Slide 28 (of 35) Machine Learning and Bioinformatics Laboratory ANALYSIS AND FINDINGS Average Potential Phishing Victims per day. Determine how many users interact with a phishing page A user that has any interaction at a site classified as phishing is regarded as a potential phishing victim. 2/25/2016 Slide 29 (of 35) Machine Learning and Bioinformatics Laboratory ANALYSIS AND FINDINGS Average Potential Phishing Victims per day. Based on the number of users who view phishing pages in a day, we further can infer Potential Success Rate of a phisher as follows: 2/25/2016 Slide 30 (of 35) Machine Learning and Bioinformatics Laboratory ANALYSIS AND FINDINGS Average Potential Phishing Victims per day. 2/25/2016 Slide 31 (of 35) the distribution of phishing attacks on each day of our study. Machine Learning and Bioinformatics Laboratory ANALYSIS AND FINDINGS Distribution of Phishing by Organization 2/25/2016 Slide 32 (of 35) Machine Learning and Bioinformatics Laboratory 2/25/2016 Slide 33 (of 35) ANALYSIS AND FINDINGS Geographical Distribution of Phishing. To determine country that hosts a particular phishing URL, we used Googles IP to Geo-Location infrastructure. Machine Learning and Bioinformatics Laboratory Anti-Phishing Tools 2/25/2016 Slide 34 (of 35) Machine Learning and Bioinformatics Laboratory CONCLUSION We use our features in a logistic regression classifier that achieves a very high accuracy. One of the major contributions of this work is a large scale measurement study conducted on Google Toolbar URLs On average we found around 777 unique phishing pages per day and on average 8.24% of the number users who view phishing pages are potential phishing victims 2/25/2016 Slide 35 (of 35)