How to detect Phishing URLs using PySpark Decision Trees - PyCon India 2015

0 0

it's topic is how to detect phishing URLs using PI spark by hit H he is an independent security researcher his interest lies in networking security data science in big data so over to you Tish hello everyone thanks for showing up so I'm going to be talking about I mainly do internet threats work on around security mail code privacy and malware engineering kind of efforts this work has been if you if you want access to these slides you can actually go to the web page now and download them yourself in case you can't see them at the back so what is this talk about essentially this talk is about my attempt to solve a problem I wouldn't yet say that this was a successful attempt because it's still ongoing but basically the problem is about detecting phishing URLs and what has been done until now to detect these malicious entities on the web and how to protect common people against them why I made the choices that I made to use PI spark an ml Lib and various other things I am by no means a machine learning expert so you will have to take what I say about machine learning with a grain of soil and this is not us talking about finding you know a success story to find a solution or anything is per se so what in the world is phishing right typically I find that to be any form of credential theft any form of theft where intellectual property or personal details like usernames and passwords and credit card numbers have been taken from you when you thought you were giving that data out to a legitimate person and you were in fact not and why you solve this problem and this basically stems from a very local issue right in my city where I am from the police department gets about 50 complains a day regarding phishing now whether that is in the form of malicious emails or whether that is in the form of social engineering and someone calling you and telling you that I'm calling from a bank and taking your credit card number all that ties into the phishing problem and this is this is a problem that has not been soiled as correctly as many of the other problems like for example for an T for viruses and malware you anti-viruses for a PT style threats you have a lot of products in the world out there that you can go and buy but there is not much for for a phishing scam stopping to a common man at least and that there is also this sort of moral side to it right that fishing takes more advantage of the gullible rather than the tech savvy because I can pretty much assume that nobody here is going to get affected by fishing because you guys know what you're doing but on the other hand there is a majority of the populace that does not and to whom these problems have yet to be solved where it gets even more difficult is that is those two news articles that came out relatively close to each other where the direction treasurer of Police of Karnataka actually lost money because he fell victim to a phishing scam so what what do we do in today's world to beat these sorts of malicious entities right we have these ever prevail in blacklist so you can go and download a list of URLs every single hour that will tell you that you know these URLs are now malicious or these URLs are hosting content that are that that is not that is detrimental to people in a broad sense and then there are other people who write sort of yarra rules if you have ever familiar with them on email bodies and say okay you if you if these words appear in an email then I can assume that this is phishing email or something of that sort but there's no sort of solution where people are still happy with it Google actually did a pretty good job at the Safe Browsing initiative but the problem is that that is only applicable if you're using Chrome or if you're using some browser bundles that have been tied with Google that can offer you that sort of protection so for example if I get some link to me on my corporate account and I have to click on it there is no filter there is no Safe Browsing that I can leverage at that point to to really make a difference also people say that okay let's you know let's be strict about it and and IV will download this list of beautiful Alexa 5-million top domains and we will only allow things to be clicked only on those five million domains and if anyone wants to go to any other website then we will sort of you know sure small warning before we make things go ahead but those are sort of only making it a tiny bit harder to to get on but not stopping no problem at all democ does a very great job but you can stop spam to a great degree but not fishing URLs because typically if when if you were if you know about this during the were to call it correctly job hiring spree that happens every time graduates come out of school you will notice that people create identities like my company job interviews 2015 at gmail.com and you can't put you can't block gmail.com from sending you email right at that point so there is nothing you can do until unless you actually look at the content that is being clicked at that point and then decide for yourself whether this is good or bad so like any other approach I had to start with ground zero my ground zero is existing research on detecting phishing URLs which led me into the machine learning direction for some time because I had fondled with these things but I am again by no means an expert at machine learning per se and the most amount of success I had with you know fine getting features from these data sets of phishing URLs and and emails and I'm sort of running them through various classifiers was this concept about decision trees and it intuitively makes sense to anyone who's in the security industry because V as Clan tend to think of maliciousness and benign activity in terms of rules so we say okay this happened this happened this happened and this happened then it is bad or it does not meet my level of comfort and if this this this happens then I am relatively okay with someone doing anything with it also it was pretty expressive in in in sort of saying what I wanted to come out of such a model was hey if we find a URL and the content of the webpage has such and such and such entity then we don't want we don't like this so testing it out and humanly looking at it became very very simple and easy to do which brings to my second choice about PI spark and ml Lib I am tend to be a little biased towards spark for my own reasons but it allows me to sort of Kampf in a lot more web pages than I normally can there is a very good resource for getting fishing fishing URLs called fish tank and you can get brand new URLs about fishing from there every hour and you just run a crawler that fetches the HTML for you or time and again this led me to have a pretty good data set at this point on a parallel note ml use ml abuse is something called cycle learn that you all know about which made it very easier to even find documentation and sort of cross-reference and see whether I am on the right track so for this tiny experiment of mine I gathered about twelve gigabytes of web pages which doesn't seem like a lot but then I realized that it's a pain point to parse HTML and extract features out of it for every single web page and if you have two and a half lakh web pages and about five 10,000 being added to them every day you realize why you now need to do something with with cluster computing engine like spark and not just write a for loop and go to bed so which again brings me to the point that I also did not want to roll out my own multi-processing framework where I say okay you consume this process consumes these many this section of these web pages and this other process consumes this section and we sort of bring features together at the end not something I want to reinvent the wheel kind of approach also you can save a model in spark can load it anywhere else so it that makes it easier for deployment now I don't know why they had to wait until 1.4 to do that but whatever so what are the features that sort of work right typically if you get go to see people use dynamic DNS domains mostly for malicious activity if you see any traffic going to a dynamic DNS domain which you have explicitly not gone to then I can assure you that it is not something good it is definitely somewhat trying to trying to contact you never also tend to go to a direct IP address in external situations because you would typically go to Google or a search engine of your choice and type in some search query and go to a link so you're never actually interacting with IP addresses directly so these are the things that I thought was would be indicators of phishing URLs happening right because these URLs these web pages for phishing are very very short-lived until the point that the hosting provider realizes that hey this content is quieting more harm than anything else and at this point I have to take it down and but the crux of everything comes in the dynamic part right because the moment the only way humans detect phishing pages is by is that we look at the web page we look at the URL and we say you know what this Yahoo logos stopped shipping in 2002 how can they still have a web page with that logo on or or any such thing you may see that the web page doesn't load properly you may see certain errors these are the thermos takes that that phishing attackers make and these are okay for me and you to understand but it's very difficult to convince an algorithm to say look at a logo and say tell me what logo it is unless until you go to input a separate science of its own to figure out whether a particular place is genuine or not a little counterpoint to this is the fact about SSL and CPU pinning but most people don't pay attention whether they are on a pin HT HTTP site when they go to google.com if they see Google's logo and it says enter your email address they will happily type it out but the sort of bulletproof approach came to this when it said if you see a form and you say email address password the moment you know that that the post request of that form is actually not going to google.com or or any search service then you definitely know this is not this is not something that is good for anyone for that matter of fact and we leverage those kinds of features in in in ml Lib and and all we do is take about 10 or 12 features put them in sort of a true or false sort of vector I think that you call it a one hot vector and you accept you let the model train it gives you a very beautiful tree and it says if this then malicious if not then for then benign and so on and so forth for someone who doesn't know a lot about decision trees it sort of also tells me what is the most useless feature right because if I say that having a form with a password is a useful feature to detecting a phishing page and that feature is present in benign pages and phishing pages with equal amounts of probability then that feature is useless because it's not giving me an ability to distinguish between any one of those those two sets at any given point in time in which case I might as well throw it out so it does this time and again time and again until it finds out that okay these these features in this way gives me sort of the best fit so to speak and whatever does not add value is thrown away and what we get sort of is is something that sort of surprised me at the beginning because once the model got trained and I ran it on on the place of s data it classified were classified about 99 percent of those web pages correctly which seem to do good to be true if that is what you're thinking which is correct because in the real world we actually have a lot more phishing pages than we have real pages because if you want to find out whether a gmail login page is real or not you will only get one example of it from google.com but if you go to find phishing pages for the same thing you will find hundreds of them so there is this problem where you have the data set for benign pages is far lesser than the data set for malicious pages and you cannot actually get more benign pages because there is only so many services there are in the world that you have to defend against but it gave a lot more approach there were a lot of false positives about 35 37 percent of false positives when you used bin when you gave it benign URLs and it came back to misclassified as as malicious which got quickly offset it with some white listing right so typically barring some edge cases you will be okay if you are going to google.com or let's say you going to docs.google.com you are relatively safer there are cases where things might go wrong but relatively safer so offsetting these kinds of things by by a whitelist by by some mechanism of clustered cites can give you far better results and if you incorporate something like Safe Browsing hopefully we get to do that if Google releases it as open source we can do much much much better so we quickly realized that having Alexa as a part of the feature meaning if this URL is in the Alexa list if that's a feature that does not help at all because died in DNS dot-org actually appears in the Alexa list in which case all the dynamic DNS domains go out of the window right there also people might say that hey you know what these phishing URLs come up on very very new domains right like someone might register a domain last week and then use it to do something malicious this week so you can you know say give a reputation score to the to the domain name and then you can decide whether this is a good or a bad domain which again does not work because people register and keep domain names for a long time or use dynamic DNS services or typically these attackers might have moved from more malware like activity to phishing like activity now and you cannot really judge just by the reputation of the domain about what is really going on and there is that there is still a tiny problem to this site because every time now you need to classify web pages you need to actually get the HTML for that page and then figure out what features it has and then send it to a model and send it back which is a good and a bad thing because if you have a browser extension that can compute these features and send them up to some service the lookup takes like less than a microsecond to do right because once you have the model and it's listening on some port it's very quick to make that decision so the idea is sort of to extend this approach where the features are computed locally to whoever has an extension to stop fishing and then do that that sort of only the feature vector comes to us and we can just reply with hey don't click on this or click on this and and it would be a it would be beneficial to a lot of common people in the world to be protected protected against phishing attacks there are a couple of other problems also that that arise out of this approach is because once you put this out in the wild and this is to do with any problem insecurity attacker will find other ways to sort of subvert your your feature gathering capable if you're looking for a password field instead of putting a forum they might actually put two text inputs and do some crazy things where you sort of they sort of try to evade your feature scanning approach which again brings you back to square one because as they rightfully say the defender has to be right 100% of the time and the attacker needs to get lucky just once so that is going to be an active problem also getting new pages every now and again requires that you can now have dedicated infrastructure to to fetch these web pages to crawl these web pages time and again and to update your model every single time it's still not as bulletproof as a human looking at it but it is better than most approaches that are out there also one of the problems is that SPARC does not have an API face so you can't just take a model and say accept any input that comes on port 80 and then just tunnel it to to this model and then write the response to a database or something you still can't do that so you have to find like hacks around writing tiny pipes between things and say okay you communicate with this guy you communicate with this guy and it gets messy really really fast but again we will find problems to such things time and again as they happen that sort of like a rough sketch about about phishing URLs and this technique doesn't necessarily apply just to URLs but also to emails right because the same way you extract features out of HTML you can extract features out of email communications as well leave it at that be an exchange email or or whatever that is that's basically sort of the approach that that we've had until now this is what we got it's sort of a mix between static features and dynamic features when I say static features I mean features that just rely on the URL right just the text of the URL nothing to do with the content of the webpage and then there are some features that rely on the URL on the HTML of the page which means okay we are actually looking at the content of the page and deciding for ourselves whether this is a legitimate page or not so this is where we are now that's all I actually have to share if you have any approaches that you have tried and tested or if you like to contribute to this you are welcome to take the data set or take the code once this meets sort of some measure of quality it will be open source since the point is to have it accessible and usable by everyone in the world who is you know sort of can fall a victim to phishing at this point and I can I can take any questions you have that are based on security or all phishing so as you said like so most of the things is like so you need to pass the content of the page and then decide most of the things right so have you tried the approach of putting up a like playing out with some plugins with reverse proxies or some other stuff because that helps right so that helps identify so as a reverse proxy you are like having all the requests which are going through and the response which is coming back and it it's very easy like a honey pot or something to identify the patterns which is which you want to which you want to learn and then block or this do some other stuff so like so there are certain things which we have done in past so I will be happy to like interact post your talk to explain which tools we have used which we use in email and other stuff which can be used to mark a ham or spam kind of stuff in phishing also using reverse proxy stuff yes so there's two things right the reverse proxy is useful only when you know where the traffic is going to pass from for example if I am here at PyCon I know that all my web requests are going to be routed through some server right and I can put some defensive measures there the also second alternative is to do this inside like a browser extension or something like that because browser extensions are a really lightweight they can run per computer regardless of where that computer is they anyways here I have access to the HTML page which means you don't have to sort of reexamine the page in transit or something like that and again extracting features will be simpler per computer on the browser rather than like at a at a gateway or something of that sort you have tried those two approaches and so only the learning part I'm saying like so like a honey pot you put reverse proxy try to learn those things and put those features in your browser extension or somewhere else true too and yeah that'll be a I agree that's definitely something that is feature so that you are used to it mostly about textual content or or features of the DNS or which which all features so there are all the features are our binary except for one the features are we see whether the domain name that you get in the URL is the same domain where the post request is going to write so that's one feature another feature is to check if normally you see is for example if you are fishing for ebay credentials rate you will traditionally see that the URL will look something like Simon dot ebay.com dot hello dot XY Z dot some bad domain dot are you right and and in that case you know that the top-level domain is actually not ebay.com and just you know sort of fuzzing itself to look like he better calm and we try to find these brands in the email and say okay if we find a legitimate brand in the URL then true that so that's that's another feature I'd be glad to share this with you because I have 15 and I can't remember them off the top of my head right now so there are a bunch of these and we can talk about all the features offline I'm sorry but this is machine learning question oops just out of curiosity which other techniques then decision trees have you try it out and what made you zero in on decision I have tried everything that comes in the mail lip package everything from logistic regression to they assume classifiers all of them I found that decision trees are good I have just discovered that even random forests are giving somewhat similar performance but I haven't tried had like a chance to deep dive into random forest and check out what they do so spark actually accepts does not have you don't have to change things much between changing classifiers so you I could basically run all of them and just check at the end what's the best sort of detection rate that I can get and this isn't pleased turned out to be that one classification accuracy is the only metric that you were looking for when deciding between yes I mean that sort of what we are aiming I am aiming for as a as a detection efficacy point of view right because my problem is to classify something as malicious when it is malicious so to me at least classification is is sort of the primary motive I realized that there is going to be a sort of an F period and there's going to be another true F period but it doesn't really hurt anyone not to not to know I I am NOT doing it at right now and I have found why I am not doing that it's because typically if you go to say you go to login or gmail.com right that page actually does not show up consistently from time to time it may see something today it might be something else tomorrow not the not like the viewing experience I'm talking about the underlying JavaScript and HTML that comes with it also this begs the point that then I can only protect people against web pages that I've previously seen benign versions of before so for example if I have the page for ICICI dot-coms web login interface and I don't have it for IDBI Bank then and if I get a phishing page for IDBI Bank I have no metric to compare that against so that sort of seemed like a bottleneck that would not be a scalable solution at that point but I am happy to be proven wrong on this and which is the most fascinating for you obviously so spot cheats a little bit because when you use PI spark you're actually not it's not implementing everything natively in Python so they have a gateway of sorts where it's sort of slipping you know sort of giving a check to someone through a through a channel underneath the door and it just just sort of shifts tasks underneath and everything actually runs on the JVM rather than running on the Python interpreter but yes by by far the experience to use Python is far greater than using I mean I am I don't know are for the even the first line so yeah the experience to use spark as a cluster computing framework and as a means of sort of processing large data sets very very quickly is is very very very easy to do in Python than anything else thank you