Bringing Research to Real Life

0 0

let me get my bag on okay can everyone hear me all right good all right hi folks so today I'm going to be talking about freeing the research papers and we'll get a little bit more into that soon so this story kind of starts December 15th i was at work in the office and i had about 1.2 million rows of data that was my data so that i was working with and I have a 83 potential feature so those number of columns in my data set the you know were cleaned and we kind of had an understanding of and my job was essentially to cluster at my it was a clustering problem at that point we had done exploratory data analysis we had kind of talked to domain experts and figure it out some key features we should cluster on and that's my job so I kind of dug into it pulled out scikit-learn started playing around i used the k-means clustering technique I used an egg leimert of clustering technique got a bit into hierarchical clustering and then I'd try to DB scan clustering technique which is a spatial clustering technique and I really liked DV scan I hadn't played around with it before and it was kind of interesting to work with um sort of the things about DB scan is there's actually a parameter to the model called an epsilon value that specifies the maximum distance between two points for them to be considered in the same cluster parameter tuning and you know had I known about Kevin's technique yesterday from his talk maybe I wouldn't have had to speak about this but I was not about parameter tuning so I set out and I did some googling I tried to figure out if there was a newer version of the DB scan clustering technique or an improvement on it those out there that kind of helped me out and I did I found a paper out of the University of Munich in Germany that was written in 1999 that describes a technique to solve my problem which is not have to tinker with the epsilon value for my model it's not a clustering technique per se if you're interested in learning more about that there's the link to the paper feel free to check it out so I set out to what I usually do when I find a new you know machine learning technique or a modeling technique I went over to the socket learn website and I google optics clustering and this is what I came up with there is nothing and so I had a panic attack because that means there's no way I'll ever be able to solve this problem um but there was um I'm paid hourly so I thought is there a way for me to turn this research paper into something that I can play with and to code that I can play with and it took me a while but I get paid hourly so it was a good thing and so this talk is going to be the techniques and the lessons I learned on taking this research paper and turning it into a project or a bit of code that I was able to publish online and put up there so the first things you have to do is our TFA which is read the abstract so many people don't do this right like actually read it get a sense of the mission of the paper the purpose of the paper why these group you know for intelligent people decided to spend years of their life doing this you can get a sense of that from the abstract so please do read the abstract second thing I did was look at the pretty pictures pictures are very informative um they kind of get you a sense of where the paper is going and they orient you in a proper direction in order to understand the paper so look at the pretty pictures and then again you want to translate the ancient alien hieroglyphics i am not a fan of math I some point in my life on at a PhD in applied math but I was like that's never going to happen and it hasn't happened so I try to always kind of give meaning to those symbols that you often see in research papers the way I see it is if you see these squiggles think set in Python if you see these squiggles think about it eration so for loops or list comprehensions and if you see these squiggles think about or boolean logic so you're not your or and your aunt in that order for those specific symbols you know the next thing that you want to do this is you know computational paper a paper outlining an algorithm you'll find some pseudocode code on it so take a peek at the pseudocode read it understand it you'll be good to go for now at this point you're going to RTP which is read the paper after you've done all that so the point your kind of aware of the context that you're working in in what direction you're going to be headed and what kinds of things you're supposed to be understanding it'll be much easier than just you know reading the abstract and diving into the paper and then what you're going to do is write sudo pass on mmm the way this happens is you take a block so this is in the paper for ops the clustering it's a definition describing a function to find the cord distance of a point um and you'll notice just reading through it you've got an if statement there obviously you've got a lot of variable definitions so I just took that and I wrote it into what I thought would be Python code this is the first version of my function note that in no way does it work this doesn't work at all it makes no sense but what I've done is I've transferred the language in this paper and the symbols and the math that you know are usually really overwhelming into something that I understand and something I comfortable with in a way it's way easier for me to debug this and kind of work through that but it is to figure out what's going on from that that's very important then you're going to read the paper closely and then refine and fix that pseudo Python until it's better and better and you're gonna repeat that about 9,000 times don't get frustrated it happens to all of us I worked on this for like two weeks non-stop nine hours a day and it eventually happened so that's great and then at the end of all that you're going to write tests so this is a quick little test I wrote for the optics clustering technique I used some basic data points that I kind of had an idea of how they should be clustered eventually and wrote some unit tests for it and then once you pass those tests and you're done make sure you share it because this is what it's all about it's you know we I feel as you know programmers in Python we have this really robust library called scikit-learn and when it doesn't work for us we freak out but don't freak out just you know look at the paper understand it produce the code that relates to it and then publish it online you know keep keep the movement moving at this point I've done and I went through this really fast which I guess just happens when you talk like a really fast person but that's more time for you guys to ask your questions and give comments and rants and any like deep confessions you have questions what's your top level domain is that calm my top level domain is safiya rocks there is a tough there is a dot rock top level domain get on it folks no I didn't that's a good idea I should have done that but I am at that point I was more interesting kind of just learning this by doing it is the best way to do it and kind of really immerse myself in the process um and there was actually code out there for optics clustering but I was like no I want to really understand this and do it myself and also that code was messy so at the beginning of the previous century I guess so there was this program of constructivism in mathematics in which way when you have a team they wanted to show a computational view of how to actually achieve the results but the problem was always in oasis and she essentially well existential 25 sorry I've been dealing with a real member so but by these were what I mean is this i mean novelty of em seen papers can be instantiated in code right well sometimes you that something exists that you don't know what where it is it so yeah so I guess that's sort of like what domain you're working in so this was specifically a machine learning paper not only machine learning paper was an algorithmic paper so they had to be able to show pseudocode and quiet kind of explain not just the theoretical math behind why that works but how you could implement it any of you notice a little bit ago I showed that they had to to code in the paper so this kind of is specific to certain types of papers but I feel like the papers that this works for are very prevalent in machine learning and in computational science frequently like read papers and implement them or was this just like a one time this was my first time doing it I haven't done it since because I have had the opportunity but I decided to kind of outline that process and put out there how often do you find ever seen papers when you do that you know sometimes people skip on parts of the implementation and so if you just implement to the paper straight actually may not work view does it happen to you yeah i think when that happens you have to be able to read it so i remember another situation where i wasn't implementing it there was sorry i'm trying to run batok i was looking for it was rare event prediction um so you know given time series data can you predict when something unique in the data is going to happen and in that case there were actually referencing a genetic algorithm technique that was not outlined at the paper so in that case I to go out no look in the references and then find that paper and implement it through that so sometimes you just have to kind of like look through all the possible papers when they explain over and just go through the references but rarely is a paper and complete in that sense other questions what's the next day you're going to do you know what that rare event production paper actually started working on a library for it called um called endive and I was working on it for a couple a couple weeks in January and I just got so busy with life and everything I haven't touched it then but it's in progress i'll probably finish it at some point the authors of that paper after you had published the code no mostly cuz i was like coach I and suffer from imposter syndrome like Mark touched on yesterday like no one can save us even though I put it on my github I I cannot help you think it would be wonderful if a parallel to a paper repository there would be a code of a positive for the people who then replicated the papers even if the regional office right so that then you could download the code and right for yourself ever seen I have not searched for you that website research compania orc attempt at yeah yeah so this is a yeah so this fairly an old paper 1999 hopefully in the future people like me won't have to do stuff like this or give talks like this because the researchers will just provide their codes in Python any other questions huh yes there was so many bugs it ended up just being essentially the same results I would have had like tinkering with DBS and clustering because um er optics clustering does it doesn't necessarily produce on the actual clusters it produces an ordering of clustered data which I've been just kind of like looked at and visualize and got similar to what I would have done had I done TV stand clustering so fruitful in some senses think they're about to kick me off stage crap no not necessarily but there are there any more questions okay then yes thank you all this talk is on my website if you want to go to Sofia dot rocks /d 5 2015 you can find it there