Data Science on Clojure

0 0

good afternoon this is a talk about data science and closure I am Soren Macbeth that's my Twitter handle I like to get this one out of the way very quickly what is data science I like this definition that I made up which is stuff a person who stated yes that is the best definition I think I work at a company called yield bots we are an intent marketplace which is to say in advertising technology everyone's favorite if you know how AdWords works we are very similar except we're not Google and we don't advertise on search result pages what we do is yes you're welcome we basically resolve every page view from the publishers that are in our network these are premium publishers like Martha Stewart or if you've ever searched for a recipe online and you don't run an ad blocker like I do then you have seen one of our ads probably so we resolve these pages in two sets of keywords and aggregate them and sell them to advertisers the same way that they might buy search keywords we are about 80 people we have an office here in Portland the data team except for one is is located here we have offices in New York and Boston and other places we are a big data company we currently process about 1 billion page views a week across the publishers in our network this colors the type of data science that we do and it's one of the things that closure lends itself - very well so how do we deal with data science for a billion page views a week we use a lot of these technologies Kafka storm spark May sews all of the large distributed computation engines that you know how many people have used to do before wow that's a scary ok good we don't use a dupe anymore and I'll get to that in a little bit so I feel sorry for everyone it's not it's not good we do a lot of machine learning on all of this data I think we cover like the four major bases of machine learning we do a lot of text processing as I mentioned this involves looking at HTML and seeing what words are there and looking at a lot of other things from these page views that we get we do a lot of analytics similar to Google Analytics but for these publishers that are in our network so we're mostly interested in the intent and the keywords on the pages not a lot of the other stuff supervised machine learning everyone's favorite logistic regression trying to predict the probability of of clicks on ads things like that reinforcement learning our main optimization algorithm which runs in real time is a reinforcement learning algorithm this is like you know pick a choice see how it performed record do that over and over again to try to find the optimal choice in our case the optimal ad to show unsupervised learning I'm not sure if we're actually doing this in our past I've been the chief data scientist for about five years so we've we've done many different things we used to do clustering for anomaly detection and some other things like that so I have to double check it for if we still are doing on a supervised learning at the moment but we have done in the past and we do all this stuff in closure we have always used closure we never used anything else when it was just me I started with closure and we have built the team and built the company with closure and its core there are a number of reasons why the list that I showed before are almost all run on a JVM the large big data distributed processing systems run on the JVM so that's that's the first thing we need to be able to run on the JVM that limits our choices pretty substantially the repple I think is huge I think it's why a lot of data scientists like Python or are these other things we do exploration data exploration being able to mess around load data look at it in different ways mess around with it I don't think it makes sense to do data science without a wreck we'll are right just a wrap closure it's fun and this is the one that matters the most to me because I have to write whatever language it is every day closure is fun it's been fun since I use it I enjoy writing it so I force it on everyone else in my team I think the most interesting thing about closure and it being on the JVM is the Delta between prototyping some code and actually getting it into production is essentially zero we rewrite the code we can test in a rep while we write you know write tests we work with a little bit of data or a lot of data jar it up ship it off to whatever distributed processing system it's going to run try and run it on a lot of data maybe tune it a little bit but really it's it's write it on your local machine jar it up send it off and you're done so this is I think very different than a lot of other places that do data science we all write code we all ship go to production this code is at the core of of our business it's how we make money the optimization critically important so I don't like the idea of having to write something in our and you know sort of test something out and then maybe ship it off to another engineering group who writes Java and you know or god forbid Scala and gets it to run and you know has to translate it I don't think that's necessary it's allowed our team to say very concise and small and do a ton of stuff because we can all work in parallel we can all work very quickly we work on a rep Bowl when we're done we're done so way back in ancient history 2011 or so I had to start writing some code for Hadoop and at the time there was sort of like the vanilla Java Hadoop things which was the definite no there was Pig which I sort of tried for a very brief amount of time and that was okay until you want to do anything remotely interesting sides like grouping and counting so then you had to write Java and the pigs Java API and I know it's come a long way since then there's now language things in the UDF's or something that I don't pay attention to but fortunately Nathan Mars released Casco log about the same time I had never written closure but it was interesting and I had sort of was aware of closure before this so this gave me an excuse to kind of learn closure right a Duke code that wasn't actually terrible to to develop I could go quickly it was just me I had to write a lot of stuff so the speed of development was the thing that after I got over the hump of learning closure made it worthwhile and the reason we kept doing it and we did it for a number of years and went through various smashing ations and I grew to hate life and Hadoop and myself not Casca log casco log was good I liked writing data log it was trying to get it to run on clusters of various types we used EMR we used our own clusters at some point we spot instances we did all sorts of crazy things Casca log was good again the testing was much much better than anything else that was out there like vanilla Hadoop didn't real have any good testing story I don't think big did either they just weren't they weren't good being able to write a little function in a rep Bowl and you know the whole your Hadoop job ran right in Emacs so you could test it actually test the thing and then be pretty confident that was going to work and I was much different so that's what kept just going on Casca log and in December of this year finally we got rid of our last to dupe job and so we don't run Hadoop anymore and life is great what we replace it with was storm and so storm came with a very nice closure DSL a lot of the core of storm again written by Nathan Mars was closure so it had a nice closure API that were able to start using right away so we didn't have to do any of the work there we got it going quickly we already knew closure it was it was very straightforward to just start writing this stuff a lot of the libraries that we had we favor writing small libraries so that they were usable transferred over into storm with no work so that that was all very good and shortly after that makes it really something called trident and i don't have a good sense for how popular trident is i don't see a lot about it in the mailing list i'm not sure how much of a success it was in the java community or the other communities that write storm topologies but we liked it we liked it very much we liked it so much that we wanted to use it and we didn't want to write java so we wrote something called marceline which we've open sourced and marceline is just a closure DSL on top of trident the java library i don't know how familiar has anyone how many of you love you storm or know about storm at all less okay well storm storm has the concept of topologies and a topology is just a streaming computation some computation that you want to do forever streaming in data storing the results somewhere for some other system consume you write things called bolts and spouts and tridon abstract those into into functions like filters it's modeled after cascading so if you used a Dupin you know cascading Trident is essentially cascading but for storm so here's an example of just a regular function that does something in a storm topology this partial sum json event looks like normal closure they're mostly macros there's a lot of docks for Marceline if you're interested in it I think it is the actual only decent way to write Trident I am guessing but I suspect that the uptake for trident in Java is not so great mostly because writing it in Java it's kind of weird and it doesn't it doesn't look right whenever I see examples it's odd I think it's much better in closure obviously I'm biased but here is a combiner so like cascading you sort of write your transformations you write some aggregators it has a concept of state and it manages the state for you and the exactly once processing or at least once or there's different semantics that you can do so here's an example of an aggregator that has different era T's if there's no nothing on one side it will be empty I won't go too deeply into the code there but it's aggregating over some bolt or spouts that's sending a stream of tuples that has an abstraction called tuples from cascading basically which are named feel lists basically so here's like what des actual topology looks like you build it up using this threading macro you say here's my here's my topology here's where I'm gonna get some data from I want to transform it this way it's going to produce these fields my next aggregator or function or filter is going to accept this set of fields and return these things and you store them it also has D RPC which is nice and we make a lot of use of so the state is constantly being stored and updated we use Cassandra currently for our state store so as it's going in this here we have a state a state query it will I'm losing my play yes that's Marceline I skipped forward in my science and no it's okay good flambo how many people know about spark spark yes lots of spark people good flambo yes good flambo is our closure DSL on top of sparks Java API I'm very grateful for them for producing a java api because I didn't want to wrap Scala which is why I ignored it for a very long time and I looked at it six months ago or so and they had released a Java API that was fairly nice and so we wrote a little DSL for it we actually borrowed has anybody from the climate corporation here I think they're here at the conference yes we the climate corporation who has also done some catalog stuff they had something called CL j spark which was which was sitting around so we picked that and started adding to it and renamed it and expanded a bunch on it and are using it production now so it's similar to the way that Marceline looks you write functions that mirror the spark Java API functions map filter reduce fold although all the things that that you're used to are there it even looks a lot like the Scala if that's not a horrible thing to say I think you can look at a Scala example and figure out what it would look like in Flambeau pretty easy so again this is all just closure stuff the reason we have a spark function is so that we serialize there serializable functions we didn't have to do that in Marceline because everything gets compiled there's no concept of a ripple in Marceline really you can run topologies locally for testing but there's no interactive ripple in storm flambo does have a ripple and so we wanted to be able to do interactive stuff in spark on a cluster interrupt so we serialized the functions and we ship them around so when you define a function in the rep ball it gets you realized up and deserialized in the in the JVM worker processes you don't have to think about any of that you just use this def spark function macro and it'll do all that you can use higher order functions you can write normal closure this is what a function looks like like an actual work that you're doing it's you know normal closure you thread things through threading macro you pass data from one function to the other all kind of stuff that you all know and love it feels a lot like writing normal closure and so taking something that runs locally or some you know you go and you bang out so I'm logistic regression implementation in closure porting it over to Flambeau and getting it to distribute and run is really cool I guess quickly on the way that SPARC works for anyone who doesn't know has this thing called an RDD resilient distributed data structure and it makes these things and shoves them up in memory and then has a planner that ships around sort of the transformations their dependencies and the data all together so that if a task fails it can regenerate its dependencies so it's all very nice everything's in memory it's really fast for anything that's like iterative machine learning stuff where you have to go back over the same data set that Hadoop is absolutely atrocious at SPARC does extremely well so we had a lot of things like that that we'd written in casco log that were not pretty so this was an obvious thing for us to try and it works great it's super fast we have a bunch of different jobs our supervised learning stuff is all in Flambeau we do a bunch of aggregations and and things for analytics and a bunch of other stuff in Flambeau now what's slide is actually good um so I have a quick demo because this is closure because it's just a jar you can like put a n ripple server in it and then run that on your cluster and connect to it and and do data science on it so let me make sure my videos yeah yes so started a tunnel gonna connect to the in rebel from Emacs so this is a in ref will actually connected to our mesas cluster or we run spark load up a namespace I we have a bunch of parque files that live up in s3 there so Java did the Java update thing pop up yeah that's my favorite part so that's a the configuration you define a configuration make a spark context which is what has the references to all the different rdd's and data structures that's the output on the cluster so you can see things are actually happening there's the little job UI just so you believe me that this is actually on a cluster or not pretend so I don't actually do it here in my little demo this is just gonna read a day's worth of parkade files it's defining the RDD first so it gives you a pointer to the RDD and then when you perform a class of things they call actions it will actually go and make a plan and run run the function so like asking for the first element of an RDD is an action and starts doing things and this is the point where I would open up Twitter but for those of you that aren't data scientists I recorded the entire seven minutes that this takes to run there it is Java wanted me to update and there's the the first element of our park' file so this is on my local Emacs I've got a rep Bowl I can do all the things you can do but I'm connected to the cluster so loading the whatever 30 gigs of data that it read through there so the count thing actually has to count every single thing our DG's are partitioned so if you're doing operations it'll keep partitions around and you can just to go to certain partitions to get data back first was very fast cuz it just got like the first chunk out of the park a file and return that count is actually going to have to go through and count all of the partitions so this actually takes much longer and rather than make you sit here for the seven minutes that it took I'll just assume that you believe me so closures been really really good for us it's been it's been great as the closure community has expanded its allowed us to do a lot of other things with it we my team gets a lot of requests for data or you know analysis or ad hoc things so along the way we have been able to build tools to to do these things and we haven't ever had to leave closure so we we built little web servers that we can run reports on or you know display tables or do all these sorts of different things closure script has brought a whole nother avenue of exploration so doing actually interesting data visualizations all those sorts of things make it really great as a as a basis for data science and I think beyond all of the other choices you can do everything you could do in any of the other languages except that there's no translation so you you can go right to your cluster you can run a repla from your Emacs and and do work on the cluster and not have to sort of write something and then have it translated or have the whole system of of things to generate static reports or something else and that's been really valuable for us so I want to give a quick demo of something else that we have been working on recently that uses closure script called Qbert and so what we did was in bed a repple in a webpage hooked up to one of our data stores we have a lot of elasticsearch so this uses the elasticsearch client and closure script and n rebel to embed a rep won't allow us to do interactive data exploration right from a browser which is cool so that's just a normal map sorted or grouped by one of the fields that are there well it'll generate a nice HTML table so I like to call them right there and these are in worksheets so you can actually save all this code ship it around share it with other members of the team or other parts of the business we're training folks from the business side and other places to actually use this to write closure there's a nice fancy plot that can be generated another one so yeah we can save these it's actually an N repple server so the other cool thing is each person gets a namespace a randomly generated namespace but if you switch to the namespace of another person you can actually both work on the same variables from different browsers so while we've been training people on how to do this we can actually each sit down at our computers connect to the namespace define something there have the other person use the variable look at the look at the data they can try to define something we can then show them how it's used or print out a plot or modify it so this is just an example of one of the great things about closures it continues to grow I despise the word full-stack but it allows our team to do everything from visualizations front ends all the backend work that we need to do machine learning all the the things that the person who has a title of data scientists might do we can do and we don't ever have to leave closure and I think that provides just a ton of benefit in the amount of code that we save by being able to write libraries and reuse them across all the things that we do we have common clients and interfaces when we have to write a new service or new something else we're not rewriting a ton of code I know other libraries and languages can also share code closures very concise we don't have to write a lot of it we're five people on my team of the 80 plus and the company doing all of the data science all the production machine learning all the pipelines everything else which i think is impressive and I think we owe a lot of it to to closure so that said I think there are different opportunities for expanding upon the tools that are available visualization is the the big one and the one I think I'm most disappointed with the availability or expansion of everyone wants ggplot and refuses to like use or build ggplot somewhere else and I think that's pretty unreasonable Qbert uses Vega if anyone's familiar with Vega it's like a JSON grammar of graphics spec it's not you know you can make just basically make a closure map and then do the thing it's a little rough around the edges it's not perfect but with closure script and all the other things that are out there I don't see any reason that we don't have something that's as good or better than ggplot in closure now it's something I'd like to build or do but haven't done completely or been able to release Qbert is not open source yet the other area and this is sort of JVM as well as closure is some of the more hardcore numerical libraries there isn't like a good go-to matrix library in closure there has been some work day there's core dot matrix and some of the other things and someone recently I forget the name of it wrapping like Lopakhin and Blas so there has been some attempts there which I think are great I'd like to see them expand I think closure is a great place to do all this stuff in and the JVM is the platform of choice for this large-scale machine learning and I don't really want to have to write C code or j'ni to do this kind of stuff so I hope that other people will write it for me maybe maybe we'll write it too but it'd be great to have some help and that's it actually I ran through it pretty fast we are hiring data scientists here in Portland if you like the weather you know the coffee all that kind of stuff look us up I hope there are questions so we have a fair amount of time sure so for Flambeau for the most part you just use normal Newbridge our semantics so you can define a main function where you actually do your work you create the the context that's necessary so spark and you can create multiple contexts within a sort of a single main function and do different things with different contexts but the way that spark works it's really just kind of up to you you have a jar and you submit it you tell it what namespace you know what class to run and you don't have to do anything special it will just call the main function of that class so like for the repple the example that I showed the main function just starts the end rebel server and that's all it does it just spark has a submit script that will put the jar somewhere accessible to the cluster depending on the cluster manager we use may so spit it looks the same regardless of whether using yarn or or maysa or whatever you just run that spark submit command send it your jar tell it what the main class is and it'll it'll just start it up so you don't you don't have to think about that in spark the abstraction between local stuff and cluster stuff is pretty clear anything that's an RDD is going to be distributed and the actions that I mentioned that actually pull data back to what they call your driver so that's the thing that you submitted the driver program is the thing that will actually create the context in rdd's and the other stuff actions pull data back to the driver so you can do local computations on your driver you can run a big reduce to get a small set of data it's gonna fit in your drivers JVM return it do normal closure stuff for a while you know whatever you want to do dump it in a elasticsearch store or send it to Kafka or do whatever you want from your driver program and in practice the semantics are very clear whenever whenever you have an RDD or you're doing distributed things otherwise you're working with normal closure lists and whatnot and all of the function serialization and all that other stuff is abstracted away you don't you don't have to think about it it uses cryo the serialization business spark supports that so we when you set up a Flambeau job it it does a bunch of stuff in the background for you for the context to register all the different serializers for closure data structures and for the functions and and all that other stuff so you just use the macros when you want rdd's and when you pull things back to your driver program you're using normal closure data structures yeah we were we were I guess somewhat lucky we never had a ton of data in like a data warehouse HDFS disks living somewhere with all this stuff that we could never get we've always used s3 for our long term data and would read and write data to s3 which caused no small number of issues while using Hadoop so we were kind of not deeply integrated in the whole Hadoop landscape we didn't spend five million dollars on some huge Hadoop infrastructure that we're now sunk into forever I do for the enterprise or you know whatever um so we were it was mostly just moving the computation logic that we were doing into a different library I think SPARC does have nice sort of pathways to using your large Hadoop infrastructure it runs on yarn and sorry I say it that way but I just it's terrible so you can use a Hadoop cluster to run spark jobs and I think that's a great Avenue for like a larger shop or someone who can't just wholesale move away HDFS is actually the only useful part of the dupe I think spark still unfortunately uses a lot of Hadoop libraries like inputs and outputs you can use to do libraries for that which is nice like the the park' stuff just because a dupe input now put formats for park' so you kind of just get it for free and you can use s3 as an HDFS thing so that's how we have sort of always interacted with it and still do but yeah if I had a huge cluster somewhere I would try to get spark installed using yarn as the cluster manager and and then just try to start moving things in to spark I think is effectively similar to what we did I wouldn't choose that as my optimal way to run spark but it's certainly a path away from MapReduce to spark we use Lucine for most things for tokenization we don't do a lot a ton of really heavy NLP we use TDI tf-idf in a lot of places it's not crazy amounts of NLP so mostly we need fast tokenizer Xand and cleaners and things like that and we use the scene for most of that yep yeah we looked at gorilla raffle we played around with it and we tried actually to use it first and the architecture of it and this may have changed it was a few versions ago it didn't want to be run like as a service it was very much tied to running on a local machine interacting with code it didn't like being set up as a server where multiple people were connecting to it and the bulk of it is actually JavaScript there's not a lot of closure script stuff in there really so we we sort of none of us know JavaScript and so we gave up really trying to get it to to do what we wanted there's another one called session I think which we base some of we pulled some stuff out of for Cubert and and some of the other things but it's really just you know an N repple client and there's probably actually easier ways to do it now I know they've been doing a lot of better repple support enclosure script very recently so it might be even easier than it is now but we had you know we have like Emacs key bindings in there and syntax highlighting and some other stuff it was largely an exercise for one of our data scientists to play around with closure script but it's already super popular just as a product within the company for for pulling data and doing plots and it's much faster and easier than a lot of the other things that we have available so it's I think it's going to be a really good Avenue to get like our chief revenue officer and our CFO to rate closure yes we have not yet we will sow spark Heslin they call ml libe which has a bunch of machine learning algorithms we did that we kind of wrap some of it or ported some of it like I said you can kind of look look at it and just sort of write it as closure very easily we found some of the implementations to be a bit suspect they're not the way that we would prefer to implement them so we've kind of rewritten some of that and yeah when we have sort of a more complete or more useful set of things on top of Flambeau we will open-source it last question no more questions no more questions thank you all for