Simplifying ETL with Clojure and Datomic

0 0

it's a my great pleasure to present our next speaker who I work with on a regular basis he's been teaching me the gentle art of how to troll rich and so I'm getting better at it but I still have a long ways to go so without further ado Stu Holloway good morning everybody my name is Stuart Holloway I'm gonna be talking about ETL jobs in closure and atomic and this is something that I mean they're sort of biggie enterprise ETL that has all kinds of illa T's and buzzwords but there's also sort of small ETL which is that all the time as developers we find ourselves in situation where we have data in a box over here and we need to move that data to a box over there and along the way filter transform correct errors and so forth and when I started working with closure I was confident that closure was a fantastic language for doing this kind of work for a variety of reasons and when we first built the atomic we did a theoretically one time ETL job to get the musicbrainz data set which is in Postgres and put it into the atomic as a sample data set and that has you know multiple people have looked at that ETL job on and off over the years and and frankly it's been years prior to a month ago since I thought about that ETL job and we've never really talked about ETL into day Tomic I don't think there's been a conference talk that we've done about that or an article or anything like that and I decided that it would be fun to revisit that in light of things that closure has picked up since we did the original job so when we wrote that original ETL job closure did not have spec closure did not have transducers closure did not have core async for that mattered atomic did not have nested entity maps and lookup breaths so there all kinds of things that I wanted to go back and see how I would feel about this job revisiting it in light of these new tools so from a simple-minded perspective extract transform load is just that it's three steps right you've got a data source you extract you transform you load and you end up with data in some destination but when you start to break it apart you realize that that is a very facile assumption and that there are probably a lot more steps there might be more than one data source there might be more than one destination you might need to do integrity checks on the data coming in to make sure that it's not corrupted you might want to clean up errors in the data you might want to filter some data out or join in data from an alternate source and so it quickly turns into potentially a wad of spaghetti whereas it was originally just this sort of you know three-step process and I'm gonna tell you right now that you can go and look at articles and books that break up these steps if you read the first 30 Wikipedia hits on something which I always do before I give a conference talk because you don't want to be sort of surprised by some random craziness that people out in the world have said about things but if you look at the first 30 Wikipedia hits you'll see this job sliced up a million different ways and I'm here to tell you that the exact steps don't really matter and whatever you think the exact steps are you're probably going to discover after you finish it that you need different steps or that you need to add a step or you need to fold together some steps for performance or parallel eyes a step or whatever but how you do compose the steps matters a lot and the right way to think about this is a functional pipeline that you have a series of steps each one of which is a functional transformation of data to data so there's not a lot of sort of magic mutability or updates or other things in there it really is just a pure function of data to data followed by another pure function of data to data followed by another pure function of data to data and when you think about it this way all kinds of nice things become possible that might not have been possible if you approached in a different way so if you decide to go back and parallel eyes that you're a good fraction of the way towards being able to do that if you want to make intermediate steps durable all right you've already just described your transformations in those terms if you need to checkpoint your job the something's gonna go wrong and it will then having broken it into these steps gives you you know straightforward places to do that and also ability to get feedback so you start the job and you realize that something didn't work you get feedback you adjust and recover and don't have to go all the way back to the beginning so there are several obvious advantages to using closure for this kind of work the JVM is super powerful the ripple is a fantastic interactive place to explore you know we encourage a data oriented approach to programming and now we have closure spec I'm not going to talk about all of these in detail because that's sort of closure 101 type stuff but I do want to mention a couple of things one of them is this sort of notion of data oriented functional programming I think that we can do a better job when we're evangelizing closure to people who are coming to the language of talking about exactly what this means and how it distinguishes closure from other approaches in particular the pair of words systemic generality so closure encourages you to program with a small set of functions that can operate on a generic data structures it doesn't just do that though it takes that systemic generality everywhere in your application stack all right it's not just about your program it's also about configuration data which could be in Eden or transit it's also about data on the wire it's also about exceptions and errors when things go wrong representing that as information and so it's it's not just sort of a local phenomenon it really is a systemic phenomenon in thinking about programming this way and this is important because when people talk about closure they tend to lump it in with scripting languages they tend to say that over here you have the statically typed languages you know c-sharp and and Java and so forth and over here you have Ruby and Python and well closures dynamically type so it goes over here in the Ruby and Python bucket and I think that that under cells closure because I think that the the more important distinction is that regardless of whether you're using statically or dynamically type languages most of these languages in encouraged what I'm going to call encapsulated specificity right so regardless of whether you language like Ruby or a language like Java you walk up to a problem domain and you immediately start writing into code specific things right I'm gonna make a person class in an order class in an order line item class and the static dynamic typing thing is a little bit of a distraction to that right we're doing that regardless of what language we're in unless you're in closure and so I think this is a big point it's a little bit ancillary to my talk today but I want to get this out there so that when you're trying to explain closure to people in talking about what's going on you don't have to accept just being dropped into a bucket with Ruby and Python which is not the bucket that we want to be in I think then you talk about spec so spec is obviously new and closure 1/9 and it is designed to allow you to talk about specific things I mean I just said we're going to talk generally right we're gonna take this approach of systemic generality and now we can introduce specificity in a very closure oriented way right we can say we're going to talk about specific things with data and then we're gonna leverage that data at times and places of our choosing I've given several talks about this the one that I gave last night for the Austin group is going to be up on video there's one from Strange Loop I won't go through all the details again and in particular I'm not going to talk about all the different points that are made on this comparison table between spec and static types and tests as a road to trying to understand if your program works I just want to point out really the bottom row here which is reach because doing an ETL job is all about reach right it's by definition being in at least three places right being in the source data being in your program and being in the destination data so this is a place where spec really shines and when I approach this musicbrainz ETL job I was approaching a brown field thing right it was a thing that was written by some malicious closure developers people like Bobby Calderwood over the last several years and it had you know a lot of traps laid there for the unwary and so I walked back to this program and of course you see the classic closure code that was written five years ago it's a bunch of functions that are manipul stuff and you look at a function you go I have no idea what keys this map is supposed to have right and this is one of the motivating things that led people to doing things like climatic schema it's also one of the motivating things that led to spec and I had a hypothesis that I could use spec in a really dynamic way to rediscover a code base that I had forgotten and and that really turned out to be the case and so what I ended up doing was describing the input side of the data as map specs and so an artist in the system is required to have a global ID and a sort name and a name and optionally it also has a type and a gender in a country and and by doing this I could then strap this spec on I could run all the data through a conformer and say am i right or you know what did I not know about I could also strap this on to functions that were processing the data an instrument and say hey stop me pull me up short if I'm mistaken about the facts of the system and the dynamic nature of this meant that I could bandage exactly enough to get my understanding up to level where I could work in the system which is a lot different from say flowing static types where it's like okay I add this everywhere right you stick your finger in the wall and then you know now you have to do this you know kind of everywhere in your system I could do exactly as much as I wanted to I also want to make one other point here I was working with legacy data in the sense that the Maps had unqualified keywords and people are wondering why I and other people keep going on and on about qualified keywords if my data had qualified keywords then specs would have been able to pick up and validate things I didn't even notice right so I have a bunch of keywords listed here in a comment that also happened to appear in artists sometimes and let's imagine that I didn't know that when I wrote the map spec if I were using qualified keywords everywhere then those things would be self identifying and when spec was validating even though I didn't know about them when I talked about it spec would know about them and be able to validate now I chose for reasons of this talk which I've written sort of between sort of midnight and 2:00 a.m. over last couple of weeks I chose not to go back and make this data change right which is another important point right that would have been additional leverage if I had used name space qualified keywords I would have gotten more leverage here but I decided you know what it's good enough it's not the place I want to focus on so I'm gonna move on I did write a little helper function called conform bang that is like conform but throws an exception with data in it and I end up writing this function and every spec project that I ever use which means I now hereby submit it to the rich name rejection system so so rich will mock that name for a while and then either come up with a better name and include this somewhere or come up with a way to think about this more generally and include it somewhere so I'm looking forward to see yeah you know what happens there so that's kind of obvious advantages of abusing closure having the ability to describe things as functions being able to use spec and come back to existing systems there are some less obvious advantages and I want spend more time on these and this is transducers strong namespaces reified transactions in atomic and Universal schema in the atomic and I'll start with transducers and I have to admit that when transducers came out I didn't use them a ton I mean I was involved in a development but I'm working on this you know legacy code based atomic that was on an older version of of code at some point and I didn't want to go back and just change things willy-nilly and so I didn't but I thought you know what would it be like to approach an existing thing that's not written with transducers and replace the pieces with transducers and three things happened the first thing was kind of obvious right transducers if you go and look at the like single paragraph about them at the top of the page it's gonna tell you that they decouple transformation from input and output sources that's what they're about right so you describe algorithmic transformations without talking about inputs and outputs and then you can sort of strap that on in the manner of your choosing later that's obvious and by itself that sounds like a good fit for functional pipelines but two things fell out that are maybe a little bit less evident one it made accent complexity in the system more evident because it gave it fewer places to hide so imagine a function that's doing a transformation that does a little bit of work with input and a little bit of transformation and a little bit of output and some stupid stuff right it's a function that's doing four things and so when you're reading especially when you didn't write it right you're reading it you're trying to fix something else right the stupid stuff now represents 25% of the function and it's easy to miss now imagine a world where you're using transducers and the input and the output has been pulled out of that function so now all you have in that function is transformation and stupid stuff and you look at that function you go wow the stupid stuff looks really stupid I should get rid of that and it becomes easier to see the other thing that happens is when you start doing that is you discover commonality and and admittedly this is commonality that would be present if you thought about it for five minutes anyway but you know me I'm like most developers I just like to type right so I don't really want to I don't really want to think for five minutes before I start typing so I'm just sort of banging along there but with transducers those common things become more visible and so what happened when I went back to the ETL job was I ended up writing this case statement that said you know what I've got to convert eight or nine or ten different kinds of entities and I'm gonna force myself to go through and describe every one of those entity conversions as a transduction right none of those ten entity conversions were described as transactions before because transducers didn't exist and some of the transactions were entirely trivial but once I wrote it this way it became blindingly obvious that there was this helper function hiding in there called transform entity whose logic was split out across a bunch of different places and didn't appear in some places where it could have appeared and had some ad hoc code now could I have done this by thinking about it and lying on a hammock yes I absolutely could have but just by adopting the rule that I have to write this stuff as transducers it was actually impossible to avoid noticing this opportunity for reuse that had not occurred through at least a half dozen revisions of this ETL job and so I was able to discover this you know piece of generic code and pull it out the other thing that's going on with transducers right now is closure 1-9 gets hall twin which is also an easy thing to backport if you want to just grab it and pull it back and use it and what this does is it's a transducer that stops the transduction when some predicate matches and so the transducer I'm showing here is actually a transducer that trans acts against the de Tomic client API and then stops if there's a transaction that failed for some reason and this is really cool and when you're in the transduction business you really don't want to be in the errors as exceptions business now you really want to be in the errors as information business and so the idea of halt win now brings up how are we gonna represent errors as information well the obvious thing to do to some degree because we're running on the JVM is ask Java how they think about errors and so when you go down that road you end up looking at jabba's exception hierarchy and Java exception hierarchy is problematic for this job because it's dominant axis is checked versus unchecked exceptions which is like a type theory exercise that is perhaps proven not to be that valuable but because that's the dominant axis it doesn't provide a categoric split to say well over in this branch of the exception hierarchy these are the kinds of errors that mean your program is broken and you should go fix it and over in this branch of the exception hierarchy these are the ones that mean the program you're talking to is broken and you should tell them to go fix it and these this branch of errors over here represents you know the different kinds of things that are actually actionable when an exception happens and if you've tried to write programs that around the JVM you've felt this problem right you want to say I want to take action there these are recoverable things that I can take action these are things that I want to report to somebody and these are things I want to report to somebody else and you want to try to map that to exceptions and it doesn't work and then you look at things like if you're doing JDBC sequel exception right down underneath sequel exception all those categories occur again which means that if your thing might ever talk to sequel you have to like do that in branch again so pretty quickly decided that I did not want to make the centerpiece of errors as information the Java exception model so then if Duke can't help I went and said okay well let's go and ask this guy who invented the Internet you know how we want to do it well how does the Internet think about this problem you know that thinks about this problem with these HTTP status codes and where Java had a type system bias these status codes are better like if you told me I had to use Java exceptions or I had to use HTTP status codes as my mechanism we're talking about errors across the whole universe I might actually choose HTTP status codes even though I'm very rarely going to have a program that has a teapot or some of the various other things that are in the HTTP status code effect but it still has a very place bias right now all the three codes are about you should be talking to a different place all the four codes are about it happened in this your place and all the five codes are about it happened over there well that doesn't really make sense either because my question is you know is my input invalid well if my inputs invalid and the client decides it's invalid that's a four and the server decides it's invalid that's a five I mean that doesn't that doesn't really fit so well either and I realized that as much as Duke or Al Gore you know have been important figures in contributing to the history of software development that I was gonna really have to bring out the big guns I was gonna have to go back to you know as we always do in closure I was gonna have to go back to the 1970s and dig up some ancient tome and find how we should solve this problem and so I did and what I discovered was that in secret over several decades hollow notes have carefully thought through this problem and they have produced a set of songs that indicate the categoric things that you might want to do when you encounter an error in your program right the resource could be unavailable in which case you're out of touch you could have been interrupted it doesn't matter anymore you could be incorrect you'll never learn it can be forbidden I can't go for that could be unsupported in your imagination I'll do that operation for you something can be not found she's gone baby conflict give it up this isn't gonna work you can have a fault we're failing here or it could be busy wait for me come back and try again and I'll be good so this is now my favorite open source library that I've ever created it's this open source library is exactly 11 lines long there's these nine lines in spec there's a spec that says there's a named keyword that can refer to one of these and then there's another suspect that says there's a string message that can add detail on top of this this is shipping with the atomic clients it's open source I will get a github repo up with this or maybe I'll just put it in a tweet another less obvious advantage is strong namespaces and and this is a real sleeper thing when you first approach closure because you're used to namespaces but namespaces in most programming languages correspond to stuff right for a namespace to exist there has to be stuff there has to be a class there has to be a package there has to be or whatever but closures namespaces are really at the bottom right closures namespaces are properties of names there are property of symbols they're property of keywords and they're connected to spec and they are literally a word that I hate for people to misuse they are literally easy to use because they are literals right they exist down in Eden separate from anything else and what this lets you do and I didn't say this earlier talking about ETL jobs is when you go back and think about all these different phases of the job each one of these phases might have some knowledge of names that you want to spec and you can put each one of those phases in a different namespace and then that data can cohabitate so maybe there's some job that's going from namespace 1 to namespace 2 you want to have both of those namespaces in play you don't want to burn the good names you don't have to reify anything so this is a really big help and a message that I would send to closure programmers in 2016 namespaces are good right if you make a new thing make a new namespace do not change the meaning of names in an existing namespace just stop doing that right we do not have a namespace shortage there there are even more namespaces than there are you you IDs right we are not we are not short of namespaces in fact each one of you under any internet domain of your choice have more namespaces than there are uu IDs and by the way Microsoft wasn't wrong it's perfectly OK to name something food too right if food was really the best name for it and you make another one and you change the semantics then food too is perfectly ok another less obvious advantage is reified transactions and Timmy Wald gave a talk on reified transactions last year at the diatomic mini cost before closure cons I'd recommend that you watch that if you're interested in this but the idea is that in day Tomic transactions or entities like any other entity in the system and transactions are a durable record of operation right just as you have records about people or in our case songs and tracks and whatever you have records about what you did to the system and you can put your own attributes on these transactions so when I put together the importer I added an attribute called M brains dot initial import slash batch ID and that attribute is on every transaction now that puts data into the system and this gives you a whole bunch of different things one it establishes provenance all right anybody who comes back to this data later can say where did this data come from oh look there's an attribute on the transactions it says it's the M brains initial important I could put more attributes there too I could have you know URLs pointing back to the Postgres data that it came from and so on and so forth and that day we'll live with the actual membranes data as long as it's there this also provides a way of tracking progress I can walk up to a system that whose import state is unknown to me and issue a query for these batch IDs and say hey has the import happened did part of the import happen that part of the import question means I can make the import restartable right when I screw it up as I've done about 7,000 times now right I can go back and say hey what part of the import was done and also this supports although does not by itself fully enable this supports parallel and pipeline imports right now it may be that there are other things about my data there may be ordering dependencies in my data that say that you have to put fred in the system before you put Ethel in the system or the system is not going to be happy right but at least at this level of the system and in fact I if I don't know if there are ordering dependencies I can find out I can just try to parallel and pipeline data and using this technique and anything that fails I can just try again oh look there was an ordering dependency fine I'll do that data again another less obvious advantage is the universal relation or Universal schema in de Tomic so if you have a relational database every time you want to model anything you make a new relation right there's a relation for people there's a relation for tracks there's a relation for albums there's a relation for this well everything in de Tomic is a single relation it's about datums all right everything just distills down to these five tuples entity attribute value transaction and operation and the obligatory Jane likes pizza example right so you have a datum that says some entity Jane has some property likes whose value is whatever as of some point in time and are we adding this or removing this from the system in this case we've added broccoli and pizza and then removed pizza and what does this have to do with ETL jobs well a couple of things when you use a universal schema the logical schema the thing that people talk about in the sequel world has an almost one-to-one correspondence with your physical schema because you don't have to talk about things like join tables how many tables are there in de Tomic really one are there any join table so you don't have to talk about that likewise it automatically biases you towards being a database storing information not storing the answers to questions this is in stark contrast to for example Cassandra or Mongo where the best practice because of the the other properties of those systems the best practice is you make a table per query you're gonna ask oh I might need to ask this question in the future all right make a table with all the information in it you might need to ask that question in the future make another table with all the information in it and so when you're using a system that has that bias your ETL job has to anticipate what questions you're gonna ask it has to do more work you don't want if you don't find a generic information model you find a specific model tailored to answering a particular question so this is the EM brains schema and it shows that you have you know artists artists have albums albums have tracks artists have countries and so on tour this is the data set that we're working on this data set is about a hundred million datums so as of the time we imported it was everything that people knew about these things about the music industry and the result of my sort of reorganization along the principles that we've talked about here is a system that looks like this and I know that may be a little bit hard to read the words in the back but walking through through from the top you have a Postgres database and there's a process called the extractor the extractor pulls the stuff out of Postgres using closure Java JDBC and queries and the one other thing the extractor does is it goes ahead and joins and it finds globally unique identities for the things that are being joined to as opposed to just the you know Postgres relative numbers which I don't care about so that you get a bunch of as a result on the outside of that you get a bunch of entity files which are just eating data then there's a process called the validator that uses spec that can tool through the eaten data and say does it look correct per our specs if that fails there's a human right there's me I come in there back in my long hair days by that icon I come in there and I make a fix and you run that part of the job again then there's a transformer and this is the place where the most interesting stuff happened the transformer and the stuff on the top of the screen it has a solid dark which shows that it actually knows about other things that it's talking to but the transformer is connected to its inputs and outputs only by a channel so core async also didn't exist when this job was first written which means that the transformer really doesn't know anything about where data comes from it really doesn't know anything about where data goes to it's easy to monkey around with the system and you know tap channels and multiplex channels and play around with at stuff and then the transformer uses two things the transformer has a bunch of data tables that are really just pure data that are well this was the name in the old system and this is the name we want to call it in the new system mean we're gonna have namespace names and things like that and then it has a lookup table of a bunch of transducers right that case statement that I showed you before there's a bunch of transducers that it can use to do the job and then it takes all those Eden files processes them and then puts out another set of Eden files which are all transaction data ready to go into the Tomic and then the last piece the loader the loader doesn't know anything about any of the semantics of musicbrainz the loader is actually a purely generic de Tomic code right it knows that there's a bunch of stuff lying around that is in the shape of transaction data and it does two things it batches that data up so it gets it into a bigger chunk and then it assigns a batch ID and it guarantees as part of its job that it you know either you know knows if it tries a batch that's already in there they can say okay good I'm fine with that or it can avoid trying things that it knows are already in there so it talks to the database now the objective of organizing this this way is to localize my stupidity I'm sure there are a dozen decisions in here that you could second-guess but the boundary of most of those decisions is blocked at wherever those channels appear right so when I've done something dumb we can make it smarter by just making a change locally that's that's the real benefit here I'm not promising and this is my real objective my objective as a software developer is not to be smart it's really to limit my stupidity to small local scopes so that that when I need to make a change it doesn't sort of you know percolate throughout the system and to that end because I've described these transformations as transducers right I actually had threading information on here it showed how parallel the various jobs were but I've taken it out not because you might not want to communicate that to somebody if you were handing off the system but because I want to emphasize how much this design doesn't care right if you said to me I'd like for you to parallel eyes some step I'd be like okay that's gonna be a one or two line code change somewhere if you said to me you missed a step you need to break this thing into two steps it has an additional thing then I'm gonna take something splitting apart put two more channels between it and add two steps if you said you know what I don't need the durability I don't need all those files in the middle well how much does this system care about those files in the middle it doesn't really and in fact the old part of the system that I didn't really touch this time around still actually talks directly to files but if I would continue down this road I would have that talk to channels as well and at that point you know if you said you know what this is going to slow and I see that you have you know stopped in the middle and made this thing durable why don't you just take the output of that one and pipe it on to the input of the next one right no problem right you just pull that piece out and then instead of you know instead of having a channel serialize to eat and serialize back for meeting on a channel you just wire those two channels together or you just make it a single channel and you compose the transducer that's in that channel so this also uses channels that have transducers on the channel so you can say I know I'm gonna and you can see again whenever you have a transducer on the channel that transducer may be a composition of four jobs so if you decide that one of those jobs needs to be different or add a fifth job so one line change on the transducer that one line change doesn't know anything about source data doesn't even know anything about target data it just knows about the job that it's doing so I think that this is getting much closer to delivering on the promise of what a good ETL job you know would look like in Indy Tomic and closure and just to give you some numbers in the course of doing this job there's about a hundred lines of code that are general-purpose ETL helpers right that really have nothing to do with the job at all they have to do with things like grabbing data from Eden and putting it on a channel and so that stuff could be in its own you know library or whatever there's about 50 lines of generic the atomic detail helpers this is like batching and checking to see if the batch has already been submitted and that sort of thing which don't know anything about the batches there's about a hundred lines of just data look-up tables so I don't really think of this as code at all right and this doesn't contribute to the complexity of the system this is just data that says you know the name name in the artists table becomes artist slash name in the destination database there's about 50 lines of specs and there's about 150 lines that actually do the job right really the job here is about 150 lines of code that's about a half-and-half split between transducers and wiring and by wiring I mean instantiating channels knowing where the names of files are if you're writing to files and stuff like that so that's really all sort of glue code and the interesting part the transducers is under 50 lines of code the other interesting thing that I want to point out here is that the thing that I think is most variable here is the number of lines of code that respects that could have been 0 in fact it used to be 0 right you could fly completely naked without specs and not care that could easily be triple if not quadruple right we could spec more things I could spec not only the format of the entities but I could also spec the transaction data I have a second validator right at every step I could do those things and what I want to drive home here is the importance of being able to spec exactly as much as you want to to make yourself feel confident and having written those specs then run them exactly as much as you want to to make you feel confident so for example having run all the data through the specs once let's say that this is an import job that has to be run again and again so rather than being the musicbrainz import this is a batch import to an atomic system that happens once a week right it's an online transaction system but it also gets a batch of reference data that gets updated occasionally you might decide at some point that the specs were more about proving to you that you understood the system and less about that you trusted the person that was giving you them you just take that step out again all those things end up being you know one line of code so in summary the combination tools that I have here are enough to help my dumb self feel powerful all right the the many times that I approach this job before I I felt powerful but not quite as powerful as they should and this time I just felt awesome I mean going through this data I felt that the combination of async and specs and transducers and Atomics features Universal schema and reified transactions the whole picture of that made me feel like I had total control of this job and that I really could you know go in and make changes with surgical precision so I'm pretty excited about it of course it's gonna be open source I'm still you know putting finishing touches on it but I'll put up a github repo where you can look at this and download the data and try it for yourself this also targets de tommix new client API which we shipped on Monday so it's a chance if you haven't seen the client API to look at that obviously it'll be again another example of a few line code change to make it target the existing de Tomic peer API that's it thank you