Anatomy of a Data Analytics MVP

0 0

thanks for the introduction my name again is Ken and this is an enemy of the Analects nvp so a little bit about about me I care about this a lot because that's why I do for a living and also when that is last two years I I was doing my own startup in data analytics and when you're in a start-up certainly the pressure of time the pressure of resource is a lot greater so this this talk is really about getting you up and running so today I want to go through I want to flip things around and show you a demo first that hack together in roughly a date in half and really go through this architecture how I set the whole data pipeline up and as well as some considerations for when you're building your own MVP so before we started the demo here's a disclaimer this is an MVP so the idea is to get your product out there in front of a client in front of a user as soon as possible with the core functionality that's necessary that they're looking for and with that is you you really want to focus on the problem and for the demo I'm showing you the problem we're trying to solve is how do people like the latest Godzilla movie so now it's a dental time here's the link so you you're interested you can go on there yourself but I already have it open here it's fairly straight a straightforward demo website of use the scaling that's a very simple website but key thing key thing here is this chart of the latest tweets mentions about Gazzola and whether or not it's positive neutral or negative how do people feel about it hey you can also highlight it see some of the positive posts how many counts they are or the neutral post quite a lot not as many negative some pretty thorough about this new movie let's jump it back into the presentation and really looked into okay what what sort of components are are there to composed of this and just as any good MVP I designed my architecture on a on a napkin you can see we have data feed on once i put it through a data pipeline analysis database all the way to the visual side which is ladies what you just saw let's jump into the data fee of course because we're analyzing the twitter the data fee here is well twitter i'm using the twitter stream API as well as their tour p excuse me the twat thon client library but there are plenty of Python libraries that that hooks onto the Twitter API so you can you can definitely find your own favorite ones however when you're working with data feed the real big question is not a suitable library whether or not you want to be put you want to push or you want to pull so when you're doing a poll which is what I'm doing here you're really listening to your really connecting to a client API it's an outside API a lot of times to the endpoint and come and grab their data however at the same time if you can also be firing data a define your own web endpoint and have other people submit data to you so this would be a push type of architecture and generally find push architectures or push paradigms for internal data systems so for example if you're working with if we're working with internal emails or logs the data logs or even clicks or sells marketing and kpi's it's really it's nice to have and push to your to your data pipeline instead because you can define their interface you can define what's important to you and disk and have her to your requirement rather than the other way around once we get the data one good way did a great thing to do is push it into a queue because that was separate you from the data fee that's speed of the data traffic the load of the data feed with the rest of your system and you can handle as you as you need to now if in this implementation in our implementation i'm using r NQ which is a SAS product as part of our and i owe a great thing about it is a has a really powerful and really easy to use Python client library that helps you hook right onto it and one of the reason and one of the reason why i chosen mq instead of another paradigm and introduced later is that when you work with the mq allows you to be a fairly decentralized system the the components pushing on to the queue has no idea what the messages are being done on the other end and the same time those pull taking the data off the queue doesn't really care and you shouldn't really care about which there who inked you who put the messages onto the queue so it allows you to have a distributed system which definitely has its advantages and disadvantages alternatively you can have a worker system where you have a centralized component or a centralized layer that or have a particular layer that assigned jobs to various workers you could put it on it basically put it on a job queue where they'll and then you have other worker pick all those jobs and execute those jobs now for those paradigms it tends to be more it's better for a centralized system because the delegator the manager of this particular paragon require it needs to know what jobs needs to be executed in already know it needs to know what what module or what libraries are available on the worker to say I wanted to execute that particular function along do you to execute that particular operation so it's better for a centralized system and you tend to see a lot more of which when you're working with a big data analysis problem where for example you're doing you're distributing computation for for big numpy job for example there are some alternative technologies for both paradigms so for mq you can also use rabbitmq memcache if you're working with a worker I am aren't I also provide their own worker architecture and red is Q which can only requires you have the Redis a reddit server setup enables you to implement job queues fairly really extremely easily now let's move on to the analysis component and this is actually something the lightest component but this is really where you as a developer you as startup this is really where your core is because this is the job that you're you're meant to be solving the question that you're meant to be solving so it of course the implementation varies quite a lot now for me IDC i use text blog which which is a text analytics library that lets you quickly implement text analytics features so all i did is create a text blog call its sentiment attribute grab it and say ok given this sentiment polarity attribute is a positive neutral or negative now here's some alternative libraries that you may want to use for this now TKS I can't learn of course numpy syfy but the real key here for this component is because it's dq'ing from the mq you want it you want it to be as as lean as possible you should have as minimal amount of data as possible possible to carry out that that analysis to carry out their operation why because it allows you to distribute your your problem your problem set a big data problem set or however you know the load or those problems that into smaller pieces that one machine or one job one web server or one process can handle and then you're really using the architecture itself to scale out your calculation rather than trying to do too much with the software and have run into concurrency problem run to other problems that comes with the complexity after you done analysis the analysis post the results the component posts the results on to the database now in this particular implementation i'm using the clouding which is a database as a service it's implementing the CouchDB specifically the big couch flavor which has better which has more support for distributed computing now one of the reason while I'm choosing couchdb on the number one reason is is because I've used it for a long time so it gets me going really quickly but it also has a really great strength is it it does you can apply incremental MapReduce onto it and can define on view and you apply incremental MapReduce which as it says incrementally MapReduce the results that of that's in the database and it's done on every read so which makes it a great system if you have a heavy read system so as a data comes in and and the other end it's keeping being it's continuously being read by the clients it's continuously updating on a very very small incremental basis and that keeps it that keeps the particular view the particular index up-to-date all the time there are some alternatives I can also use MongoDB Hadoop I'm put it on dupe in there because this is really the this is really the step where you aggregate all your data I'll aggregate all your analysis so we had right in the beginning we have the feed we had the data fee grabbing a bunch of data put in the queue and analysis doing the bitwise analysis and so this is really where you pull everything together and generate that and ups and create that internal data structure report that's that's being fed out into the client of course you can also do that aggregation be a sequel database however for an MVP I tend to I tend to recommend a no sequel solution and the really reason is because no sequel solutions do not require a schema and because you're working with the MEP your goal is to get your solution strated us to the user to client as soon as possible and get their response the more time you're spending updating the schema more time in spinning updating the data structure the less time you have getting the real feedback from the users finally let's get the data let's get the analysis results onto onto the client so this is the visual component for the visual component i'm using rickshaw and this is actually not to logo acumen fund its logo i'm using the rickshaw library it's a javascript library it's built on top of d3 it renders really beautiful time series charts really quickly but there are some alternatives I can use you can also you can always go with d3 yourself get some you know get your hands dirty on it meed 3 i also love a lot it's a library of reusable d3 charts so your typical charts of pi char bar graphs times three charts it's all it's all there and get you up and running really quickly now the reason I put Django in there is because the current implementation uses flask of course it depends on your how much you like how much your interesting flask verse Django how much you like either one of them how familiar with each one of them but I tend to prefer flask because for this particular layer it's actually fairly light it doesn't do a lot of calculation all it does is it's the layer between database and the client and the JavaScript and another charting library it's raphael which is not a d3 library which is not yeah it's not d3 library it's based on based on Rafael which is another vector based graphing engine okay so now we're a hold all these things hosted this is really the big MVP part you can see here Oh everything here right now is hosted on one Heroku app and I have three processes running it's costing me seven dollars a monthly however when you're warning with mep again a lot of times your data load it's not necessary large enough so you be able to keep yourself within the free the freemium model the free plan for some a lot of these products so you can see here I have cloud and running i have r mq running both of these are still very much a free tier the Dinos I've three diners running that's what that's what really contributed to the monthly cost however if you do separate those three components into their own hero accrue apps you can actually stay within the free plan for hiroko so you're really running your MVP for free and you're getting your product out there to your clients an additional benefit for using something like hiroko platform as a service such as Heroku is that integration is extremely seamless I didn't I didn't need to I didn't really touch any come fake values any configuration issues lining them it's all done internally through environmental configs thru the only config file that had to deal with myself is really for my death machine now for the production and you can see here because I'm keeping every every layer separate I can scale each process separately if there is a lot of data load on analysis sigh well not as many users I can scale up the analysis component without worrying about the rest of the system the same time I can scale there's a lot of user but not so much data feed I can do the same for the web now just finish this up i want to show you i want to share some considerations when you're building your own system first of all again gotta have meal and it's all about the problem what particular problem you're trying to solve and that is the focus of NP that's the focus of your system don't really worry about anything else let someone else worry about it second vault try not to learn as much actually when you're building an MVP you don't really learn you know you want to minimize your learning because you want to minimize the amount of foreign technologies you're working with the goal is to get things up and ready as quickly as possible so if there's something that already know use it again when you're on a more more larger software engineering side we all learned that when you're dealing with a big problem try to break it down to smaller problems such as when you're dealing with a big data problem break it down to smaller de small data problem and sassy the backup use there are a lot of free technologies out there a lot of technologies out there with free plants that you can utilize and I've introduced a number of them today that you can utilize to get you up and running get anything that you're really not solving for your client everything else try to utilize develop resources out there for cues in DB layers these are great buffers that you can utilize to make sure that between different components you can scale them as swiftly as smoothly and as seamlessly as possible finally if speed is a problem for you so if you really worried about speed a lot of these SAS technology that all these platform-as-a-service technology DB as a service technologies they're all built on top of the ec to you as well so when you select when you're creating a new plan you when you're creating selecting setting up these sats products these platforms for you trying to select the same region that were minimizing the communication require between each component and that would drastically increase your speed that's it thank you very much think we have some time for questions nope I yeah I have some question about like the sound legal issues like how to get a data source for example like them you just mentioned that you use twitter stream api or something like that and i remember that twitter license there that had to like them to company or two companies or something like that so um we use those kind of IP is do we need to pay anything to them there are specific so the twitter streaming API currently is not there's a flavor of it that is free it's limited of course then what you can do with it they are they are so what you may be thinking of this fire hose which is a pain product so you have to pay for fire hose the streaming API limits you to particular keyword particular mention so you only get stream of that I have another question of I'm pretty naive to he Roku and I I heard you just mentioned that you use glass or something like that yes and its flesh that default web framework there is that you can choose on Heroku or you need to install it in some way unto he Roku Heroku is a so for those not familiar Hiroko is a platform as a service so what it's what you're really submitting to Heroku is just your code so the library you as well as a requirement dot txt file so as long as flask is within that you can do it he also has great tutorials for integrating with flask but as well as Django and many other popular web frameworks you're definitely not limited to ask questions or comments okay so you have any questions you can talk later thank you very much thank you