How Serves Deep Learning Model Predictions at Large Scale

0 0

hello my name is Sahil dua and welcome to the stock it's going to be a unique combination of two technical areas containers and deep learning so we are going to talk about how we productionize our deep learning at booking comm using containers and how we manage those containers using cuban artists i hope i got all the buzzwords covered so before I start let me see like how how familiar you people are with these technologies can I see hands of people who are familiar with deep learning okay and those who have written some apps which run in containers okay quite a lot so let me go through the agenda of what we are going to talk about today I will start with applications of deep learning at what are the applications that we saw that makes made sense for us next I'm going to talk about the life cycle of a deep learning model what are the different phases of a model how does it look like and what are the different applications in those what are the things that people do in those different stages and next we're going to talk about what are deep learning production pipeline is how do we make sure that we put our models in production and we are able to serve those models at huge traffic before I start who am I am Sahil dua I work at as a back-end will upper I work in deep learning infrastructure where we are working on building this pipeline so that we so as to facilitate the data scientists and had them put their models for in production as fast as possible I'm a machine learning into sales which means I spent a lot of time on learning about different techniques of machine learning and deep learning and an open source contributor I've contributed a couple of patches to get tools that probably most of you use every I'm also contributed in pandas library which is a Python library for data analysis and a bunch of other projects canto by Mozilla and go github by Google and I'm a tech tech speaker that's why I'm here today on this stage so let's start with the applications of deep learning at what are the different applications that we saw at booking that made sense for us to use before I start let me talk about the scale that we operate on we have more than 1.5 million room nights booked every 24 hours in more than 1.4 million properties across 220 countries and let me make one thing clear I'm not bragging about these numbers the whole point I'm making here is that we have we work at such a huge scale that provides us access to a large amount of data that we can then utilize to improve the customer experience so let's see how we do that what are the different applications first of all the first application is image tagging if you see this image the first question that arrives is what's there in this image and it's a really easy question as well as a really difficult one depends who you ask this to if you ask this to a human to a person it's a easy one because we can see what's there in a particle image but it's not as easy one as as it looks like for a machine to be able to answer so why is it difficult problem because the context matters a lot for example if we pass this image to some publicly available image net or dense net networks which are available for image tagging these are the results we get we know that it's a it's a beach house building penthouse apartment okay what do we do with that how do we make sure that we are able to use this this these tags to improve the customer experience we can't because these are the things that matter for us we care about whether there is a sea view from this room or not whether there is a bulk whether there's a bed in this photo whether it's a breakfast photo or a swimming pool or what so the problem is hard because what matters more than identifying the images is that what is your context from booking point of view things are going to be totally different the domain of the things that we care about is going to be totally different from what other people do so we can't really say that we are better at doing this thing it's just that we are better at doing this this thing in our own domain because this domain is not interesting for others it's only because we care about that and another thing that makes this domain really this problem really interesting is that first of all we have to come up with own labels since first of all when we come up with something it becomes like unsupervised learning thing because we don't really know what things we want to identify so we have to mark them we have to label them that label the objects that we want to identify label the themes that you want to object identify from an image and then we go about doing the supervised learning of identifying what's there in an image another problem is that there is going to be hierarchy of these tags it's not like an image classification problem where you see if this is image of this particular thing or this particular thing it's every image is going to have multiple tags which means any photo can be of having a sea view as well as balcony as you can see in this case and another interesting thing is there is going to be hierarchy of these tags that means if we see there is a bed in the image we can be sure that it's going to be in side view of a room unless it's such a room where this bed but no room so once we know what's there in an in a particular image we can use this information to help our customers look for properties that they really want to look for show them properties which have the features which they are looking for and hence we can make the process of booking a hotel easier for them another interesting problem that we saw at booking was recommendation engine so the problem is simple the problem statement is really simple user X booked a hotel Y now we have a new user set and we want to predict what hotel will be booked by this person so the problem of objective is we want to find the probability of a particular user booking a particular hotel and we have some features like user features where the person is coming from what's the language they're using what's their history and all that and then we have some contextual features like when are they looking for booking what's the day of the week the season they're all the localization stuff and next we have item features which are properties of a particular hotel or apartment like what's the price what's the location of the hotel and all that stuff so I'm not going to go into details of how exactly this problem the how actually exactly we solve this problem but research by Google shows that if we use wide and deep networks which means that we have multiple layers of hidden here hidden units that's called in deep network and then we have some wide network which is only one layer if we combine both of these together we get better accuracy and better precision on such kind of problems it's a really nice paper that I will suggest you to read later it's called wide and deep networks it's a nice one so once we saw some applications of deep learning that we could utilize that we could work on we started exploring more and trade goes to my colleagues scarce anymore who I probably am sure I am right now so they started exploring it and today we have some models in productions we have some really nice models in production so this was about what why we use deep learning and what some of the applications use some examples now let's talk about how we use deep learning and why containers a part of this talk and how we manage containers so first of all what's the lifecycle of a model what are the different phases there are three stages cold train and deploy what does it mean in stage first we as a data scientist we write the model write the code for a model we use tensorflow Python API to write the models and in this stage data scientist responsible for trying out different kinds of features different kind of hidden units different kinds of model architectures or different kind of interactions between different features and sort of this is like a hyperbola meter tuning phase where people try out different things and see what's the what's the perfect architecture that that suits this particular problem once they are happy with this testing of different features and trying out different things next stage is to train this model on production data and once you train this model in production data the next stage is to put in production and that's when things get really interesting but first of all let's talk about the training part so these two parts train and deploy at the part of our production pipeline that I have been talking about our deep learning production pipeline and you may wonder why training of a model is a part of our production pipeline because you can also train your model on your laptop but if you do that you're going to end up something like this because there are a couple of reasons you may not have sufficient number of resources on your laptop like number of codes or GPUs so it's better to train your models so if you it's better to test your models on laptop on your local machine once you're happy with the performance and you want to Train on your production data move to bigger servers so that you can speed up the process of training so this is what we do we have big servers with a lot of CPU cores and GPU support and we run our training script on those servers looks simple right but there's a problem when we do this there are going to be multiple data scientists who are going to run their trainings at the same time and we want to make sure that they are able to run their trainings in independent environments so what we do is we wrap this entire training script in a container and run that on our server so what's a container container is a lightweight package of a software which contains all the dependencies that it that this app that's running in there needs to run so here's what we do we have our training script that's written by a data scientist we know what are the dependencies like in this case tensorflow or if its kara's or some other library we package all of this with the version that it demands into a container and ship it to run on a server so this way we are able to provide independent environment for all the trainings running they are now able to use number of cores and everything all the resources independently they are not responsible they are not responsible for taking care of lock locking conditions or dead locks or anything like that on the resources on the servers and the containers are also able to get the support from the GPUs get the GPU support from the servers as well so we can link the particular GPUs to a particular container and they are able to use it to speed up the computations so this is what production pipeline looks like from a training point of view we have our production data in Hadoop storage whenever we want to train a model we create a new container with the training script and all the dependencies inside it it takes the data from Hadoop storage the production data we specify in the training script what data does it need next it runs their training does all the stuff to like all the stuff that's mentioned in the training script next step once we're done with the training is we want to make sure that we are able to store those model weights that are trained so in the transfer flow terms we call it model check points and the next step is save the model check points back to a dupe storage and that's it container dies who can be more selfless than a container that it takes birth to do the stuff that you wanted to do and it goes away dies away so that's what a training pipeline looks like the next step after training is to deploy a model so let's see what we have done so far we have written the model tested with different parameters then we have trained it on a production data now we have the Train model what we want to make sure that we can use it we can have some clients some some web client or some app client we want to make sure that we can use those models we have trained in production let's see what how we do that so we have a Python app which runs again in a container to make sure that we can easily scale up and scale down and also provide independent environment to all of the models running and what this app does is these two steps it takes the model weights from the Hadoop storage which we stored in the previous step while training and takes the model definition that's already there in the in the in the app and combines them and loads the model in memory so now our Python app is able to serve predictions once it gets any sort of inputs let me wrap over again so to be able to serve predictions from tensorflow model we want two things moral definition and model weights we combine both of them load the model in memory and now we are able to serve the request and the top that we provides a really nice URL - easy to get URL to get predictions so this is how it looks like from Klein point of view we have an app running and client it can be anything app or mobile app or a web app and all it has to do is send a get request with all the input features and get back a response which contains the response the predictions that the model gives back so it boils down to sending a get request with all the features and getting the prediction back but since we work on on a large scale we have a lot of requests coming from these models we need to have more than one containers so what we do is we replicate these containers and put them behind a load balancer so the client doesn't know how many apps are serving and it only cares about the blow balancer API sends I P and sends the request HTTP requests get request to load balancer IP and/or balancer is responsible for making sure that it distributes the traffic among all the apps that are serving but since we work on a really large scale as I mentioned earlier as well we need to have many more containers and as we keep on increasing the number of containers it becomes really difficult to manage them and this is where Cuba natus comes into picture how many of you know about Cuban it is okay so Cuba radius is a open source rate by Google and it's basically a container orchestration platform which helps in managing scheduling and a bunch of other things with containers so what we do is we wrap in this entire set up in Cuba netis object and we use replication controllers what that means is it makes sure that we can have a specified number of containers at any point of time during the production or production time of this model so we can specify for example let's say we want to have 50 containers for this particular app which is running a model a and the replication controller is going to take care that at any point there are at least 50 containers running so how it does is if some container dies in the in the meanwhile it were going to create new containers with the specify with the specification that we mentioned while creating this application controller there are a bunch of other features of kubernetes like it's really easy to scale up or scale down the number of containers so if you know that you're going to have more traffic for example on weekdays and less traffic on weekends you can specify how many containers you want to have and you can automate this process as well if you're not sure how many containers you're going to need at any time you can specify some of the metrics like CPU usage or number of requests and cubanía this object is going to make sure that it has sufficient number of containers that are able to serve those requests that you are getting for example you can specify what's the maximum threshold of WSGI queue size once it gets to the particular edge hold you want to spawn up new containers to make sure that you are able to serve requests without dropping items from that queue so there are a bunch of other applications like this where you can make use of queuing it is to easily manage the cluster of four containers and then you can have multiple clusters of containers for different models so once we put these models in production the next step is how do we measure the performance how do we make sure that we are not hurting our customer in terms of performance by deploying these models so let's say your model takes some computation time to compute the predictions because it's going to be still a lot of mathematics going behind all the computations of your input features and getting to the to the predictions output but that's not going to be the time that your client is going to see because there's going to be some requests overhead so that total prediction time becomes requests overhead added to the computation time and if you have more than one instances to predict on in one Qwest you're going to have n into computation time and we can see from here that for simple models which have really low number of features or low number of hidden layers or no hidden layer the computation time is not going to be the bottleneck while the requests overhead is going to be the bottleneck and so so it really depends on what kind of model you have and you should be careful enough to see what model you have and what's the time that it takes and be sure that what exactly is that you want to optimize for do you want to optimize for latency or do you want optimize for throughput so let's see both of them one by one what is latency latency is the time it takes to serve one request from a client point of view and you want to optimize your application your serving for latency when you are serving some like traffic for example when you are getting that prediction real time and you want to show something some output to the to your to your customer to your user that's a point when you should be optimizing for latency so that you can serve that request as soon as possible and what are the things that you can do to optimize for latency first of all do not predict real time if you can pre-compute it may sound little silly because the whole point is to be able to serve real time but sometimes when you have a fixed finite set of features it's better to pre compute all the predictions from your model on on those data sets and have a lookup table and serve the requests in real time from that lookup table instead of computing the predictions real time and the next thing you can do is reduce the requests overhead when you know that requests overhead is the is the bottleneck for your serving for your latency you want to optimize it and one of the things that we did was to make sure that we reduce the requests overhead by embedding the model right into the app that serving the predictions so as I mentioned earlier we have this Python app we load the model right into the memory so that there's no latency in between getting the big get request with the features and being able to actually start predicting so when the request comes the model is right there in the memory and it's right there predicting the output so this is one of the things that you can do to reduce the requests overhead for your serving next is predict for one instance let's say you have more than one instances that you want to predict on for your for your app make sure that you send one instance for to predict for one request and another thing you can do is a quantization which means you lower down your precision you lose something on your own precision because you change your 32 bit float values to 8 bits but it makes sure that now your processor is able to handle four times more data and that means you are able to predict faster and there are some tensorflow specific techniques like releasing the network which means you change all the tensorflow variable nodes into constant nodes and that speeds up the process of computing a prediction and another thing is optimizing for inference which means you remove all the redundant nodes that are not required at the prediction time so that you can predict faster so these are some of the things that you want to optimize for when you are caring a lot about the latency of your app another thing you want to optimize you may want to optimize for is throughput what does that mean amount of work done in a unit time you may want to optimize for throughput when you're when your application is something like cron job or uz workflow which runs in the background every 24 hours or every every every one hour or every 10 minutes and you want to predict a lot of instances so there you don't really care about when a particular instance gets pretty did you care about when is the whole workload get to get getting done so that's when you want optimize for throughput to make sure that you can get a lot of work done in less time and again the first technique is do not predict real time if you can pre-compute and have the lookup table to serve those requests another one is really interesting you batch your request you send thousands of instances thousands of your input features to predict on so that you can reduce the requests overhead because if you send these one by one the requests overhead is going to be with all of the requests but if you send multiple of these requests multiple of these predictions in one request you're going to get the get the output much faster another thing is paralyze your request instead of waiting for your previous request to complete send requests or synchronously and this is what we do a lot as well to make sure that we can done we can get done a lot of work in in less time so let's talk about what all let me sort of summarize what we talked about first of all training of models in containers whenever we want to train a model we spawn up a new container it takes the data from the Hadoop storage runs the training once the training is done we store the model checkpoints back to Hadoop storage and now these model checkpoints are ready to be served so in moral checkpoints are like the model weights of all the parameters that we have in your model and the next step is how do we serve we get these model checkpoints from a loop storage load them into memory of a Python F which is running in a container again and now this container is ready to serve request we have a cluster of these containers which is managed by cuba narratives replication controller object and once we have this entire set up ready to serve predictions we optimize this serving for latency or throughput depending on what the purpose is if you have a real time serving real time app which uses these predictions you want optimize for latency otherwise you want optimized for throughput if you want to get a lot of work done in input in one unit time or in some time that's all I have today if you want to get in touch with me you can follow me or contact me on all of these these social media networks I go by Sahil doer to 3:05 name usually and yeah that's it thank you [Applause]