How we serve Deep Learning model predictions at large scale

0 0

[Applause] hello everyone my name is Sal dua as Phil already introduced and I work at booking.com and today I'm going to talk about a very unique combination of deep learning which is a special branch of machine learning and containers something that all cool kids have been talking about these days so in short this talk is going to be about how we production eyes are deep learning models how we deploy them using containers and Kuban it is I hope I covered all the buzzwords so that's not so before I start I would like to talk about the agenda but before that can I can you please raise hands those who have Hugh's deep learning those who know about deep learning what deep learning is okay quite a lot and now those who have written apps which run in containers okay okay cool so I will start with discussing some of the applications of deep learning at booking.com which were really unique to us and how we solve them next we're going to talk about like safe like say lifecycle of a model what are the different phases in the life of a model from the perp from the NICHD idea stage to the production and next I'm going to talk about our deep learning pipeline which contains all the buzzwords that I mentioned so before that Who am I I'm sorry Luo I'm a back-end developer building deep learning for structure at booking.com I'm a machine-learning I do she asked which means I spend a lot of time learning about different technologies different new new things coming up in this industry I'm also a open source contributor and contributed to a bunch of projects like it probably you use that and your daily life pandas library which is a library for data analysis by in in written in Python and a bunch of other projects by Google Mozilla and and a lot of others organizations I have a deaf tech speaker and that's why I'm here on the stage today let's start let's start with the applications of deep learning before that let's talk about the scale why are problems interesting or hard to solve we have more than 1.5 million room nights booked every 24 hours across more than 1.4 million properties in 220 countries so I'm not here to brag about these numbers my only point is that we work at a huge scale and that provides us huge amount of data that we can then utilize to improve the customer experience further so the first application that we saw was image tagging so when we see this image the question that raises is why do we see in this image what is there in this image while this is a really easy problem you can see it you can tell it there is a there's a balcony there is sea view and all that but how do we make sure that our machines are able to learn that and detect that in any new image that comes out so while this is a hard problem to solve this is an interesting one as well because the context matters a lot what I mean by that is if we pass this image to some existing models that are there by Google or by Amazon these are the results we get we get like oceanfront nature it tells us it's a building apartment okay it's an apartment what do we do with it it's not that these models are not good these are really good at predicting what they see in this image but in our context there are different things that matter for us like sea view we care about if there is a sea view from a particular room or not we care about if there is a balcony if there is a bed if it's a bed foot or if it's a seating area or not and before you start thinking it might be ok it might be a easy problem solve there are a couple of challenges with that first of all it's not image classification problem which means there are going to be multiple tags for every image so it's not like you just classify it among the categories that you have there are going to be different things different tags multiple tags for one image and another challenging part is there is going to be some sort of hierarchy of these tags because if you know that this image is of a bed you can say that this is an inside view of room unless you're in such a room where there's no room only the bed so once we know what's there in an image we can use that information to help our customers in making the decision about the property they want to book easier and faster another application that we saw at booking was recommendation engine and the problem is really trivial user X booked a hotel Y now we have a new user Z and we want to predict what hotel are they going to book or in general what hotel are they more likely to book so the objective of this problem here is find the probability of booking a particular hotel by a particular user so as you can already guess there are some features that we have like user features which are about the particular user where are they from what language are they browsing in then there are some contextual features like what's the day of the week where are they looking for it what's the season where are they from depending on the localization and then we have some item features which mean the properties of particular hotel like what's the price what's the location what's there in the neighborhood and all that stuff so the search have shown that if you use deep learning build deep models with a lot of features it's it becomes better to recommend a hotel to a user as compared to the collaborative filtering which is non deep learning way of doing it so once we saw there were some applications where we could utilize deep learning because we had a lot of data we had a lot of context a lot of features about about both properties as well as the users we started exploring deep learning and that goes to my colleagues Stas and Amira who started exploration of deep learning and these days we actually have a lot of models which are working on deep learning on the base of deep learning and they are in production of these days so next let's talk about the lifecycle of a model this was about what are the different areas where we explored deep learning and now just talk about what exactly a model is and what are the different phases so there are three different phases one is code then we go to train and then deploy in first stage data scientist is responsible for writing the model writing the code for the model and you may be using any kind of framework available or library available we use tensorflow tends to flow Python API to write our models and in this stage data scientist tries out different kind of features different kind of embeddings different kind of architecture of the model and try out things on the sample data and once they're happy with their model and once you're happy with all the stuff they have been doing and once they are happy with the accuracy that they get on the small data now is the time to actually train the model on a bigger data on an actual production data so that we have a good model which learns a lot of stuff and is able to predict accurately that that's the Train part of the of the psych lifecycle the next once we have trained the data train the model using their production data we go to the deployment stage and these two parts training and deployment these are the two parts which build up the production pipeline that I have been talking about and that we will see in detail next so so for those who have worked with machine learning might argue that training why do we need training in the production pipeline we can also train our models top right but if you try to do that if you try to train your modern on your laptop this is what you may end up looking like so there are a bunch of different reasons why there's a case one is the lack of resources on your machines it's possible that you don't really have enough number of cores enough GPU to to be able to run the training at a good speed on your laptop and another thing is your production rather may be too large to be able to read in in memory so this is why we we train our models on our servers which which have high resources and not really on the local machine so this is how it looks like the training of a model in production we have our training script we have our big servers which have a lot of resources and we run those training scripts on these servers easy right we just have you just replace your laptop with a bigger server and that's it but there are problems with this approach as well because now there are going to be multiple neuroscientists who are going to be running their trainings at the same time yeah so there are going to be multiple other scientists which are going to be running the model trainings at the same time we are not going to be able to provide them independent environment we are not going to be able to provide them support for different versions of the libraries and yeah this is where how I'm going to move to the containers so this is what we do we have our training script we wrap them up inside a container so now what is the container container is lightweight software package which includes all the dependencies that you that your software needs to run so we every time you want to run a training of a model we take the training script we package all the all the dependencies that we want that our training needs and wrap it up in a container run that container on our servers on one of our servers so this way these containers are also able to use the GPU support that is available on these servers so we can directly link up some of these containers with the GPUs that we have available on our servers and run the training faster than running on your laptop so this is how it looks like we have our production data in Hadoop storage every time we want to train a model we spawn up a new container which has the training shipped with all the dependencies that it requires it takes the data production data from from the Hadoop storage it rolls the training it does all the computations all the training on that data that it gets and once the training is done we want to make sure that it's able to store back the model checkpoint that has been trained so model checkpoints are like model weights which the model has learnt after all the training process so model checkpoints is a term that's used by tensorflow to denote these weights of these parameters of a model and once it's it's done with the training and it stores the model check points back to hadoop storage the container goes away exactly so who can be more selfless than a container that it takes words to do the stuff that you want to do and then it dies so once we are done with the training we now have the model that has been trained on the production data we had exported its check points the next step is to be able to deploy this model so that we can actually serve these predictions from this model to our clients so let's talk about how deploying of a process of a model looks like we have a Python app which runs in a container again in a container because we want to make sure that all of our different models are running in different specific environments for the same reasons as having different versions of the dependencies then it takes the model weights from the hadoop storage which we stored in the previous step and load some model in memory so to load a model model in memory we need two things we need the model definition that means all the features all the interactions of our features all the other information about what our model looks like the architecture of the model and the next thing we need is the model waits for particular parameters so that we trained in the previous step while training the model we combine these two and we have them already in memory to be able to serve the predictions and on the top of that which provides really nice URL to get predictions and so this is how it looks like we have our app running does the same stuff that I mentioned just in the last slide it is able to serve predictions using this mechanism so client just has to send a get request with all the input features and get back the response which contains the prediction spawns which contains the prediction which you are expecting from that particular model but since we work at large scale our problems don't end up with just having one server we have multiple we can create multiple replicas of those containers serving the app and we put them behind a load balancer now client is not-- it doesn't really care about how many apps we have running and it's just sending the request to the load balancer API and all the balancer is responsible for distributing the traffic so that it's we are able to serve more requests per second but as we have large-scale problems get bigger we have more and more number of containers for a particular app and at some point when you keep on increasing the number of containers you need to find a way to manage these containers because it's really difficult if you have all these all this cluster of these containers running somewhere and you don't really have a way to manage them properly so we use q1 it is for that cube analysis open source project by Google they handed it over to CN CF cloud native Foundation and coupon itis is a containerization platform which helps us in scheduling managing creating and doing all kind of stuff with the cluster of containers so now kubernetes handles all the pain of managing the containers maintaining the number of number of replicas in a cluster for example if you want to make sure that we always have like 50 containers for a particular app we just specify that in the configuration and cube radius is going to make sure that if some of the containers are dying because some error because of some network issue or something it's going to make sure it Correa creates more containers more replicas of the same type of the same specification that we mentioned and maintains the number of replicas to a particular number that you mentioned and it's also really easy to scale up or scale down based on some parameters like for example you want to scale up and you have a lot of requests and how do you measure a lot of requests it may be like you check the size of the your WSGI queue you may check the CPU usage on your apps and once it crosses the threshold you may want to increase the number of replicas that you have and given I just provides a really nice API to do all of that so once we have deployed our model what's next since I mentioned we work at a large scale the problems don't really end up with putting the software in production there are problems with the performance measurement and how do we measure the performance of a model that's in production so let's say you have a model and it takes some computation time to compute the predictions for a given set of input features but that's not going to be the time that your client sees there's going to be some request overhead so let's say you have some request overhead and your total time is going to be request overhead added to the computation time that it requires for all the predictions that you want in one request as we can see if we have some really simple model like a model with less number of input features less number of interactions and let or maybe less number of hidden layers or neuron we are going to see that the requests overhead is going to be the main bottleneck of serving because the computation time is going to be really small because there is going to be less computation and of course go to take less time and in those cases request overhead is going to be the main bottleneck for you to be able to optimize your serving so once we know what's the what does the performance look like what the prediction name looks like we can try to optimize for our use case and we can try to optimize either for latency or throughput let's talk about both of these one by one so optimizing for latency latency is the amount of time it takes for one request to return the response to the client so you may want to optimize for latency when you are serving some prediction in real time for example if you are serving something to only on your website in your app when you really want to make sure that the prediction is back by the time you're you're showing something to your users so the first thing that you can do for optimizing for latency is do not predict at real time if you can pre-compute this may look funny in the first time and you'd say okay we are actually trying to predict in real time but you're saying you don't predict because the latency is going to be always better if you don't do anything in real time so the point behind this is that sometimes when you have a limited set of features it's possible that you are able to pre compute all the results and able to store the results in in a lookup table for all the set of different features that you may have and so using this information now you can serve the your traffic serve your predictions faster than you would do if you were pic preview if you are computing the predictions real time another thing you can do is reduce the requests overhead when you know that requests overhead is the main bottleneck for your serving as we saw in the last on this last slide you may want to reduce the requests overhead and one of the things that we do for reducing this is making sure that we reduce the latency between the kept getting the request and being able to predict so basically what we do is we get the model right into the memory or the app so it's embedding embedded right there in the memory and once it the app gets the get request it's able to forward those input features directly to the app and get predictions back so we reduce the requests overhead by embedding the model right into the app there there might be other ways to solve this as well next thing is always predict for one instance and that means if you have let's say you want to predict for three or four ins different set of input features and you want to optimize for latency it's better that you have only one feature request in every request you make so you have one row features one input features for your if one requests because this way you're going to get the predictions as fast as possible and then there are some other things like quantization you may want to lower down the precision value of your input features from let's say float 32 to fixed 8-bit value and how that helps is now you're able to store more four times more data in your processor and that means your CPU is not going to be able to compute things faster so that means you serve your predictions faster than before and since we use tensorflow there are some tensorflow specific techniques like freezing the network that means you change all the variables to constants so that helps you a little bit in in the computation time and the next thing is you optimize for inference which means you reduce you remove all the nodes that are not used in the prediction time and hence you make the computation faster now the other use case is optimizing the serving for throughput so throughput is the amount of work you get done in a particular unit time some part is that time now you may want to optimize for throughput when you care about how much work is getting done or how quickly is a batch of work getting done like if you have some job which runs every day or even every are you all you care about is when is all the work getting done you don't really care about when is one prediction there for you so you may want to optimize for throughput to make sure that you're getting done the work that you want to do quicker again the first thing is do not predict a real time if you can pre-compute as I discussed earlier the next thing is batch the request if you have a lot of requests a lot of a lot of features that you want to predict on always bet your requests so that for example if you batch your request to send 10,000 requests with one HTTP request now you're going to reduce the requests overhead for all of those requests which were going to be there if you were sending one by one this is what we actually use a lot to get our work done when we have a lot of work to be done in a cron job or easy workflows and another thing is paralyze your request don't wait for one request to be complete to the requester or synchronously so that you can get a lot of work done quickly so now I will talk to I will try to summarize the things that we talked about so that he actually takes a whole picture of what we talked about and how we do stuff because it was I think a lot of different kind of things so first of all we talked about training models in containers every time we want to train a model we create a new container it has the training step and all the dependencies and get the data from Hadoop storage run the training once the training is done save the model checkpoints to the Hadoop storage back and that's it the container dies it's so selfless once we have the model waits ready once we have done the training the next step is to put the models in production and for that we have a Python app which loads the model in memory using the moral definition and the model weights that we collected in the creates those containers as part of the clusters that are run by a Cuban artists because humanities helps a lot with the manage managing all across different clusters of containers for particular specifications and the next is optimizing or serving for different use cases if you have a real time use case you may want to optimize for latency and if you have a use case where you want to do a lot of work in one burst it's better to optimize for throughput and I've mentioned are some of the techniques that we can use to optimize for both of these use cases and if you want to get in touch with me I'm available on social media almost all of them these I am so he'll do a to 3:05 on most of them and yeah that's it thank you [Applause]