Increase response rate with predictive analytics

0 0

this morning talking about customer confidence and we've got a great about a person I to be here and I just want to do a quick on a level set because these talks can be really deep or reach out with a wife a lot different come here to find out if I so what they gotta show hands is who has maybe use google eyelids to look at their clothes with the pastor then I'm sorry oh boy ok ok and anyone doing any predictive modeling stuff you know run forest run and see the future ok got much smaller in scope in the audience as anyone go neither little chains what is just here today to try to figure 8 it kind of had to understand your customers better related red t-shirt pillow coo coo so I think hopefully what I try to do is first light is writing principles segments voices maybe application a little bit color and then to be a few watch edge along the way android you're going to go on the one thing you'll go if don't worry bit and i mean i Luke you a yen will feel free to ask any question but let's not make it a saga like you ask me question my response source within you can say that for later ok so that's good and just go to the agenda I'm gonna give you an introduction at me did talk about the business problem the beach sounds a lot of context or culture and data or data and quality checks so anyone who only know very strongly that also and the tools we use we're going to talk a little bit of stock put together and our methods so how we actually did Artie and then the Edwards and my next steps or a little bit open so I'm properties- heartbreak quite loosely and tell introduction to me and I with business consultants for about five years and right rear and colleges with Davis not just for whatever that means so I a young PhD in statistics not a mathematician and what class appoint myself is quite strong business logic and I'm definitely whatever Taylor certainty a hammer so bascially things will be asked so this is not beyond rather than anyone nor and the problem itself is we worked visit problem code that's me and the problem was we had a client he sent millions of millions of SMS jizz and received me they wanted to try and understand how they could get people to respond to their campaign where we have for every SMS they received they got money for every piece of user information they captured they got one and so consumer products companies would pay them money to gather details and effectively promote the front of the attachment so it's mainly in the developing world hey there's the coin of the loose context it should be are comical I can't aim honestly the company would help you can a personal stuff better so effectively what they really wanted to do was what what behaviors are important that you can understand what if you do not then we'll definitely gauge if you're doing that with anything to give doesn't care and and then are these behaviors can they be compared across different regions so they operated in India and South Africa Brazil islands Indonesia and all of these different cases and one of the one of the potholes are the pit full of people fall into sometimes they go really really narrow on their cohorts always trying to understand the audience and you're trying to get commonality and data side when you do this sort of work so you try to get everyone to be the same so that was one of the girls to figure out what if they're going to bodies noodles in Indonesia is the same as the guy avoid ice cream in Brazil then we've got a bigger cohort to tackle them the guy you boys news in Indonesia who done a low-cost mobile phone and has two kids right because that's that's smaller and it's you can't leverage anything ever that and and then of course the final point is so we know what these behaviors I know what the common okay so hoping we actually get some value out we've done all this great mass and we don't date Attorney General so what's next like what so so much and I just want to drive on that last point because someone said to you a while ago no one cares how you do it right no what they do care about is what they can do it I think sometimes we need to switch without a little bit and that's important to drive it home and I'm going to go to this next slide really quick you see you wait a second so data quality the daily checks so what we got what we have to do with it and so we need 1.5 million user profiles a lot of people we have five brands it campaigns about 6.2 million SMS translation so that's so common aims of going right and you can see the regions there stop forgetting Tunisia Brazil India can see some of the brands there so they're very meaty consumer brands and you also have to remember that there be over the law both the line advertising going on where during these campaigns so not only with the SMS incentive for the consumer teen adrift on radio promotion could one whatever so it's quite a noisy data set with stuff that we actually couldn't measure or manage so how did that all come in a split up in terms of what we had and this diagram looks complete naughty what I think it's quite important to get these things over there so it was a team of two of us doing this so um was it the analyst was then 24 so she razor sharp but no real world is very little to make sure that we're all speaking the same language and so the first thing is list of all the people we've surveyed the second thing is every SMS that was sent out during all the campaigns the third thing is the details of the campaign this is a this is something I want to kind of talk about a little bit more is that so you know we have this environment around traffic we also have to attach the time of traffic that you're engaging so for instance the ice cream campaign is a luxury brand in Brazil the noodles is like low-cost really really marginal kind of seasoned deeds they call them brands of it like that's a factor to include and consider when you're actually trying to examine consumer behavior so the going might really really go nuts on the new book pain because he wants to eat whereas the guy who might go bots on the ice cream campaign because there's all these different things that you need to be aware of and for thing is its ambassadors and then the fifth thing which is kind of a little bit cool is this idea of features we think are important so we hope to GP and the different regions we looked at the mobile phone networks so you know highlighted all value all of these different things that and this is this is where you get to have the most fun with your this because you can just pick anything anything you want singing and how the other look at a policy time constraints and all that but it does give you an opportunity to explore and and use your imagination resort as around what you could use ok so it was perfect we've got giving it all on a single file and they just handle it straight till it's written like they gave it to us in fig 20 CSV files right so even just disgusting and you know there was no you need to consume ID so you had everyone was a gunfight with a mobile phone number or you know different country different campaigns even different even in a year they had to say different providers and same over the phone like it was just bonkers so we Jam adafak sure a unique ID and typically responses that is pretty much time to talk if you don't yeah you're going to get it with anything that big incomplete profiles and then non-uniform dates so either just a few to come into so you know date format different from one company to another campaign because was executed on a effectively our clients just provided the numbers and then made a request to this thing called an eye dr and an IVR as a supplier of the engine that fired the SMS and so some are they are still some day formats for the rodeo another day format there was a business reason we had to deal with it you know and these are the problems that like data scientists kind of D glamorous and don't talk about but like guys what breaks your heart we spend much autonomy and so that was quite of the dataset and I'm going to talk a little bit about the tools were used it was a hard like a whole section on Jupiter happened yesterday is done we didn't get to do it so you guys went around yesterday yeah everyone's talking about it and I just want to run through quickly the reasons we chose what we chose and amazon web service we're generous enough to give us about you and they're expensive they are quite easy to use their well documented so if you're beginning and that would be my suggestion and there is loads of others provided over there don't feel toy to them and also understand what cost base you want to incur and what's the process so for instance if we were to do this again we don't word yes and I'm John and then we had our our Jupiter both our Jupiter in Amazon as well probably would have moved the already s just a more cheaper and what I want to do deal watching the Jupiter thing there because it's highly documented like we actually burnt all our moves we've got most of our morning on the database rather than the actual dct and like we went we were not free money that's we take the biggest biggest thing we could get em and you know we get great danger I cannot stress enough how cool that is just to be able to run that and then you like you can pass your your notes I can for it's really really easy there's one thing that and are yet to resolve and I'd love to hear if anyone before it's this idea of version control on a public you put an uncle so yeah I'd make a change and I brother I mean I go off do something as many animal to make changing then you come back and let me be kind of it you get an error message saying it's changed it'd be great maybe some kind of version control or I my google doc maybe a little bit different that was yummy challenge or just having it there switched on all the time not having to worry about blowing downloading and and don't forget your database right where data scientists databases are important this whole concept of putting it on your laptop and having a photocopy and messing with it it's not it's not good practice it's not something you should do and this is kind David part of their business you know it's critical they spent morning this you know be mindful this it's cheap enough that you can do that and I'll end my SQL work bench you know it's free easy and anyone watching me dejan just and doesn't know that as well I really like you want to read get on that you don't don't don't get it fun little machine learning and then forget the basics your love learning her away it's just golden I like I learned a lot of SQL on this project i'll have to say and more about performance and had index or and cool yeah so that's why you picked what we picked at jupiters an obvious choice anaconda is that obvious choice it all came in this gist here you just click on it and it gives you step-by-step it doesn't tell you about the already edge of the workbench but that's pretty straightforward as well and then I can use Python 2.7 and I go or what you know and then went like her the next project will you train for British 1 i've had problems with one day's problem with 2.7 coloring issues and when you're funny characters and i don't have to learn on how to gain entry now and i'll show you how that issue would know asian told you know i will get there but you can't for us this was the pretty innovative things all right let's not go completely bonkers and now so that was a stack getting it setup right guys it shouldn't take you more than two days a day at the most for errors if you know what you're doing right don't be scared don't be worried about us the instructions are like so straightforward is your tick tick tick tick tick and if you're finding a challenging keep artists and you will and after you don't get me like God and like even you know you're in the command line logging in with your ssh key and and then poison goes and crashes just log in and reset its it starts in poland and it's worth the effort it just kind of wanted three dislike for me is to tell you guys these things are worth the effort and invest in them and i know like it's banned from well Dave as where we'll be doing something inspired next because that's the next we want to tackle because the dataset with big and we spend a lot of time loading a lot of time joining like we go for a bigger dataset so they they run to campaign the week so if they want to do this on their a whole dataset a relational database which is not going to work it's just not it's not at the races so we're going to have to do inspire so there's our next challenge ok that's the tools and I'm going to take questions I'll children and if that's ok and oh switch it off when you're done switch it off when your door switch it off when you're done really really really important like everyone I know is as a horror story about eight of us and so just wit you it's only it's right we'll go to the packages and I'm not going to talk too much I think we started wrapping packages basic stuff I took it learn pandas and no boy and you know Hannah's is a whole massive topic on its own non polyfoam laptop don't not going to go there i'm just going to talk i'm just going to give you the guidance around is what we used and we also use this SQL driver here which again use it like don't bother export is your csv onto your desktop and then loading it or exporting it to your s3 book in whatever the database there connect to the database that's what you're supposed to do then you can take the script the next time you have a budget you just connected to a different database you have to change filename not to do anything it's important you feel these efficiencies into your process because it saves you time to do the sexy stuff later right okay and oh yeah we did actually do a little bit about so I'm added plus a planner tool for flooding under 24 but we did so many plots in this I forget which package we use and because you're just constantly trying to get things out of the data and so we usually by now have it in my image in our books I know just have a standard lock with all my buddies anomaly in fact if it's an exciting one I'll just pick install that and put it in again so that that would be my life goings in my voice took to you how you want to nicole whatever foot usually just plumb it all in things are cheap at the moment and like and I tell you the notebook isn't a performance tool it's an interrogation to now people are charging on that a little bit but that's how I were you see it and so now we get to the fun stuff the methods how we actually did us I'm just going to take a pause for a second is there any questions before I like we have Q&A TM but is writing really burning that people want to hold it so summarize we've now we've got our data we set up a nice big toolbox to really go and do some fun stuff and we have an idea of the business problem we have an idea of what we're trying to sell but as everyone knows when the rubber meets the road it's completely different story right and that's why I bought the business problem let's start because you need some sort of compass to guide you through all the messiness that you get to later on so this is roughly how we did it right we did a thing called feature preparation so we prepped other features trying to figure out what was lost what worked what didn't because remember they've never done this before and what you don't see never before so you know no one really knew what the underlying things about response rate where and which I want then the holiday of predictive modeling and user closing big wide up with an iterative thing is that you're trying to figure out well what am I predicting for particular users and one of the challenges you face was this idea of user clustering how to do with property and then how to predict that cluster properly and then eventually you kind of get it if it's a virtuous circle and some people will either go stirring first then predictive model second but the people who do it the other way around I just feel that if you can do it with one pass and it works like you know publish a paper and and then the last one is this feature selection so that's like I know strong Dancing with data and have transferred so what it means what it's a baby now I can start actually picking the stuff that matters that would impact the business and deliver on this business problem okay feature preparation right this is not gonna okay we have 45 features for consumer 20 / campaign and then we use synthetic ones right so some a lot of stuff okay and two people one of them was nominated in the speaker right so you've got to make sure that you're naming conventions make sense and that everyone understands what they need so if someone doesn't come into work one day they haven't made the other person hasn't made a mistake or vice versa right and and then six months time if you're looking at the color you're like actually me in a year to go and real to it again so think about naming conventions and spend some time in them like it's it's times well well spent and compound for processing right so we had focused over 20 files the different tables with everything to help me had a big board as you had stickies and each sticky was a feature in each color was a different color each color was a was it was a different data type for the one two three four five database type so as the stickies rolled in as they printed the database we just move them across and then for our synthetic features and our joins we use the thing called Trello a new fellow Charles fantastic for short term bursts what uses a project management tool but it's definitely good for collaborative working so we use this idea of the Kanban look it all worthwhile and and then synthetic features really there's a typo at the end okay so then it features are features that aren't in the data set their features you manufacture so for instance time between responses that's a synthetic future with an IV in the dataset we manufactured it we want to see if it was useful or not and as I said before this is different to the features you're rejecting around like GP and all that sort of stuff because these come from the day they say I think this is where you are buddy were like this is where domain expertise comes in and you and you really start to understand the data and the way we approach the first time rounds in kitchen sink if we went like absolutely anything we could possibly think of we made a feature for it made it binary automated the continuous variable or whatever and and I think we might have overrated a little bit i'll explain why and boy have a detrimental later on but I just you've got to be aware of time and I always feel it that it's the academic person the business card you know the academic wants to pursue this the intellectually challenged business guy wants answers straight away and as you guys work towards these problems of duty from yours figure out what your balance is and where your cutoff point is sometimes I can be a little bit hasty and what business partner goes to 10 so we have a nice point you delicate balance to maintain them okay so that's the feature fresh and I don't know if anyone does any cooking no chef before so the mission place is this whole area every time a chef cause damage to his his workstation to prep for his session every for dinner he organizes absolutely em in Seoul when the orders start coming in everything really do and that's effectively what we do here that we're preparing Bri I'm ready to do our hardcore daily signs of organizing everything so we just reach riddance there and reach Britain's there every first they're so we would have spent I don't know maybe thirty or forty percent of our time on this bitch i paid in dividends okay and you know let's just have a butcher's at the data so we've got a really now in our database it's in a reason before my there's nothing like a little tough just to check it out just to have a look so we ate campaigns and you want to have a look at them making sure that there's no Blackie one of them we had a whole stick down middle we had like three days of nothing and one of the four doesn't skewed over to the right so you know always check always have a look before you start doing the heavy duty machine they're innocent because you can't go back and do the the check after you've written your algorithms because you're like oh so get really easy mistakes quickly flush them out very very fast maybe I was fine you know SMS did as metadata seems balance I'm happy right now we get into a little bit more interesting stuff around outliers and we all have way of dealing with them and there is an asset this big there's always going to be junk like I think we've won mobile whole number that responded eleven hundred times and so it was a test for the campaign you know we weren't informed enough and so we spent a good bit of time hyster groaning it'll fox body at all figuring out like you know where where it was and eventually what we kind of came out with was there was about 10 users has 12 users above 50 responses in total across all the campaigns and so we said right those trouble users are gonna mess with our heads let's just put them there of their own mind we've got 1.5 million so taking out 12 is ok like like I said I'm not a statistic but I'm pretty sure you can just get rid of that and and then you get this kind of block of 50 and it's not particularly great route but I just want to talk about the outliers so usually just want to get rid of two or three standard deviation don't think I thought of that it's a pretty good indicator and anything less you're probably going to get a bit annoyed anything more you just take it into these are really really loose rules there's reason for you can go and spend whole weekends and weeps figuring out what I already you want to remove we just go to sanity's fun and it looks grand I'm happy if it doesn't look round then I know I'm not happy the underlying reason is that I've to solve a problem not proved reported statistics land okay so you guys are are allowed to do whatever dude actually wherever degree you want but I'm not 50 look around yeah let's just get rid of this move on we now know 50 delivered we probably have a high density of zeros and then up around ten and then anything greater than 10 is probably another cohort cool so we're looking like three cohorts but we'll get on to why water have you can really test is your three cohorts or not later on oh one thing valid responses right always check just because it's like a you look at email campaign during their marketing and sent is not the same thing to receive right so make sure that when you're the events that you're measuring are actually consumer events not a the events that the process is generating so those those SMS is were declared okay bye receiver so they're actually received for it consume rather than just sent because the oily or the sending she would send them over 10 times over if they can get through especially in India and right so that's great so we know now roughly what are what our users are looking like then we go down another level I want to check out our campaigns because remember we're them for commonality right we're across the canal across to consumers so always have a butcher's you take it we took out our crazy and greater than 50 guys and we do these things called their box box they're just just literally an idea of how the campaigns are responding and what the distribution is so you can see campaign one actually had quite a high like lots of people responded more than what's interesting again the others are kind of low there are skewed towards the right and artsy left sorry there's a few you know out between 34 and we had two other campaigns where they were absolutely saying they were all the way over here and like I think they were the where the 40s and 50s before it actually turned out campaign 1264 product relations so each response represented the purchase by a consumer of a particular consumer products and campaigns seven and eight we're just a reward if the consumer response to consumer responded they got credit on the phone so they live and immediate desire to respond where is this consumer had to invest some time and money I given up their third world developing nation excuse me okay man know about get donations and you know there's a lower intensity to respond okay and yay so there we get a little bit technical okay I didn't really know where to put this load and people would probably donate a little bit but if you do care it's an interesting thing to think about it so we try to talk a little bit about mobile phone networks and so we ate campaigns across four countries I think we had about 60 or 70 mobile phone networks in our data set right and we were trying it we didn't know if the mobile phone networks were like a meteor Vodafone we no idea what the quintara faces Vlad really premium when I'm sorry okay well we're data scientists we're machine learning experts if we categorize as variable is ticking on that model all will be revealed and we were like we little modeled on the mobile context we do this thing called hot Cohen who's familiar with the dualism 20 with so okay right let's go back to the basic so imagine I have someone's name I have how much they spend a month and and then i have the shots that they expanded it right so it's tesco super values and center if I want to do open she learning only works on number two I can tell us san fran i can tell it supervisor it needs either a continuous or discrete variable right so what you do is with hot encoding is instead of having a column with texts in it you have three columns and each column represents a shot so it'll be to providing at tesco our center and if it's a central shopper the column with center will have a 1 in it and the other two will have zeros so then you're immediately giving the machine numbers to work with instead of text rings and it's called a hot and holding right and i would based on my experience I would be very very hesitant to use it and because effectively all you're doing is you're taking so in the shopping example we're trying to model how much they spend yeah and span is a continuous so it can be from 0 all the way up to a hundred and all those numbers along the way but with hardened holding you only get a one or a zero you get you got a binary result and I'll call you it over influences the machine learning algorithm of the model to a degree that maybe it's shorter it shouldn't so and it also adds an extra feature on to the model so in our case we had 60 mobile phone companies that means we've got 16 extra features or 16 extra cards okay and it played when you use a binary variable in clustering you get problems binary variables are our final analysis it a shin I don't like them I need I need I need to come along a bit more I think about them a bit and how are you apply them but I just I just wanted to put this in to see if you start coming up against like text strings when you want to produce against compartment Holdings away you can do a poison and and it's a really easy little snapper code and then you just go operator named campaign name then you replace it and then that just build you add a comment seeking two columns of ones and zeros for the mobile phone networks are whatever variable you want and I take questions on that afterwards with an L a bit boring but I just want you to go ah now I know how to do that and so we've done our hard coded with donor are kind of our thing and this is going to be n slow but I want to go to it no two controllers back so we 45 pieces of information for consumer right it's a lot we then have 30 pieces of information about each campaign okay and then we had 20 items of behavioral information so that's the synthetic stuff so it's like how often they text art the time between texts are and all of these different things so there is our database to show you the start you took a campaign based features we took our time based features and we took her user characteristics which are users and then we just steal the down into key features which we felt we're important right that's positively what you're trying to achieve you're trying to get to what matters and the rest is around it and um so stuff like time between in and out was it English or not the Delta between the user getting a message in sending it out how much the reward was worth and average time machine responses and how many messages they receive so if they were peppered with messages they you know some of these are very obvious other ones are a little bit interesting so like the English enough that's kind of cool and the reward info it's kind of cool you know you can start playing reward figuring things and which is really really nice and that's effectively like that's it you know know what your consumers what what do you read how do you get there now we get into this iterative loop cluster user clustering and predictive modeling right now I'm gonna try and get through this at a high level have you take questions technical or otherwise and this could be really boring on Sunday morning or it could be the best collection representative for you Keith your lecture and so it's all about this iteration ok so we talked about clustering right and familiar with k-means clustering okay yeah yeah and who gets the idea why you lost her yet okay so I got more hands of ok beans and I did why so that's the audience are whatever and so we see here like there's our histogram of a response rate so how many users and head I'll show you how many users on the back and how often there is father and there is our sexy k-means clustering of continuous variables right so which one do you think is better I mean what on the left is obviously easier to interpret you know that you've got pretty close to here be close to here right here you're like what the actual hell is going on they're different colored dots they're not blobs like you're so like they did on the air on the github repo I looked at you know they're not distinct what am I going to do with that and we spent a long time trying different by K taking out discrete variables so zeros and ones putting only using continuous variables all this different stuff and we just kept getting getting junk right so we just went right mr. Graham it is intuitively it feels good we can explain it away let's go with that and the reason I put that up in the board is that like unless you're trying to prove a theorem on clustering you put something to do do the thing that you can you're comfortable with not the thing that you read about in the blog because eventually you if you can't explain your hair product and now to say that and sorry oh that's just your day the k-means algorithm like two vectors on the left yeah so that's the user can't and then I'm sorry TX is around actually user kind there that's response rate so it's how many people responded 300 200 150 150 in between 1500 and then less than 50 sorry but I'm just okay just visually looking at it I didn't say there are three glasses and why is it i align than the other two horizontal lines and they are three distinct plus does it mean does my classification make sense yes but then you got that black one of the top I know yet i'll learn how do i explain the features hey Roy designed facts austere make sense and I appreciate you see norming with clusters you're looking for extinct like blobs problem it's just look like is it through this thing of behavior and another one so does it yeah you thought you could you could you could call it if I not saying you couldn't but it just for me to think about a logic fighter because what happens is what you then put those clusters into a machine learning algorithm right and then you gather you get a prediction around the features or whatever and then you have to go and explain it to someone i think i would say yeah then it's a really valid point do you need interpretability or do you need protection it was are two very different things yeah yeah and I think if we were to black boxes and the client just wanted us to send messages to people we knew that response I'd spend more time on K means and work with my model but I had to go back to the clientele roughly what he should be looking out for because he I need to get more data autumn a particular manner so yeah and I like I said yes so regular those must be interpreted yeah it probably probably know and this goes back to I always feel like i'm just-just done on my fingertips with this sort of stuff like a pretty sure I could spend some time on it but I just didn't feel comfortable off because you know available work you don't make sure you're on top of it however I had a chat with a friend of mine who is infinitely more clever than me and he suggested that we the way we worked is and oh yeah the other thing was we know when we don't discrete variables in this cluster that was another reason so we know ones and zeros everything was continuous which we don't they need sixty or seventy percent of our features right so it just didn't make me comfortable with it and the next thing is if you are inclined and the gentleman the front row we highly suggest you have a look I have a paper if you want and special clustering right it knocks it out of the park for those sort of complex odd clustering stuff and but then histograms or even generate so just you know there's always there's always a way to crack but spectral clustering now is a maybe three or four days of me going around figure notes head employers clean on my data doing some tax cases but at least I know it exists and I don't have to I don't have to worry about I'm doing this wrong I know it's right and I know there's a different way to soldier that's kind of important just to not to be intimidated by the simple things are the big things or whatever it is just just get it done okay and so now we get to some interesting stuff are you so now we've got our clusters and got like eighty features / consumer we have a massive amazon box so we doesn't care for like and if this is if we were doing this for every single data point and we're predicting on undefeated it's gonna cost a lot of morning its kind most importantly it takes your time to watch and wait i'm emily and then the answer comes out and then what happens is you get a tree that's like unbelievably massive and you're trying to figure out what the act what's going on in this tree off each other important like and you know if we couldn't even we have to print them out on a 3 because they were so big and it was just used is everyone kind of understanding decision tree is so basically it's the influence of each feature as we go down you're trying to spot the big nose and it's it's all about like and how big do campaign wise was a really interesting feature and then we're that's is your the reward involves another feature and then the failed would succeeded later feature we've actually quite important that I talk about that in a minute and and here's the code to do it and you do your cross-validation model you do your Mac step and in your decision tree classifier will give you a max depth of 3 which i think is pretty important and you pee everyone's from it measure chaos and is even inexperienced bro just use these measures as a point of reference they don't tend to explore them too much I'm like okay it says use MV entropy is bored I understand if it's before it is it gives me I don't specifically understand what the measure of entropy means in the model but I know that relative across the whole tree that one has a big number of that one has a little not grow up the big one is more important to tick right and that sounds like a really tacky and cheap way to do things but it works so don't don't knock it and so yeah that's your decision tree so then you can start to figure out okay I'm chopping it up into four five layers starting to see what features are important then I can go back and take those feet take out the features that I don't need and start reading on the model against the clusters of one's more than ones and 50s etc okay so then we get to this thing called a confusion matrix anyone use from these before okay who hasn't useful right so these are developed in the 40s jury is how I was thoughts of these are development 40s for a by the Royal Air Force for radar predictions so effectively what what they were trying to do is they were trying to figure out if they made a prediction that a plane was going to be somewhere and it was there then they were great they made a prediction plane was going to be done it wasn't there then they put a tick in another box so you can actually start to see how good your predictions all right so what we did was is based on our histogram a little bit of pixie dust we figured out that there is cohort of people who never responds okay they're interesting there's cohorts of people who respond one time interesting and métis people who respond greater than one so you saw from that graph that it was like that big block 10 to 50 but realistically between two and fifty the size of that cohort was think nearly sixty percent of the entire consumer base I know you actually know you see the numbers see the numbers in a second I think right can see the numbers in a second but it was basically never respond respond more time and respond more than one time um and say a tree distinct use cases back to explain ability i can go back in and i can save the dates and everyone understands that they give today i'm going to do is take a mug yeah just say who here can you tell me who your customers high level like was the chief marketing officer okay so they cook the course of her level was he was the base manager so for those of you don't know if a spine injuries he's the guy who manages the customers have their interactive it so he he reports the chief of marketing for the marketing guy would say we were running a campaign in india on noodles make it work so phase manager dr. Goff look at his page and go right on once i want this and like that all this sort of stuff so that's a customer so he understood data he wasn't a data scientist his goal was just to execute and do it well so unique and so we have to feed him things so he could go back to be I team and he okay vit I want this type of consumer and for this campaign and vit I want that type of the too much of that happening and what you really together like into it you want to push it pull it up a little bit further and we just build a model that an extract consumers straight up the database based on campaign criteria so they tell us that pain was and I'm on just go in new group of which consumers me but you walk before you can call and there's it one of the talks is anyone donate talks on getting strapped into production I do to clock like that's for me this is the holy grail again this sort of thing into production getting and actually working because there's a lot of people like th everyone's well am yeah that was a bit inappropriately to apologize if I could talk so what we do with this grub is here we tried to predict frequency zero frequency with Warren frequency greater zero so we don't hear we're getting good good prediction rates because it's one and one yep two and two or two and two yeah we're getting stronger picture a port we're predicting we predicted 362 people right we predicted 362 people would never respond but they actually responded more than once so that's what that sharp kind of means so effectively you're looking for a good a good strong sense of things and what I want to come out is right and yes we did you got to be careful while vectors you fit into the model right because we've got a really strong model we got like ninety-nine percent it starts like yes look that's about it if anyone says their model works what doesn't really bug you know it stable it should that it's over the problems we dissolve or they've made a mistake somewhere and we made a mistake that what we did was we included what was happening was some customers were sending SMSs and the SMS is weren't been processed wearing been credited to our working credited because either the campaign was closed or it exceeded the limit or basic video with a failure and we wanted to capture that as a feature and say the likelihood of them to try again no are they persistent so we said if they try him if they tried once and failed then they tried and failed or tried and succeeded then that's it warned that's a feature because we know that they're persisting consumers right we put that into the model love hold that seems to be a really strong indication they'll actually respond you know because that's what they've already tried of course we're going to try again you're the only you only realize it when you do is so we have to take a fantastic model so just be careful with if it's simpler to be true it is and I'm this this point explain ability if you can explain that I know it's a bit of a novelty that general in the classic thinking about explain it to you later yeah so always be careful that sort thing you're looking around anything over 75 80 yeah five-minute any only 75 80 yeah good anything over 80 like you're a genius or something right and this other thing about cross validation and folding right talked a lot about training tests that's one step folding is another step and I'm running a little short on time I want you all to put folding in all of your models from now on just do it and there's no constraints and purchasing power anymore and because you'll either be using spire in us so just do fold makes for better modeling it's just go practice it's like the mr. place it's good practice and there is the case old solid it's kind of a benchmark and we've explained it briefly there you can go into more detail on thing and right so now we are at the Enders okay so effectively what we did was then we did all this predictive stuff and then we figured out the features that were important across all consumers all right all the consumers bah features are important and we saw some huge so we saw one or two campaigns causing some features to dominate the entire set okay so this idea features across our consumers was actually a misnomer and we just proved it effectively so then we said what we have to do the model her customer right / / campaign and then we get a graph like this so effectively what it is was so each is a campaign and each is the percentage that feature are influencing the likelihood now there was for each campaign there was 30 40 very of the came out as predicted somewhere like solo we just stripped them out we've northern and then we set our top page are top nine all down to equal one so you get these nice little graph that you can understand and look okay in that said everything adds up to one here is to share everything other to want you to share me on one hitter and that's part of it that's not statistically valid incorrect foot opposed to look at and it gives you an anchor point for a non statistic person ago haha again I understand now we know what you're talking about so that's kind of interesting so user behavior and language dominate which is kind of nice so English are not so that was your deal yes in some campaigns the SMS is going out in English and also in the native language of others the native language one for getting better so your life will then just send them all in Portuguese or stop sending me whatever so that was one aspect we drew it the second one is probably more pertinent and way more important hey so we're like great we now know features bridge however eighteen percent of our users account for seventy eight percent of our response that's like flat straight off the bat perrito do I have to explain creo 8020 frightened you can look it up in your wikipedias and so these are as a power user these are the guys you want on you want to send nearly every time if you've got a new client from not bored or a big angle avoid these are the guys you want to take a quiet one human exit and so that's the other thing you've got to think about is that like okay I feel my sex machine learning I have an idea of like how earthquake campaign which then we're going to drive your values and we're going to give my clients they can actually reuse and this is what we figured out so we did basically diagraph for highlighting fill port and then you see pretty much someone who are different to DML cohort size is kind of an interesting one so its size of the edge and messaging batches they were standing so where they sending kind of micro watches to individual regions are religious pocketing the maze all at once so that's kind of interesting and then to conclude we do what we say we do we got to our key features and this is the bit that's nothing to do with data science but I want to put it in okay so i took this is to do a day's wage data collections are really really really poor in this right some of their RM it's like 51% with valid three percent on the date of birth it's just a measure the date of birth and data capture three-point-eight percent of the dates of birth captured in south africa where in bar or a valid like we couldn't use data broken down traffic information so like the another would have been a huge if we could have and it worked in indonesia not switching said up two flights right so then we get to our final site so this is a slightly presented for client this is the one that they wanted to see this is the whole idea of what the actual have you done for the last five weeks like and we figured out that if they improved we could get if we could get them using our methods for a hundred campaigns they did earn an extra bitch and an action it yields whatever why i made with three different things automated data is probably like like straight off the part will open up your stuff and we've seen terrible like it doesn't take a data scientist to go guys and pick up the pace here and you'll actually get more value from from your execution and the other types of stuff you should say to people don't be well you know my codeine value is whatever it's like know already important and the cohorts generating high-value cohorts there two 18 / center and then finding this last one and this is something that people really really like and really get hooked up one so people who don't respond dunks mom likelihood of getting them to respond this it's kind of in the wind right people who respond loads probably respond those but you want to know they are ok people who respond once you want to make them respond once and if you can do that that's the whole right there and that's where the behavior stuff comes in right it's not about knowing who they are it's very encouraged them to respond more okay if I use to send one message to you guys about this type of work it's either going to respond once and get them to respond one time bored and true there is like absolute through the roof okay guys that's the end and if you've any questions so we got time to learning with no talk to Nate okay guys this is