Advanced Redis data structures

0 0

so the white ship a solution declare position Lee stroke Purim Danny Freddie second on Alaska's a tamiya thank you hello guys it's also to be here it's a bit cold but awesome today I will speak about race and Python and how to extend the data structures that are available in Redis with something more powerful I will start with some simple examples and then move all in some more complex ones and I will also tell you how we actually use readies to save like a sense of thousands of dollars in like servers or paying for other services yeah so a bit about me I'm a founder of the doest which is a company that makes the list up to do that so this is a pretty popular and has like millions of data items I'm also a co-founder and former CTO of a social network called clerk which is this logo here and it's basically a big social network in Asia Asia Pacific and more specifically Taiwan and philippines and in a clerk I helped it scales like billions of data items for tens of millions of users and readies what's a big part of this yeah scaling up of this and I will also show you one of the examples of how we use register to scale yeah so how many of you actually have used or know a bit about Redis please raise your hands ok so there are some the thing about radius is it's a database where everything is in memory but you still have sistance it has amazing performance and it also can do a lot of things so like you have really really deep data structures that you can use to solve a lot of different problems and later on I will showcase our analytics library that we have built on top of of readies to basically like store hundreds of millions of events using just one server I think also that the the lead developer ntrs is a really cool guy and he has basically like developed readies from the beginning and I think probably has committed most of the code also he is a verte on some amazing new things such as senator and cluster that makes it much easier to scale Redis from just one server so basically could have like a cluster of servers and scale too much much bigger levels and also some of the startups I think baidu has like terabytes of data inside radius yeah so it's a really really I think I database with a lot of future so let me just explain you a bit about the different data plate styles we have today so you have like a relational databases which are the most common this is like my sequel postgres then you have column databases which are a bit more popular today this is like Cassandra BigTable age base and etc and then you have like a Redis and some other already site databases such as Tokyo tirant where you basically have like it's a key-value database but you have like very very deep data structure such as sets these hatches and bitmaps and a lot of other things for instance you can use Redis to do like a top sub publish-subscribe implementation pretty easily and you can use it to for instance implement server push like comment communication and stuff like that bike so I think at least like for our in these in startups I've ever done we have used the relational databases and readys and I think like you need to use them both because the problem with with radius is that everything needs to fit inside the memory and you know like if you store a lot of data it you will probably fail because it's not really built for that but it's built for like highly specialized data structures that solve a particular problem that you have and where the data growth is like huge so why is ready school for me it's really that it represents the data type they defined in normal programming languages and I think working with those is very natural for us so for instance you can easily work with lists or sets or hash this and you can easily model and solve your problem that waiting and the upcoming sides I will showcase some a tiny libraries I have built to demonstrate how you can use Redis to implement different data types so a ready strap is basically a forever that makes a rebus a bit more platonic so basically it makes Redis lists and hashes work like normal Python data structures so this is basically usage of this so you get a list then you can use append on it you can do a length of it and you can also like see if item is inside this list and all of these gets transformed into Redis commands another is mimicking of hashes so basically you have the Python API you can get a hash then you can see something is not to inside it you can do assignment you can do length of this and the last thing is sets so basically you get a set you can see something it's nothing inside it you can add something to it and etc and you can actually also eater 8 on this so I think like this is just an example of how you can make this you know working with red is very very platonic and you don't really need to like compromise them and make things ugly and I think if you did this in other line like implementing this is actually really really simple so I think it's maybe like 50 lines of Python code to actually implement this interface another single extract example is a draft database so maybe you know you need to use some kind of graphs for the problem you're solving and I think like if this graph does not grow a lot you can use Redis for this and even like if it grows a lot you can use Redis for this like linkedin has all of their like all of their draft social graph they have that inside a memory database because that's the only way you can actually implement connections like how far are you connected with somebody you can only efficiently use in memory like if you do it no normal ways without using memory like or using normal databases it would require much much more resources so based like the way that linkedin has done it is I don't think they use Redis but they used or they store the whole social graph inside memory databases so basically we can do the graph where we have like edges we can add an edge from one node to another then we can see if two nodes are connected then you can like get a neighbor's of a node you can delete edges set note values and set no edge values so this is like pretty simple and this is the whole implementation of this so basically you can use this I mean maybe you don't need that crap maybe you need another data structure and you can easily model it using Redis and it does not have to be like thousand lines of code this is like 50 maybe another thing we use a lot import Turk into this is a ques and our cues are implemented in Redis and I think like on today's we process millions of jobs per day and basic like this implementations maybe 50 lines of code or something so like you don't need to use celery or something much more complex so actually solve this problem and our pews like we have done extensions for our pews for instance we can delay a job or we can implement so jobs are distributed to specific workers and stuff like that and like these extension like if you work with celery or something much more complex some of these extensions are much much harder but if you just have like 50 lines of code and it's very very has really good performance then you can like do all kind of abstention to it and fit it to the problem you're solving so basically you can delete jobs you can put jobs or add jobs it's probably better name there and then yeah you can get some stats about jobs and you can also like to reserve a job and process it so this is the actual implementation of a very very simplified cube inside radius are you use Redis to to implement a very simple q yeah so all of these are very very simple things that you can do and now i will show something that we are built that is more complex but it also solves a more interesting problem i think so basically how many of you know but the cohort and alex is okay not many so i will explain but what cohorts are and i will also explain you like how we actually implement it and why the implement bit matters so basically there's a sad up called Mixpanel which does retention tracking and i thought it was like really really good and like in most adults or in most products you actually want to track retention so basically you want to look like basically these are like cohort analytics and you want to look like uses the sign on this day and the like let's say that you're tracking actually I think later on there's a better example so i want to explain i will lose painting cohort and this later but the thing is like you can use for to track retention so basically one of the problems with the mix title is that for our costs like we have millions of users and like tracking them would make us pay like much more than two thousand dollars per month in expense so i look at this and i also like thought that this could be solved better and i could probably spend like a day or two working on this and implementing something that's equivalent and it saved us a lot of money so that's what i basically did and this project because is called bit mattress so it's an open source project it handles like hundreds of millions of events for today's and it has 01 execution or yeah so basically what it does it's it's implemented unaddressed and in this library on top of a reddish and a top of bitmaps so basically what can you use bitmaps a bitmap is for so you can answer like as user 123 been online today or this week you can look at how many users have performed this action on a specific day and then you can like aggregate these things and very specific itself like you know this month how many users have signed up and have added a task or completed a task the thing with this is that everything is constant so like it's very very fast opposed to add and update and the reason for this is that it uses bitmaps so what are big maps it's basically operations working on bits and reddish has these commands that you can use to manipulate bits and that's basically what bitmap is does like it uses these commands to implement our analytics library that uses not that much memory and has a really nice performance so basically like you can set a bit for some key and then you can get this big I danica like to a bit operations such as ant between keys and here's more about bits and bit operations so basically this is the usage of a bitmap is so basically you can mark an event so here you mark that some ID is active and you mark that some IP has a sound plate then you can answer if this IP has is active in this month and you can also do it like this hour or this day or this week or this year and you can also aggregate using the land of you know all the users that have been active in a month and then you can do operations on this so you can actually end stuff and or stuff and work like with very like you can implement anything you want you to have answer for instance like all users that in specific one have been active and a sound plate and maybe have upgraded and then you can like do this for to do like very customer reports and for us like this is a huge help because we can answer almost anything about user behavior and this is where cohort comes in so basically you use the bitmap is to import to to implement cohort analytics so basically Gordon alex is here we can section like all filtered people so we want to see everybody that has signed up via iOS and has completed a task then you have like here the days and we put it but by days but you can like filter by months or years or whatever you want so here we see users that have sign up v iOS on the first of every airy and yeah there was 800 of those and then here we see how many of those have actually completed a task and these days are you know so 0 is on the first day one is on the second day on the third day on the fourth day and etc so basically you can see how your attention moves along so like you can see that users after a while actually stop adding or stop completing tasks and this is a problem because that means that your attention maybe isn't that good and you can do like much more complex filters based on this so basic like you can take this and then like if you make a change in your app maybe you create a new version or you optimize something you can basically see like this day we made this change how has this affected our phone numbers and this is really really powerful and useful in in product development so yeah that was about bit mattis nisi the time so now i will talk another structure that we have built and this is primary for clerk but we also use it for today's it's a cold fixed list so basically i will try to explain you what the problem is and how we use Redis to solve it and also Tyson so basically the problem is that in social networks you have timelines usually and these timelines they grow exponentially if you're not careful like if you have a lot of interconnection between users and they need more and more data then the growth you will have is exponential and this is a huge problem to solve so what is the easy solution is that you go out and build huge data centers and then you throw money at the problem but there's a smarter way to do this and we learned this in third because when we started we didn't have that much money so like we couldn't go out and build data send us to solve our problems so basically what we did was we cheated we figured out that most users only looked at the recent messages that means that we don't really need to track you know messages one year behind because almost no users scroll back one year behind so basically what we can't came up with is that we limited the size of the timeline to like thousand elements and then it was a fixed size and we can do constant time updates inserts and gets any result for cacheable and also the same solution is used at Facebook and Twitter so like for instance on your home a timeline intruder you can scroll back like su like because it is like kept I think their cap is like around thousand tweets better I'm saying with the facebook news feed it's also capped and the reason for this is like if you don't do this then you have like this problem and its really expensive assault so basically we did this and we implement fix this to do that so basically I have done some benchmarks and you can actually do this in pure Redis plus item so basically use the primitives of of ready to implement a fixed list our implementation which we have open source does not use the pure readies construct and it's also it uses our it uses a zip to compress the the storage so basically like you have 2.5 faster performance and 1.4 less memory but this said like I think also just in using the the pure ready solution would work as well so basically a fixed list you can add a value street you can add multiple value street to multiple keys and then can like return values from it and it handles like that duplication and stuff like that so basically you can have the same elements inside the list and you can remove a value so basically like this is a very very simple implementation I think it's maybe like other lines of code but you have saved clerk like tens of thousands of dollars because we could instead of like using my sequel or another database we could use Redis and get huge performance as to solve the problem that we are solving of course by cheating a bit so this is the the last part of this talk is basically like how you can say is further and basically at some point you may have some data types that need even more complexity or that need to be even faster and this is where you may want to go into Lua scripting and actually the the pipes and library Freddy's has great support for this and I will show an example of this so basically like the use cases would be like you need something more complex more specific and unique better performance so I will show you how I implemented the the increment implementation in fights on and this like just using the the pipes on the standard Python Redis client and implementing this I don't think there's anything special about it and then what i did is also i implemented our registrar slow implementation of the same functionality but here i use the I registrar script and then i use the lua script to call this this this command and basically this is the lua for the increment command increment yeah so i'm unsure like how useful this is but i think for some cases where you need to push the performance to the limits or you need something very very complex that works because this is like executed directly inside radius and you know you can do ready schools like we do here and you can do like I think you can do the same things but I think the performance will always be better so basically the lower version is like three times faster than the Python version and I think this is like a very very small X example but you can do something much more complex and I think the performance of the blue implementations will be much faster because it's it's using the lure jit version and I think though there are some problems here because you must be aware that this can be executed for a long time because it's basically talking on the keys here so you should not you should also be careful not to like do something a monster of this if you do it will probably kill the performance of the reddest so on and the last thing is here this is a proof of concept a fixed list implementation inside lua I have not really had a chance to to play that much with it but I think actually like you could take the fixes that we have here and implement directly inside blue-eyed I think you will get much faster performance than using Python itself so that's about it I hope it was useful there's a questions and answers and there's also a special offer for you guys given that we have a todoist it's at probably the best to do app if you want to get more productive then it's the way to go you can get our free upgrade its with $29 just send me an email and i'll upgrade you okay thank you so total dream is there is a process our religion and thank you and the question is how do you handle or artists connection errors and I mean that you just use already said it would be simple data structure but amazing that summarizes line color is an exception do assumes that rectus is absolutely available and just let exceptions to propagate to the top of your port or not yeah I mean currently actually i read this infrastructure it's a bit fragile but i think it has gotten a lot better with a semi term that that is shipping with the i think is always ship but at least like there's a lot of work being done in actually making red is much more like a scalable and a full color so basic like with semi tell you can easily have like fullbacks to another server and like have this handled automatically yeah okay i'm just to thank you is it possible to remove items from fixed list like in your example this is war timeline its focus when somebody you want to fight his book from the time monitors yes it definitely is like you can I think of there's an example you can remove stuff from it in spite value is it possible to specify item from structure which defines the idea of the post not to check the whole value hmm I think it's probably possible but it depends on how you design at this yeah i mean when the post is like this on deep and we put it into Redis and the one of the few days for ID and is it possible to specify quick operation to remove the post by this attribute no I think like we need to have like a very specific you know loop over all the users that affected by and then use like a direct call to this to remove it so you can't just say you know remove this from all the I mean you probably could but I don't think it would be that performance okay thank you do you have the readies have connections and do you have mountains on installations of ways and manage it yeah so basically in this example this is actually a transaction that is happening so basic like we are implementing the increment operation and then we use a pipeline and this will basically mean that only like this will be a hot atomic oh and no other thing can actually update this while we are here you can see here like you call watch and then you go on watch here so yes readies does support transactions and in the lower example basically like you specify the keys and then when these when Redis execute this it basically blocks the keys here so you don't have like to you know to treads or two processes updating in the same in the same data yeah and the less the other question what was that yes yes we have a lot of them I think like currently we don't do like you will need to do custom shouting on top of readies to distribute the keys to other by the thing is like we already fluster that's basically what they want to do like they want to make it possible to basically have an outdoor shouting but going like in most cases you just have to implement your own shouting on top of it it's like you will always have you know the keep pointing to to specific server and you need to resolve this yourself well all the operations are atomic so as long as you just feels like one server you can do it I said no it's not the max length of the set in radius is something about for millions of elements did you ever seen a situation when it is not enough and what to do this situation I mean I think definitely like you should be very very careful with the data you store inside radius because it's not like a general-purpose database so you can either like for instance like at some point we try to store a JSON inside radius and this is not a good idea because it will just use a lot of memory and you will hit no I mean you provide a sample this cohort analysis and as I know a lot of resolution of the sample is Sam's my views I said this some kind of sorted set and what about situation when you have more than four millions of users enter this the length is not a month hmm yeah that's a good question the thing is like in most cases you could probably just maybe buy a bigger machines I think it's a limitation of riders I just I never seen this in reality but I i'm pretty sure that you can use bigger IDs than formerly maybe yeah i'm not sure about this like we have not hit this limit yet it's probably like 20 research about it but and another question we have some kind of holy war some recommend to use Redis only in synchronous manner because it's really fast and when you use it in a sopranos manner it costs some kind of overhead from from a synchronous call what do you think about it should we use Redis only in synchronous manner I don't think so like in in our code we use a lot of transactions like these and there's another huge penalty because it's only blocking on the keys that you actually watch at least based on on our experience each other person not the best I see