5 Years of Rails Scaling to Support Massive Sales

0 0

[Music] my name is Simon I work on the infrastructure team at Shopify and today I'm going to talk about the past five years of scaling rails at Shopify I've only been around at Shopify for four years so the first the first year was a little bit of digging but I want to talk about all of the things that we've learned and I hope that people in this audience can maybe place themselves on this timeline and learn something learn from some of the lessons that we've had to learn over the past five years this talk is inspired by a guy called Jeff Dean from from Google he's a genius and he did this talk about how they scale Google for the first couple of years and he showed how they ran with a couple mice écoles they started they eat all this stuff they did all then the no single paradigm and and finally went to the new sequel paradigm that we're now starting to see but this was really interesting to me because you saw why they made the decisions that they made at that point in time I've always been fascinated by what made say Facebook decide that this now is the time to write a VM for for their for PHP to make it faster so this talk is about that it's about an overview of the decisions we've made at Shopify and less so about the very technical details of all of them it's to give you an overview and a mental model for how we evolve our platform and there's tons of documentation of that okay there's tons of documentation out there on all of the things I'm going to talk about today other talks by co-workers blog posts readme and things like that so I'm not going to get into the weeds but I'm going to provide an overview I work at Shopify and at this point you're probably tired of hearing about Shopify so I'm not going to talk too much about it just overall Shopify is something that allows merchants to sell people to other people and that's relevant for the rest of this talk we have hundreds of thousands of people who depend on Shopify for their livelihood and the this platform we run almost 100k RPS at peak the largest sales in our rail servers with almost 100k requests per second our steady-state is around 20 to 40 K requests per second and we run this on 10 thousands of workers across two data centers about 30 billion dollars have made it through this platform which means that downtime is costly to say the least and these numbers you should keep in the back of your head as the numbers that we have used to go to the point that we are at today roughly these metrics double every year that's the metric we've used so if you go back five years you just have to cut this in half five times I want to introduce a little bit of vocabulary for Shopify because I'm going to use this loosely in in this talk to understand how Shopify works Shopify is at least four sections one of those sections is the storefront this is where people are browsing their collections browsing their products adding to the cart this is the majority of traffic somewhere between 80 to 90 percent of our traffic or people browsing their storefronts then we have to check out this is where it gets a little bit more complicated we can't cache as heavily as we do on the storefront this is where we have to do write decrement inventory and capture payments admin is more complex you have people who apply actions to hundreds of thousands of orders at the same time at this concurrently you've had people who change billing they need to be billed and all these things that are much more complex than both check out and storefront in terms of consistency and the API allows you to change the majority of the things you can change in the admin the only real difference is that computers hit the API and computers can hit the API really fast recently I saw an app for people who wanted to have an offset under orders numbers for at a million and so this app will create a million orders and then delete all of them to get that offset so people do crazy things with this API and that's one of our other major that is our second largest source of traffic after after storefront I want to talk a little bit about a philosophy that has shaped this platform over the past five years flat sales is really what has built and shaped the platform that we have when Kanye wants to drop his new album on Shopify it is the team that I am on that is terrified we have we had a sort of fork in the road five years ago five years ago was when we started seeing these customers who could drive more traffic for one of their sales than the entire platform otherwise was serving up traffic they would drive a multiple so if we were serving a thousand requests per second for all the stores on Shopify some of these stores could get us to 5,000 and this happens in a matter of seconds their sale might start at 2:00 p.m. and that's when everyone is coming in so there's a fork in the road do we become a company that support these sales or do we just kick them off the platform and throttle them heavily and say this is not the platform for you that's a reasonable path to take ninety-nine point nine something percent of the stores don't have this pattern they can't drive that much traffic but we decided to go the other route we wanted to be a company that could support these sales and we decided to form a team that would solve this problem of customers that could drive enormous amounts of traffic in a very short amount of time and this is I think was a fantastic suggestion and this happened exactly five years ago which is why the time frame of the talk is five years and I think it was a powerful decision because this is served as a canary in the coalmine those last sales that we see today and the amount of traffic that they can drive of say a DK RPS that's what the steady-state is going to look like next year so when we prepare for these sales we know what next year is going to look like and we know that we're going to laugh next year because we're already working on that problem so they help us stay ahead one to two years ahead the meat of the in the meat of this talk I will walk through the past five years of the major infrastructure projects that we've done these are not the only projects that we've done there's been other apps and many other efforts but these are the most important two descaling of our rails application 2012 was the year where we sat down and decided that we were going to go the anti fragile route with flash sales we were going to become the best place in the world to have flash sales so a team was formed whose sole job was to make sure that Shopify as an application would stay up and be responsive under these circumstances and the first thing you do when you start optimizing an application is you try to identify there's lower hanging fruit in this case the lower or in many cases the lower hanging fruit is very application dependent the lowest hanging fruit from an infrastructure side that's already harvested in your load balancers rails is really good at this or your operating system they will take all of the generic optimization tuning so at some point it has that work has to be handed off to you and you have to understand your problem domain well enough that you know where the biggest wins are for us the first ones were things like back rounded checkouts and this sounds crazy what do you mean they weren't back rounded before well the app was started in 2005 2004 and back then back rounding jobs in Ruby or rails was not really a common thing and we hadn't really done it after that either because it was such a large source of technical debt so in 2012 a team sat down and collected the massive amount of technical debt to move the background jobs into or remove the check out process into background jobs so the payments were captured not in a request that took a long time but in jobs asynchronously with the rest and this of course was a massive source of speed-up now you're not occupying all these workers with long-running requests another thing we did at these domain-specific problems is that was inventory if you just you might think that inventory is just decrementing one number and doing that really fast if you have thousands of people but my sequel is not good at that if you're trying to decrement it at the same number from thousands of quarries at the same time you will run into lock contention so we have to solve this problem as well and these are just two of many problems we solved in general what we did was we printed out the debug log of every single quarry on the storefront on the checkout and all of the other hot paths and basically started checking them off I couldn't find the original picture but I found one from a talk that someone from the company did three years ago where you can see the all here where the debug logs were taped and the team at the time would go and cross off and write their name on the quarries and figure out how to reduce this as much as possible but you need a feedback loop for this we couldn't wait every until the next sale to see if the optimizations we've done actually made a difference we needed a better way than just crashing at every single flash sale but having a tighter feedback loop just like when you run the test locally you know whether it works pretty much right away we wanted to do the same for performance so we started we wrote a load testing tool and what does load testing tool will do is that it will simulate a user performing a checkout it will go to the go to the storefront browse around a little bit find some products add into its cart and perform the full checkup it will fuzzy test this entire checkout procedure and then we spin up thousands of these in parallel to test whether the performance actually made any difference this is now so deeply webbed in our infrastructure culture that whenever someone makes the performance change people ask well how did the load testing go this was really important for us and as Siege just hitting the storefront that is something that just runs a bunch of the same requests it's just not a realistic benchmark of realistic scenarios another thing we did at this time was we wrote a library call identity cache we had a problem with we had one line sequel at this point hosting tens of thousands of stores and when you have that you're pretty pretty protective of that single database and we were doing a lot of queries to it and especially these sales were driving such a massive amount of traffic at once to these databases so we needed a way of reducing the load on the database the normal way of doing this or the most common way of doing this is to start sending queries to debride slaves so you have databases that feed off of the one that you write to and you start reading from those and we try to do that at a time it's it has a lot of nice properties to use the read slaves over another method but what we did this back in the day there wasn't any really good libraries in the rails world we tried to fork some and try to figure something out but we ran into data corruption issues we ran into just mismanagement of the read slaves was really problematic at the time because we didn't have any DPAs and overall mind you this is said this is a team of rails developers who just have to turn infrastructure developers and understand all of this stuff and learned that the job because we were read aside to handle flash sales the way that we did so we just didn't know enough about my sequel and these things to go that path so we decided to figure out something else and deep inside of Shopify Toby had written a commit many many years ago introducing this idea of at the identity cache of managing your cash out of bound in memcache idea being that it's Ike worry for a product I look at memcache first and see if it's there if it's not if it's there I'll just grab it and not even touch the database if it's not there I'll put it there so that for the next request it will be there and every time we do a write we just expired those entries that's what we managed to do this has a lot of drawbacks because that cache is never going to be a hundred percent what is in the database so when we do a read from that manage cache we never write that back to date at the database it's too dangerous that's also why the API is opt-in you have to do fetch instead of fine to use ID C because we only want to do it on these paths and it will return read-only records so you cannot change them to not corrupt your database this is the massive downside with either using read slaves or identity cache or something like this is that you have to deal with what are you going to do when the cache is expired or old so this is what he decided to do at the time I don't know if this is what we would have done today maybe we would have we've gotten much better at handling read slaves and they have a lot of other advantages such as being able to do much more complicated queries but this is what we did at the time and if you're having severe scaling issues already identity cache it's a very simple thing to do and use so after 2012 and what would have been probably our worst black friday cyber monday ever because the team was working night and day to make this happen there's this famous picture of our CTO face planted on the ground after exhausting work of scaling Shopify at the time and someone then woke him up and told him hey dude check out is down we were not in a good place but identity cash load testing and all this optimization it saved us and once once the team had decompressed after this massive sprint to survive these sales and survive Black Friday and Cyber Monday this year we decided to raise the question of how can we never get into this situation again we spend a lot of time optimizing checkout and store front but this is not sustainable if you keep optimizing something for so long it becomes inflexible often fast code is hard to change code if you've optimized storefront and check out and had a team that only knew how to do that there's going to be a developer who's going to come in add a feature and add a query as a result and this should be ok people should be allowed to add queries without understanding everything about the infrastructure often the more slower thing is more flexible think of a completely normalized schema it is much easier to change and adapt upon and that's the entire point of a relational database but once you make it fast it often is a trade-off of becoming more inflexible think of say a an algorithm a bubble sort n square is how much is the complexity of the algorithm you can make that really fast you can make that the fastest bubble sort in the world you can write a C extension in Ruby that has inline assembly and this is the best bubble sort in the world but my terrible implementation of the quicksort which is n log n complexity it's still going to be faster so at some point you have to stop optimizing zoom out and we architect so that's what we did but starting at some point we needed that flexibility back and sharding seemed like a good way to do that we couldn't we also had the problem of fundamentally shopify as an application that will have a lot of rights doing these sales there's going to be a lot of rights to the database and you can't cash rights so we have to find a way to do that and sharding was it so basically we built this API a shop is fundamentally isolated from other shops it should be shop a should not have to care about shop B so we did per shop sharding where one shops data would all be on one shard and another shop might be on another shard and the third shop might be together with the first one so this was the API basically this is all the sharding API internally exposes within that block it will select the correct database where the product is for that shop within that block you can't reach the other shard that's illegal and in a controller this might look something like this at this point most developers don't have to care about it it's all done by a filter that will find this shop on another database wrap the entire request in the connection that that shop is on and any product query will then go to the correct shard this is really simple and this means that the majority of the time developers don't have to care about charting they don't even have to know if it's existent it just works like this and jobs will work the same way but it has drawbacks there's tons of things that you now can't do I talked about how optimization might you might lose flexibility with optimization other architecture you lose flexibility at a much grander scale fundamentally shops should be isolated from each other but in the few cases where you want them to not be there's nothing you can do that's the drawback of architecture and changing the architecture for example you might want to do joint across shops you might want to gather some some data or an ad hoc query about app installation across shops and this might not really seem like something you would need to do but the partners interface for all of our partners who build applications actually need to do that they need to get all the shops and the installations for them so it was just written as something that did a join across all the shops and listed it and this had to be changed and so the same thing went for our internal dashboard that would do things across shops find all the shelves with a certain app you just couldn't do that anymore so we have to find alternative if you can get around it don't shard fundamentally Shopify is an application that will have a lot of rights but that might not be your application it's really hard and it took us a year to do and figure out we ended up doing an ad at the application level but there's many different parts of part or levels where you can shard if your database is magical you don't have to do any of this some databases are really good at handling this stuff and you can make some trade-offs at the database level so you don't have to do this application-level but there are really nice things about being on a relational database transactions and schemas and the fact that most developers are just familiar with them are massive benefits and they're reliable they've been around for 30 years and so they're probably going to be around for another 30 years at least we decided to do that via application level because we didn't have the experience to write a proxy and the databases that we looked at at the time we're just not mature enough and I actually looked at some of the databases that we were considering at the time and most of them have gone out of business so we were lucky that we didn't buy into this part for Praia teri technology and solve it at the level that we felt most comfortable with at the time today we have a different team and we might have solved this at a proxy level or somewhere else but this was the right decision at the time in 2014 we started investing in resiliency and you might ask what is what is resiliency doing in a talk about performance and scaling well as a function of scale you're going to have more failures and this led us to a threshold in 2014 where we had enough components that failures were happening quite rapidly and they had a disproportional impact on the platform when one of our shards was experiencing a problem of requests to other shards and shops that were on other shards were either much slower or failing altogether it didn't make sense that when a single reddit reddit server blew up all of Shopify was down this reminds me of a concept from chemistry where it your reaction time is proportional to the amount of surface area that you expose if you have two glasses of water and you put a teaspoon of loose sugar in one and a sugar cube a cube in the other glass the glass with the loose sugar is going to be dissolved into water quicker because the surface area is larger the same goes for technology when you have more more servers more components there's more things that will react and can potentially fail and make it all fall apart this means that if you have a ton of components and they're all tightly knitted together in a web where one of where if one of these components fail it drags a bunch of others with it and you have never thought about this adding a component will probably decrease availability and this happens exponentially as you add more components your overall availability goes down if you have ten components with four nines you have a lot less downtime if they're tightly web together in a way that one of them is a single point of failure and we haven't really at this point had the luxury of finding out what our single point of failures even were we thought it was going to be okay but I bet you if you haven't actually verified this you will have single points of failure all over your application where one failure will take everything down with it do you know what happens if your memcache cluster cluster goes goes down we didn't and we were quite surprised to find out this means that you're only really is weak or as good as your weakest point a single point of failure and if you have multiple single points of failure multiply the probability of all of those single points of failure together and you have the final probability of your app being available very quickly the what looks like downtime of hours per component will be days or even weeks of downtime globally as amortize over an entire year if you're not paying attention to this it means that adding a component will probably decrease your overall availability the outages look something like this your response time increases and this is this is a real graph of the incidents at the time in 2014 where something became slow and as you can see here the time out is probably 20 seconds exactly so something was being really slow and hitting a time out of 20 seconds if all of the workers in your application are spending 26 seconds waiting for something that's never going to return because it's going to timeout then there's no time to serve any requests that might actually work so shard 1 is slow request for shard 0 are going to lag behind the queue because these requests to shard 1 will never ever complete the mantra that you have to adopt if when this starts becoming a problem for you is that single component failure cannot compromise the availability or performance of your entire system your job is to build a reliable system from unreliable components a really useful mental model for thinking about this is the resiliency matrix on the left-hand side we have all the components in our infrastructure at the top we have the sections of the infrastructure such as admin checkout storefront the ones I showed from before every cell will tell you what happens if that component on the left is unavailable or slow what happens to the section so if Redis goes down is storefront up is checkout up is admin up this is not what it actually looked like in reality when we drew this out it was probably a lot worse and we were shocked to find out how red and blue how down and degraded Shopify looked when what we thought were tangential data stores like memcache and Redis took down everything along with it the other thing we were shocked about when we wrote this was this is really hard to figure out figuring out what all these cells and the values of them are is really difficult how do you do that do you go into your production and just start taking down stuff how do you know what would you do in development so we wrote a a tool that will help you do this the tool is called proxy proxy and what it does is that for a duration of the block it will emulate Network failures at the network level by sitting in between you and that component on the left this means that you can write a test for every single cell in that grid so when you flip it from being read to being green from being bad to being good you can know that no one will ever reintroduce that failure so these these might look something like this that when some methods queue is down I get this section and I assert that the response to success at this point in Java 5 we have very good coverage of our resiliency matrix by gated tests that are all backed by proxy proxy and this is really really simple to do another tool we wrote is called Semyon it's fairly complicated exactly how all of these components work and how they work together in Semyon so I'm not going to go into it but there's a readme that goes into vivid detail about how Semyon works Semyon is a library that helps your application become more resilient and how it does that I encourage you to check it to read me to find out how it works but this tool was also invaluable for us to run to to not or to be able to be a result more resilient application what we the mental model we mapped out for how to work with resiliency was that of a pyramid where we had a lot of resiliency debt because for 10 years we hadn't paid any attention to this the web I talked about before of certain elements dragging down everything with it was eminent it was happening everywhere the resiliency matrix was completely red when we started and nowadays it's in pretty good shape so we started climbing it we started figuring out writing all these tool incorporating all these tools and then at Davitt when we got to the very top someone asked the question what happens if you flood the data center that's when we started working on multi DC in 2015 we needed a way such that if the data center caught fire we could failover to the other data center but resiliency and charting an optimization were more important for us than going more to DC voltage DC was largely an interest rate and infrastructure effort of just going from 1 to n this requires recorded a massive amount of changes in our cookbooks but finally we have procured all the inventory and all the servers and stuff to spin up a second data center and at this point if you want to fail over Shopify to another data center you just run this script and it's done all of Shopify has moved to a different data center and the strategy that is uses it's actually quite simple and one that most rails apps can use pretty much as is if the traffic and things like that are set up correctly Shopify is running in a data center right now in Virginia and one in Chicago if you go to a Shopify owned IP you will go to the data center that is closest to you if you are in Toronto you're going to go to the data center in Chicago if you are in New Orleans you might go to the data center in Virginia when you hit that data center the low balancers in that data center inside of our network will know which one of the two data centers is active is that Chicago is an aspirin and it will route all the traffic there so when we do a failover we tell the low balancers in all the data centers what the primary data center so if the primary data center was Chicago and we're moving into Ashburn we tell the low balancers in both the data centers to route all traffic to aspirin aspirin in Virginia when the traffic gets there and we've just moved over any right will fail the databases at that point and read only they are not writable in both locations at one because the risk of data corruption is too high so that means that most things actually work if you're browsing around Shopify and Shopify Shopify storefront looking at products which is the majority of traffic you won't see anything even if you are in admin you might just be looking at your products and not notice this at all and while that's happening we're failing over all of the databases which means checking that they're caught up in the new data center and then making them writable so very quickly discharged recover over a couple of minutes it could be anywhere from 10 to 60 seconds per database and then Shopify works again we then move the jobs because when we move to when we moved all the traffic we stopped the jobs in the in dissent er so we move all the jobs over to the new data center and everything just ticks but then how do we use both of these data centers we have one data center that is essentially doing nothing just a very very expensive hardware sitting there doing absolutely nothing how can we get to a state where we're running traffic out of multiple data centers at the same time utilizing both of them the architecture at first looks something like this it was shared we have shared reticence entities shared memcache between all of the shops when we say a shard we're referring to a my sequel shard but we hadn't started reddit we hadn't started memcache and other things so all of this was shared what if instead of running one big Shopify like this that we're moving around we run many small Shopify's that are independent from each other and have everything they need to run and we call this a pod so a pod will have everything that a Shopify needs to run as the workers as the Redis the memcache the my sequel whatever else there might be there needs to be for a little Shopify to run if you have these mini Shopify's and they're completely independent they can be in multiple data centers at the same time you can have some of them active data center 1 and some of them active active in data center 2 pod 1 might be active in data center 2 and pod 2 might be active in data center 1 so that's good but how do you get traffic there so for Shopify every single shop has usually get domain it might be a free domain that we provide or their own domain this when this request hits one of the data centers the one that you're closest to Chicago or Virginia depending on where in the world you are it goes to this little script that's very aptly named sorting hat and what sorting hat will do is that it will look at the request and interpolate what shop what pod what mini Shopify does this request belong to if that request is on a shop that is going to pod 2 it will rather to data center 1 on the left but if it's another one it will go to the right so sorting hat is just sitting there sorting the request and sending them to the right data center doesn't care where you're landing which data center you're lining to it which is routed to the other data center if it needs to okay so we have an idea now what does multi DC strategy can look like but how do we know if it's safe turns out that there's just needs to be two rules that are honored rule number one is that any requests must be annotated with the shop or the pod that it's going to all of these requests for the storefront are on the shop domain so they're indirectly annotated with the shop they're going to through the domain with the domain we know which pod which many Shopify that it's request is belonging to the second rule is that any request can only touch one pod otherwise it would have to go across data centers and potentially this means that one request might have to reach Asia Europe maybe also North America all in the same request and that's just not reliable again fundamentally shops and requests to shops should be independent so we should be able to honor these two rules so you might think well it sounds reasonable like Shopify should just be an application with a bunch of control actions that just go to a shop but there were hundreds it's not a thousand requests that didn't that violated this they might look something like they might do something going over every shard and counting something or doing something like that or maybe it's uninstalling PayPal accounts and seeing if there are any other stores with it or something like that across multiple stores when you have hundreds of endpoints that are violating something you're trying to do and you have a hundred developers who are doing all kinds of other things and introducing new endpoints every single day that's going to be a losing battle if you just send an email because tomorrow someone joins you've never read that email who's going to violate this Raphael talked a little bit about this yesterday he called it white listing we called it shitless driven development the idea is that your job if you want to honor rule one and two is to build something that gives you a list a list of all the things that violate the rule if you do not obey the list you raise an error telling people what to do instead this needs to be actionable you can't just tell people not to do something unless you provide an alternative even if the alternative is that they come to you and you help them solve the problem but this means that you stop the bleeding and you can then going forward rely on rule one and to you in this case being honored when we had this for Shopify and rule one and two honored our multi DC strategy worked and today with all of this building a top of five years of work we're running 80,000 requests per second out of multiple data centers and this is how we got there thank do you have any global data that doesn't fit into a shirt yes we have a dreaded master database and that database holds data that doesn't belong to a single shop in there is for example the shop model right we need we need something that stores the shop globally because otherwise the load balancers can't know globally where the shop is other examples are apps apps are sort of inherently global and then they're installed by many shops it can be billing data because it might span multiple shops partner data there's actually a lot of this data so I didn't go into this at all but I actually spent six months of my life solving this problem so we have a master database and it spans multiple data built multiple multiple data centers and the way that we solve this is essentially we have read slaves in every single data center that feed off of the master database that is in one of the database data centers if you do a write you do cross BC rights this sounds super scary but we eliminate it pretty much every path that has on high of the low from writing on this so billing has a lower SLO and shopify because the write have to be crossed DC but the thing is that billing and partners and the other sections of this master database they're in different sections they're fundamentally different applications and as we speak they're actually being extracted at our Shopify because Shopify should be a completely Charlotte application and if they're extracted at our Shopify then you're also doing across DC right because you don't know where that thing is so it's not really making the SL O's worse and it's okay that some of these things have lower at the lows then the checkout and storefront and the admin that have the highest SL O's so that's how we deal with that we don't really deal with it how do you deal with a disproportionate amount of traffic to a single pot or a single shop so I showed a diagram earlier that shows that the workers are isolated for pod this is actually a lie the workers are shared which means that a single pod can grab up to 60 to 70% of all of the capacity of Shopify so what actually isolated in the pod are all the data stores and the workers can sort of move between pods fluid like they're fungible they will move between pods on the fly the low bouncer just sends requests to it and it will appropriately connect to the correct pod so this means that the maximum capacity of a single source is somewhere between 60 and 70% of an entire data center and it's not a hundred percent because that would cause and outage because of the single story which we are not interested in but that's how we sort of move this around does that answer how do we deal with large amounts of data yeah like someone someone who's doing importing a hundred thousand customers or hundred thousand orders well this is where the multi-tenancy a strategy or architecture sort of shine these databases are massive half a terabyte of memory many many tens of cores and so if one customer has tons of orders then that just fits and if the if the customer is so large that it needs to be moved that's sort of what this tactic fragmentation project is around is around moving these stores to somewhere where there might be more space for them so basically we just deal with it by having massive massive data stores that can handle this without a problem the import itself is just done in their job some of these jobs are quite slow for the big customers and there we need to do some more parallelization work but most of the time it's not a big deal if you have millions of orders and it takes a week to import that you have plenty of other work to do it do during that time otherwise so this is not something that's been high high on the list how much time hath done okay thank you [Applause] you