Distributed systems from scratch: lessons learned the hard way!

0 0

we have Becky Lewis from energy deck who's going to be talking to us about building distributed systems I've - Becky hi well Pete's pipeline UK my name is Becky Lewis I've been coming to pipe on for several years now so I thought I'd actually get up this year and try and do a talk this is my first conference talk so please bear with me if I start around a little bit but my talk is my talk is the service assistance from scratch lessons learned the hard way so for the past couple of years I've been getting more into creating more distributed systems moving away from monoliths and of course when you start to do that and you do not go a lot of experience you're making an awful lot of mistakes so hopefully this talk will be useful to some people here who may be starting off writing new systems I was I hope it past couple years I've had a couple of talks more focused on how to break up a monolith so I thought I would avoid that because I think there are an awful lot of lessons to be learned when you actually try to write something from scratch it might sound easier but there are certainly different challenges and in fact there is I believe a talk on breaking up the morning live tomorrow at five o'clock okay I'm not sure whether that's happening on that Street but that says infrastructure provisioning and deployment now when you first start a project probably like me the first thing you actually want to do is start setting up your local environment and starting to actually get things running and start writing some code so one of the first things I learned last year was they just jump into the code the code is really important of naturally when you're doing software development but as you doing distributed systems you're going to find out more and more that actually system operations is really really important if you don't get your sis ops right you're going to be scrambling at the last minute you know all of your things aren't going to work together and you're really gonna have a bit of a mess on your hands so the first thing I started off when my latest project was to actually just do infrastructures code so the aim is to completely have hands off of the servers you don't want to be touching servers if you're doing distributed systems that's probably not what everybody is used to I certainly wasn't there's lots of tools to do this if you're on cloud systems it's much much easier there are tools such as terraform cloud front and you really really need to make use of these I don't know if anybody here is using anything like that yeah so what you need to do is start this from scratch anybody who's used terraform will already know if you don't do this from the beginning it's really really hard to backtrack laser you don't want to be backtracking when you're doing your system operations it's almost impossible if you've already got a production system running to then start implementing something like terraform into the mix because you're probably gonna do what I did and take down your production server by accident something else I noticed about setting up all the system operations is that the more you do it the more systems you have running the more individual pieces you have the more centralized your system operations is going to have to become so it's not just about infrastructures code it's about making sure you have the entire system automated apologies you know you also need to have separate steps for everything you need to have your infrastructure separated from your provisioning you have your provisioning separated from the deployment system a lot of people I see writing their automation scripts actually writing really long pipe lines and not split up properly you don't want to do that what you need to do is actually write little make it modular write it like your software make sure that every single part of it can be run individually because otherwise you're going to end up in a position where you actually just want to push the code but actually you've tied it into your infrastructure bill which means it's going to take longer but it also means if you want to change something about your infrastructure you have to run your entire deployment that's not a good position to be in so your deployment doesn't need to be there by itself it needs to be running without all the rest rest of the guff around it your servers need to be provisioned already it doesn't really matter if you're using containers or are you going on the bare metal make sure that if you're pushing your containers you're not doing any provisioning at the same time if you are just deploying sorry if you're running on the bare metal that you can do it without needing to install any software so once you've written your infrastructure as code once you've got your provisioners running whether using salt or puppet whatever and you've got your code going up into the server well what's left what else do you need to do well actually you've got to make sure you've got everything centralized I'm sure people here are using CI sorry continuous integration but you also want to be doing proper continuous delivery now a lot of the CI tools will do continuous delivery somewhat but they don't tend to do it in parallel I would really really really recommend getting a proper continuous delivery service I prefer something like go CD I believe that Jenkins has some plugins to do CD but you really want something that can run in parallel as possible if you have all of these pieces in place you actually get a really nice bonus which is you have complete disaster recovery set up already if everything is automated you should be up and running very very quickly depending on the scale of your system of course now that says logging and monitoring so I don't know about you but quite often I leave my logging until the last minute which is really bad and monitoring is traditionally done by your system operations team if you have one at all but both of these are really important if you're doing distributed systems you cannot do anything distributed if you can't tell what's going on if you're jumping onto a server every time you want to look at your logs you're gonna have a nasty surprise when you have five servers running you have 16 different services on those trying to find a problem in that haystack is impossible so best way to solve this issue is to just get some centralized logging aggregation now this can be hosted by yourself I actually quite like using the elk stack those I don't know that's elasticsearch log stash and Cabana those three things together allow you to have a really nice searchable dashboard so you can put all of your logs in one place and you can search them that also brings up a point how do you follow logs through how do you do your searching on them before you're doing is firing lobs a server well the answer is make sure that you put a unique ID on any entry point for your logging so when you start the process off and it's going to pass it down the whole pipeline just put in an ID at the start and then thing that's doing the logging can check is there an ID yes and that we use that ID with your logs if you do that then it's very very easy to trace something through your entire system now I know a lot of people don't like to do so socks because a lot of people here will just be developers who want to get on with the actual work of developing systems so there are an awful lot of really good hosted solutions for this you don't have to set up your own stuff of course that they will charge you for the for the honor but it's well worth the investment because your time is more important than than anything else really and you want to be developing with your time you don't want to be tracing things through a huge logging stack now likewise to being our needing to be able to follow your logs you want to know when things are going to go wrong ideally you want to know before they go wrong so you can fix them at that point this is where monitoring services come in and they come into their own really really well for example if you suddenly start getting a whole bunch of fluorophores on your check out you kind of want to know that before somebody rings up your call sensor complains at you so if you start looking for patterns of things that aren't quite right quite normal you'll get alerted to problems before they actually turn into bigger problems and quite often you can find out issues before your customers do so get the monitoring in place there's many tools to this again if you want to host yourself you can use Nagios you can use a base I'm sure there others around those are the two I am miss familiar with but if you don't mind the cost again there are others New Relic which is a wonderful tool but super expensive this data dog which is actually a lot like New Relic these days we found it quite useful in the past and all of these tools can help you to look at to locate patterns and what's happening they can all help you to find sort of inconsistencies and what's going on they can also help you backtrack perhaps every two weeks you get a huge spike in your CPU usage at two o'clock in the morning it's really useful to know that that's happening it might not even be breaking anything yet but it's something that you can go and actually respond to that says you never used to take this long probably the biggest lesson that I have taken away building distributed systems from first principles isn't a technical one it's actually managing your managers there is so much more overhead when you are dealing with distributed things not because they're distributed but because you have all of this other stuff to do you have to have your monitoring you have to have your logging you have to completely automate your entire infrastructure your deployment everything so this was actually quite a recent thing for me as I'm sure my CEO will tell you oh how to explain to somebody why you taking so long it's really important to get buy-in from everybody they all have to know that you have all this extra overhead they all have to know that see that text change might take longer or well actually if we want to build another piece of this it's not going to take two hours which is what the coding would usually take it's going to take a day so this is the biggest takeaway that I've had is to make sure that everybody involved knows that everything is going to take longer because there is more to do but at the same time you also have to be able to explain to them why that's a good thing and that's really hard to do because all they see you doing is playing around having some fun building something that they think is a little bit monstrous so it's really difficult to do but if you can make sure that you have buy-in from every level from your technical project manager and technical project manager surprisingly quite often don't understand that it's going to take longer so it's really really important to make sure that everybody from the project manager to the CEO anybody who's going to come and give you hassel actually knows why is taking longer and that it's a good thing that it's going to actually decrease costs over time that's one of those sneaky beasts luckily I didn't manage to go full-on and create a complete distributed monolith we got really close though and oh don't do it you so careful if you are starting to have to deploy two things at once because they rely on each other stop there backtrack and just separate them please we did find ourselves in quite a terrible situation where we didn't have three things that we needed all up at once we couldn't do it and we ended up with just errors every time we were deploying anything and it just snuck up on us we all know very very well how to build things and separate pieces but you know you just make a little cut cut corner here or there or you go well you know it can't really hurt this time but you will end up with a horrible ball of mud but it's gonna be distributed and it's going to be really hard to deploy so don't do it just keep an eye on everything if you do start realizing that you're deploying more than one thing at a time stop review it get rid of that problem thank you very much Becky any questions anyone well start the front here and then one at the back there you mentioned in the beginning about the importance of when it separates an orchestration about making sure you've talking about sort of separating the system at least resolving some software so how do you make okay I put things in three different categories or group structure you have your provisioning and then have your deployment to me if bringing your infrastructure up is bringing up your servers getting everything together making sure that all of your security groups if you using AWS things like that are correct and everything can talk to each other provisioning is about if using docker containers making sure doctors can on on everything that you need it to be on it's about installing RabbitMQ all that sort of thing I've not really run into many situations when I'm deploying that I actually have to install anything except for obviously the software deployment so the distinction is a little bit subtle and it can be kind of hard to spot but generally speaking if you are installing things it's probably making a bit of a road for your impact in the long run of course you can't get away from installing requirements and things like that but I don't but kind of thing that's a dependency of the project rather than the dependency of the system as a whole so the only thing really to install is the requirements I think and if you can get around that that's even better the generalized refugees for surveys not as such but then I tend to do things in a bit of a simplistic way most of the time my team's quite small we're working on actually improving that kind of thing so that we have better better I think better views better ways of looking at us our stuff so that might even be something that I'm gonna run into oh right and few things I'm gonna think and think what I'm using in my texture we're using Django for web stuff we're using RabbitMQ to kind of push messages around our system we decided to stay away from using api's internally sorry like restful api so internally just because it's a bit more difficult if if the message doesn't get through then you've got to write some code to actually handle that we've got Postgres running for service we use AWS so that's actually quite helpful we also run it straight away from Python as well a little bit reason some Java here and there to do things that we need to have concurrency very very very fast code we've also got a flask it's sort of grace really sake has a little bit fortunate because I went into a company that had a legacy product and it's very hard to change anything on it so everything takes quite long time anyway it's also completely unscalable it's it's kind of a little bit unmaintainable so i was quite lucky in the fact that i could just point them so the existing stuff needs to be scalable we need to be international half week we couldn't even run it on more than one server so when clients are asking us or can we have a local version of this we've had to say no so I managed to justify it they're basically saying actually all these business problems go away if we build it probably like this hello can I just say good talk it was some especially if it was your first talk in front of an audience like this really good I've been there and it can be very unnerving yes good job can I just ask if you had an on-call rotation as developers and if so whether you cherry picked different types of alerts to wake you up in the middle of the night versus what you cared about the next day or if you don't have an on-call rotation of developers and you have someone else developing the sort of and DevOps step versus ops question thank you we're a very small company so I am the uncle which is kind of good bad at the same time because it means I have to make sure stuff works and that I don't get a phone call at two o'clock in the morning so we don't deal with any on-call rotation at the moment I should say really because we are pushing out to the other side of the world at the moment which means we can't have 24/7 encore support we're going to actually get some help in for that so I'm going to see how that goes but the rotation probably end up with me still unfortunately does that answer your question I think the question was about types of alert is that we try to make sure that we're getting sort of soft alerts by email so that when I come in the morning I can actually see if there are any sort of growing issues if something has gone completely down then it's just critical that sends text messages things like that not just to me to other people in the company just in case I don't hear it more I don't receive it yes we sort of just stick really with more soft alerts as well as sort of critical issues as well as uncaught errors of obviously are we century-- to sort of report to us what's been going on that we may not see because the error is just getting swallowed yes yeah I really like century it's certainly a few times okay that concludes our session we've got a half hour break coming up your next set of talks will be starting at quarter past eleven and can I just say a great big thank you for Becky Lewis and her very