Graphs: The Fabric of DevOps

0 0

[Music] hi everyone my name is Ashley I am a DevOps engineer at Lending Club and I'm here to talk to you today on how we use a graph model to manage our infrastructure so Lending Club is America's largest online credit marketplace we offer personal loans small business loans patient financing and as of earlier this year auto refinancing I'm sure you guys are all more interested in DevOps at Lending Club though so I'm on the infrastructure and tools team formerly known as and oftentimes still referred to as the DevOps team we build software and infrastructure automation to enable Lending Club to efficiently and seamlessly deliver our apps into production while ensuring stability reliability scalability all the LEDs so more specifically we write software that handles our infrastructure monitoring alerting deployment automation cloud orchestration and all the way up into common app frameworks that all of our platform apps use and then a little bit about our architecture up until recently Lending Club was data center only we have a primary and a secondary data center and then a couple years back we started migrating our services into the cloud into AWS so before I start talking about our graph model I would like to frame our talk frame I talked with just saying that I think embodies everything that my team does at Lending Club so be pragmatic not dogmatic over the years we've worked toward a consistent standard unified build packaging and deployment model so whether an app is written in Java or node or NGO whether we're deploying that app into our production AWS environment or our nonprofit center environment we want everything to look feel deploy and run exactly the same we've also tried to avoid we call tool trance so today the big thing is docker and micro-services tomorrow it could be something else and we want our infrastructure and tooling to be able to handle to handle that so oftentimes we'll write third-party tools within our own internal interfaces and automation tools and over the years of as we've grown from five micro services to over 400 and moved from the data center to the cloud one of our goals has been to automate all the things well we soon learned that we have a lot of things and it's really hard to automate them all when you don't know what those things are so looking at this slide you guys probably recognize most if not all of these technologies you probably use a lot of them at your companies too and so we all face this problem figuring out the relationships and integrations between and among these tools often times when we're onboarding new tools are considering multiple solutions or options for a solution we look at the built in third-party integrations so what if New Relic doesn't play nice with github does that mean we can't use one one of those tools how do we manage those integrations are we making dozens or hundreds of rest calls to dozens of different endpoints after a mercator at Lending Club we've written an internal java application called Mercator and its job is to communicate with all of our infrastructure and build a graph model of our infrastructure so I also want to mention in a talk yesterday it was on analyzing system failures the speaker mentioned that we need models to help us visualize our systems and by extension to help us diagnose and track down problems when they arise within those systems and this is what Mercator does for us so Mercator is periodically scanning all of our infrastructure components and there are party tools and it's making sense of the responses and building a graph map of all of those interconnected infrastructure components this then provides us with mattad metadata around which we build monitoring alerting and automation so as I mentioned three years ago we were struggling with manual deployments we were keeping track of our services via Excel spreadsheet so we needed greater visibility into our infrastructure and we really wanted a way to be able to get real-time or near real-time state of our infrastructure so as I mentioned we created Mercator okay so this right here is a visualization of our graph database so as I mentioned Mercator is a java application and it's storing information into a graph database we use neo4j and just to kind of go over what you are looking at so each of these circles are what we call nodes and their different colors to represent different node types so every node has a label you can think of each individual circle as an instance of a certain object type or label so if you look at the yellow circle called LC UI you'll see up at the top left that it is the label of that node is a virtual service so this is our concept of an app and then the lines between all the nodes are relationships and both nodes and relationships can store properties so for example our LC UI virtual service node contains two pools pool a and pool B and then each pool contains a number of virtual servers and so what we're actually looking at is our Bluegreen model which we use in our data center many of you guys probably also use Bluegreen so the question now is how did we get to this visualization and so this is Mercator feeding into neo4j or graphed out abuse so we did in this data center example is we had all of our app instances phone home super cater every minute with information on the apps that were running on them so things like the app ID in this example it was LC UI what environment it's running in the revision and version that it's running the IP hostname and just by adding this actually we gained service discovery which we didn't have before similarly Mercator we started to scan our load balancer for information and we got back information on our load balancer servers not so much app info but information on our load balancer server the state so whether it's active inactive how much traffic it's it's taking so now we have these two objects we have app info and then we have the load balancers server info and if you map those two things together by their hostname you can combine that and this is what gives us the virtual server which you saw a few slides back so the virtual server has all these properties and in this example it's LCU i will see that it's in production the version and revision it's running it's active which means it's in the live pool so then if we go back to this slide hopefully it makes a little more sense now that virtual server we just saw is the pink notes so one of those pic notes is represented by the virtual server that you just saw and they all have these properties with IP hostname live or dark and so you can kind of start to see if we start to group these servers together by app ID state and also some like hostname naming conventions that we have we can split them then into these two pools we can group them into pools and then if we group the pools by app ID and environment then we get this concept of a virtual service and so this then allowed us to automate our data our deployments and we were able to set up a lot of monitoring and alerting just off this so for example we never want to have multiple revisions of an app within the same pool especially if that pool is life so this model allowed us to visualize that and now we have alerts that go off similarly if one of our servers is down we'll get an alert on whether the pool is degraded or if it's fully down also once we hooked in the center to the scanning we were able to map our app instances and virtual servers to the vCenter instances in vCenter arrays which gave us monitoring ability and to single points of failure so if one pool all of the instances are on one array then that's a single point of failure and we'll get an alert on that and we'll move our distribute so that was a data center example but it works exactly the same way in the cloud we are periodically scanning a bunch of AWS tools you'll see there there is ec2 RDS SNS SQS a whole bunch of stuff and again this is a visualization of some of our AWS components and there's no need to understand exactly everything that's going on in that in that picture but I think the key is just to show how quickly all these components get pretty complicated and this is not even everything that's in AWS and I mean there's a lot more that we have in our graph database besides AWS and so there's no way that our minds would ever be able to conceptualize or visualize this song on its own but the graph model allows us to track our infrastructure not just in the cloud or the data center but even beyond that and keep track of all our interdependent infrastructure so this particular cloud model that we created allowed developers actually to start spinning up their own instances in AWS they were able to self-service more instead of having to create tickets for us or wait forces ops to spin up servers for them in addition this is what we built our cloud orchestration on and we mirrored our Bluegreen deployment model from the data center into AWS I know that AWS actually recently released a Bluegreen code deploy feature but I just want to say that we made ours first okay so we've seen a data center example and a cloud example and I think we have some like a basic understanding of how the graph model works but there is a lot more to our graph model so if we just walk through the app lifecycle so an app will start out will write documentation and confluence pm's and engineers work together to write stories and tickets in JIRA and engineers then commit to get they build with Jenkins artifacts get stored in artifactory or s3 we deploy it to AWS it goes into our load balancer all right I mentioned VMware and then our monitoring tool Splunk way friend and you relic we diagnose and talk about these apps or any problems on HipChat we get paged we also use ops to you know there's sponsor to storage cisco ucs so all of these things throughout the app lifecycle they are all getting fed into mercator and so as I mentioned we have maybe over 400 micro services now and they've all followed this they've all followed this lifecycle and they've been managed from conception to deployment to monitoring and beyond all through Mercator and so we went from having low service visibility to having a unified graph model that we can query at anytime that will give us real-time information on the state of our infrastructure and we can answer questions now like I mentioned do we have a single point of failure our revisions synced across environments how much is our AWS infrastructure costing us and another point that I want to make is that the growth of our graph model happened very naturally and/or nicholae when we first built our graph model we did not set out to include all these things we really just wanted to answer the question of what do we have deployed out there right the service discovery bit that's what we started up with and it's evolved over the past three years and we've ended up with this but I think it does kind of highlight the fact that the graph model is very flexible and it grows and evolves with your infrastructure and now I'm not saying that everyone should go and use a graph database to map out their infrastructure but for this particular use case it's worked really well for us and that's why we've continued to use it okay so I'm gonna do a demo last time I presented when I did a demo it didn't work but I don't care I'm gonna do it again and hopefully have better luck this time so this is the neo4j console it comes with your neo4j download and this is my local database but it is actually a copy of our production database and I scrub some stuff and yeah actually I saw lime large this so first I just want to show you some of the stuff that we have in our graph database so this is like a really simple query just to get us warmed up if you look at so this is cipher which is kind of like sequel but for graph databases it's very visual this quite language so anything that's in a parentheses is kind of your node right so you'll see there's three AWS account nodes you know if I click into here you'll see the properties within the node so this one's kind of boring there's just the only two properties our AWS account and the update timestamps we have non prod prod and infrastructure again kind of boring and I can filter according to certain properties so if I do this it's going to filter all the nodes that have AWS account equals prod and so it's just going to return this one node so that's not super interesting but once we start to dive a little deeper so this query then I'm asking for all nodes that are that have a relationship to the prod AWS account node and because here I did not specify a label it's gonna return all label types that have a relationship so we'll see now within prod and prod owns some AWS s3 buckets some SN nsns topics and for VP C's so again we're gonna dive a little deeper so now I'm asking it to return everything that has a relationship to the for VP C's that are contained in the prod account and I'm going to enlarge this so one thing is that as you as your queries get a little more complicated it does take a little time to render it here in real life when you're hitting it you hit the back end it's it's much faster so you can see now we're pulling in in purple we have the four V pcs we saw earlier and within those V pcs now we have subnets V PC endpoints security groups and regions and then not to belabor the point but I'm going to do one more so we're going to get something that and again it's the query language itself is pretty visual so like the parentheses mirror the nodes and the two dashes I mean relationship and I'm going to limit this so it doesn't take forever to render okay so now you see we're starting to pull in our ec2 instances our auto scaling groups elastic load balancers and again this isn't even everything there's a mais code deploy deployments stuff like that and actually if you think back to the slide where I showed you this kind of mess of notes we're actually looking at I think either the same or very similar and again we could dive deeper and deeper and we could start to see how the ec2 instances are related to code deploy deployments or you know the ec2 instances are parts of elby's and auto scaling groups the auto scaling groups and elby's are also attached so it's a very complicated network of stuff that you know by ourselves we could never we couldn't our brain could not handle that type of load but our graph model allows us to visualize and make sense of this stuff all right so that was the stuff which is kind of interesting I guess but maybe and it's what we used to build all our cloud orchestration but maybe not that useful this is the type of thing that maybe my boss or my boss's boss might ask and so I have a query here so what this is doing those ec2 instance node types that we just saw in the in the previous query we're just matching those with another label called AWS ec2 instance type and the properties within the type it's just the model so for example c4 dot large or t2 medium and then the hourly cost of that model so we're mapping those two together and again so the two dashes in between that's like the specific label for that relationship so the ec2 instance relationship label is has type so we're mapping those and then we're grouping the instance types by account region and what else Oh an instance type and so you see from this query then we get in descending order a monthly cost of our instances it's kind of funny I guess actually I was in a meeting with some of our billing folks and my boss and boss's boss and we were trying to get better visibility into our billing and because this information was in new offer day it was a matter of minutes I kind of I spun out this decipher and we actually have reports generated off it now every week we sent it to wavefront it's it's a thing now and then just to show you maybe another way to look at this information so this is my instance site and we could group it by you know just show me the total cost or show me a cost the cost by account or by region or just by instance type we could also in this example will show the cost by app ID so up here the app definition this is another label that we have we have a service catalog where we keep track of all of our services and by mapping the ec2 instance tags to the app definition we can get a picture of how much each app costs us per month a little bit a description of the app what type how many instances are running and again the monthly cost it's like this is kind of expensive this is expensive but we have a lot of instances it's just interesting stuff / useful to know and finally what is this thing so this is another thing that maybe my manager and my managers manager might come running down or someone from InfoSec right you have an IP you don't know what it is they want to know what's on it where it's running what is it doing [Music] and again u4j can help so those app instances we saw towards the beginning of the presentation they store one of the properties you stores IP so we match according to the IP and then we match the virtual server and the pool and the virtual service and the app definition that it's all connected to and then we go from not knowing what this IP is to knowing exactly what it is we have the host name we know what app is deployed on it as well as the revision and the version we know it's running in prod so ok maybe this is a real problem but then so according to naming convention the 200s are Pool B so then we see okay the dark pool is pool B so it's not actually taking traffic so maybe it's not as big of a problem if it were live though when we needed to restart it we would see oh no there's this many connections maybe we should wait maybe we should drain it then we can take actions and again this one also it's easy to see how if you want it to go even deeper you could map the app definition to a git repo or a Jenkins job and then you can really start to dig down and kind of figure out all the different components of this IP so that is the story of how we've used a graph model at Lending Club to manage our infrastructure and run the company I'd like to recap a few of the things that our graph model has allowed us to do so I mentioned automating all the things earlier on you can't automate them all if you don't know what they are and with our graph model we have a pretty clear visualization of our infrastructure and I mentioned some of the monitoring and alerting capabilities some of the deployment automation and cloud orchestration that we've been able to build using our graph model but in addition some other examples I can think of our automated patching we keep track of the image that is attached to each of our ec2 instances if that image is over let's say 30 days then we have software that will automatically roll those instances forward to use the most recent image or for example we have automated nightly replication from our primary data center to our secondary data center and from our primary region in AWS to our secondary region it's very hands-off we get paged less in the middle of the night and because we spend less time patching or replicating from our primary to secondary site we have more time to build cool tools so the graph model has also allowed us to push for DevOps as a culture not as our team name or as a title so I mentioned we have more free time to build cool tools and some of those tools are built around more cater by exposing the information in Mercator not just to our team but also to release engineering QA and even like the risk teams they can use this data and leverage it to build their own automation and they can self-service more instead of having to ask us for things or maybe a sinus a ticket and then wait on us we don't have to become the blocker anymore it's allowed engineering to or ownership and finally circling back to my opening statement I graph model has allowed us to be pragmatic not dogmatic with this model we treat almost all of our third-party tools exactly the same we have a standard and unified pipeline and we're not locked into any particular technology or tool if tomorrow our CTO said we're leaving AWS we're going to Microsoft Azure or Google cloud or Oracle cloud for whatever reason then in terms of this model in terms of this model very little would change and this gives us a lot of flexibility when it comes to infrastructure and when it comes to the next big thing at Lending Club so that's it this is my twitter handle if you have any questions if you're interested in maybe trying out Mercator we have an open-source version there if it doesn't work then you also have my handle so you can yell at me on Twitter and that's it thank you guys all right so I guess well you kind of answered at the end too clearly you have it open you have open source some version of this I guess my question was when you first determined you had a need for this type of tool how much exploration did you do in terms of what was available in the open source world and how did you say you know what we don't really have none of this is going to cut it as or it's just easier is simpler to build our own what types of what was that process like right so I actually joined like right after we chose C's neo4j but from what I've heard from my manager he did try out some other graph databases I think like tinker pop was one there was one more and then he settled on neo4j it did kind of start as an experiment kind of like oh this is cool and then once I joined I started adding to it and I guess from there once we saw how useful it was and how easy it was to visualize our infrastructure we just kind of never back in terms of open sores or like there are some things like I think there is a tool I forget what it's called but it also maps out your AWS infrastructure and a graph like model um you do have to pay for that though and also this one is I guess very easy it was simpler and easy for us to build it ourselves and this way we just it's very easy now we kind of have this rhythm where if we have a new tool or some third party technology we'll just build a new integration with Mercator and it just it'll get loaded into a graph model which is huge no oh I've got questions oh do you have sorry have you a graph QL is like a thing that Facebook does right and I've been looking at like options as well so you guys just have like a single rest endpoint that your other you said you built tools off of this tool so how do you handle those requests is it just querying that it's actually acquiring directly like our graph data bees and is that through a Web API or just direct through yeah through an API we've also written a wrapper around some of the neo4j the neo4j stuff and we use that wrapper then also we like to wrap we like to wrap our third-party tools within our within our own internal stuff you I think there's more questions but for the say I'm moving on and getting to lunch at some point you can find Ashley in the hallway and answer all the questions or start an open space on the subject thank you so much Ashley that was amazing thank you [Applause]