Distributed Tracing: From Theory to Practice

0 0

[Music] sweet all right everybody ready to go I I'm sorry this is a very technical talk in a very sleepy talk slot so if you fall asleep in the middle I will be super offended but I won't call you on it too hard so yeah I'm Stella cotton if you don't know me I'm an engineer at Heroku and today we're going to talk about distributed tracing so before we get started a couple of housekeeping notes I'll tweet out a link to my slides afterwards so they'll be they'll be on the internet so there'll be some code samples and some links so you'll be able to to check that out and if you want to take a closer look and then I also have a favor if you have seen me speak before I have probably asked you this favor so Ruby karaoke last night anybody go yeah totally destroyed my voice so I'm going to need to take some drink some water but otherwise I get really awkward and I don't like to do that so to fill the silence I'm going to ask you to do something that my friend Lola chalene came up with which is each time I take a drink of water to start clapping and cheering alright so we're going to try this out I'm gonna do this yeah alright so hopefully that happens a lot during this talk so that I won't miss my voice um so back to distributed tracing I work on a tool team at Heroku and we've been working on implementing distributed tracing for our internal services there and normally I do this like whole Brady Bunch team thing with the photos but I just wanted to acknowledge that like a lot of the trial and the error and the discovery that went into this talk I was really a team effort across across my entire team so the basics of distributed rethink who knows what distributed racing is okay okay cool who has it at their company right now oh I see you her okay so if you don't actually know what it is or you're not really sure how you would implement it that's you're in the right place this is the right talk for you it's basically just the ability to trace a request across distributed system boundaries and so you might think like Stella we are rails developers this is not distributed systems conference this is not Scala or strange loop you should go to those but like really there's this idea of a distributed system which is just a collection of independent computers that appear to a user to act as a single coherent system and so as a user loads your website and more than one service does some work to render that request you actually have a distributed system and technically because somebody will definitely will actually meet us um if you have a database and a rails app that's actually technically a distributed system but it's up by things I'm really going to talk more about just the application layer today so a simple use case for distributed tracing you run an e-commerce site you want users to be able to see all of their recent orders monolithic architecture you've got one web process or multiple web processes but they're all running the same kind of code and they're going to return information users orders users have many orders the orders have many items very simple rails web we authenticate our user our controller going to grab all of the orders all the items rendered on a page not a big deal single app single process now we're going to add some more requirements we've got a mobile app or two so they need authentication so suddenly it's just a little more complicated there's a team dedicated authentication so now you maybe have an authentication service and they don't care at all about orders so make sense they don't need to know about your stuff you don't need to don't know about this there's so it could be a separate rails app on the same server or it could been on different be on a different server all together it's going to keep getting more complicated now we want to show recommendations based on past purchases so the team in charge of this recommendations bunch of data science II folks the only right Python a bunch of machine learning so naturally the answer micro-services obviously but I mean seriously it might be services so your engineering team your products grow you don't have to have this microservices bandwagon to find yourself supporting multiple services maybe one is written in two different language in my have its own infrastructure needs like for example our recommendation engine and as our web apps and our teens grow larger these services that you maintain might begin to look less and less like a very consistent garden and just more like a collection of different plants than different kinds of box and so where does this tributary thing fit into this big picture so one day ecommerce that ecommerce app you go to your website starts loading very very slowly and if you're going to look in your application performance monitoring like new relic or skylight or use a profiling tool you can see recommendation services taking a really long time to load but with the single process monitoring tools all of the services that you own in your system or that your company owns are going to look just like third party API calls you're getting as much information about their latency as you would about stripe or github or whoever you're calling out to and so from that users perspective you know there's 500 extra milliseconds to get there the recommendations but you don't really know why without reaching out to the recommendations team checking out you know figuring out what kind of profiling tool they use for Python who knows and digging into their services and it's just more and more complicated as your system is more and more complicated and at the end of the day you cannot tell a coherent macro story about your application by monitoring these individual processes and if you have ever done any performance work people are very bad guessers and understanding bottlenecks so what can we do to increase our visibility into the system and tell that mackerel story sorvita tracing that can help it's a way of commoditizing knowledge and adrian coles is one of the Zipkin maintainer x' he talks about how an increasingly complex systems you want to give everyone tools to understand this whole system as a whole without having to rely on these experts so cool or on board I convinced you you need this or at least makes sense but what might actually be stopping you from implementing this at your company few different things that make it tough to go from the theory to the practice with distributed tracing and first and foremost is that it's kind of outside the Ruby wheelhouse it's not represented Ruby is not represented in the ecosystem at large most people are working in NGO or Java or Python you're not going to find a lot of sample apps or implementations that are written in Ruby there's also a lot of domain-specific vocabulary that goes in to distribute a tracing so reading through the docs can feel pretty slow and finally the most difficult hurdle of all is that the ecosystem is extremely fractured it's changing constantly because it's about tracing everything everywhere across frameworks across languages and it needs to support everything so navigating the solutions that are out there and figuring out which ones are right for you is not a trivial task so we're going to work on how to get past some of these hurdles today we're going to start by talking about the theory which will help you get comfortable with the fundamentals and then we'll cover a checklist for evaluating distributed tracing systems yeah all right I love that trick so start with the basics black box tracing the idea of a black box is that you do not know about and you can't change anything inside your applications so an example of black box tracing would be capturing and logging all of the traffic that comes in and out at a lower level in your application like at your TCP layer all of that data it goes into a single log aggregated and then with the power of statistics you just kind of get to magically understand the behavior of your system based on timestamps but I'm not going to talk a lot about black box tracing today because for us at Heroku it was not a great fit and it's not a great fit for a lot of companies for a couple of reasons one of them you need a lot of data to get accuracy based on statistical inference and because it uses statistical analysis it can have some delays returning results but the biggest problem is that an event-driven system so sidekick or a multi-threaded system you can't guarantee causality and what does that mean exactly so this is sort of an arbitrary code example but it helps to show that if you have service one kicks off an async job and then immediately synchronously cause call out service two there's no delay in your queue your timestamps are going to correlate correctly service 1 async job awesome but if you start getting queuing delays and latency then a timestamp might actually make it consistently look like your sweat your second service is making that call so white box tracing is a tool that people use to help get around that problem it assumes that you have an understanding of the system you can actually change your system so how do we understand this paths that go that our requests make server system we explicitly include information about where it came from using something called metadata propagation and that is the type of white box tracing it's just a fancy way of saying that we can change our rails apps or any kind of app to explicitly pass along information so that you have an explicit trail of how things go and finally another benefit of white box tracing is real-time analysis it so it can be almost in almost real-time to get results very short history metadata propagation so the example that everyone talks about when they talk about metadata propagation is dapper and the open source library that inspired called Zipkin so dapper paper is published by Google in 2010 but it's not actually the first distributed systems debugging tool to be built and so why is dapper so influential um well honestly it's because in contrast to all of these other systems that came before it those papers were published pretty early in their development but Google published this paper after it had been running in production at Google scale for many many years and so they're not only able to say that it's viable at a scale of like google scale but also that it was valuable and so next comes again and that's a project that was started at Twitter during their very first hack week and their goal was to implement dapper and they open sourced it in 2012 and it's currently maintained by Adrian Cole who is not actually Twitter anymore he's a pivotal and he spends most of his time working in the distributed tracing ecosystem so from here on out when I use the term distributed tracing I'm going to talk about dapper and Zipkin like systems because white box metadata propagation distributed tracing systems is not quite as a zippy if you want to read more about things beyond just metadata propagation there's a pretty cool paper that gives an overview about tracing distributed systems beyond this so how we actually do this I'm going to walk us through a few main components that power most systems that are of this caliber so first is the tracer it's that the instrumentation you actually install in your application itself there's a transport component which takes that data that they collect and sends it over to the distributed tracing collector that's a separate app that runs it processes it stores the data and it stores it in storage components and then finally there's a UI component that's typically running inside of that that allows you to view your tracing data so we'll talk first by the level closest to your application itself that's the tracer it's how you trace individual requests and it lives inside your application in the Ruby world it's installed as a gem just like any other performance monitoring agent that would monitor a single process and a tracers job is to record data from each system so that we can tell a full story about your request you can think of the entire story of a single request lifecycle as a tree this whole like system here captured in a single trace next vocab word span within a single trace are many fans it's a chapter in that story so in this case our e-commerce app calling out to the order service and getting a response back that's a single span in fact any discrete piece of work can be captured by a fan it doesn't have to be Network requests so if we want to start mapping out this system what kind of information are we going to start passing along you could start by just doing a request ID so that you know that every single pass that this took through you query your logs and you can see that's all one request you're going to have the same issue that you have with blackbox tracing you can't guarantee causality just based on the time stamps so you need to explicitly create a relationship between each of these components and a really good way to do this is with a parent-child relationship the first request in the system doesn't have a parent because somebody's just clicked a button loading a website so we know that's at the top of the tree and then when your auth process starts the e-commerce process it's going to modify the request headers to pass along just a randomly generated ID as a parent ID here it's set to one but it could really be anything and it keeps going on and on with each request the trace is ultimately made up of many of these parent-child relationships and it forms what's called a directed acyclic graph and by tying all of these things together it's we're able to actually not just understand this as an image but with a data structure and so we'll actually talk in a few minutes about how the tracer actually accomplishes that in our code so we've got our relationships if that's all we wanted to know we could stop there but that's not really going to help us in the long term with debugging ultimately we want to know more about about timing information and we can use annotations to make a more rich ecosystem of information around these requests by explicitly annotating with timestamps when each of these things recur in the cycle we can begin to understand latency and hopefully you're not seeing a second of latency between every event and these would definitely not be and like user reasonable timestamps but this is just an example so zoom in to our off process and how it talks to the e-commerce process so in addition to passing along the trace ID parent ID child span will also annotate the request with a tag and a timestamp and by having our office annotate that it's sending the request and our e-commerce app annotate that it received the request this will actually give you the network latency between the two so if you see a lot of requests queuing up you would see that time go up and on the other hand you can compare two timestamps between the server receiving and the server sending back the information and you would be able to see if your app is getting very slow you'll see latency increase between those two things and then finally you're able to close out that full cycle by by indicating that the client has received that final request let's talk about what happens for that data each process is going to send information via the transport layer to a separate application that's going to aggregate that data and do a bunch of stuff to it so how does that process not add latency the latency latency here's some first it's only going to propagate those IDs and band by adding information to your headers then it's going to gather that data and just report it out of band to a collector and that's what actually does the processing in the storing for example Zipkin is going to use Ducker sucker punch to make a threaded async call out of the conserver and this is going to be similar to things that you would see in metrics like lebra tow any of your logging and metric systems that use threads sir data collected by the tracer transported via the transportation layer collected finally ready to be denied you in the UI so this graph that we're reviewing here is a good way to understand how the request travels but it's not actually good at helping us understand latency or even understand the relationship between calls within systems so we're going to use Gantt charts or swim lanes instead so the open tracing that i/o documentation has a request tree similar to ours and looking at it in this format you'll actually be able to see each of the different services in the same way that we did before but now we're able to better visualize how much time is spent in each sub request and how much time that takes relative to the other request you can also like I mentioned earlier instrument and visualize internal traces that are happening inside a service not just service to service communications here you can see billing service is being blocked by the authorization service you can also see that we have a threaded or parallel job execution inside the resource allocation service and if there started to be a widening gap between these two adjacent services it could mean that there is Network requests queuing I still can't help myself but like saying and do a little dance when I do that so alright we know what we want how are we going to get it done so at the minimum you want to record information when our crust comes in and Lin requests goes out how do we do that programmatically in Ruby usually with the power of rack middleware if you're running a ruby app the odds are that you are also running a rack app it's a common interface between for servers and applications to talk to each other Sinatra rails both use it it serves as a single entry and exit point for a client requires requests that are coming in the system the powerful thing about rack is that it's very easy to add middleware so that can fit between your server and your application and allow you to customize these requests basic Rapp rack app if you're not familiar with it Ruby object it's going to respond call case one argument enemy and returns status headers body that's that's the basic at the rack app and under the hood rails and Sinatra are doing this and the middleware format is a very similar structure it's going to accept an app could be your app itself or another set of middleware respond a call needs to call app call at the end so it keeps following down the tree and at the end returned the response so if we wanted to do some tracing inside of our mill where what might that method method look like so like we talked about earlier we're going to want to start a new span on every request it's going to record that it received the request with a server received annotation like we talked about earlier it's going to yield to our rack app to make sure that it executes in the next step in the chain and is actually going to run your code and then it returns back that the server has sent information back to the client so this is just pseudocode this is not actually a running tracer the Zipkin has really a really great implementation that you can check out online so then we could just tell our application use our middleware to instrument on request and you're never going to want to sample every single request that comes in because that is crazy and overkill when you have a lot of traffic so tracing solutions will typically ask you to configure a sample rate we got our requests coming in but in order to generate that big relationship tree that we saw earlier we're also going to need to continue to record information when our request leads our system so these can be requests external API is like stripe github whatever but if you control that next service that it's talking to you can keep building up this chain let me do that with more middleware if you use an HTTP client that supports middleware like Faraday or ex-con you can easily incorporate tracing into the client I'll use Faraday is an example because it has a pretty similar pattern Dirac so match the method signature just like we did with rack and honestly Faraday's is very similar in a racket you're using like ex-con it's going to look a little bit different but this is just an example so pass on our HTTP client app we'll do some tracing and keep calling down the train chain it's pretty pretty similar but the tracing itself is going to be a little bit different so we're actually going to need to manipulate the headers to pass along some tracing information that way if we're calling out to an external service like stripes they're going to completely ignore these headers because they don't know what they are but if you're actually calling to another service that's in your your purview you'll be able to create that you'll be able to see further down so each of these colors is going to represent an instrumented application so we want to record that we're starting a client request ensure that we're receiving client requests add in the middleware just like we did with rack it's pretty easy you can even do it programmatically like automatically for all your requests for some of your HTTP clients so they've got some of the basics for how distributed tracing is implemented let's talk about how to even choose in this ecosystem what system is right for you so the first question is how are you going to get this working I'm going to give a caveat that this ecosystem is ever-changing so this information could actually be income right now and it could be obsolete especially if you were watching this at home on the web but let's talk about whether or not you should buy a system yes if the math works out for you it's kind of hard for me to like really say whether you should do that if your resourcing is limited and you can find a solution that works for you and it's not too expensive probably unless you're running a super complex system light steps traceview examples that offer ruby support your APM provider might actually have it to you adopting an open source solution is another option for us a paid solution the paid solutions just didn't work so if you have people on your team who are comfortable with the underlying framework and you have some capacity for managing infrastructure then this really could work for you so for us for a small team or just for people for engineers we got defconn up and running in a couple of months while also doing a million other things but partially because we were able to leverage Heroku to make the infrastructure components pretty easy and if you want to use a fully open source solution with Ruby Zipkin is pretty much your only option as far as I know so you may have heard of open tracing you might be like Stella what about this open tracing thing that seems cool a common misunderstanding is that open tracing is not actually a tracing implementation it is an API so its job is to just standardize the instrumentation like we kind of walk through before so that all of the tracing providers that conform to the API are interchangeable on your app side so if you want to switch from open source provider to a paid provider or vice-versa you don't need to re instrument each and every service that you maintain give in theory they're all being good citizens they're conforming to this API that is all consistent so where is the open tracing at today they did publish a ruby API guidelines them back in January but only light step which is a product in private beta has actually implemented a tracer that conforms to that API so the tracer existing implementations like in they're going to need to have a bridge between the tracing implementation that they have today and the open tracing API and the other thing that is just not it's just not clear still is interoperability so for example if you have a ruby app tray open tracing API everything's great and you have this page provider that doesn't support go you can't necessarily use two providers that use open tracing and still send them to the same to the same collection system so it's really only at that app level another thing to keep in mind is that for both open source and hosted solutions ruby support means a really wide range of things at the minimum it means that you can start and end a trace in your ruby app which is good but you might have to still write all of your own rack middleware your HTTP library middleware it's not a deal breaker we ended up having to do that for ex-con for Zipkin but it may be an engineering time commitment that you are not prepared to make and then unfortunately because this is tracing everywhere you're going to need to rinse and repeat for every language that your company supports so you're gonna have to walk through all of these these thoughts and these guidelines for go or for JavaScript or for any other language so some big companies find that like with the custom nature of their infrastructure they're going to need to build out some or all of the elements in-house Etsy obviously Google they're running fully custom infrastructure but other companies are actually building custom components that are tapping into open source solutions so Pinterest pin traces just is this an open source add-on to Zipkin similar to Yelp so if you're really curious about what other companies are doing large and small jonathan mesa brown university published a snapshot of 26 companies and what they're doing is already out of date like one of those things is already wrong even though it was like literally published a month ago so 15 are using Zipkin nine are using custom internal solutions but yeah most most people are using Zipkin so another component about this is what are you running in house what is your team or your ops team what do they want to run in house and are there any restrictions there's this dependency matrix of the tracer and the transport layer which need to be compatible with each one of your services so JavaScript go Ruby and to both the tracer and the transport layer need to be compatible across the board on so for example for us HTTP and JSON is totally fine for a transport layer we just literally call out with web requests to our Zipkin collector but if you need if you have a ton of data and you need to use something like Kafka you might think that's cool and then totally supported but if you look at the documentation it's going to say Ruby and then you're going to like wait no if I dig in four layers deep into this documentation it's only JRuby so that's like a total gotcha and so for each of these you really shouldn't build a spreadsheet because it's pretty challenging to make sure you're covering everything the collection and the storage layers those don't have to be those aren't really related to the services that you run but they might not be the kind of apps that you're used to running so for example the pin is a Java app which is totally different from the absent my team runs another thing you need to figure out is whether or not you need to run a separate agent on the host machine itself so for some solutions and this is why we had to exclude a lot of them there were you actually need to install an agent on each host for each service that you run and because we're on Heroku on Heroku if we can we can't really do that because we can't just give root level privileges to an agent that's running on a dyno another thing to consider is authentication and authorization who can see and submit data you're tracing system for us Zipkin was missing both of those components and it makes sense because it really needs to be everything for everybody and so also adding on authentication and authorization on top of that for every single company to use that open-source library is not really reasonable so you can run it inside a VPN without authentication the other option is using a reverse proxy which is what we ended up doing so we use to build packs apt and then run at build pack and so we are able to get nginx onto Heroku slug which is just like a bundle of dependencies and and your code with app and it's just a package manager for Linux so we can download and install a specific version of nginx to run as a reverse proxy run it allows us to run our Zipkin application and engine exelon alongside each other in the same host and we didn't we didn't want anybody on the internet to just be able to send data to Zipkin like if you just suddenly started sending data to our zip in an instance that would be pretty weird so we wanted to make sure we're only having Heroku applications interacting with it and so we decided to use basic authorization for that we use HT password to set some just team based credentials in a flat file because we only had about 25 different different basic auth configurations that we thought we'd be using ends up looking like this from an architecture diagram standpoint the client makes the request nginx is going to intercept that check against the basic valid off basic off and make sure it's valid and then if it is just forwarded along to Zipkin otherwise it returns an error and so adding authentication on the client-side itself was as easy as going back to that rack middleware file and updating our hostname with both basic auth so that was a really good solution for us we also didn't want any of yall to be able to see our Zipkin stuff on the internet which right now if you just run as it can instance there is there's nothing to keep you from seeing anybody's if there's no authorization so we use bitly has an oauth2 proxy which is super awesome it allows us to restrict access to only people with the Roku comm email addresses and so if you're on a browser and you try to access our recipient instance we're going to check to see if you're authorized otherwise it's off to proxy is going to handle the full authentication so it's configurable with different load balancers flash reverse proxies and OAuth provider so it's it's actually really cool you need to run any kind of OS in front of a in front of a process but even if you're going the hosted route and you don't need to handle any of this infrastructure you're going to need to ask about how are you going to get access to people who need it because you don't want to be the team who has to to manage this hand off of like sign ons and sign ins and oh you need to email this person you don't want to manage all that so just make sure it's clear with your hosted provider and how you're going to manage manage access security if you have sensitive data in your systems which a lot of people do there are two places specifically where we we were we had to really keep an eye out for security issues one is custom and custom instrumentation so for example my team the tools team added some custom internal tracing of our own services using prepend to trace all of our Postgres calls and so when we like we did with the middleware earlier you just we're wrapping that behavior with tracing but the problem here is if you're calling sequel to s and that sequel statement has any kind of private data you want to make sure that you're not just storing that blindly into your system especially if you have PII any kind of like security compliance information that you're storing and the second thing is that you need to talk through before it happens what to do when your data leaks for us running our own system is a is a benefit because if we accidentally delete data or leak data into a into a third party provider it's easier for us to validate there is easier it's about for us to validate when we own that data that we've wiped that data than having to coordinate with a third party provider it doesn't mean you shouldn't use a third party solution but you should ask them ahead of time what do you do in data leaks what's the turnaround how can we verify it you don't want to do that when you're in the middle of a crisis the last thing to consider is the people part is everybody on board for this the nature of the distributed tracing is that the work is distributed your job is probably not going to and when you just get the service up and running you're going to actually need to instrument apps and there's a lot of cognitive load as you can see from the you know 30 30 minutes we talked about this into understanding how to tribute a tracing work so set yourself up for success ahead of time by getting it on teams roadmaps if you can otherwise start opening PRS is the other option even then you're probably going to be able to me to talk through like what it is and why you're adding it but it's a lot easier when you can show them code and how it tabs text actually interacts with their system so here's a full full checklist for evaluation we'll cover one last thing before I I let y'all go if you're thinking like this is so much information like where do I even go next from here my advice is if you have some free time at work with a 20% time or hack week start by trying to get docker Zipkin up and running even if you don't plan to use it in at all it includes a test version of Cassandra built-in so you just need to get the Java app itself up and running and you don't have to worry about all of these different components right off the bat if you're able I mean if you're just instrumenting ruby apps and Zipkin is compatible you can even deploy this on to Heroku and so if your once once you're able to just get to deploy the UI loaded just instrument one single app even if the only thing that app does is make a third-party stripe call it'll help you turn some of these like really abstract concepts into concrete concepts so I got the day folks I'm going to have any questions I'm actually heading straight to the Heroku booth after this in the in the big Expo hall so stop by I'll be there for about an hour you know come see come ask me any questions or you know talk about Heroku or get some stickers so yeah [Applause]