Performance Profiling for V8

0 0

So we started off today with gPU programming and we're now moving on to compilers nice Um my name is Francisca Hinkelman. I'm a software engineer on the Chrome V8 team I'm also a node.js core collaborator and I talk about Profiling V8. I Think we can all agree that Javascript is incredibly powerful like not only in what it can do But also in how fast it can do it it's amazing that the Javascript a Scripting language you can run enterprise note service And you can run websites with these huge complex frameworks like a Facebook website or linkedin or YouTube and all that was a scripting language and the performance of Javascript, or the possibility to run such huge applications with Javascript of course comes down to the Javascript engines the virtual machines that take your source code and turn it into an executable Machine code make your website dynamic um so in this start, we'll look at V8 that's the Javascript engine in Chrome and What it does to make Javascript so fast and how you can profile these optimizations? So to put V8 into context the eight is not the only Javascript engine The the Major Browsers all have their own Javascript engine we have check of our core that is in Microsoft Edge you might have heard kind of talk yesterday. We have Javascript core and Safari We have spider monkey and various other monkeys and firefox and we have V8 in chrome If you're familiar with nodejs that also runs with a javascript engine your default node comes with V8 but you can also build it with Jack Krak for Electron refer talks about electrons and status Chromium and nodejs it has a V8 instance And then there's other engines that are smaller if you took yun talks there They are very good for iot devices because they're much much smaller They are not as fast as the big ones, but you can fit them on your little micro controllers So duct tape or Javascript are examples there in In order to talk successfully about program about profiling In this talk will look a little bit of some of the internal key concepts of V8 So I'm hoping to give you some insight into this black box um what we eight is doing under the hood how you go from Source code to really fast machine code And I'll show you some some tools. How you can Profile V8 so not profile Javascript on A Higher level that would be probably more useful if you want the first performance gains But really down at the Low level compiler level Javascript engine The the tools I'm showing you on the one hand. They're tools that we use internally at V8 So if we want to make it faster if we want to figure out, why something slow, what is slow? Yeah, those are the tools we use and then we change we ade internally to make it faster but of course you can also use these tools to compile to profile your code and Maybe make some changes there to gain performance If you've used the chrome dev tools you might have seen there is a profile tab So you have two console and you can inspect your hTML and css, but there's also a profile profile Tab where you can record cPU profiles or heap snapshots and When you do that for Javascript, or you can also do it for node.js server application? You you start your app you load your website You start recording after while you stop it? And then you you get a profile of which functions are being run a lot So I did this year for compiling some typescript you see which function is spent time being spent on but one thing you see he unfortunately is this little exclamation mark Which of course means some kind of trouble? In this case the function is related to gets a warning from the profiler saying not optimized and the reason is optimized too many times so the next 22 minutes, I want to dig into what does this optimization mean? What are we trying to optimize and why is this not happening or possible in this case? so Let's start with some fundamentals. I guess um Javascript is dynamically typed. It's not statically typed like C++. Or what you have arrest It's dynamically typed that means the types of your variables can change all the time and the compiler only influenced at runtime What? What kind of object or type it has it doesn't know that when I just sees the swiss code? um Here's a very simple example. It's an object defined as an object literal X is 1 and y is 1 do you know what the properties of this object are I? Mean it's it's very obvious x and y clearly our properties But are they do the only properties know this object also gets all the properties of the object Prototype or if you change the prototype it would get all the properties of that new prototype Ok so we figure out properties are x and y and everything from the prototype chain can the compiler rely on this information? No, because you can at any time delete properties of an object So this object that you have can change all the time and only at runtime can the compiler say yes? It has an x yes. It has y or not And of course you can also add properties So this type information is not available up front it's it's dynamic and the compiler always has to infer it and So since Java Script is dynamically typed. What all the Modern Javascript engines do those that care about performance those that we use under browsers Not necessarily on the smallest iot devices is so-called just-in-Time compilation So we compile as we execute a program in C++ this is very different if you've ever done any C++. It is clearly ahead of time compilation because you actually do two separate steps you first compile it and You add a little bit and then you get an executable that you can run There's no such thing in Javascript you you just run it. I mean you might Transpile it first depending what you use, but there's no separate compilation step from the execution But for Javascript because we don't have any type information anything can change all the time we need to do just-in-time compilation I'm gonna look at this example a lot. It's it's super simple it's a function that I called load and all it does it's it's Accessing a property you wouldn't even put that in a function usually because it doesn't do much but it's just return object that x given a parameter x and Obviously you do property access all the time in Javascript even if you do console dot lock Console is your object and blog as your property so it's just Everywhere you can't do without property access and this looks super simple you have it all the time. It looks very harmless You know what it does, but if you think about it carefully what are all the things that could happen here as to return value well as the compiler tries to get the value object at x um First of all if you put an undefined as a parameter, it'll throw a type error The object might not have two property x so it would be undefined Or it might be that the object itself doesn't have an own property x but somewhere up the prototype chain x was defined so the compiler would have to look up recursively the prototype chain until it finds x Object might be a proxy so we have to call it to get handler And x might have been defined and eggman script sic style as an exerciser script where you have to call the get function So you can have any kind of arbitrary side effects? So for the compiler this very simple object at x is is a lot of stuff it has to worry about um This is a snippet from the Acma script specification don't be worried about reading all this, but this is how the Ordinary get us what they call it is Define And you can tell this is quite involved for just a tiny little object that x down here So since we have property accesses everywhere on the program We eight needs to do a little trick because we can't afford To do these steps every time you do a console that log or any kind of access So let me show you what you do So here's the simple load function And we call it with this very simple object that is just a regular object no changes on the prototype chain it has one property x and it's an integer, so When we when the compiler encounters is this call of load? Well, we do implement Correct atmosphere Javascript, so the compiler has to follow these steps on a specification Which is quite involved and eventually it figures out okay? It's a value to script. I have to return this value Which is at a certain offset of this object? Okay, so we found object at x and now what we do is. We cache this information and What we are caching is we're caching for these kind of objects This is how you get the value? So we're not caching that specific object and the 5V caching objects that look exactly like this simple Literal in this case we know the Value for x is at a constant offset to our object, so that's some V8 internal that we know where it's stored but but we know any object that looks like this, so just one property it's called x and it's an This is how we get to the value if we needed? And we associate that cash with the call with this property access So when we later in the program called load again? We're calling it now as a different object this one has the object x is 17 and not 5 but they look very much the same if you think about the shape of these objects So when we do this again, and we get to this property access um well. We're being smart now We check the cache first to see if maybe we have already figured out how to do that um and we realize yes We have exactly these kind of objects we have an entry full objects that look like this in the cache so let's just get the Offset because that is where that value is and we furr can forget about these 10 steps that otherwise. We'd have to do Ok and um these caches the the former name is inline caches. We shortcut them to ICs So if you have a look at the 8 source code, there's a lot of ICs everywhere You have an IC associated to every property access Not to the function or the object but every single access so in this case you have two Completely separate inline caches for both these lines even though it looks the same and it's the same function and what you store in an IC is the keys a Shape of objects and the values are the fast path how to get that value? So in our case We're storing what the object looks like and we had to get to 5 or a 17 over my axis And when I say shape of objects internally we call this map of an object and sometimes it's also referred to as a hidden class so since Javascript doesn't have Classes, or didn't have classes until ES6 those are very different we call it We give objects internally such a hidden internal class Ok, so this is one way how we speed up the aid to do property accesses faster That would not be yet to have a reactor immigrant in any reasonable time What what any modern compiler has these days is at least? two compilers a basic one and an optimizing compiler in V8 we have actually two optimizing compilers at the moment the referred to as crankshaft and Turbofan if you've ever come across that um And so how that works is when you first run your program the basic compiler is trying to compile your code very very fast but too very Naive machine code so it might not be super fast when you run your first executed But after a while the basic compiler says well this function here that is executed all the time we should really speed this up into better machine code so the optimizing compiler takes over and recompiles a so-called hot function into Better machine code so you have more work because you have to recompile again But you get much much faster codes when you run it a lot. It's worth it and so if you combine Optimization with these inline caches that's where you get a big big jump in in speed Okay So I hope you still with me. It's early um if If I lost you a little bit now would be a good time to get back um I'm gonna show you no actual machine code for a simple function like that, so think about it you write Javascript source code with cool frameworks and lots of stuff and everything's changing all the time everybody can put modules up But when you compile it you get down to machine code That's basically the same for decades, and I think it's really cool to see how like The the high-level stuff what it ends up with in in machine code um So just to recap The Javascript engines they compile Javascript code down to machine code and I want to show you some machine code now still for this load example so this year is optimized for scene code for the simple load function that returns object at x and It might look like a lot. I mean, it's a full page of code but for like low-level machine code This is really not much, and I'm gonna show you what it actually does so up here. It says um Call Stack check. This is where we entered a function? We just have to make sure we don't have too much recursion recursion and we can still Have space on the stack. So this is where we start the function The next thing we do is a check nonce my sews my is The aide internal language for a small integer we distinguish between small integers or objects on the heap Anything that's an object must be on the heap if it's if it's a smile You can check the hidden class of it or anything So this is just a backup check basically making sure we we do deal with some kind of object internally and now the What the function should do now is get object that x and return that so what we do in this? Optimized piece of machine code is a so called check maps and I said map is our word for internal class or the shape of the object So what this optimist machine code is doing now. It's comparing the two maps of the Parameter that you passed into the function So that object has a map it's comparing that to the map that we saved in the inline cache when the basic Compiler ran this before and if those are the same just like in our example before if those are the same then we can do the cheap fast property access which in this case is a load named fear you know I'll just load this field at the specific offset and Be done with it return it now if those map checks if if we call The load function with an object now that looks very different Then the map check would fail we don't have the same as that what we had on the cache And we have to jump when we end up down here which says the optimization bailout, so You run your code you fill your cache eventually you decide let's make really fast machine code you use the information from the cache to make this machine code like it's Hard-Coded in there. It says check the map at exactly this address and Now when you're running this code when you put objects in that are very different from what? You are expecting from something you can't handle in the inline cache then you end up in Addy optimization So that's where the optimizing compiler says I can handle this and you go back to the basic slower compiler Okay, so um I said here. We make one map check because we had exactly one object in our cache So we only compared to this one object um that's why we distinguish different states of inline caches So we have if there's exactly one entry we call them monomorphic If there's a handful like up to four entries We call it polymorphic and anything more as mega morphic and if you want if you can you want your code to have mostly monomorphic Or maybe polymorphic caches but avoid the mega morphic um So here in this example that I just showed you there was one map check because the cache that we assumed the cash was monomorphic if our cache is polymorphic and then we Generate this optimized code um it would look like this which is exactly the same except that it has Four map checks in there we would compare against these four entries that we have in the cache now if it's mega morphic We can't keep going on with all these branches and we have to do Computationally more expensive things so that's why you want if possible always objects of the same shape at the property accesses So same game as before we are checking but if it matches one of the entries in the cache Then we can just fast jump and return if it doesn't match any one of them. We're back to this deoptimization Okay, and um you can actually try all of that You can just take chrome that you have anyways and if you started from the command line Be behind chrome just put this flag - those J's flags Equals and then put the V8 flags that you want so for this case if you put print opt coat Print optimized code you can also do a print code comments that gives you exactly this output So you start chrome put this behind it open a website and your console will go full with like output like this oh these flags Okay so we care about what the state of the inline caches is and if you want to dig a Little Deeper in what's going on in your application? we have a nice little tool for it ice the Icy explorer wenjian where you can explore the states of your inline caches You generate the output like I just said put the flags behind chrome and then load the file here And you can group your inline caches very nicely by different keys you really do want to use this tool not just the editor because Here was compiling some type script, and it's a million entries. This is Very nice if you get this grouped for you You can sort by different things like here for example I sorted by a code location and I drill down on the function that I saw earlier that had the Exclamation mark so I can see exactly okay. What is happening at the property accesses in this function? So nice little tool to see what are the inline caches doing because they affect the speed and the optimizations There's also a nice overview if you want to see which functions are actually being optimized that micro Kyla and any functions d optimize if you do tres opt and trace d upped you get this information, so Here if we run an example that calls load eventually if you run it often enough, you'll see Optimizing the function load, so the optimizing compiler Recompiles it generates machine code similar to what we just saw And then if you keep calling it and eventually call it with very different objects you will see D optimizing evicting entry from the code database So what's what's happening? Here is um In this optimize machine code that I showed you when the map check didn't work out? we jumped all the way to the bottom and said Bailout the optimization and what we aide is doing here, it's Deleting this optimize code because it figured well we optimized it, but it doesn't quite work, so we throw it away We start over with the slow basic compiler for this function again so You don't want to see this too much when you profile your app because it slows it down So back to this original problem where we saw the exclamation mark warning not optimized optimize too many times um Now we have a better understanding of what's going on here We have this function is related to there are things like property Excesses in it we run this function with the basic compiler we put entries into our inline caches And eventually we optimize it using the information from the inline cache, but then we run it with stuff that doesn't match What's in the cache and we d optimize it and we run it again and a slow compiler eventually say oh let's optimize it again, and then we do this ten times and eventually the compiler says no point in optimizing this because Generating the optimized code of course costs performance Let's just not optimize this function anymore. I give up so this is where this exclamation mark Comes from And so by by looking at the cPU profile looking at the inline caches Looking at optimized machine code So non-optimized code and looking at the original Javascript code It's possible to figure out what kind of changes to make to get rid of this optimization the optimization optimization Do you have to mutation thing? So when you make those changes you profile again? You don't get the exclamation mark Anymore okay um General warning as always with optimizations be very very careful only optimize if you really must if you have performance problems And then if you do optimize make sure you really measure and find your bottlenecks Don't blindly optimize don't say oh I've heard something about Optimizing compilers gonna start changing because here I can tweak and in my cache or something You just introduce bugs and make your code really hard to maintain That's just the usual optimization um but in this case specifically I Would almost advise against optimizing for things like that because those are V8 internals. They're not Agreed upon by the Acma Script Committee to be like this and they change internally like Every commit that we make every canary version you get every night things might have changed We might have figured a better way to do something. So when you did micro optimizations for something that is true today It might actually be Negatively affecting your performance tomorrow And then obviously it's just we ate the other Javascript engines work very differently so if you write your Javascript code in a specific weight, and they're only Faster and this one engine, but probably not in the other engines So I hope I was able to give you a little bit of an insight into this Black box of Javascript engines what they do from source code to really fast machine code If you want to play with any of this yourself and see what uh what's my app doing in general? Or what a little code Snippets doing? You can always just use chrome started from the command Line - - - j as flag equals and in some things we looked at was a trace of trace deopt print opcode or Trace I see if you're familiar with node you can also Start a node server and put the V8 flags here. Just put them right behind node and if you were more adventurous You can compile a V8 yourself and get D8 then it's the V8 debugging shell and run that with all the flags um if you use D8 you have the advantage that you don't get the overhead of things that happen in the browser or Things that happen in node before your function actually starts up and if you want to Explore your inline caches the ICC explorer is distributed with the V8 source code There's a link to it um if you have any questions or feedback or performance questions Please find me during the breaks and feel free to reach out with Twitter or email. Thank you