JRuby+Truffle: Why it's important to optimise the tricky parts

1 0

Thanks this talk is about why it's important to optimize the tricky parts and some of the lessons we've learned from doing Ruby about where it's important to spend the time focusing on optimizations for a dynamic language I also work for Oracle's so the same safe harbor disclosure is just a research project that's not ready for people to actually use yet so I'm gonna be talking about Ruby Ruby is an imperative language ten years ago probably would have been called a scripting language that's probably a little bit unfair and outmoded these days would be powers behemoths websites like github are used to power Twitter so it's used for really large scale enterprise applications so greater scripting languages playing it down a bit I think the scripting side of it comes from Perl that people don't really use the Perl isms anymore people use it more like a small talk language these days so it's object orientated everything's and objects including primitive numbers and things like that people write explicit classes in it rather than prototypes as in JavaScript Ruby's very much a batteries included language you get quite a large standard library bit that's in contrast the languages like JavaScript it's more like languages like Python where you get lots and lots of features and libraries built into the language and they usually shipped as part of the implementation past interpreters and not be implemented in the implementation language those language do lots lots of important things this is some example Ruby code just a random snippet from rails rails as the the big Ruby web framework that lots of people use so a lot of people Ruby is rails that's how important it is as you can see it's pretty easy to understand if you don't know Ruby you can just squint and return this is Python or JavaScript and if there are important differences I'll point them out but if you don't know Ruby don't know yeah that phase you can just pretend it's something else Ruby is quite unusual in that there are quite a few or having quite a few historically good implementations of Ruby so other languages like Python or Lua there's generally one main implementation and Ruby there's been a lot over the years the main one is called MRI for Ruby interpreter max is the creator of ruby it's a very simple bytecode interpreter it used to be an AST interpreter implemented in the sea and only in the last six or seven years I think became that a bytecode interpreter and had got a little bit faster and is implemented in C and the core library although almost all of that built-in stuff is implemented in C so one C function Ruby method there's two main competing alternative implementations of Rubin there's JRuby this is Jacob this is Ruby implemented on top of the JVM written in Java it does have a JIT it chips by emitting JVM bytecode at runtime so create classes dynamically at runtime and parses that went to the JVM and of all the languages which do that and there's quite a lot of Jimmy language to do that I think JRuby is probably the most sophisticated by quite a long way when they modified the JVM specification several years ago to add better support for dynamic languages JRuby was the main use case for doing that so we can even go so far as to say that the JVM has been to some extent redesigned and specialized to better support JRuby and over time they've added more more complex stuff they now act out their own IR that's done before they emit bytecode so they have a conventional data flow graph control flow graph IR they do some optimizations on before limit byte codes it really is quite sophisticated the only other language was from its byte code at runtime which is more sophisticated it's now an asshole that was done by Oracle at runtime in JRuby the core library is mostly implemented in Java when they started they literally translated the the runtime from C into Java and just started using that the other main alternative of predation of rim is called Rubinius Rubinius is based on the blue to blue book implementation of a small talk that Mario's show this on Tuesday and this jits by omitting LOV Mir so lrvm is a compiling for chakras a library and you commit IR for that it's fairly low-level IR and it's that but you machine code the VM itself was written in C++ we're really interested in the core library that batters including stuff is actually permitted in ruby in rubinius to some extent both JRuby and Rubinius are actually used in production they're not just toy implementations on the sidelines they're actually fairly heavily used my implementation of rebuilder talked about today is a combination of JRuby and Rubinius and it's called Joey lost truffle and he's the truffle framework which she'll introduced and I'll talk more about and uses the growl compiler so we call it J we'd be lost truffle it's an interesting story about language implementation and the difficulty of writing VM is that both Jim and Rubinius had to implement almost everything it did from scratch so they both wrote their own like sir both wrote their own parser Rubinius has its own by cobra from scratch has a Gypsys LLVM but it has its own primitives its own core library and j would be how to implement their own intermediate representation their own byte code generator their own core library and this is why using VMs is difficult it takes so long we have to do all these steps pretty much from scratch to build up anything though even at use any kind of reasonable level performance J brush truffle combines elements of these so we took the lexer in the pasture from JRuby we took lots of other support routines from JRuby but we wrote our own ax T using the truffle framework and we wrote our own core library using some of the core library from Rubinius so it's a mixture of the two Jacob is truffle isn't a toy implementation in Ruby it's still not quite ready for general use yet but it isn't just a research prototype we actually support 100% of the language specifications the the test suite has been produced and wrong and Ruby and this is more than JRuby and Rubinius have achieved so we support literally everything that anyone has thought it's useful to write a test about for the library we do and when 90% of capacity with the core library and this gives you an idea about the scale and a core library and we're working on it for three years we've reused some of the core library from Binney mr. Thurmond 90% it's huge the core library I don't talk too much about benchmarks but just to give you a sort of idea of scale this is several implementations of rubies this is a recent version of MRI in blue is JRuby using the advanced in veterinary technology taller is faster yes and it goes up three to seven Gray's rubinius that's LLVM green is topaz so that's using a technique called meta tracing in our Python which I think conferring could be talking about tomorrow and we're here in purple and the little ticks are error bars this is for a set of five or six of the synthetic shootout benchmarks so benchmarks most of us probably know pretty well and we do look like we do really well there Jovie and Rubinius managed maker improvement over MRI but it's pretty modest topaz does really well aware above that but this is on the shootout benchmarks often people will say you do fantastical to shoot up benchmarks that's great but what about real applications well one really interesting research result we found from this project is actually we can do much better on real code this is a set of about 40 kernels so they are the kernels of applications the the titan loops inside them from 2000 iris which are chunky PNG and PSD to RB they are for processing photoshop files and PNG files these are compute intensive benchmarks that's true but there are also things that people are using to make money in business today so this is deployed in use and when we run on these like these gems rather than missing take benchmarks we're way way way of everyone else at around 65 70 times faster what this talk is being explaining why that is why are we able to do that what are we doing differently where's the this is obviously we've broken through a kind of barrier here to achieve this one of the questions I often get asked by people in the Ruby community it's why aren't using more of JRuby they've done so much work to implement the Ruby core library in Java and we're not reusing most of that and I'm going to use this as the backbone for telling the optimizations we've done and the thesis is that the the core library of Ruby is what it's actually important to optimize not the basic language constructs and you need to optimize them a very deep level much more than anyone else has done to achieve that kind of performance so let's talk about what makes Ruby difficult optimize it's got the same changes as JavaScript the same basic challenges as that it comes in in source code form it's a dynamic languages and have the explicit types but it's more than that it's about how do people want to write Ruby I'm going to philosophize here for a second and a lot of people who do work on language implementations spend a lot of time disparaging the languages and the people who programming them I like I understand that because it's often very frustrating when you work these languages but the philosophy I'm trying to take is that whatever code people want to write I'll try and get that running as fast as possible so I'm going to show here it's not a criticism of anyone's programming style or programming ability it's an empirical observation of how people choose to write Ruby this is an implementation of a clamp routine and it's from the Photoshop processing library so clamper a number between the minimum and a maximum and the way they've chosen to implement this is they've created an array double hit literal here with the values they've sorted it which produces a new array with them sorted then they take the middle one index one being the middle one so that's a valid implementation of sort it may not be how you would choose to implement sort if you're writing a low-level language are trying to achieve performance and this routine is used for clamping pixel values so it's called many times for each pixel in potentially very large Photoshop and PNG files so it needs to be high performance that's how they chose to write it so we could criticize them for that valve they've used high level operations they've used core library routines rather than using low-level stuff and if you look here there's no low-level language constructs at all there's a call to the array constructor there's a call salt and then a brand indexing as all operations are in Rubik almost all is a method call so we have method call method called method call so there's very full scope for applying to just that code vention all compiler optimizations but it gets worse this is a modification to the range class range being you know between a low value and a high value and this modifies it and a an extra method called include question marks cream parts of method names in in Ruby and because a value to see it is included now ranges in Ruby can be exclusive end or inclusive end so they check that here what they decide to do though is based on that they say well I've used the less than operator or the less than equals operator this things that's here is simply a an intern string so this code means an intern string and then they call send using metaprogramming to pass that operator in so this seems like an innocent in a loop sort of operation that it should be high-performance and here actually we're using metaprogramming based on control flow to choose a method and call that another example of hit they say this is a methods added to be the object class it's a blank better tell you whether something's empty or whether it's near lor false or somehow that and they want to call this empty method which is on some classes but not others so they first check does it respond 20 so again it's like we're using metaprogramming or reflection and again blank is something you can imagine using innocently within an inner loop and expecting to be high-performance and potentially not being so this is again part of the inner loop of that graphics processing so it's hard mix which is a Photoshop filter between a foreground color and a background color there are some options that what we're doing here it's interesting is we have this method are G and B to extract those values from the packed pixel but there are no methods are gob what they've done is they defined a fullback method missing and they said if this other module has that method then go and call it so rather than using a modular normal way to import that method they've decided to do this again that's probably not ideal programming style but that is what the the Ruby code that is out there that we'd like to run is like so that's what we want to try and accommodate this is a really common pattern it's a duration and it's a facade class that wraps a value a time value and it provides some extra methods but then if you call a method which isn't defined directly on our class it passes it directly to the value so it passes it through them really Assad and again you think you're gettin class and you're using it to do timing staff meeting it's high performance and actually this is Harry it's implemented getting more extreme this is for decoding the grayscale entry from a pallet for a particular bit depth because it how its Pat depends on the bit depth what we're doing here is we're using send again to call PNG resampled but there are different methods depending on the bit depth so what this syntax here is doing is it's a string and it's interpolating this variable bit depth into it so it's calling different versions of the methods based on the different bit depth and it's posting that descend so we are a runtime creating a string and then doing this dynamic call eval is very common in Ruby this is a delegate method which tries to speed up that pattern of passing code through what it does is it defines a method with a name and it's called once for each method that's on the the value wrapped and gives you an extra method and then it defines out using eval templating is a really common thing to do Ruby so generating an HTML template from some values it again that uses eval the templates it compiled to Ruby code and then we evolve out with some particular runtime values each people talked about a valid should be something which should be avoided the reality of a Ruby web application which is processing an image and returning an HTML page is that it's calling eval for every single request and it's calling method missing seven times orange pixel in the image is processing etc etc but why can't a conventional VM optimize this why can't you make JRuby why can't Jade minutes as fast as we want and we know that it's the amount it achieves and speeding this up is very limited from that first in our second graph we saw there were several key problems here and they're all to do with the implementation of the core library that's inside jovi and the first problem is JRuby score library is mega morphic it handles everything at every time this is the implementation of add so you member I said that's every operation is a method called in Ruby so when you call add you make a method call and you can add at a fixed number all integer to another fixed arm or bignum or float or something else entirely and they have all these cases written in here you think that things like speculative optimizations would help here the sort of things that Jill talked about but the problem is that this stuff is already behind a invokedynamic call and it's already several method calls deep away from where there the function whether a method call started from and the reality I think is the hotspot could conceivably optimize this kind of stuff to replace the ones that are being used here with guards but in practice it does not the second problem is that J Ruby's core library is stateless a lot of people this is meetup of talked about the importance of things like caching and about storing profile state and what you've seen before but Jamie was caught alive isn't able to do that because they implements every ruby method as one Java method when they come to the implementation of something like send member send was where we could call a method based on a string name they can look up a method but there's nowhere for them to store information they're inside a method activation and there's no referent to say this is our cache of what we've seen before so their implementation of send looks up the method from scratch every single time now you could have a global method cache but that sort of thing gets overwhelmed so quickly that it becomes next to use this there's no reason to store the information they want you for next time compounding this is the problem with JRuby score library is very deep these core library routines don't just do one little thing they may call back into your Ruby code they may have to convert types etc etc it just keeps going down this rabbit hole if we look at a example like sort and remembers saw was being used inside that tight clamp method that we wanted to optimize so JRuby sort it cause this sort internal and sought internal calls quicksort and it constructs and passes in a comparator object which calls back into Ruby and we said that we had nowhere to store something like a cache when were just a new normal driver method well it's even worse here we're in one job method we've called another were in the sites that we've just allocated as a comparator and there's certainly nowhere to store state by the time we get to there and we're even further away from the method call with if we wanted if we would expect hotspot to inline this effectively we're expecting it to inline through many many many methods to get even get this far little in the complexity of the actual sort routine when it gets here even if all the inputs to this quicksort routine a constant there's no way hot spot is gonna partially evaluates constant fold remove all that extra overhead it's just not going to happen again combining that is that Jamie's cooler isn't amenable to optimization this sort routine is just includes too much stuff and too much stuff that defeats escape analysis too much stuff that's generic and mega morphic I said they probe vinius implemented much of the core library in Ruby and that seems like a really good solution to the problem many VM start increment functionality in the the guest language itself so rubinius has most of the call I bring implemented in in Ruby and JavaScript to start and do this in many implementations and Java of course most of the JDK standard languages implement in Java this is part of rubinius is salt routine and this looks good this the pike this is normal Ruby code so if we have normal inline caches we can put this in here we can cache calls on cetera but the problem is that the the imputation in Ruby has to bottom out somewhere unless you're going to implement your integers using Church new rules you have to some point reach some kind of genuine primitive and unfortunately that happens very very soon and it becomes a barrier this is implementation of 6 num compare which is the the call being made here this isn't written for fixed numbers here and it's a C++ method the rubinius JIT has no knowledge of these he throws past methods all it knows is it's a code address it can jump to and call even if you compile rubinius with LLVM that doesn't matter because of its compiled statically and then you're generating code at runtime but it can't compile through the code that's already compiled it is possible perhaps we could in the future generate something that shipped the LLVM ir with it that's not something that's being done the moment so we'll discard that for the moment and so this becomes a another barrier to do any kind of optimizations and this is something really simple you know this isn't a complex operation this is just comparing two integers where all the the bots of and leaf operations are printed this you very quickly find that you still can't do anything and this is done in other systems this is a ray copy you may have used systems are already copy from Java I know if you've ever seen this modifier before native it means that it's a primitive which is then called from the runtime much in the same way as the rubinius code is the difference here is within the java there's very few of these comparatively and that means that we could implement compiler knowledge of each of these relatively easily so the grow-op compiler can have a special node for a ray copy that it can optimize the see through we can't do that for every single Ruby method okay let's take an interlude and talk about truffle and growl and I'll talk about them from a language implementers perspective rather than the the growling it's a perceptive so we have hotspot it's a fantastic piece of technology very powerful people have written many successful languages on top of it but as a language implementer somebody promoting something like JRuby it is just the black box and all you can do is give it things so when JB emits machine its bytecode all it does is poor by coding to the top and this there's it's very little else it can do it knows that somewhere in there is a JIT compiler and people like Charles not say work on Joe be very knowledgeable how about how this process works but they still can't do anything because they've got no levers they've got no lights for them what's going on and the the path but the bytecode needs to take to get to the gist is very torturous and there's no guarantee even that the the code that they're pouring in will get compiled it could just be ignored so when you're using the JVM is a to implement guest languages you have this this path where you have a piece of code generating AST you emit JVM bytecode and you get machine code out and the idea is that the possible take takes care of that what we need to do is them in the bytecode so you need to do two of these stages but making sure that happens making sure after a good output is proved very difficult what we're doing in the role projects in the VM research group of our other labs is we have reimplemented the the compiler the JIT compiler that is inside hot spot normally as a java application and it's a normal Java library and that means that you can use it as a normal Java library so instead of just pouring things into the top of it you can more finely tuned stuff you can get things out of it you can do to Mai's you can control what's going on that's hard to do require specially its knowledge so truffle is the layer on top of that truffle talks to growl back and forth on your behalf and lets you just implement things in an easy way so let's talk about that that easy way to implement things the modeling truffle is that you implement and ast interpreter which is what an undergraduate would probably implement of the origins of interpreter the the novel idea is that it's a self specializing s interpreter so when you start emitting a program you emit each node in your AST as an uninitialized node that could handle anything or perhaps I should say could handle can handle nothing and the idea is that as the program runs you can replace the nodes with specialized nodes that do what you want so the easiest example is types so if you see if you're an ADD node you see two integers males make yourself into an integer add node and we have this this lattice of types so that there's an uninitialized at the top and then we might have stringing it all double or integer maybe inches or go to double but they all move down towards our bottom points they converge on some of the stable eventually so if your program really is mega morphic then we have a mega morphic no but you can handle it this is a dag so it all goes goes downwards I'll see what this looks like in practice I don't want to give you a truffle tutorial but just to make you understand what this looks like in Java what we have here is a an a class which is a simple language add know simple languages our demonstrator language and you write a method for each of the types you might expect so this can add together two dongs or two big integers or two strings the four strings they're just object and then we could write a guard that says is it string and the guards implemented there there's almost like multi methods if you've heard that term multiple dispatch so the specialization which is cool depends on the types we've rented but it's not just about types we often emphasize types in truffle and really you can specialize for anything you like so this this next is on but it's quite complicated don't worry about understanding us want to show you how far you can go this is the eval node from simple language and what this does is there's a specialization here which caches the code it's seen previously and the compiled version of it so app cache means that this takes a mime type and a piece of code and it caches that mime type and it caches that code and it caches the result of passing it into an ast and then the next time it can simply call that and then we have guards to say that it hasn't changed so it's not just types you can you can specialize in cash and can specialize for any anything you like specialize at the time of day if you want to that sounds a bit slow though these ast interpreters because it's all Java methods making Java method calls so what we do is we compile that ast using partial evaluation into a single piece of official machine code so we literally take the bodies of all the methods involved in your ast inline them into one apply lots of optimizations partial escape analysis that sort of thing that produces a single piece of machine code personally I think partially vibration is easy explained as just inlining and constant folding I'm sure there's a more technical distinction but that's how I explain to those people if you want to know more about policy evaluation that's a great talk by Tom Stuart's call compiles for free it's for a non-technical Ruby audience but if you want to learn the basics of it it's great so we have this pipeline we go from the ast become specialized or him it compiled code but we talked about speculative optimizations so we've made lots of speculative optimizations here obviously with respect to later the fact that it's hopes aren't gonna change so we can reverse this they can go from the compiled ast back into the real ast that's mutable we can change our nodes we can respect I and then we can compile again so we can we specialize perhaps these are our doubles rather than just integers and we can compile back to code obviously we have mechanisms to stop this just happening eternally only the fact that the specializations are a lattice means always moved down hopefully to some conversion point so this is the second half the pipeline we can do up to Mai's we can keep updating we can recompile again so you get all the benefits of producing real efficient machine code but the only bit you have to do is is these bits any but you have to do is worry about updating your AST the deoptimization the recompilation is all done for you and it's done more or less transparently as well Java does other things for you but also your ast s form a kind of forest because there's lots of them in different methods they all call each other and shuffle can automatically do things such as if there's a particularly frequently executed call it can inline that an in lining is really simple to understand ace if you take the ast and copy and paste it and instead of being a call you've got a normal edge to it and that becomes a an inline function we talked about specializing I one of the problems we're specializing can be that things become mega morph they called generic very quickly when I talked about not being able to had any kind of state or cash to Jamie's core library printed in Java even if they could the risk would be they will become specialized super quickly so send would see ten different methods in the first millisecond of running and it would simply say I can't cash this anymore and this is the problem we have here if we have some a method which specializes beautifully but it's been bombarded with both integers and doubles and big integers and it's gonna become megamorph it went rofl again can automatically fix this problem for you and then it can split it can generate multiple copies of the same method and then let them specialize independently and truffle actually keeps track of how much it thinks trees cost so if it sees that says there's lots of multiple copies of different specializations going on and automatically say I'll spit these for you and I'll keep splitting so each of these then become nice and tightly statically typed if you like and then inlining still applies so we can actually inline that one specialized copy where it's being called this gets around the fact that in groovy everything's a method call we don't need to do the special to get around that we simply have lots of copies of those method calls and we let them specialize and be inlined it's just like having the primitive they're normally and of course we can do optimized to fix if that turns out to be incorrect so we set the model on the JVM we've guessed languages normally as you have some source code missing ast you know bytecode and then machine code is mr. for you well in truffle and growl both of those last steps are done for you and all you need to do is write an ast interpreter I'm setting it a little bit long here in the root in reality do you have to do quite a lot of tuning to make it a really high-performance interpreter but I think it's very honest to say that you could get reasonably fast very very easily and then you can get extremely Avast with a bit more specialist knowledge and jube there's truffle is the fastest implementation of Ruby by far with reasonable effort a few students working for a few years truffle guar said there were research projects you probably want to know if you they're available for use for real one of the interesting things about the way growl is developing is that the stack that we're getting so we have hotspot at the bottom which is of course it moving to C++ and everything else on top of that is implemented in Java so growl is implemented in Java and there's just a small little place where it needs to interface into hotspot the interface is really simple actually it's not much more than here's some metadata give me a byte array of machine code back to install and then on top of grah we have truffle and then printed using top fall on top of that there's a JavaScript interpretation there's an augmentation and I work on a real implantation and that there are others what should happen is that's in Java 9 that little interface we need to allow a growl to work will ship as part of normal Java 9 and it's been merged in and it's part of the early access build but it's not functional yet I think it should be functional the next 30 access build and then you get over the ocean water environment and how fantastic is this we're getting a new JIT compiler to use on the the standard hotspot potentially via maven or you could ship it with your application bundle it in if you wanted to and of course you could modify it you could use a different compartir the same interface and if other people are implementing JVM wanted to they could even that say this and stop themselves potentially and connecting wrong since again means that it should be reasonable to get people to start using this stuff with jovi at some point in the future as I said the the current gdk9 early access builds aren't working and you'd have to get everything else separately so what you can also do is if you Google or any other search engine for OTN growled you'll find this page that you download a package we have called the Gran vía that includes the JDK includes growl and includes truffle they includes all the languages and you didn't you have an executable which is Ruby executable or Java executable and you can run that so let's talk about how truffle solves the problem of optimizing Ruby we talked about several problems we said the first problem was that Jay Ruby's core library is mega morphic and we said that the the solution the truffle has for specialize in a STS helps us solve this and then instead of having one version of each core library method we can have many different versions that handle different types so we have a type for adding which handles strings or doubles or integers or generics and then we choose the the most appropriate one but we don't just get one it's like a polymorphic inline cache and if a location sees two different specializations being appropriately used it can say well we use one or we use the other and we'll choose a guard to choose between the two we could use three or it could just side to go to the generic and if you could slide to get the generic that might provoke it to split and generate multiple copies of them here's a bit more how this looks in practice zoom in on the stuff you saw in the simple language one we have these multiple versions of methods so instead of having one method that handles anything we now have multiple methods each of which handles a particular set of types this is a more logic I'm not showing here that does things such as example can also actually promote an integer too long in order to meet a specialization we've got some annotations here allow us to handle exceptional cases such as when we had to give her too long ruby has long integer semantics so if you add together two inches and they would over float will give you a long integer instead or a big integer we can model that in Java using is that math to add exact and we can tell truffle but if you get miss exception stop it here and try using the next specialization instead so we can sort of dictate what that lattice looks like we can say after you tried this one it's something went wrong try the next one so if ad fails because forget the overflow session we'll try using this special instead Adam of own flow which turns them into big integers and then adds them together and let's say this is like multiple dispatch should be like or pattern-matching multiple versions of each method the choose appropriate types this helps compilation because of each of these methods is much smaller and much simpler so it isn't reasonable to expect hotspot to compile that big mega morphing method that Joey we had but for these little methods it is reasonable to expect growl to optimize this simple little call here we said that everything else if they types on what expected it'll be a D optimization as she'll described if you write bytecode yes but I don't need other supports that you simply get them at the different method names of meaningless they're just there for the yeah what actually happens is a truffle node is a Java class and this we call this the DSL with these annotations at specialization we use code generation a compile time generate different class for each one of these we call them truffle beans I think occasionally they're like called Java objects that you can connect together to build something the second problem we talked about was that Joey's core library stateless when you're inside a core library implementation in Java you've got no way to store anything and multiple people have talked here about the importance of caching and profiling and that really is the biggest weapon we've got against these dynamic languages and this implementation technique there the current one provides nothing to do that what we can do here is each of these truffle notes is an instantiated object on the heap and that means as far as having a method that allows you to execute some action you can store things in it so in our implementation of send in truffle we have a specialization that is said and it's the object and the name of the method and stuff this isn't complexity in here I don't worry about too much but one thing we can store is a cool site called dispatch head notice our complicated name for a call site in JRuby and the app child simply says that it's a child node of this node so I've got one node that does my send and then I've got another node hanging off it which is my inline cache for making method calls so that means that I've got a in line cache I can use when I make this call and that cache is the name for me so this means I can turn a sends node into a normal method call via and in line cache maybe maybe wondering doesn't that mean that that that will become generic very quickly well the solution for that is we've got one copy of the same score library implemented using truffled multiple people are calling it a calling a with sin bar or send through and in a big Ruby application the matter of Sens going on means there'll be thousands of these and this looks like this will simply become generic very quickly we've got it in line cache in it but it's still a limited in line cache but no remember truffle can detect the cost of a node and we'll see that node is becoming more and more polymorphic and then it'll spit it so you'll get one copy of send for each call site that it's using it so this one is specialized just the calling bar this one's specialized just for calling food and because of the name if the name bar is a literal earlier in the method then it's going to be a literal all the way down and the only thing that's left to do is guard is this literal what I expect it to be yes is a literal so it's always going to be what I expect it to be and the cost of go away and there'll be zero overhead for using the dynamic sense and most this is happening automatically I'm not having to split the nodes truffle is doing that what not to be for me and we can use this for other things again it's not just about types and it's not just about inline caching this is our integer array builder notice this is a node helps you build up arrays and that this helps show you that nodes don't correspond always to simple things in the language so it's not just an add node a assign node we have lots of nodes to help us build things up and this node helps us build up an array and you can push things on to it and one thing that stores so as a field inside this truffle node is the expected length of the array and that means that when we allocate the array to start with we can give it that expected length and when we finish creating the array if it's larger than the expected length we can say next time create it with this extra space to start with because I've been gonna use that so we can cache the expected size the new array here and there's when you start to implement a court I've use this technique there's a whole wealth of things you can cache and you can start you really advanced techniques for example v8 uses a technique called mementos and moment which helps you profile how long an object lives and if it looks like it lives a long time you can allocate it in the old generation to start with and we can do something similar here we did in the past where we we stored a link back to the allocation site which is a node and then you can simply tell the node I'm hanging around for a long time you might as well allocate me in this different away so anything any kind of state you want it's it's a open field you can you can store your states in there third problem which was back was the fact that Jeremy's core library is very deep and you'd be a very long way down into this dense dense java code before you got anywhere truffle solves this because it is going to by default compile through everything that is a node and anything that's an inline call to another truffle method so in this case I can guarantee we're gonna compile from this into this down into this and the mother in truffle is that by default you have to tell it to stop in lining okay been lining as far as it can go so you can guarantee your the contents of your implementation in your nodes will get inlined and if you find that's too much or partial evaluation isn't able to do it if is operative has unbounded location then you can even tell it to stop the football we had that was that Jeremy's core lowering isn't the minimal to optimizations so we can have an implementation is something like sort but it's not reasonable to expect hotspot to compile something like sort as it is this is the truffle version of salt I'm just going to zoom in on this part that there's a a sort node that has a specialization for sorting shorts and things so I'll zoom in on this a bit I've used an annotation here in truffle that says that at explode loop so I can tell truffle if there's a loop in here I want you completely unroll it explode because it's completely unroll it there's a few other little annotations like that you can you can example for to the optimization yourself using other annotations and methods what I've got here is I know that in Ruby we often have very small arrays like in that clamp we had which was ABC sort so I thought we could implement specialization that handles those short ones easily so I wrote a selection sort which would normally be a great sort algorithm and I made the loop a constant size so the maximum size for a small array and then I checked inside the me the actual size of it to check still have been balanced I use my cool node here to make a truffle call and I can rely on truffle exploding this to be something that's not a loop that simply has the body of it copied out many times so that becomes quite a lot of code it's true but it means that it's all just simple code and we get rid of all the loops and stuff you simply have getting from an array calling a compare and then setting if we need to swap them and that means that truffle and roll the mao-a was optimized through this kind of stuff and I'll show you the impact of it in a minute so let's talk about a simple example and we'll get crazy in a minute this is implementation of min so it's a bit like that clamp but I've reduced it a little bit so we have a minimum of a and B we can do that by doing a be sort and take the first one and the code I'm using is I want to put this between four back to the string minimum of two and eight let's walk through what's trough on and grab a deal with this the first thing we can do is we can inline min and truffle will do that automatically because it sees that the the ast involved in the implementation in min is very very simple so probably in mind that without any kind of extra hinting and indeed it does so we have now just to eight salt and two the first one now that's sort routine when you when you inline it when you explode the loop the residual code is this so we compared to an eight and then t1 is if T 0 is less than zero because it's a compare so cos u minus 1 or plus 1 depending on which it is then take the two all the eights of why 78 over 2 and then put them into the rain and that sorted order and then index it we only do this for small arrays keep in mind it's not not unbounded we're doing it for arrays we know as small now we're taking the first element of the array which is t1 so we don't need the t2 and no one's actually using the t3 so we can get rid of those or we can just directly referred to the t1 this is what the party evaluator is doing it's seeing these things are are constant and removing them and we can we can do that comparison manually and then we can simply plug that variable in we know that's true and we said we get a result and we simply put 2 and that happens for real if you run that Ruby code people say to me though programs don't spend most of their time doing things just on constants we someone else said that's when you do inlining in code generation stuff like that you do get some silly code that is obviously constant fallible but let's look at what happens if in sort of it being literal values it's a and B we keep going as far as we can so we say we put that in there and that's as far as we can go so he can't completely come support it but we've still achieved something here we still got rid of the allocation of the array we've still got rid of the the call we still in line stuff this is called the residual program and even if it's not a constant which still achieved a lot and when you think about implementing a VM in a high-level way in Java you have lots of temporary objects and things like that so boxing for example if you've got an ast in terms of whether return type of execute is always objects then we're always boxing but the partial volume is sees through the boxing and all we're left with is extremely efficient code and when we were looking at the original Ruby code and thinking that was a silly way to write minimum we've got now what is a good way to write minimum improving we've got that abstraction removed for free okay let's look at a deliberately extreme example so this is some code which I found actually worked without I don't have an engineered tiny automation to make this work but I have made the benchmark deliberately just show off things which I'm you were working anyway so go on live code here I'll zoom on on bits of it individually so you've got a a module called foo and now as a method foo taking a B and C and I've just done a load of stuff in here that creates objects and processes things so I create a hash of a b and c i map that hash to get just the values out i index that array i create another array of ABC i sort it i take the middle one and add them together bar this other class it's one of those classes which just forwards stuff so I've got method missing it takes an array of argument checks if foo responds to it then it sends it to food otherwise it doesn't know anything and then the the outer loop of this bench mark Critz a new bar and it simply just bar dot foo so that's the that whole benchmark so let's think about the control flip through is called on bar but it has no such method so it four words the method from this send as well as calling respond to up to here and the data flow is that the values are passed in here they actually become allocated implicitly into an array here this is another part of the problem of these kinds of languages in the things are often allocated or call very implicitly and there's often a lot of stuff going on there behind the seat and so we actually create an array of the arguments here and then we in order to call on send we have to turn in from the brain back into those argument straightaway and it goes up to to food and then the vantage go all the way through this until they come back out to here and then this returns presumably some value and the money returns is 22 if you sit down with pen and paper or won it you'd find out that was that the value 22 this is how that performs on other implementations our Ruby so again taller is better it's all relative to MRI the improvisation of Ruby and see and this is for up here and this is Jerry won't even vote dynamic and that's pretty respectable four times faster that's a really useful speed-up billions is a bit slower I think it has quite poor support for meta programming that sense and stuff like that topaz does quite well here I think there's some known things they just haven't done yet on the particular routines I'm using in topaz there so I put me on to pass at its best so one we would see what drew Hoffman do this is run this benchmark and we are now a thousand times faster when your mother's benchmark I said it was a deliberately extreme example so what is happening here in reality is what you usually don't want to happen of a beta far it's that the whole thing is being optimized away and the the output of the compiler is that constant value 22 that's a how we're able to run it so fast but as we completely optimize the benchmark away we can confirm that in several different ways this is the the growl I our that comes out of that compilation I've taken a screenshot of the the actual tool we have for visualizing growl I ask you can see what Sears language of parameters the other is as she'll describe but we have here is a node I think yellow nodes are side affecting and blue are side effect frees up right Jill I don't know what the colors mean here some of the mean stuff I don't know what they always mean but we have I do know that the the red edges are control flow and the blue edges are data flow I'm going to zoom in here just on the bottom and by where you read these graphs and bottom up if you didn't know that we have a return and we have box and that constant Valley 22 so we've reduced the whole thing to a constant now there is quite a roll of logic and the control flow path and the graph does continue up and up and up along so there's a lot of things we have to check before we know the value 22 is is the right result you can see something here the values were 14 and 8 and stuff like that here you can see we're checking that the value below from the argument is still 8 and is still 14 and if not we're going to transfer back to the interpreter and do something else but as I say we do produce that constant body and again my what I'm saying to you is that even if it did reduce to a constant value that's shown that we can do the best in the best case and then if other things are still run time values then they just get left in where necessary and the the guards aren't of obviously aren't far too much because if we were still a thousand times faster so that we know the guards are reasonable and the final way we can check that this is being asked to reduce to a constant is we can look at the the generator machine code officer there's quite a lot that generates the guards but the the impulsive bits are there's a return instruction and there's move into the return register this literal and the literal is the the already Bob fatty 22 and the JVM keeps around box versions of small integers so we don't even allocate anything here we simply return and already allocated value so for all that complexity in that benchmark which was comprised of these patterns I pulled out of these gems people are actually using to make money today we can completely remove them and the only way we've done that is by having these techniques to implement the core library using specialization using splitting using the stuff that Rafa provides us if we hadn't done that and we'd left the core librarian as something that was opaque I've implemented in Java code the hot spot corn optimizer in C++ in anything else then we would never have got anything like that we'd be left with nothing more than lots of calls to runtime routines iced work on parallelism and there's a law there and ahs law that says you can make your your parallel part as fast as you can you've still got the the sequential part left and it's bit like that and you can even from made calls to these intrinsics take no time at all it would still be really slow because of the actual core routines take so long we were talking about core library at being a barrier we can take this to the next level if we talk about C extensions Ruby like Python like v8 like many of these languages supports the extensions so you can write code in the sea and you can dynamically load that often into the interpreter and it gives you a fairly new built-in functions do you call and these are pretty essential for a pragmatic use of the language if you wanna do things like talk to database using system drivers but they're also used to to fix and to work around the the performance problem so if your your Ruby code is too slow you've sold right to C extension but the six tensions of course who painted ourselves into a corner and that they're written against an API which is very inflexible and it's very hard to optimize the implementation of the language while still meeting the C API specification the people who wrote the JVM found this very early on the original J&I equivalence was a exposed everything for real like direct pointers as cliff talked about and they went back and said now we need to abstract this means out of handles but then that's adding overhead in Ruby and Python the C API has been a real limiting factor I think for alternative implementations of those languages I think the pipe I people are working on it a lot at the moment but it's a big problem ironically jruby who suffer from this problem have reinvented it themselves and that their extension API in Java is just their entire internals exposed so they complained about the Ruby C API but we can't use this Java extensions because they've written against this this massive API and all this has been about removing barriers to optimization than C extensions creates a new barrier because it's more native code so people who wrote this clamp routine they knew it was slow zoom they profiled it and they realized it was slow so they wants to make a faster billion so they wrote a C extension and it's written using this API where there's these opaque values which are pointers to to Ruby strokes and then they've written a nice high performance version of clamp that you and I might write if I write some of those high performance but they use this inside this conversion from seeing them YK to RGB method and they do lots of a particular anchor they call clamp on each value but look at what else is going on around here they're already creating a hash for to store the RGB components of the pic this is an image processing application running real time producing images of people and they're creating F hash at the RGB they're mapping the key in value to clamp the value of each component and then they're constructing another hash with those clamped values this map Maps just the values inside that hash we create two hashes and then we call this the C routine and that's right in the middle and it means that we've got another optimization bio right in the middle every value going into this black hole there's the C code that the component is nothing outbound and coming back out again what we're doing in JRuby truffle is we have implemented an interpreter for the C code for Ruby Steve stitches in order to run this idea in depth today cuz it's quite complicated it may seem like it's a crazy thing to do to implement interpreter for C you'll see is a native language uses direct access to memory but if you look at see it doesn't really have that much stuff that other languages we dynamically compile all the time do it has if statements it has functions it has arrays those are all things we can dynamically empower using a conventional JIT compiler when we prototype this we wrote a a direct C interpreter we're now bring it through LLVM so we're compiling our six inches to LLVM and then we're interpreting LLVM IRL as if it was a AST of a real language so it's imagine if you have like LLVM are written out like a basic program and you interpret it it's got jumps that's what I think it's a bit trickier in reality but then if we interpret it like that then it means we can actually take that native util plant method and we can inline from where it's called enriched rofl graph and back out again into if it call back into Ruby and this is something which has produced real results for us we took all the C extensions for the the PSD native and the PSD Darby and the chunky PNG that's the the PNG and the Photoshop library and took all the things they found was important to write six inches for every one of them so we didn't choose these someone else write them as C extensions and we looked at running them on MRI just the native Ruby code and then with this suggests feed the grouper coat and then with the C extensions is about ten times faster on MRI on rubinius there's a smaller margin and on JRuby there's a smaller margin because of they have to emanate that C API and if it might met to do when we ran with the C extensions we also achieve around three times faster than the native compiled code and the reason for that is because if we could inline it as one we'd go from being a collection of really fast routines just going backwards and forwards corage of the whole time to one tight loop all optimized together partially Verret into one really really tight piece of machine code and you can see that a large amount of the effect is the fact that we could in line between the two because if we disabled the cross-language inlining and then performance was about the same as native and again this is real code people to make money today so conclusions the blocker for the performance of idiomatic Ruby code the people want to run and that people have not read like people to write is the core library not the basic language features so if you tackle a language like Ruby and you think what'll implements Shrek for adoption are and implements inlining a limit of this than that if you don't implement the core library and leave it as in a paper box then I don't you'll get any more than a couple of times speed-up and this very extends to everything that gets in your way so it's the core library it's also see extensions and it's also other little languages they're embedded for sample regular expressions people often use those three strange things in Ruby and if they're a barrier if you can't compile through the regular expressions then your performance their specialization splitting in mining partial evaluation inline caching they're all solutions to this problem and truffle makes it easy to use them it makes it easy to use them on the core library because our core library is everything is an object everything can be cached in the store State and we found that this can result in order of magnitude increases in performance with reasonable effort so for the first two years as one person working part time on Jerry betrothal more recently there's been a couple more but we've surpassed all of our Google rotations with very reasonable effort I need to making a couple of important acknowledgments Jerry Chavez built on the work of Germany Rubinius and Jerry people hostess in their repository so a very grateful to them allowing us to do that and the the Grove VM team Oracle labs is large and always growing this particular work on Ruby's joint work with me Ben Modelo's who's here Kevin Menard and Peter a looper the work on C extensions but you rittany work with matias grimmer but now with manuel rigor who's also here and there's a large group of people who work on things like the growl compiler and truffle you make what we're doing possible thanks very much