Polyglot in High Performance Computing

0 0

I'm a physicist a computational physicist at Caltech and I work in a variety of programming languages in my undergraduate degree at MIT I studied both computer science and physics and it's great to work at the intersection of the two also had another life as a mobile developer for applications like circle of six and signal so I'm definitely getting a lot out of this polyglot conference mostly the languages i'm going to be giving examples in today or c c++ even Fortran a little bit of cuda and that's the high performance in the title and to start out with and some people might not have done much parallel computing in the past I'll give a little introduction to what exactly that might mean so you could have shared memory or distributed memory you could have single instruction multiple data or multiple instruction multiple data it also have functional paralyzed ilysm where different different tasks are broken up and executed in parallel that may not have dependencies upon each other and then combined in the end in high performance computing we mostly deal with distributed memory and multiple instruction multiple data that would be might have multiple networked machines which each individually have their own memory and data store perhaps even a shared hard drive a networked hard drive as well and you want to perform some computation on that machine mostly the languages of choice or C C++ and Fortran again because of performance but I'm really excited about languages like Julia what you're going to hear about I believe your talks been moved tomorrow and other languages that will give us as programmers I think we're all going to be forced into this era of distributed computing whether or not that's high performance as Moore's Law kind of comes to an end and we're going to have to be leveraging multiple cores on the same machine of course but also eventually probably distributed memory systems to get the performance that we need so one standard that is often used with these languages of course the standard can also be implemented in other languages is the message passing interface or as well MP and for distributed memory and single instruction multiple data potentially CUDA which is a GPU programming higher level programming language to allow you to execute the same instruction on a lot of different data at once originally used in the gaming community of course but now we've leveraged that for scientific computing as well so in parallel computing you might want to quantify how much faster an algorithm will run on a parallel computer with n processes and TP it's the time it takes to execute the algorithm on p processes so here you would have the speed up the ratio between the time it takes on one process to the time it takes on n processes and generically every algorithm let me go back here every algorithm has some cereal component and minimizing that serial component is what will give you the best speed up so as processes are added to the computation a decreasing return may be seen on algorithmic speed because that serial component will begin to come into play so if s is the fraction of the algorithm that's inherently cereal then the algorithm will instead have the following speed up upper bounded by 1 over s you take the limit as n your number of processes goes to infinity so fundamentally or your upper bounded by this the serial component this is just a visualization of not going to talk about this parallel visualization framework called pair of you today but it's much it's an open source visualization framework that allows you to visualize terabytes or even more of data on a distributed computer but I like this visualization of this algorithm basically we have four processes and these are actually different computers and we can see that prior to this algorithm being run when the data was initially read in that in the in the center there was a lot of mixing between where the data was located on different processes and for many algorithms this creates a problem for example if your algorithm depends on neighboring particles or neighboring elements then you might have to query another machine to do that so one way to speed up your algorithm might be to first try to minimize the number of processes the the kind of mixing of the processes um and another nice concept from parallel computing is the concept of ghost faces so here's a little space invader divided on three processes and then I want to run an algorithm that finds the boundary of the space invader if I were to do this in parallel and then recombine it I would get a lot of spurious boundaries however if I go ahead and on the space invader on the right hand side I add in some white blocks where a white block is always added if it used to be next to another block on another process called a ghost block or a ghost node and then i do the same algorithm then I throw away the ghost blocks and then I recombine it I get the right result so that shows that by adding paying some cost in terms of memory in terms of storage you can create some algorithms that nevertheless function as planned if you throw away these go stones when you're done it's a little bit of a complicated formula but this is an example a simple example from physics where if we want to find the center of mass let me parse this out for you basically I'm going to sum up the mass of every particle and then I'm on the bottom then on the top of that formula I'm going to multiply the mass of the particle times its coordinate where I is referring to X Y or Z coordinate so then you get the center of mass in the X Y or Z direction and the important thing about this algorithm is that it really doesn't depend on any neighboring particle per se you could have each you could break up your data into and chunks have each chunk compute this quantity on its individual portion the top quantity and then the bottom quantity and then communicate the result and compute the final sums of the top and bottom quantity to get the final result so we have very neatly parallel Eliza's because it's a commutative operation and the processes can calculate those top and bottom sums independently and the result is combined and final division is performed you can either do that naively by sending everything to some root node or you can combine it in some binary structure which languages or standards such as the MPI actually support those operations you can think of it if you're familiar with MapReduce as a reduce operation of sorts done very efficiently so this is a nice algorithm and more hairy algorithms are ones that really depend on many communication steps because not only are you killed by serial components you're also killed by communication especially as you scale up okay so a little bit of an example of MPI if you haven't seen it before it has some of this I believe most people in the audience would be more familiar with MapReduce than with mpi but you have the ability in in the language in something that implements the standard to tell which node you are which process you are 0 to n and you have the ability to broadcast messages so you have the ability to broadcast the result of your computation to everyone you have the ability to collect results you have the ability to send your was to a specific process and you have the ability to recognize if you're the root process as well I we need some comedic relief and a talk that's going to feature so much Fortran Fortran can't be the only joke in the talk okay so you also have the ability to erect barriers so that's basically like a synchronization step so you can do a blocking send or you can do a non-blocking send or receive where you send data or receive data but you don't wait and then of course you have to check whether you received the data or the data has properly been sent before you actually go to do something important with it but you're able to do other computations in the meantime okay so here's some C code ah and we see a couple you can just look at the comments so basically we can start out with an initialization we can find out how many processors we have we can find out what the what the number of process that this is and this same with an MPI exact command the same code runs on every different node it's just that the result in terms of my ID will be different num prox will be the same for everyone so then we can actually do something with that we can print something and you'll get this printed and you can receive data and send data and here everyone received something from the route processor in this case rank 0 and then send some information to to rank 0 so that's a little bit of MPI we also have openmp so this would be doing the same kind of thing but locally using threads instead of distributed machine and here we can see the number of threads that we have and say hello to everyone else so here's a little bit of CUDA so CUDA is again distributed computing on a GPU so here you have the concept of host memory which is the CPU and device memory which is going to be your GPU and you there's a special instruction to allocate memory on the GPU so once the memory is allocated then you can go ahead and copy information there and then there's this concept of a kernel on a GPU and I'll show you some kernel code later but basically we're going to add on all end portions and copy the result back to the host which is our CPU and then free the memory here we didn't actually do anything with the computation but we could eventually so here's a little bit of a small kernel and you can see that it's quite simple and the only thing different from your standard see is this global keyword so CUDA is basically an extension of of see that you can pile with with a CUDA compiler but it's basically straight see with with some small little extras so that's a little introductions to some of the tools of the trade and high-performance computing so what do we actually want to do with these I'll give an example from my field which is solving differential equations on a grid so sounds kind of complicated right and here's some more complications you might remember from calculus if you ever did take calculus if not I'll teach you calculus in one minute um so we have two points on this curve and we want to find the slope between those two points um actually calculus would tell us the slope of the line that doesn't touch the curve at any point so say that red point so calculus would tell us that result but we don't want to do calculus so we're going to do our best guess and we're just going to take two points kind of near the red point and draw a line between those and say that that's basically the slope and then the smaller we make that line between the two blue points the closer we get to the result that would get that calculus would give us now calculus is continuous and can shooters are discrete and so mostly we have to do the discrete thing on our modern computers and so we can do this in a more complicated manner in three dimensions using volumes if you want to compute volumetric things but this is the basic idea if you want to solve a differential equation that is some equation involving these funky derivatives you might create a grid and then do some approximation like this to try to get those derivatives and then go ahead and solve your equations with regular algebra okay so here's an example solving something called advection basically moving material from one place to another and if our computers were perfect if you had this much material one Christine here and then I put it into a equation and figured out where I was here you would get one Christine here in practice we have errors in computation so you might get like 99.999% of Christine here and a little bit that went to numerical error but this is a way that you can test if you move material over time from A to B you should get a hundred percent of the stuff at be if you moved one hundred percent of the stuff we can actually do that with an equation and test all good algorithms work or sometimes you actually want to do that to solve various physics things such as heat transfer okay so that was a little bit on a 1d grid and then I said you could do it for two dimensions or three dimensions and volumetric stuff and practice if you have say billions of particles like we do on the left here you might not want to have a regular grid everywhere because that would be a very large grid so we sometimes do something called adaptive mesh refinement in these grid based methods where you follow where the most particles are in terms of where you create your grid to do numerix and that's called adaptive mesh refinement and then you can kind of have big grid cells where you don't have a lot of stuff and small grid cells where you do have a lot of stuff okay so those are some examples of differential equations but these come up in chemistry biology biochem Street physics fluid dynamics environmental simulations weather forecasting geophysics space science astronomy artificial intelligence all sorts of really cool stuff and being able to solve these on supercomputers gives us insight scientifically to any of these fields traditionally physics has kind of led the way because we had a lot of computer inclined people already in the field also weather forecasting kind of depended on it very early on but nowadays biologists and chemistry folks are getting into the the fray as well and it'll be very exciting as we see some of these we've already seen with with coffee and other things that GPU accelerated or other high-performance computing techniques and artificial intelligence will hopefully result in some very cool stuff in the next few years okay so in the last part of my talk I'll do what I promise to do and introduce a framework that allows physicists or other scientists who want to solve these differential equations on large grids to do so very efficiently in the high performance computing language of their choice and that is the cactus framework which is a modular open source environment for scientists and engineers that allows parallel computation across architectures common code development across groups and languages okay here's a cactus and its name this because there's flesh which is the root port and that has a bunch of api's allows you to define parameters has a scheduler grid variables it has a hole make system and error handling and then there are things called thorns which are more modules which allow people to define their own parallel ISM coordinate infrastructure IO etc interpolation various types of equations and so most of the the use of this tool kit there's a set of core developers who work on the flesh but mostly people work on these modules and the make system allows you to very easily select which modules you want to use and as well integrate other people's modules and I'll show you that towards the end okay so what is in thorne it contains cactus declarations in a special language the source code which can be any of these languages I believe it is said to be very easy to add an additional language no one has found the need yet to do so but it is certainly possible makefile fragments documentation and test cases as well as example parameter files so once an executable is is built there are some options you can change at runtime in parameter files ok so the API of The Thorn is defined in this cactus programming language essentially called interface CCL and it has the name the implementation name various grid functions it can inherit grid functions and routines which are basically the api's that are provided or used there's also the scheduler which calls routines at certain times so the main thorn the execution body of the flesh will go through things in a various sequence and you can kind of insert your own computations in that sequence and schedule groups can be created it so you can have a whole group where you say execute this entire group at the same time you can also synchronize variables so it's rule-based where you can schedule after before a while and also allocate storage for the grid variables ok so your parameters are basically the runtime options that I mentioned that you can set in the parameter file and also provide defaults when you define your thorn so you can declare parameters there are five types integer real boolean keyword and string you can define allowed ranges where it will basically give a warning or crash without proper data basically it will stop running and say hey we only accept this parameter within this range for example in in this example the ample two must be non-negative so if the user tried to give a perimeter file with minus one there it would stop running and say hey in your primitive file this has to be non-negative and it can also inherit public parameters from other implementations so this is an actual example of a parameter file um so each of the things on the left are various thorns driver grid 80 and base these are all thorns and then colon colon the actual parameter that you want to set and then equals whatever you're setting it to and at the top you have the active thorns so that is basically because even if you compile a range of thorns you don't necessarily have to activate them all at the same time this will set your parameter values so here's a little bit of example source I'm not going to show the whole thing I'm not going to show anything party time okay all right um so here all right look at a troubleshoot for one second all right disco disco okay so here's just a little bit it's a subroutine that's within a particular thorn and inherits a bunch of arguments from everywhere else and then it declares some local variables does some computations and this can be used within that thorn are called from other thorns which depend upon it okay so the driver is part of the flesh and it's a specialized thorn for memory management and parallelization that contains all this information about the grid so I showed you how you might want to solve differential equations on a grid before you might not want to design all the parallel infrastructure to do that and that's why cactus is great so in that case you can use a multi-block or find grid structure and just basically say oh I want this portion of the grid to the left or the right of me or above or below me etc and not have to worry about doing all the efficient parallelization there and things that are driver specific or interpolation reduction I oh and check pointing ok so we're to the last few slides so no worries about me boring you with high-performance computing so I promised some polyglot arree how the cactus is made so these CCL files are parsed by Perl scripts now cactus was designed in the 1990s originally and it still is mature almost decades later which is impressive and it then creates from these CCL files the interface the params etc C code and latex files for documentation the make files required new make and the flesh is written in C the thorns you can write in C C++ or Fortran each different module can be each of the thorns can be in any one of these languages and even a combination of these languages and more can be added so the flesh helps with the function calls between the different languages and really enables the developer to write and build upon the modules of their choice in their particular thorn while still leveraging all the infrastructure and modules written by many other people in the community and the fortran code is preprocessed with c++ combining fortran with c++ you thought it couldn't get worse and then we sanitize it with pearl i don't think that's sanity ok anyway documentation uses latex so you'll get after you compile it an entire documentation in PDF and beautiful latex with mathematical formula if you put those in your docs ok so that was how the cactus is made so how to make the cactus so on the right this is just highlighting that we have one thorn that's using particularly one svn server in Germany arm and I even have a username where I have a specialized I have basically access to that server via my public SSH keys and then I'm checking out a particular thorn and then right below there's another thorn or it's a get Hut it's a git repo on a bit bucket account or there's no username so you can do a variety of these things and when I'm actually building code to run a simulation I'll have hundreds of these perhaps so it's very nice and that you can integrate even multiple types of version control on multiple servers with multiple different types of authentication or no authentication and then you give it this thorn list you can even build different configurations in the same tree so I might want to have one configuration for one simulation in another for another I don't have to check out multiple copies and you can choose a list of thorns and options for each configuration and of course as I mentioned different version control systems and repository locations okay so what is this all used for a variety of things I mentioned a lot of different applications was primarily the numerical relativity community who basically had to go this route to solve their very gnarly differential equations came up by Einstein so don't give him access to your differential equations he'll make them too complicated ok so black holes neutron stars supermassive stars gravitational waves quantum gravity supernova all sorts of things this is a star exploding with magnetic fields on the right and that was all done in a simulation using the cactus framework and one particular group of thorns called the Einstein toolkit but people are using this in a variety of different fields as well so you can do very powerful things with this framework and I think something that we can take away as well is the fact that as a scientist or as a programmer you're allowed to work in your preferred language and yet integrate with api's from a variety of different languages which is neat because you can build on legacy code without having to write everything in Fortran so I right almost zero Fortran but I get to deal with Fortran and not have to rewrite and parse and port everything to c or c++ and even if you are writing Fortran you don't have to write everything in Fortran 77 or Fortran in the time of Jesus okay so here's a little picture of some cacti and I will take a few questions if you have any if not I'll be around all week I'm based in Los Angeles I'm Corbett on twitter and i work with kip thorne the producer of interstellar and at the university that Fineman used to work so it's a pretty cool place to be all right thanks just to clarify so when you have multiple languages interoperating that's fire old school unix linking so the multiple languages interacting I believe the Fortran is actually parsed into C++ at some level something that's like that yeah ok so everything gets built into a c++ and then built from there yeah hi so I was wondering how much time usually does the design of the equation take compared to development of it in the pure code so we are do you design a like encode alrighty yes so the equations are mostly well known and the question is how many of them can you solve at once efficiently and what approximations you need to make so one example just from gravitational physics is to solve the gravity equation directly it's an N squared order N squared problem where every particle depends on every other particle and that works for objects up to the scale of sake globular clusters balls of stars they have millions of stars but if you want to try to simulate portions of the universe you end up having to sample with billions of particles so you have to make approximations and say well actually it's an N squared problem but we're going to make approximations where we group particles out here all in one single particle it's called a multipole method so the majority of the time is not spent on may be developing equations but more trying to find the right approximations for the right problem and trying to quantify your errors because you don't want to make the effort of computing something exactly if your numerical error is above that for example so just figuring out what you really do have to solve to get the physical result is I would say the majority of the time and then how to do that and implement that is another portion hi thanks for talking this is just a site question but what is the coolest scientific discoveries it was made with the cactus framework if that you know about I think there's been a lot of cool research into how black holes collide for example and predictions one thing that I'm working on is how to predict what how the first supermassive black holes might form and I think that's going to be pretty cool so would say the coolest discoveries are yet to be made but there have been a lot of fundamental theoretical predictions the thing is that the majority of these will not be verified until the gravitational wave Observatory 'he's get a little bit more sensitive and that the primary prediction and this super strong general relativity regime is a radiation of gravitational waves and we're still not able to observe that with enough sensitivity but there are detectors in the u.s. LIGO eventual space detectors elisa that should observe this so there are a lot of predictions that the community is making that once you see the spectrum you'll be able to say oh those are two black holes merging you'll be able to say exactly what it is because of predictions that we've made otherwise you would just look at this thing and say I don't know what these waves mean so yeah anyone Christians someone down here would you code something scientific in JavaScript um I mean JavaScript does have fairly good performance as compared to say java I have seen people code scientific things in Java but more for demonstration or educational purposes I definitely think that there are neat demos where you can do unexpected things in JavaScript and that would be a various demo but at some point when you want to scale things up you're going to want to run on a distributed machine you won't be able to scale that with with JavaScript but what I would do scientifically with javascript is um so for example I mentioned this parallel visualization framework the front end for that is built on cutie and you know cutie is a little bit of a pain to build if you've ever tried to build it it does often build ninety percent of the time but it takes like four hours to build so if you want to do like rapid iterative development front end tools where you have a client-server architecture and the javascript is the client hooking into a server that the back end is some HPC cluster javascript would be very well suited for that my opinion