Big Data with Python & Hadoop

0 0

hopefully interesting so hello welcome everyone I hope you enjoy in Europe I thing as much as I do and for the next 45 minutes you can just sit relax and enjoy the talk about Big Data with Python and Hadoop slides are already at slideshare.net and i'll give you the link at the end of the talk and this is our agenda for today it first a quick introduction about me in my company so you get an idea about what do we use a loop for then a few words about Big Data attached Hadoop and its ecosystem next we'll talk about HDFS and third-party tools that can help us to work with HDFS after that will briefly discuss MapReduce concepts and talk about how we can use Python with Hadoop what options do we have like what third-party libraries are out there written in Python of course about the pros and cons next we'll briefly this color thing called pig and finally we'll see the benchmarks of all the things we've talked about earlier these are freshly baked benchmarks which I made a week ago just before coming to Europe python and they actually quite interesting and of course conclusions by the way can you please raise your hands who knows what I do please working with Hadoop or maybe worked with hello panda past okay okay thanks not too much all right well this is me my name is Max I live in Moscow Russia i'm the author of several Python libraries there's linked to my github page if you are interested I also give talks on different conferences from time to time and contribute to other Python libraries I work for the company called a data we collect in process online and offline use data to the idea of users interests intentions demography and so on in general process like more than 70 million users per day there are more than 2,000 segments in our database like users who I interested in buying a BMW car or users who like dogs or maybe users who watch porn online you know we have partners like Google DBM turn appnexus and many more we have quite a big worldwide use the coverage and we process data for more than 1 billion unique users in total we have one of the biggest user coverage in Russia and Eastern Europe for example for us it's about like 75 percent of all users so having said all that you can see that we have a lot of data to process and we can see there ourselves a data-driven company or a big data company like some people like to call it now so what exactly is big data there is actually a great quote by dan Ariely about big data big data is like teenage sex everyone talks about it nobody really knows how to do it everyone thinks everyone else is doing it so everyone claims they are doing it nowadays actually big data is mostly marketing term or buzzword actually there is even a tendency of arguing like how much data is big data and you know different people tell different things in reality of course only a few have real big data like google or cern but to keep it simple for the rest of people big data can be probably considered big if it doesn't fit into one machine or it can be processed by one machine or it takes too much time to process by one machine but the last point though can also be a sign of big problems in code and not a big data now that we figured out that we probably have a big data problem we need to solve it somehow this is where apache hadoop comes into play Apache Hadoop is a framework for distributed processing of large data sets across clusters of computers it's often used for batch processing and this is a use case where it really shines it provides linear scalability which means that if you have twice as many machines jobs who run twice as fast and if you have twice as much data drops will run twice as slow it doesn't require super cool expensive hardware it is designed to work on unreliable machines that are expected to fail frequently it doesn't expect you to have the knowledge of inter-process communication or threaten our PC or network program and so on because parallel execution of the whole cluster is handled for you transparently have you proceed giant ecosystem which includes a lot of projects that I designed to solve different kind of problems and some of them are listed on this slide ma just didn't fit in HDFS and MapReduce are actually not a part of ecosystem by the part of Hadoop itself and we'll talk about them on the next slides and we'll also discuss pig which is a high-level language for parallel data processing using apache hadoop i won't talk about the others because we simply don't have time for it so if you are interested you can google this for yourself so HDFS it stands for Hadoop distributed file system it just stores files and folders it chunks files into blocks and blocks are scattered randomly all over the place by default the block is 64 megabytes but this is configurable and it also provides a replication of blocks by default three replicas of each block are created but this is also configurable HDFS doesn't allow to edit files only create read and delete because it is very hard to you know implement an added functionality in distributed system with replication so what they did was just you know why why bother and implementing editing files when we can just make them not editable Hadoop provides a comment line interface to HDFS but the down said the downside of this that it is implemented in in Java and it needs to spin up a JVM which takes up from one to three seconds before a common can be executed which is the real pain especially if you are trying to write some scripts and so on but thankfully the great guides Spotify there is an alternative called snake bite it's an gfs client written in Python it can be used as a library in your Python script or as a comment line client it communicates with Hadoop with you we are PC which makes it amazingly fast much much faster than native Hadoop command line interface and finally it's a little bit less the type to execute a comment so Python for the win but there is one problem though snake bite doesn't handle write operations at the moment so while you're able to make meter operations like moving files renaming them you can't write a file to HDFS using snake bite but it is student very active development so I'm sure this will be implemented some point this is an example how snake by can be used as a library in Python scripts it's very easy with just import client connected Hadoop and start working with HDFS it's really amazing is simple there is also think of you here is the web interface to analyzing data with Hadoop it provides awesome HDFS file browser this is how it looks like you can do everything that you can do through native HDFS comment comment line interface using here it also has a job browser a design and 44 jobs so you can develop big scripts an Impala hive queries and a lot of a lot of my stuff it supports zookeeper Uzi and many more I won't go into details about Hugh because again we don't have time for this but this is the tool that you love if you don't use it to try and by the way it's made of on top of python and django so again python for the win so now when we know how Hadoop stores its data we can talk about MapReduce it's a pretty simple concept the mappers and reduces and you have to code both of them because they actually doing data processing what mappers basically do is the load data from HDFS the transform filter or prepare this data somehow and output a pair of key and value mappers output then goes to reduces but before that some magic happens inside Hadoop and mappers output is grouped by key it allows you to do stuff like aggregation counting searching and so on in the reducer so what you get in the reducer is the key and all values for that key and after all reduces our complete the output is written to HDFS so actually the work flow between mappers and reducers is a little bit more complicated there also is always also shuffle face sort in sometimes secondary sort or combined as petitioners and a lot of different other stuff but we won't discuss that at the moment it doesn't matter for us it's perfectly fine to consider that there is just on the mappers and reducers and some magic is happening between them now let's have a look at the example of MapReduce we will use the canonical work an example that everybody uses so we have a text used as an input which consists of three lines Python is cool head over screw in Java is bad this text will be processed by by you it will be used as an input which consists of three lines so it will process line by line like this and inside a mapper line will be split it into words like this so for each word in a map function a map function will return a word and a digit 1 and it doesn't matter if we meet this this word twice or three times or we just output a void and a digit 1 then some magic happens provided by Hadoop and inside the reducer we get all values for forward group by this word so we just need to sum up these values in the reducer to get the desired output this may seem an intuitive or complicated is first but actually it's perfectly fine and when when you're just starting to do MapReduce you have to make your brain think in terms of MapReduce and after you get used to it it's all will become very clear so this is the final result for our job now let's have a look at our previous work out example will look like in Java so now you probably understand why you earn so much money when you call in Java because more typing means more money and but can you imagine like how much gold you should drive for a real word to use case using Java so now after you've been impressed by the simplicity of Java let's talk about how we can use pattern with dupe Hadoop doesn't provide a way to to work with Python natively it uses a thing called Hadoop streaming the idea behind behind this streaming thing is that you can supply any executable to Hadoop has a map or reducer it can be standard unique tools unix tools like cat or unique or whatever and 40 Python scripts or Perl scripts or rub your and our PHP but like whatever you like so the executable must treat from standard in and right to stand it out this is a code for a mapper and reducer the mapper is actually very simple we just read from standard input line by line and we split it into words and output avoid and digit 1 using a tap as a default separated because the it's a default Hadoop separator you can change it if you like so one of the disadvantages of using streaming direct directly is the input to to the reducer I mean it's it's not grouped by key it's coming line by line so you have to figure out the boundaries between key key piece by yourself and this is exactly what we do here in the reducer we are using a group buy and eat groups multiple workout pairs by word and it creates an iterator that returns consecutive keys and the group so the first item is the key and the the values the first item of the value is also the key so we just filter it we using an underscore for it and then we cast a value to into somewhat sum it up it's pretty awesome compared to how much you have to type in Java but it still may be like a little bit more bit complicated because of the manual work in the reducer this is a comment which sends our MapReduce job to Hadoop via Hadoop streaming and we need to specify a Hadoop streaming joy and a path tree mapper and reducer using a mapper and reducer arguments and input and output one interesting thing here is to file arguments where we specify the path to a map and reduce again and we do that too to make Hadoop do you understand that we wanted to upload these two files to the whole class strapped to its called Hadoop distributed cache it's a it's a place where it stores all files and resources that I needed to run a job and this is really cool thing bit because imagine like you have a small cluster of four machines and you just wrote a pretty cool job and script for your job and you used an external library which is not installed on your cluster obviously so you if you have like four machines you can plug into every machine and install this library by hand but what if you have a big cluster like of 100 machines for and 1000 machines it just won't work anymore of course you could create some some bash script or something that could do the automation for you but why do that if I do probably provide a way a way to do that so you just specify what you want Hadoop to to copy to the whole cluster before the job will start and that's it and also after the job is completed Hadoop will delete everything and your cluster will be in its initial state again it's pretty cool and after our job is complete we get the desired results so Hadoop streaming is really cool it requires you to do a little bit of extra work and though it's still much simpler compared to Java we can simplify it even more with the help of different Python frameworks for working with Hadoop so let's do a quick overview of them the first one is Dumbo it was one of the earliest Python frameworks for Hadoop but for some reason it's not developed anymore there's no support no downloads so just let's forget about it there is a Hadoop II I do pyar I know it's the same situation as with dumbo the the project seems to be abandoned and there are still some some people trying to to use it according to pipe I download so if you want you can also try it I don't there is a pie doop it's a very interesting project while other projects adjust wrappers around loop streaming pile is it uses a thing called Hadoop pipes which is basically C++ API to Hadoop and it makes it really fast we will see this there's also a luigi project it's also very cool it was developed at spotify it is maintained by spotify its distinguishing feature is that it has an ability to build complex pipelines of jobs and support magnetic different technologies which can be used to run these jobs and the results are thing called a my job it's the most popular python framework for working with Hadoop it was developed by yelp and it's also cool but there are some things to keep in mind while while working with it so we'll talk about pie deeply G&M I job in more details in next slides so the most popular framework is called MapReduce job or mi job on mr. job like some people like to call it so I also like this mr. job is the wrap-around to dupe streaming and it is actively developed by yelp and maintained by yelp and used inside yelp this is how our workout example can be written using mr. job it's even more compact so while while a mapper looks absolutely the same as in the drawer Hadoop streaming just notice how much typing we saved in the reducer but behind the scenes actually mr. job is doing the same group by aggregation which is saw previously in the Hadoop streaming example but as I said there was some things to keep in mind mr. job uses the so-called protocols for for data serialization this realization between phases and by default it uses adjacent protocol which itself uses pythons Jason library which is kind of a slow and so the first thing you should do is to install simple JSON because it is faster or starting from mr. job zero zero point five point zero which I think still in development it supports ultra JSON library which is even more faster do this is how you can specify this ultra Jason protocol and again this is available only starting from zero point five point zero lower versions use simple Jason which is slower mr. job also supports a raw protocol which is the fastest protocol available but you have to take care about civilization this realization by yourself as shown on this slide so notice how we we cast 12 string in a mapper and some to string in a reducer also with the introduction of ultra Jason in in the next version of mr. job I don't think there is a need to use these radicals because they are not so much faster actually compared to ultra Jason and at least most of the time of course it depends on the job and so you have to experiment for for yourself and see what fits best for you so mr. job pros and cons it in my opinion it has like best documentation compared to other Python frameworks it has best integration with the Amazons EMR which is elastic mapreduce and can be compared to other Python frameworks because Yelp uses it operates inside emaar so it's understandable it has very active development biggest community it provides really cool cool local testing without Hadoop which which is very convenient while doing development and it also automatically uploads itself to a cluster and it supports multi stabs jobs which means that one job one job that will start only after the second not another one is successful finished oh and you can also use vegetable tease or jar files or whatever in this multi-step work for the only downside that I can think of is a slow serialization and deserialization compared to raw pythons streaming but compared to how much typing it saves you we can probably forgive it for that so this is not not really big on the next in our list is Luigi Luigi's also a wraparound Hadoop streaming and it is developed by Spotify this is how our workout example can be written using Luigi it is a little bit more bearable compared to mr. job because Luigi concentrates mainly on the total work flow and not only one on a single job and it also forces you to to define your input and output inside a class and not from a common line in the face as for for the mapper and reducer implementation they are absolutely the same Oh four minutes left oh my god I have so much to say um for mean it's okay okay so-so Luigi also has this problem with civilization this realization and all you also have to use ultra Jason just just use ultra Jason and everything will be cool okay so we'll probably skip that it's also cool which is cool but not so good for local testing and we'll also skip I dupe okay okay okay oh man all right all right okay benchmarks this is the the most important part okay this is probably why a lot of people are therefore for the benchmarks so we this is a cluster and and software that I used to do the benchmarks so the job was a simple word count on a well-known book about a python by mark mark lots and i multiplied it 10,000 times which gave me 35 gigabytes of data and i also use the a combined between a map and reduce phase so it combined is basically local reducer which just runs after them mm app face and it is kind of an optimization so this is it this is the table Java is fastest of course no surprise here so it is it is used as a baseline for performance all numbers for other frameworks are I just reaches relative to Java values so for example we have a job run time for for java like 187 seconds which is three minutes and something to get the number for pi dupe you need to multiply 187 by one point 86 which will give you 300 and 87 for 47 seconds to see almost six minutes so each job I I rani job three times and the best time was taken and so let's discuss a few things about this this performance comparison so PI dupe is is the second after Java because it uses this Hadoop Hadoop pipes C++ API it still takes almost twice as slow compared to the native Java but another thing that may seem strange is the five-point 97 ratio in the reduce input records so it looks like the combine is didn't run but there is an explanation to that in spite of manual it says the following one thing to remember is that the current had the pipes architecture runs the combiner under the hood of the executable run by pipes so it doesn't update the combiner counters of the general Hadoop framework so this is why why we have this then come speak actually thought that pic should be the second after Java before I ran these benchmarks but unfortunately I didn't have have really time to investigate the reasons so I just can't say why is slower because pig pig translates itself into Java so sure it should be almost as fast as Java then then comes a raw streaming and C Python in pi PI and you probably may be may be surprised that pipe I know okay do you have any questions or I just can't continue okay okay so yeah so it's actually I'm speaking for a half a half an hour and this is a 45-minute talk so I have still have fifteen minutes I know so no questions you see okay so um yeah C Python in pi PI yeah you probably probably may be a bit surprised that pie pie is slower but actually the thing is that it's the word count is a really simple simple job and pie pie is is ISM is currently slower than C Python when dealing with the reading and writing from standard in and standard out so it really depends on the job in in real world use cases pi pi is actually a lot more faster than C Python so what we usually do we implement a job and then then we just run it on pi PI and C Python and see what's the difference and like I said in most cases pi PI wins so just just try for yourself and see what fits best for you then comes mr. job and as you see ultra Jason is just a little bit slower than then these raw protocols and but it saves you the pain of dealing with manual work so just I think use ultra Jason and finally Luigi which is much much slower even with without rejection then mr. job and I I don't want even to talk about these terrible performance using its default serialization scheme so okay if we still have a little like not 15 minutes so i can probably return back okay so we stopped it I think this or this nothing yeah this one so um Luigi as we just saw Luigi uses by default it uses its ethereal serialization scheme which is really really slow so this is how you can can switch to do to Jason and I didn't really have time to investigate also but after after switching to to Jason I needed to specify an encoding by by hand so I don't know it's also something to keep in mind and don't forget forget to install ultra Jason because by default Luigi falls back to the standard libraries Jason which is slow so ok pros and cons luigi is the only real framework that concentrates on the on the workflow in general it provides the central scheduler which has a nice dependency graph of the whole workflow and it records all the all the tasks and all the history so it it can be really useful it is also in very active development and it has a big community not as big as mr. job but still very big it also automatically uploads itself to cluster and this is the only framework that has integration with snake bite which is awesome just believe me it it provides not so good local testing compared to mr. job because you you need to to to to mimic and map and reduce functions by yourself in the run method which is not very convenient and it has the worst serialization and deserialization performance even with ultra Jason so the last of Python frameworks that I want to talk about poop unlike the others it doesn't trap Hadoop streaming but uses Hadoop pipes it is developed by CRS for which is a central for advanced studies research and development in Sardinia Italy and this is an example of word count in in PI tube which which looks very similar to mr. job but unlike mr. job or or Luigi you don't need to think about different civilization and deserialization schemes just concentrate on your mappers and reducers on your code and just do a job so it's cool ok so pros and cons um ok ok I'll do my best so poop has pretty good documentation it can be better but it generally it's very good due to the use of hydro hydro perhaps it is amazingly fast it also has has an active development and it provides in hdfs API based on liebe HDFS library which is cool because it is faster than the native Hadoop HDFS command line client but it is still slower than snakebite I didn't benchmark this but Spotify guys claims that it's slower so and it is slower because it still needs to to to spin up an instance of JVM so I can't believe them that's nobody's faster at this is the only framework that gives an ability to implement and record really a record dryer a petitioner in pure python and these are some kind of advanced coupe concept so we won't discuss them and but the ability to do that is really cool the biggest con is that poop is very difficult to install because it is written in C Python and Java so you you have to have all the needed dependencies plus you need to correctly set some environmental variables and and so on and I saw a lot of posts on stackoverflow and on other sites where people just just got stuck on installation process and probably because of that pie dupe has a much smaller community so the only place where you can ask for help is a github repository of poop but the the authors are really very helpful they're cool guys to eat yeah the answer to all the questions and so on also you had a pipe I tube doesn't upload itself to a cluster and like other Python frameworks do so you need to do to do this manually and it's not that not not not so trivial process if you just starting to to work with Hadoop so this is it so pig pig is a is an Apache project it is a high level platform for day analyzing data it runs on top of Hadoop but it's not limited to Hadoop this is a working example using pig it will be translated to map and reduce jobs behind the scenes for you and you just you you don't have to think about like with what is my mapper what is my reducer you just write your your pic script and also in most of the time in in in real world use cases pig pig is faster than Python so this is this is really cool it is very easy language which you can learn any day or two or something it provides a lot of functions to work with data to filter it and and so on and and and the biggest thing is that you can extend pick functionality with Python using Python udfs you can write them in in C Python which gives you access to more lips but it's slower because it runs runs as a as a separate process and sends and receives data we are streaming and you can also use jython which is much much faster because it compiles you the apps to Java and you don't need to leave your jvm to execute your UDF but you don't have access to libraries like numpy and you know scifi and so on so yeah this is an example of PQ DF for forgetting a a jew data from an IP address using a well-known library from max mind it may seem complicated first but it's not actually so in the driving part at first we we import stuff some stuff from Java and in the library itself then we instantiate the reader object and define the UDF which is which is simple and it except the IP address as the only parameter and then tries to get a country code and see this geo name from a maxima and database it is also decorated by the by the pics output schema decorator and you need to specify the the output of the UDF because pic is statically typed and as for the then we put this UDF into the file called geoip dot pi and end in as for the pic part we need to register this UDF first and then we can simply use it as shown like here so it's really simple concept when you get used to it yeah there is also a thing called embedded pick this one so we we already saw benchmarks to conclusions so for complex workflow organization job training and HDFS manipulation use Luigi and snakebite this is yeah this is the use case where they really shine snakebite is the fastest option out there to work with HDFS but you have to fall back to Native Hadoop command-line interface of course if you need to write something to HDFS but just don't use Luigi for actual MapReduce implementation at least until performance problems one be fixed for writing lightning speed MapReduce jobs and if you aren't afraid of difficulties in the beginning he is spy rope and and pig this has two to two fastest options out there except for Java the problem with peak is that it's not Python so you have to learn it its new technology to learn but it's worth it and poop while maybe it is very difficult to start using it because of the problems so installation and and so on it is the fastest Python option so it gives you an ability to to to implement racket reducing riders in Python which is priceless for development local testing or perfect Amazon's in my integration hughes mr. job it provides best integration with EMR it also gives you the best local testing the development experience compared to other python frameworks so in the conclusion I would like to say that twice and python has a really really good integration with Hadoop it provides us with great libraries to work with Hadoop well the speed is not that great of course compared to Java but we love Python not for its speed but for its simplicity and ease of use and by the way if you are wondering what is the most frequently used word in in mark large book learning python without counting things like prepositions conjunctions and so on this word was used 3979 times and this word is of course Python so this is all I got you can find slides and code and I used for the benchmarks on SlideShare and get hot github so thank you you