For our 51st meetup, we'll dive deep into the trenches of big data.
We all know from experience that big data can be a struggle. Let's help each other and learn by sharing our experiences! That's the core of our community ...
20h15 Profile your Hadoop pipelines (Maxime Kestemont, Criteo)
21h00 I know you didn't test your Airlfow DAGs last summer (Mathias Lavaert, DataMinded)
Criteo is a major player in the ad-tech industry, being the leader in ad retargeting. So, data is at their core, where all solutions are centered around ML models to determine the right amount to bet for each display. And Criteo is at internet scale. Consequently, they have a lot of data. Some numbers: the largest Hadoop cluster in Europe (>3000 machines on premise, growing every few months), handling >120TB of new data per day, biggest Vertica cluster worldwide, etc. The interesting part behind those numbers is that most of the things that many companies in the normal world use on a day to day basis have been somehow broken throughout the years, by simply running it at Criteo scale. Maxime Kestemont will give a talk on how his team refactored their biggest Hadoop job. To do so, they built (and open-sourced) a profiler for large scale distributed applications (Spark, MapReduce, Scalding, Hive, etc) to deeply inspect, debug and optimise these pipelines.
Every big data platform has to schedule complex jobs. Airflow is a very common solution, defining jobs as direct acyclic graphs (DAGs). DAG get complex as well. In his talk "I know you didn't test your Airflow DAGs last summer", Mathias Lavaert (DataMinded) goes in depth on why we should and how we could test our Airflow DAGs.
This meetup is kindly hosted by KBC at their offices at the railway station in Leuven.
We are KBC and you, our community, forever grateful!
PS: because we have to provide the attendee list to security, RSVP will close the day before the event at noon! So, RSVP fast!!!
Claim the event and start manage its content.I am the organizer