2 talks (not necessarily in order) -->
Food/drinks/merriment provided by Yelp!
Using Beam and Flink to power Yelp’s real time indexing pipeline
Today, Yelp’s streaming infrastructure powers a variety of critical use cases. The data processing pipelines used by our Ranking Platform team to index documents in Elasticsearch are one of such critical systems that is now fully powered by streaming. Each indexing pipeline is a complex DAG of data transformations that involve joining, filtering and manipulating data from different sources. While some of these transformations can be expressed using SQL queries powered by Flink SQL, many others require business logic that has been implemented over the years in Python libraries. Thanks to its Python SDK, Apache Beam allows Yelp to fulfill this use case, by providing a rich API for stream data processing. In this talk, we’ll present how we introduced Apache Beam in our Apache Flink based data pipeline, the challenges that we faced along the way and how we are using Beam to power one of our indexing pipelines in production.
Speakers bio #1:
Guenther Starnberger is a Software Engineer at Yelp. As Search Infrastructure Tech Lead, he architected and implemented a new microservice-based search stack using Elasticsearch as the primary backend. Prior to that he worked as a Research Assistant at the Distributed Systems Group of the Vienna University of Technology where he also received his doctorate. He currently works on integrating Apache Beam into Yelp’s streaming infrastructure.
Enrico works as a tech lead of the Data Infrastructure at Yelp, designing, building and maintaining data streaming and real-time processing infrastructure. Since 2013 he’s been working on real-time processing systems, designing and scaling Yelp's data pipeline to move and process in real-time hundreds of terabytes of data and tens of billions of messages every day. Enrico loves designing robust software solutions for stream processing that scale and building tools to make application developers’ interaction with the infrastructure as simple as possible.
Title for #2:
A crash-course introduction to Apache Beam Python SDK.
Apache Beam provides a simple, powerful programming model for building batch and streaming parallel data processing pipelines. The Apache Beam SDK for Python provides access to Apache Beam capabilities from the Python programming language. In this talk we will give a high-level overview of Python SDK, and discuss recent improvements in Beam portability, customizing Python runtime environments, and support for Python3-style type hints (PEP 484).
The talk may be interesting to both new and seasoned users of Beam as well as Pythonistas not (yet) familiar with Beam.
Udi Meiri is a Software Engineer in Google Cloud Platform and an Apache Beam committer. Most recently Udi has been working on strengthening type inference capabilities in Beam Python SDK, Beam IO connectors, and improving developer experience.
Valentyn Tymofieiev is a Software Engineer in Google Cloud Platform and an Apache Beam committer. Most recently Valentyn has been coordinating the efforts to offer Python 3 support in Apache Beam and Google Cloud Dataflow.