We typically have a main presentation or a series of lightning talks, followed by discussion and Q&A. There is a diversity of domains and experience levels represented, so come with your questions and be prepared to talk about how you use Python!
Talk 1: Samuel Oranyeli - Helping Pandas with Pyjanitor
Talk 2: Niels Bantilan - Pandera: A Statistical Data Testing Toolkit for Dataframe-like Objects
Talk 1: Helping Pandas with Pyjanitor
Pyjanitor aims to help with cleaning data within Pandas space, while offering verb-like methods that abstract cleaning/wrangling, while still being chainable and interoperable with Pandas.
Samuel Oranyeli is a Snr Engineer at Slalom Australia. Loves wrangling data. Find him on stackoverflow (@sammywemmy)
Talk 2: Pandera: A Statistical Data Testing Toolkit for Dataframe-like Objects
Data manipulation is a core part of any computational process. Whether it’s processing data for business analytics reports, statistical scientific studies, or predictive machine learning models, data needs to be reshaped into a form intended for a particular use case. Data testing is the act of validating not only data but also the functions that produce those data based on a priori assumptions obtained through domain expertise or exploratory analysis.
This talk will dive deep into Pandera, a data testing toolkit for dataframe-like objects in Python, including pandas, modin, dask, and koalas. We’ll cover the basics of defining schemas, creating custom checks, and type-checking dataframes in functions. We’ll also introduce you to more advanced data testing concepts like property-based testing, data profiling, and statistical hypothesis testing. Finally, this talk will outline the roadmap for the project and highlight newly released integrations with other libraries in the Python ecosystem. By the end of this talk you’ll be able to define your own schemas, validate dataframes flowing through your data pipelines, and create property-based unit tests using the tools provided by Pandera.
Niels is a machine learning engineer and core maintainer of Flyte, an open source ML orchestration tool, and author and maintainer of Pandera, a data testing tool for dataframes. He has a Masters in Public Health with a specialization in sociomedical science and public health informatics, and prior to that a background in developmental biology and immunology. His research interests include reinforcement learning, AutoML, creative machine learning, and fairness, accountability, and transparency in automated systems. He enjoys developing open source tools for improving data science and machine learning practice.