Apache Spark™ Python Data Source for Hugging Face AI Datasets


Details
Please join us to learn more about the Apache Spark™ Python Data Source for Hugging Face AI Datasets for AI workloads. 🤝
Agenda:
✅ Welcome and Introductions
✅ Talk 1: Overview of Python Data Source in Apache Spark 4.x, Jules S. Damji, Databricks
✅ Talk 2: Presenting the new Hugging Face Data Source for AI Datasets, Quentin Lhoest, Hugging Face
✅ Q&A
Talk 1: Overview of Python Data Source in Apache Spark 4.x
Abstract: In this introductory talk, we will cover the key concepts and motivations for Python Data Source in the recently released Apache Spark 4.0:
🔹 What and Why Python Data Source
🔹 How to write a custom data source
🔹 Share some implemented examples of Data Sources
This short introduction will set the context for the following talk on Python data source for Hugging Face Datasets.
Bio: Jules S. Damji is a developer advocate at Databricks Inc., an MLflow contributor, and Learning Spark, 2nd Edition coauthor. He is a hands-on developer with over 25 years of experience. He has worked at leading companies, such as Sun Microsystems, Netscape, @Home, Opsware/LoudCloud, VeriSign, ProQuest, Hortonworks, Anyscale, and Databricks, building large-scale distributed systems. He holds a B.Sc. and M.Sc. in computer science (from Oregon State University and Cal State, Chico, respectively) and an MA in political advocacy and communication (from Johns Hopkins University)
Talk 2 : Presenting the new Hugging Face Data Source for AI Datasets
Abstract: In this talk, we will look into challenges with AI datasets and present how to efficiently load, process, use and share AI datasets with the Hugging Face Data Source for Spark and the latest Spark features like Arrow support.
Bio: Quentin Lhoest is a Machine Learning and Data engineer at Hugging Face. He develops tools for AI builders with a focus on data libraries and the Hugging Face Dataset Hub. He is the main maintainer of the datasets python package and a contributor to various AI-related open source projects like transformers. He previously worked at Feedly in the Bay area on large scale natural language processing. He studied in France and holds a M.Sc. in Engineering (from Centrale Paris) and a M.Sc. in Maths (from ENS Paris-Saclay).

Apache Spark™ Python Data Source for Hugging Face AI Datasets