
What we’re about
This is a meetup for Bay Area users of Apache Spark (http://spark.apache.org), a unified analytics engine for large-scale data processing. We rotate hosting meetups among locations in San Francisco, Peninsula, and South Bay.
We discuss other Spark-related ecosystem projects, including Spark SQL, MLlib, GraphX, and Structured Streaming. Additionally, we include introductions to the various Spark features, tutorials, case studies from users, community contributors, best practices for deployment and tuning, and updates on future development and releases.
Upcoming events (2)
See all- Apache Spark™ Python Data Source for Hugging Face AI DatasetsLink visible for attendees
Please join us to learn more about the Apache Spark™ Python Data Source for Hugging Face AI Datasets for AI workloads. 🤝
Agenda:
✅ Welcome and Introductions
✅ Talk 1: Overview of Python Data Source in Apache Spark 4.x, Jules S. Damji, Databricks
✅ Talk 2: Presenting the new Hugging Face Data Source for AI Datasets, Quentin Lhoest, Hugging Face
✅ Q&ATalk 1: Overview of Python Data Source in Apache Spark 4.x
Abstract: In this introductory talk, we will cover the key concepts and motivations for Python Data Source in the recently released Apache Spark 4.0:
🔹 What and Why Python Data Source
🔹 How to write a custom data source
🔹 Share some implemented examples of Data SourcesThis short introduction will set the context for the following talk on Python data source for Hugging Face Datasets.
Bio: Jules S. Damji is a developer advocate at Databricks Inc., an MLflow contributor, and Learning Spark, 2nd Edition coauthor. He is a hands-on developer with over 25 years of experience. He has worked at leading companies, such as Sun Microsystems, Netscape, @Home, Opsware/LoudCloud, VeriSign, ProQuest, Hortonworks, Anyscale, and Databricks, building large-scale distributed systems. He holds a B.Sc. and M.Sc. in computer science (from Oregon State University and Cal State, Chico, respectively) and an MA in political advocacy and communication (from Johns Hopkins University)
Talk 2 : Presenting the new Hugging Face Data Source for AI Datasets
Abstract: In this talk, we will look into challenges with AI datasets and present how to efficiently load, process, use and share AI datasets with the Hugging Face Data Source for Spark and the latest Spark features like Arrow support.Bio: Quentin Lhoest is a Machine Learning and Data engineer at Hugging Face. He develops tools for AI builders with a focus on data libraries and the Hugging Face Dataset Hub. He is the main maintainer of the datasets python package and a contributor to various AI-related open source projects like transformers. He previously worked at Feedly in the Bay area on large scale natural language processing. He studied in France and holds a M.Sc. in Engineering (from Centrale Paris) and a M.Sc. in Maths (from ENS Paris-Saclay).
- Apache Spark™ and Lance Spark ConnectorLink visible for attendees
📅 Date: September 25, 2025
⏰ Time: 9:30 AM - 10:30 AM PST
📍 Location: onlineRSVP HERE 👉 https://lu.ma/76o36xuk 👈
Agenda:
- Welcome and Introductions
- Talk 1: Scalable Multimodal AI Data Processing on Apache Spark™ with Lance Spark Connector, Jack Ye, LanceDB
- Q&A
Abstract:
In this talk, we’ll introduce the Lance Spark Connector and show how it brings Lance’s high-performance, AI-native multimodal storage to Apache Spark™ for large-scale data processing. You’ll learn how Spark can leverage Lance’s unique capabilities—random access, built-in indexing, and native support for vector and blob data types—to work seamlessly with embeddings, images, videos, documents, and more.We’ll explore how the connector integrates with any Spark-compatible catalog, from Hive Metastore to Unity Catalog, enabling unified governance and discovery. Through real-world examples with Spark, we’ll demonstrate running ingestion, analytics, feature engineering, and retrieval-augmented generation workflows directly on the same multimodal Lance dataset—without costly format conversions—making it the ideal solution in a modern multimodal lakehouse.
Bio:
Jack Ye is a software engineer at LanceDB. He is a PMC member of Apache Iceberg and contributor to various open source projects including Apache Spark and Trino. Prior to joining LanceDB, Jack was a tech lead at AWS for initiatives including SageMaker Lakehouse, S3 Tables, EMR & Athena integration with open table formats.