
What we’re about
This is a meetup for Bay Area users of Apache Spark (http://spark.apache.org), a unified analytics engine for large-scale data processing. We rotate hosting meetups among locations in San Francisco, Peninsula, and South Bay.
We discuss other Spark-related ecosystem projects, including Spark SQL, MLlib, GraphX, and Structured Streaming. Additionally, we include introductions to the various Spark features, tutorials, case studies from users, community contributors, best practices for deployment and tuning, and updates on future development and releases.
Upcoming events (1)
See all- Apache Spark™ and Lance Spark ConnectorLink visible for attendees
📅 Date: September 25, 2025
⏰ Time: 9:30 AM - 10:30 AM PST
📍 Location: onlineRSVP HERE 👉 https://lu.ma/76o36xuk 👈
Agenda:
- Welcome and Introductions
- Talk 1: Scalable Multimodal AI Data Processing on Apache Spark™ with Lance Spark Connector, Jack Ye, LanceDB
- Q&A
Abstract:
In this talk, we’ll introduce the Lance Spark Connector and show how it brings Lance’s high-performance, AI-native multimodal storage to Apache Spark™ for large-scale data processing. You’ll learn how Spark can leverage Lance’s unique capabilities—random access, built-in indexing, and native support for vector and blob data types—to work seamlessly with embeddings, images, videos, documents, and more.We’ll explore how the connector integrates with any Spark-compatible catalog, from Hive Metastore to Unity Catalog, enabling unified governance and discovery. Through real-world examples with Spark, we’ll demonstrate running ingestion, analytics, feature engineering, and retrieval-augmented generation workflows directly on the same multimodal Lance dataset—without costly format conversions—making it the ideal solution in a modern multimodal lakehouse.
Bio:
Jack Ye is a software engineer at LanceDB. He is a PMC member of Apache Iceberg and contributor to various open source projects including Apache Spark and Trino. Prior to joining LanceDB, Jack was a tech lead at AWS for initiatives including SageMaker Lakehouse, S3 Tables, EMR & Athena integration with open table formats.