Daft Python Distributed DataFrame & Exploring Computer Vision Data
Details
Want to learn more about Python and meet other Pythonistas?
👉 Submit your 5, 15 or 25 mins talk proposals here: https://bit.ly/sfpythoncfp
Join SF Python on https://live.remo.co/e/sf-python-xxxxx
Enjoy a virtual platform that allows you to interact with others like you would in an in-person event
SCHEDULED TALKS
🔎 Lightning talk (5 mins)
Could be you!
Could be you!
Could be you!
🔎1st talk (~25 mins + Q&A)
Exploring computer vision data using DuckDB with Arrow and Lance
Chang She
For tabular data, DuckDB with Arrow forms the core of a truly serverless data warehousing experience. Why can’t we do the same for working with image and video data? Lance is a new open-source project designed to enable ML engineers and researchers to analyze and query imaging data via SQL. Together with DuckDB and Arrow, Lance gives you the ML infrastructure you need, without needing to manage infrastructure.
The core of Lance is an Arrow-compatible columnar format that delivers blazing fast performance for computer vision data. Lance format is optimized for nested data, large blobs, partial reads from remote storage, and comes with data versioning out-of-the-box. Lance also supports rich indexing for embeddings and full-text search.
With Lance’s DuckDB extension, you can use SQL for model inference, vector search, model evaluation, and a plethora of other tasks that currently require loads of ad-hoc python scripts and/or third-party services.
In this talk we’ll go over some of the use cases that Lance enables and dive into Lance design to see how it delivers way faster than parquet performance for computer vision.
https://github.com/eto-ai/lance
🔎 2nd Talk (~25 mins + Q&A)
Introducing Daft: The Python Distributed DataFrame for "Complex Data"
Jay Chia
DataFrames like Pandas and PySpark have made Python the tool of choice for many in the data community. However, when it comes to working with data that don't traditionally fit into tables, or "Complex Data" such as images, audio and video, most teams usually build bespoke systems.
Daft (www.getdaft.io) is an open-sourced framework for processing Complex Data using a DataFrame API. This makes it really easy to run queries and define heavy computations on your data locally in your notebook for experimentation, but also distribute these workloads on a Ray cluster for larger workloads.
1. Pythonic and built for "Complex Data" such as images, video and unstructured documents. Columns of the dataframe can be of any arbitrary Python type such as Numpy vectors, PIL Images or any user-defined type! Daft exposes an easy functional interface for loading, querying and processing this data.
2. Built for both interactive experimentation and distributed computing. Daft is built for a smooth local development experience in a REPL/notebook environment with a dynamic type system and intelligent caching. When running large workloads that require more computing power, it scales up seamlessly to thousands of machines on a cluster using Ray.
3. Built for Machine Learning workloads - Daft is perfect for performing data curation for ML training, or scaling up large scale ML inference. It integrates natively with the Ray and PyTorch ecosystem for training input data, efficiently transporting your data into ML training jobs.
FAQ
👉 How does one network at a virtual event?
https://youtu.be/k87zAKm60UA - join different virtual tables to chat with speakers, find out about how others are using Python, and start your own discussion topic. Simply turn on your mic and video when you arrive at the event link, double click on different tables to join different conversations
AGENDA
6:30p Get familiar with remo.co and reconnect with friends!
7:00p Opening remarks, sponsors acknowledgement
7:10p Scheduled talks and Q&A + networking & yoga break
8:30p Wrap up last talk, more networking
THIS EVENT IS PRODUCED BY
SF Python, a volunteers-run organization aiming to foster the Python Community in the Bay Area
Video Sponsor is IBM
For over a century, IBM has led world-changing progress by uniting, empowering, and relentlessly reinventing itself and their customers. The IBM Data Science Community is the place for data scientists and developers to learn, share, and engage with their peers and industry renowned data scientists. Join the IBM Data Science Community and participate in shaping the digital future
Virtual Platform sponsor is Sauce Labs
Continuous testing is a key enabler of digital confidence — the knowledge that you’re delivering the best possible user experience to your customers. Digitally confident organizations know that their web and mobile applications look, function and perform exactly as intended, every single time they’re used. That’s the value of Sauce Labs



