Skip to content

Details

Welcome back to another year of open source programming with PyData NYC ๐Ÿชฉ We've got a new venue partnership with St.John's at 101 Astor Pl, New York, NY 10003. Join us on Feb 17th at 6:30 pm for a talk/demo night with Ming Zhao (IBM) and Andy Walner/Chandra Krishnan from OneHouse.

Please bring your ๐Ÿ’ป to code and sign up with your government official name.

๐Ÿ• Pizza & drinks sponsored by IBM - thank you!

Agenda:
Unlocking Document Intelligence with Docling
Speaker: Ming Zhao (Developer Advocate at IBM)

Most organizational knowledge is still locked inside complex documents, making it difficult to extract and use the information effectively. Traditional tools often fail when working with real-world document formats, particularly PDFs. Tables lose their structure, figures get separated from captions, and multi-column layouts become unreadable text. These failures make it difficult to bring AI to document-heavy workflows. Docling is an open-source project that takes a different approach, using deep learning models to parse documents the way humans read them. It preserves hierarchy, extracts structured data through a consistent API, and supports 15+ file formats out of the box. In this session we'll explore how you can leverage Docling in your own AI workflows.

From OLAP to AI: How Hudi Brings Vector Search Directly to the Data Lakehouse
Speakers: Andy Walner (Product Manager at Onehouse) and Chandra Krishnan (Sales Engineering at Onehouse)

Vector search is rapidly becoming table stakes for AI workloads, but most teams are forced to bolt a separate vector database onto their lakehouse. In this talk, we introduce a new capability in Apache Hudi that brings vector support directly into the data lake, merging large-scale analytics and AI workloads in a single system.

We will demo native vector search on Hudi tables using PySpark, including a new vector search function that runs directly on lake data. You will see how swapping the base file format from Parquet to Lance unlocks better support for unstructured data and faster vector retrieval, while preserving warehouse-style analytics on the same tables.

This approach enables use cases like RAG, similarity search, and AI training directly on existing OLAP data, without duplicating data or introducing new storage systems. The design is engine-agnostic. While the demo uses PySpark, the same data can be processed with Ray, Daft, or other compute engines, pointing toward a single lakehouse architecture that supports both structured and unstructured data for analytics and AI.

Networking
Connect with fellow data enthusiasts, professionals, and community leaders. Build meaningful connections and forge collaborations.
----------------------------------------------------------------
Doors open @ 6 pm
Doors close @ 7 pm
Event @ 6:30 - 8:30 pm
Venue provided by St John's: 101 Astor Pl, New York, NY 10003
----------------------------------------------------------------
The building requires a government-issued photo ID for entrance. This, and all PyData NYC events, is an all-level event. Newcomers and beginners are welcome.This and all NumFOCUS-affiliated events and spaces, both in-person and online, are governed by a Code of Conduct.
----------------------------------------------------------------
This event may be recorded.

Related topics

Events in New York, NY
Artificial Intelligence
Data Science
Python

You may also like