Skip to content

Details

Join our in-person meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision

Register to reserve your spot!

Date, Time and Location

Jul 15, 2026
5:30 PM - 8:30 PM PT

Union.ai Offices
400 112th Ave NE #115
Bellevue, WA 98004

Building Foundation Models for Robotic Perception

3D spatial understanding is a critical skill for robotics which typically requires tedious manual design, expensive data collection and per-domain training. This presentation will focus on the development and application of foundation models to address several fundamental challenges in robotic perception, and how they facilitate robotic loco-manipulation skills.

First, we introduce FoundationStereo (CVPR'25 best paper candidate), a novel architecture optimized for zero-shot performance. The model leverages a 1M-pair self-curated synthetic dataset, bridges the sim-to-real gap using monocular priors, and incorporates an advanced filtering module for long-range context reasoning.

Second, we address its computational bottlenecks with Fast-FoundationStereo (CVPR'26). We propose a "divide-and-conquer" acceleration strategy that retains the teacher model's robustness while achieving a 10x speedup, making it suitable for real-time applications.

About the Speaker

Bowen Wen is a Staff Research Scientist at NVIDIA Research. His research areas include robotic perception and computer vision. More recently, he focuses on large foundation models for 3D visual perception and learning to facilitate robotics or embodied AI.

STELLAR: Learning Sparse Visual Concepts for Unified Vision Models

Modern vision models often split into two regimes: models that learn strong semantics for recognition, and models that preserve spatial detail for reconstruction.

In this talk, we present STELLAR, a self-supervised framework for learning sparse visual concepts as a unified representation for vision models. The key idea is to factorize visual features into semantic concept tokens (the "what"), and spatial assignment maps (the "where"), allowing the model to align concepts across views while preserving the geometry needed for reconstruction.

This sparse, low-rank representation creates a compact interface that supports recognition, dense prediction, and image reconstruction, while also suggesting future directions for efficient visual encoding, video self-supervision, generative modeling, and world-model-style visual reasoning.

We discuss the core method, empirical results, and why concept-centric visual representations may be a useful building block for the next generation of unified vision systems.

About the Speaker

Theodore Zhao - is a researcher working on multimodal foundation models, visual representation learning, and biomedical AI. His recent work focuses on learning compact, interpretable, and grounded visual representations that support both semantic understanding and reconstruction.

Motivation and Challenges in working with Multimodal Timeseries Data

Physical AI is having its moment, with many companies and research teams focussed on this space. But working with physical AI data means wrestling with high cardinality, complex, non-synced, multi sensor streams that are hard to explore, align and curate. In this talk, we will break down the challenges that come with multimodal time series data, and then look at the research directions this industry is pursuing, the ones that are unlocked when you can actually work with your data effectively.

About the Speaker

Prerna Dhareshwar - is a Machine Learning Engineer at Voxel51, where she helps customers leverage FiftyOne to accelerate dataset curation, model development, and evaluation in real-world AI workflows. She brings extensive experience building and deploying computer vision and machine learning systems across industries.

**Orchestrating Scalable AI Workflows with Flyte and Union.ai**

Modern AI systems require infrastructure that can reliably orchestrate training, inference, and production workflows at scale. This session explores approaches to AI orchestration, distributed compute, and resilient ML infrastructure for real-world machine learning and computer vision applications.

Topics may include production AI pipelines, workflow automation, scalable deployment strategies, and operating AI systems securely within cloud environments. Attendees will gain a high-level look at emerging patterns shaping the next generation of AI infrastructure and operational workflows.

About the Speaker

Sage Elliott - is an AI Engineer at Union.ai (core maintainers of Flyte).

Related topics

Events in Bellevue, WA
Artificial Intelligence
Computer Vision
Machine Learning
Data Science
Open Source

Sponsors

Voxel51

Voxel51

Administration, promotion, giveaways and charitable contributions.

Voxel51

Voxel51

Administration, promotion, giveaways and charitable contributions.

You may also like