Part of Computer Vision Meetups - 16 groups

Seattle Computer Vision Meetup

4.7•16 ratings

About us

🖖 This virtual group is for data scientists, machine learning engineers, and open source enthusiasts who want to expand their knowledge of computer vision and complementary technologies. Every month we’ll bring you two diverse speakers working at the cutting edge of computer vision.

Are you interested in speaking at a future Meetup?
Is your company interested in sponsoring a Meetup?

Contact the Meetup organizers!

This Meetup is sponsored by Voxel51, the lead maintainers of the open source FiftyOne computer vision toolset. To learn more about FiftyOne, visit the project page on GitHub: https://github.com/voxel51/fiftyone

📣 Past Speakers

* Sage Elliott at Union.ai
* Michael Wornow at Microsoft
* Argo Saakyan at Veryfi
* Justin Trugman at Softwaretesting.ai
* Johannes Flotzinger at Universität der Bundeswehr München
* Harpreet Sahota at Deci,ai
* Nora Gourmelon at Friedrich-Alexander-Universität Erlangen-Nürnberg
* Reid Pryzant at Microsoft
* David Mezzetti at NeuML
* Chaitanya Mitash at Amazon Robotics
* Fan Wang at Amazon Robotics
* Mani Nambi at Amazon Robotics
* Joy Timmermans at Secury360
* Eduardo Alvarez at Intel
* Minye Wu at KU Leuven
* Jizhizi Li at University of Sydney
* Raz Petel at SightX
* Karttikeya Mangalam at UC Berkeley
* Dolev Ofri-Amar at Weizmann Institute of Science
* Roushanak Rahmat, PhD
* Folefac Martins
* Zhixi Cai at Monash University
* Filip Haltmayer at Zilliz
* Stephanie Fu at MIT
* Shobhita Sundaram at MIT
* Netanel Tamir at Weizmann Institute of Science
* Glenn Jocher at Ultralytics
* Michal Geyer at Weizmann Institute of Science
* Narek Tumanya at Weizmann Institute of Science
* Jerome Pasquero at Sama
* Eric Zimmermann at Sama
* Victor Anton at Wildlife.ai
* Shashwat Srivastava at Opendoor
* Eugene Khvedchenia at Deci.ai
* Hila Chefer at Tel-Aviv University
* Zhuo Wu at Intel
* Chuan Guo at University of Alberta
* Dhruv Batra Meta & Georgia Tech
* Benjamin Lahner at MIT
* Jiajing Chen at Syracuse University
* Soumik Rakshit at Weights & Biases
* Jiajing Chen at Syracuse University
* Paula Ramos, PhD at Intel
* Vishal Rajput at Skybase
* Cameron Wolfe at Alegion/Rice University
* Julien Simon at Hugging Face
* Kris Kitani at Carnegie Mellon University
* Anna Kogan at OpenCV.ai
* Kacper Łukawski at Qdrant
* Sri Anumakonda
* Tarik Hammadou at NVIDIA
* Zain Hasan at Weaviate
* Jai Chopra at LanceDB
* Sven Dickinson at University of Toronto & Samsung
* Nalini Singh at MIT

📚 Resources

* YouTube Playlist of previous Meetups
* Recap blogs including Q&A and speaker resource links

Upcoming events

See all

Network event
July 8 - Best of CVPR (Day 1)
Wed, Jul 8 · 9:00 AM PDT
·
Online
Online
15 attendees from 16 groups
Welcome to the Best of CVPR series — your virtual front row to groundbreaking research, insights, and innovations from one of computer vision's premier conferences. Live from the authors to you.

Date, Time and Location

Jul 08, 2026
9 AM - 11 AM PT
Online. Register for Zoom!

Some Modalities Are More Equal Than Others: Understanding and Improving Multimodal Integration in MLLMs

Multimodal large language models can process vision, audio, and text, but it remains unclear whether they truly integrate these modalities or rely on shortcut cues. In this talk, I will present our recent work, “Some Modalities Are More Equal Than Others,” where we introduce MMA-Bench, a benchmark designed to probe MLLMs under controlled audio–visual conflict, misleading text, and modality-specific queries. Through black-box evaluation and white-box attention analysis, we show that current MLLMs often struggle when modalities disagree, exhibit model-specific modality biases, and can be distracted by irrelevant textual context. We further propose an alignment-aware tuning strategy that trains models to answer based on the queried modality, improving robustness and multimodal grounding. This talk will highlight both the failure modes of current MLLMs and practical directions toward more reliable cross-modal reasoning.

About the Speaker

Tianle Chen is a Ph.D. student in Computer Science at Boston University, advised by Prof. Deepti Ghadiyaram. His research focuses on multimodal large language models, audio–visual reasoning, robustness, and trustworthy multimodal AI. He is interested in understanding how models allocate evidence across modalities and designing methods that improve reliable multimodal reasoning.

LinkedOut: Linking World Knowledge Out of Video LLMs for Next-Generation Video Recommendation

This CVPR 2026 work links structured world knowledge representations out of Video LLMs for next-generation video recommendation, covering how large vision-language models can provide rich semantic priors for video understanding while addressing efficiency and deployment challenges in real recommendation systems.

About the Speaker

Haichao Zhang is a Ph.D. candidate in Computer Engineering at Northeastern University. His research focuses on computer vision, vision-language models, video understanding and generation, and efficient multimodal foundation models. He has research experience at Google CoreML, Meta Reality Labs, LinkedIn Video AI, Amazon AWS AI Labs, and Tencent.

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

This paper presents CylinderDepth, a self-supervised surround depth estimation method leveraging cylindrical spatial attention for multi-view consistency across camera rigs.

About the Speaker

Samer Abualhanud is a PhD student and research staff member at Leibniz University Hannover, Germany, supervised by Dr.-Ing. Max Mehltretter and Prof. Christian Heipke. Research focuses on multi-view consistency in 3D reconstruction.

Your ViT is Secretly Also a Video Segmentation Model

Existing online video segmentation models typically combine a per-frame segmentation module with complex, specialized tracking modules. This work shows that a plain Vision Transformer encoder with a lightweight temporal module can match that performance, resulting in VidEoMT — up to 5–10x faster, running at up to 160 FPS with a ViT-L encoder.

About the Speaker

Daan de Geus is an Assistant Professor in the Mobile Perception Systems Lab at TU/e. He received his PhD (cum laude) from TU/e in 2024, and his research focuses on machine learning for visual and multimodal scene understanding.
1 attendee from this group
Network event
July 9 - Best of CVPR (Day 2)
Thu, Jul 9 · 9:00 AM PDT
·
Online
Online
7 attendees from 16 groups
The Best of CVPR is a three-day virtual meetup series featuring researchers presenting their accepted papers from the 2026 Conference on Computer Vision and Pattern Recognition (CVPR).

Date, Time and Location

Jul 09, 2026
9 AM - 11 AM PT
Online. Register for Zoom!

Efficient Representation and Coding of Dynamic Light Fields

This talk presents a data-driven approach that integrates aperture and pixel-wise exposure coding with Dynamic Mode Decomposition (DMD) to achieve compact representation of dynamic light fields. By modeling them as mathematical dynamical systems, the framework captures coherent structures across all dimensions and achieves scalable compression, bitrate savings, and high-quality reconstructions.

About the Speaker

Joshitha Ravishanker is a PhD scholar in the Department of Electrical Engineering at IIT Madras, supervised by Dr. Mansi Sharma and Dr. Kaushik Mitra. She is a Prime Minister's Research Fellow and her doctoral research focuses on the efficient representation and compression of light fields for display applications.

PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Recent video generation models can produce visually striking results, but they often fail to capture the physical dynamics that govern how real-world scenes evolve. In this talk, I will present PHANTOM, a physics-infused video generation model that jointly predicts visual content and latent physical dynamics. PHANTOM uses a physics-aware video representation to guide generation toward videos that are both visually realistic and physically consistent, without requiring explicit simulator-based physical specifications. I will discuss the model design, key results on standard and physics-aware video generation benchmarks, and how this work supports broader progress toward multimodal world models for physical AI and embodied reasoning.

About the Speaker

Ismini Lourentzou is an Assistant Professor at the University of Illinois Urbana-Champaign and Director of the Perception and LANguage Lab. Her research focuses on multimodal machine learning, vision-language models, generative modeling, and embodied AI, with applications in physical reasoning, robotics, healthcare, and trustworthy AI.

LoST: Level of Semantics Tokenization for 3D Shapes

Tokenization is fundamental to generative modeling and especially important for autoregressive 3D generation. However, current 3D shape tokenizers rely on geometric level-of-detail hierarchies that are token-inefficient and poorly aligned with semantic structure. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience so early tokens produce complete, plausible shapes and later tokens refine detailed geometry and semantics.

LoST is trained with Relational Inter-Distance Alignment (RIDA), a semantic alignment loss that matches relationships in 3D shape latent space to those in DINO feature space. Experiments show that LoST achieves state-of-the-art reconstruction and efficient high-quality AR 3D generation while using only 0.1%–10% of the tokens required by prior methods.

About the Speaker

Niladri Dutt is an ELLIS PhD student at University College London (UCL), sponsored by Adobe Research. He is advised by Prof Niloy Mitra (UCL) and Duygu Ceylan (Adobe). His research interests are in representation learning for 3D and multimodal learning.

3D Reconstruction Improves Weakly-Supervised Semantic Segmentation

Semantic segmentation typically requires expensive, dense annotations, making large-scale training a significant bottleneck. We address this by introducing a framework that leverages recent advances in feed-forward 3D reconstruction to improve weakly supervised semantic segmentation on 2D images, using only sparse labels such as points, scribbles, or coarse masks.

Our core insight is that 3D geometric structure recovered directly from casual 2D video sequences provides powerful cross-view consistency constraints that can propagate sparse annotations across entire scenes. A dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, injecting geometric supervision into the learning process while keeping inference purely 2D. Our solution achieves state-of-the-art performance, outperforming existing methods by 2–7% across a range of datasets and annotation types, without requiring additional labels or inference overhead.

About the Speaker

Wolfgang Boettcher is an ELLIS doctoral researcher in the Computer Vision and Machine Learning group at the Max Planck Institute for Informatics. Since his master's degree at ETH Zurich, his research focuses on visual perception, semantic scene understanding, and dynamic 3D reconstruction. He is particularly interested in models that can reason about the physical environment for applications in autonomous systems and robotics.
1 attendee from this group
Network event
July 10 - Best of CVPR (Day 3)
Fri, Jul 10 · 9:00 AM PDT
·
Online
Online
14 attendees from 16 groups
Welcome to the Best of CVPR series — your virtual front row to groundbreaking research, insights, and innovations from one of computer vision's premier conferences. Live from the authors to you.

Date, Time and Location

Jul 10, 2026
9 AM - 11 AM PT
Online. Register for Zoom!

Advancing Generative Quality and Reasoning in Multimodal AI

This talk exposes hidden limitations of frontier multimodal models across reasoning and visual generation, demonstrates the inherent brittleness of VLMs and audio-visual MLLMs, and introduces simple yet effective techniques to build robustness. It also covers human-centric metrics for perceptually accurate evaluation of generative media.

About the Speaker

Deepti Ghadiyaram is an Assistant Professor of Computer Science at Boston University. Her research focuses on building safe, interpretable, and robust computer vision systems with advanced reasoning capabilities. Before joining BU she was at Runway and Meta AI, and earned her PhD from UT Austin in 2017.

HyperRealm: Hyperbolic Vision Language Models for Real-World Hierarchical Multimodal Understanding

Real-world multimodal data naturally exhibits hierarchical structure, yet standard VLMs like CLIP align images and text in Euclidean space, which cannot preserve tree-like hierarchies. HyperRealm embeds images and text in a Poincaré ball to encode hierarchical relationships, introducing an adaptive entropy-driven entailment loss. Evaluated on 18 zero-shot classification benchmarks, it shows consistent improvements over Euclidean CLIP baselines.

About the Speaker

Kathy Wu holds a Ph.D. in Applied Mathematics from USC. She is currently an Applied Scientist at Amazon within the Global Store organization, leading projects in e-commerce recommendation, multimodal VLMs, and LLM/GenAI applications. Her research has been published at ICCV, CVPR, ICLR, SIGIR, and WACV.

Cross-Modal Domain Adaptation using Semantic Parametric Mapping

XD-MAP is a framework that transfers semantic knowledge from image datasets to LiDAR by constructing semantic parametric maps from monocular detections and geometric priors. Unlike previous approaches, XD-MAP does not require overlapping sensor views and enables scalable 360° supervision for LiDAR perception without manual annotation.

About the Speaker

Frank Bieder is a researcher in computer vision and autonomous systems, leading the Visual and Spatial Learning group at FZI Research Center for Information Technology. His research covers multimodal perception, map-based learning, and cross-sensor domain adaptation for autonomous driving. He received his Ph.D. from KIT in 2026.

WalkGPT: Pixel-Grounded Navigation Guidance for Pedestrians

Pedestrian navigation requires more than generic scene description; users need to understand walkable areas, obstacles, and the distance of surrounding objects. In this talk, I will present WalkGPT, a grounded vision-language model for accessibility-aware pedestrian navigation. WalkGPT connects language reasoning with segmentation masks and object-level distance estimates to generate grounded navigation guidance from pedestrian-view images. I will also introduce PAVE, a 41k-sample benchmark for depth-aware accessibility reasoning in real pedestrian environments. The talk will highlight how grounded multimodal AI can support safer and more interpretable pedestrian assistance.

About the Speaker

Rafi Ibn Sultan is a Ph.D. researcher in Computer Science at Wayne State University, working on computer vision, multimodal AI, and vision-language models. His research focuses on grounded and interpretable AI systems for real-world visual reasoning, including pedestrian navigation and medical image segmentation. His recent work includes WalkGPT, accepted at CVPR 2026, and GeoSAM, accepted at ECAI 2025.
1 attendee from this group
July 15 - Seattle AI, ML and Computer Vision Meetup
Wed, Jul 15 · 5:30 PM PDT
Union AI, 400 112th Ave NE #115, Bellevue, WA, US
Join our in-person meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision

Register to reserve your spot!

Date, Time and Location

Jul 15, 2026
5:30 PM - 8:30 PM PT

Union.ai Offices
400 112th Ave NE #115
Bellevue, WA 98004

Building Foundation Models for Robotic Perception

3D spatial understanding is a critical skill for robotics which typically requires tedious manual design, expensive data collection and per-domain training. This presentation will focus on the development and application of foundation models to address several fundamental challenges in robotic perception, and how they facilitate robotic loco-manipulation skills.

First, we introduce FoundationStereo (CVPR'25 best paper candidate), a novel architecture optimized for zero-shot performance. The model leverages a 1M-pair self-curated synthetic dataset, bridges the sim-to-real gap using monocular priors, and incorporates an advanced filtering module for long-range context reasoning.

Second, we address its computational bottlenecks with Fast-FoundationStereo (CVPR'26). We propose a "divide-and-conquer" acceleration strategy that retains the teacher model's robustness while achieving a 10x speedup, making it suitable for real-time applications.

About the Speaker

Bowen Wen is a Staff Research Scientist at NVIDIA Research. His research areas include robotic perception and computer vision. More recently, he focuses on large foundation models for 3D visual perception and learning to facilitate robotics or embodied AI.

STELLAR: Learning Sparse Visual Concepts for Unified Vision Models

Modern vision models often split into two regimes: models that learn strong semantics for recognition, and models that preserve spatial detail for reconstruction.

In this talk, we present STELLAR, a self-supervised framework for learning sparse visual concepts as a unified representation for vision models. The key idea is to factorize visual features into semantic concept tokens (the "what"), and spatial assignment maps (the "where"), allowing the model to align concepts across views while preserving the geometry needed for reconstruction.

This sparse, low-rank representation creates a compact interface that supports recognition, dense prediction, and image reconstruction, while also suggesting future directions for efficient visual encoding, video self-supervision, generative modeling, and world-model-style visual reasoning.

We discuss the core method, empirical results, and why concept-centric visual representations may be a useful building block for the next generation of unified vision systems.

About the Speaker

Theodore Zhao - is a researcher working on multimodal foundation models, visual representation learning, and biomedical AI. His recent work focuses on learning compact, interpretable, and grounded visual representations that support both semantic understanding and reconstruction.

Motivation and Challenges in working with Multimodal Timeseries Data

Physical AI is having its moment, with many companies and research teams focussed on this space. But working with physical AI data means wrestling with high cardinality, complex, non-synced, multi sensor streams that are hard to explore, align and curate. In this talk, we will break down the challenges that come with multimodal time series data, and then look at the research directions this industry is pursuing, the ones that are unlocked when you can actually work with your data effectively.

About the Speaker

Prerna Dhareshwar - is a Machine Learning Engineer at Voxel51, where she helps customers leverage FiftyOne to accelerate dataset curation, model development, and evaluation in real-world AI workflows. She brings extensive experience building and deploying computer vision and machine learning systems across industries.

Orchestrating Scalable AI Workflows with Flyte and Union.ai

Modern AI systems require infrastructure that can reliably orchestrate training, inference, and production workflows at scale. This session explores approaches to AI orchestration, distributed compute, and resilient ML infrastructure for real-world machine learning and computer vision applications.

Topics may include production AI pipelines, workflow automation, scalable deployment strategies, and operating AI systems securely within cloud environments. Attendees will gain a high-level look at emerging patterns shaping the next generation of AI infrastructure and operational workflows.

About the Speaker

Sage Elliott - is an AI Engineer at Union.ai (core maintainers of Flyte).
14 attendees