Skip to content

Details

Welcome to the Best of CVPR series — your virtual front row to groundbreaking research, insights, and innovations from one of computer vision's premier conferences. Live from the authors to you.

Date, Time and Location

Jul 10, 2026
9 AM - 11 AM PT
Online. Register for Zoom!

Advancing Generative Quality and Reasoning in Multimodal AI

This talk exposes hidden limitations of frontier multimodal models across reasoning and visual generation, demonstrates the inherent brittleness of VLMs and audio-visual MLLMs, and introduces simple yet effective techniques to build robustness. It also covers human-centric metrics for perceptually accurate evaluation of generative media.

About the Speaker

Deepti Ghadiyaram is an Assistant Professor of Computer Science at Boston University. Her research focuses on building safe, interpretable, and robust computer vision systems with advanced reasoning capabilities. Before joining BU she was at Runway and Meta AI, and earned her PhD from UT Austin in 2017.

HyperRealm: Hyperbolic Vision Language Models for Real-World Hierarchical Multimodal Understanding

Real-world multimodal data naturally exhibits hierarchical structure, yet standard VLMs like CLIP align images and text in Euclidean space, which cannot preserve tree-like hierarchies. HyperRealm embeds images and text in a Poincaré ball to encode hierarchical relationships, introducing an adaptive entropy-driven entailment loss. Evaluated on 18 zero-shot classification benchmarks, it shows consistent improvements over Euclidean CLIP baselines.

About the Speaker

Kathy Wu holds a Ph.D. in Applied Mathematics from USC. She is currently an Applied Scientist at Amazon within the Global Store organization, leading projects in e-commerce recommendation, multimodal VLMs, and LLM/GenAI applications. Her research has been published at ICCV, CVPR, ICLR, SIGIR, and WACV.

Cross-Modal Domain Adaptation using Semantic Parametric Mapping

XD-MAP is a framework that transfers semantic knowledge from image datasets to LiDAR by constructing semantic parametric maps from monocular detections and geometric priors. Unlike previous approaches, XD-MAP does not require overlapping sensor views and enables scalable 360° supervision for LiDAR perception without manual annotation.

About the Speaker

Frank Bieder is a researcher in computer vision and autonomous systems, leading the Visual and Spatial Learning group at FZI Research Center for Information Technology. His research covers multimodal perception, map-based learning, and cross-sensor domain adaptation for autonomous driving. He received his Ph.D. from KIT in 2026.

WalkGPT: Pixel-Grounded Navigation Guidance for Pedestrians

Pedestrian navigation requires more than generic scene description; users need to understand walkable areas, obstacles, and the distance of surrounding objects. In this talk, I will present WalkGPT, a grounded vision-language model for accessibility-aware pedestrian navigation. WalkGPT connects language reasoning with segmentation masks and object-level distance estimates to generate grounded navigation guidance from pedestrian-view images. I will also introduce PAVE, a 41k-sample benchmark for depth-aware accessibility reasoning in real pedestrian environments. The talk will highlight how grounded multimodal AI can support safer and more interpretable pedestrian assistance.

About the Speaker

Rafi Ibn Sultan is a Ph.D. researcher in Computer Science at Wayne State University, working on computer vision, multimodal AI, and vision-language models. His research focuses on grounded and interpretable AI systems for real-world visual reasoning, including pedestrian navigation and medical image segmentation. His recent work includes WalkGPT, accepted at CVPR 2026, and GeoSAM, accepted at ECAI 2025.

Related topics

Artificial Intelligence
Computer Vision
Machine Learning
Open Source

Sponsors

Voxel51

Voxel51

Administration, promotion, giveaways and charitable contributions.

Voxel51

Voxel51

Administration, promotion, giveaways and charitable contributions.

You may also like