July 8 - Best of CVPR (Day 1)
40 attendees from 48 groups hosting
Details
Welcome to the Best of CVPR series — your virtual front row to groundbreaking research, insights, and innovations from one of computer vision's premier conferences. Live from the authors to you.
Date, Time and Location
Jul 08, 2026
9 AM - 11 AM PT
Online. Register for Zoom!
Some Modalities Are More Equal Than Others: Understanding and Improving Multimodal Integration in MLLMs
Multimodal large language models can process vision, audio, and text, but it remains unclear whether they truly integrate these modalities or rely on shortcut cues. In this talk, I will present our recent work, “Some Modalities Are More Equal Than Others,” where we introduce MMA-Bench, a benchmark designed to probe MLLMs under controlled audio–visual conflict, misleading text, and modality-specific queries. Through black-box evaluation and white-box attention analysis, we show that current MLLMs often struggle when modalities disagree, exhibit model-specific modality biases, and can be distracted by irrelevant textual context. We further propose an alignment-aware tuning strategy that trains models to answer based on the queried modality, improving robustness and multimodal grounding. This talk will highlight both the failure modes of current MLLMs and practical directions toward more reliable cross-modal reasoning.
About the Speaker
Tianle Chen is a Ph.D. student in Computer Science at Boston University, advised by Prof. Deepti Ghadiyaram. His research focuses on multimodal large language models, audio–visual reasoning, robustness, and trustworthy multimodal AI. He is interested in understanding how models allocate evidence across modalities and designing methods that improve reliable multimodal reasoning.
LinkedOut: Linking World Knowledge Out of Video LLMs for Next-Generation Video Recommendation
This CVPR 2026 work links structured world knowledge representations out of Video LLMs for next-generation video recommendation, covering how large vision-language models can provide rich semantic priors for video understanding while addressing efficiency and deployment challenges in real recommendation systems.
About the Speaker
Haichao Zhang is a Ph.D. candidate in Computer Engineering at Northeastern University. His research focuses on computer vision, vision-language models, video understanding and generation, and efficient multimodal foundation models. He has research experience at Google CoreML, Meta Reality Labs, LinkedIn Video AI, Amazon AWS AI Labs, and Tencent.
CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation
This paper presents CylinderDepth, a self-supervised surround depth estimation method leveraging cylindrical spatial attention for multi-view consistency across camera rigs.
About the Speaker
Samer Abualhanud is a PhD student and research staff member at Leibniz University Hannover, Germany, supervised by Dr.-Ing. Max Mehltretter and Prof. Christian Heipke. Research focuses on multi-view consistency in 3D reconstruction.
Your ViT is Secretly Also a Video Segmentation Model
Existing online video segmentation models typically combine a per-frame segmentation module with complex, specialized tracking modules. This work shows that a plain Vision Transformer encoder with a lightweight temporal module can match that performance, resulting in VidEoMT — up to 5–10x faster, running at up to 160 FPS with a ViT-L encoder.
About the Speaker
Daan de Geus is an Assistant Professor in the Mobile Perception Systems Lab at TU/e. He received his PhD (cum laude) from TU/e in 2024, and his research focuses on machine learning for visual and multimodal scene understanding.
