Skip to content

Details

The Best of CVPR is a three-day virtual meetup series featuring researchers presenting their accepted papers from the 2026 Conference on Computer Vision and Pattern Recognition (CVPR).

Date, Time and Location

Jul 09, 2026
9 AM - 11 AM PT
Online. Register for Zoom!

Efficient Representation and Coding of Dynamic Light Fields

This talk presents a data-driven approach that integrates aperture and pixel-wise exposure coding with Dynamic Mode Decomposition (DMD) to achieve compact representation of dynamic light fields. By modeling them as mathematical dynamical systems, the framework captures coherent structures across all dimensions and achieves scalable compression, bitrate savings, and high-quality reconstructions.

About the Speaker

Joshitha Ravishanker is a PhD scholar in the Department of Electrical Engineering at IIT Madras, supervised by Dr. Mansi Sharma and Dr. Kaushik Mitra. She is a Prime Minister's Research Fellow and her doctoral research focuses on the efficient representation and compression of light fields for display applications.

PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Recent video generation models can produce visually striking results, but they often fail to capture the physical dynamics that govern how real-world scenes evolve. In this talk, I will present PHANTOM, a physics-infused video generation model that jointly predicts visual content and latent physical dynamics. PHANTOM uses a physics-aware video representation to guide generation toward videos that are both visually realistic and physically consistent, without requiring explicit simulator-based physical specifications. I will discuss the model design, key results on standard and physics-aware video generation benchmarks, and how this work supports broader progress toward multimodal world models for physical AI and embodied reasoning.

About the Speaker

Ismini Lourentzou is an Assistant Professor at the University of Illinois Urbana-Champaign and Director of the Perception and LANguage Lab. Her research focuses on multimodal machine learning, vision-language models, generative modeling, and embodied AI, with applications in physical reasoning, robotics, healthcare, and trustworthy AI.

LoST: Level of Semantics Tokenization for 3D Shapes

Tokenization is fundamental to generative modeling and especially important for autoregressive 3D generation. However, current 3D shape tokenizers rely on geometric level-of-detail hierarchies that are token-inefficient and poorly aligned with semantic structure. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience so early tokens produce complete, plausible shapes and later tokens refine detailed geometry and semantics.

LoST is trained with Relational Inter-Distance Alignment (RIDA), a semantic alignment loss that matches relationships in 3D shape latent space to those in DINO feature space. Experiments show that LoST achieves state-of-the-art reconstruction and efficient high-quality AR 3D generation while using only 0.1%–10% of the tokens required by prior methods.

About the Speaker

Niladri Dutt is an ELLIS PhD student at University College London (UCL), sponsored by Adobe Research. He is advised by Prof Niloy Mitra (UCL) and Duygu Ceylan (Adobe). His research interests are in representation learning for 3D and multimodal learning.

3D Reconstruction Improves Weakly-Supervised Semantic Segmentation

Semantic segmentation typically requires expensive, dense annotations, making large-scale training a significant bottleneck. We address this by introducing a framework that leverages recent advances in feed-forward 3D reconstruction to improve weakly supervised semantic segmentation on 2D images, using only sparse labels such as points, scribbles, or coarse masks.

Our core insight is that 3D geometric structure recovered directly from casual 2D video sequences provides powerful cross-view consistency constraints that can propagate sparse annotations across entire scenes. A dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, injecting geometric supervision into the learning process while keeping inference purely 2D. Our solution achieves state-of-the-art performance, outperforming existing methods by 2–7% across a range of datasets and annotation types, without requiring additional labels or inference overhead.

About the Speaker

Wolfgang Boettcher is an ELLIS doctoral researcher in the Computer Vision and Machine Learning group at the Max Planck Institute for Informatics. Since his master's degree at ETH Zurich, his research focuses on visual perception, semantic scene understanding, and dynamic 3D reconstruction. He is particularly interested in models that can reason about the physical environment for applications in autonomous systems and robotics.

Related topics

Artificial Intelligence
Computer Vision
Machine Learning
Data Science

You may also like