Skip to content

Details

Welcome to the Best of ICCV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you.

When and Where

Nov 24, 2025
9 AM Pacific
Online. Register for the Zoom!

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Are Vision-Language Models Ready for Physical AI? Humans easily understand how objects move, rotate, and shift while current AI models that connect vision and language still make mistakes in what seem like simple situations: deciding “left” versus “right” when something is moving, recognizing how perspective changes, or keeping track of motion over time. To reveal these kinds of limitations, we created VLM4D, a testing suite made up of real-world and synthetic videos, each paired with questions about motion, rotation, perspective, and continuity. When we put modern vision-language models through these challenges, they performed far below human levels, especially when visual cues must be combined or the sequence of events must be maintained. But there is hope: new methods such as reconstructing visual features in 4D and fine-tuning focused on space and time show noticeable improvement, bringing us closer to AI that truly understands a dynamic physical world.

About the Speaker

Shijie Zhou is a final-year PhD candidate at UCLA, recipient of the 2026 Dissertation Year Award and the Graduate Dean’s Scholar Award. His research focuses on spatial intelligence, spanning 3D/4D scene reconstruction and generation, vision-language models, generative AI, and interactive agentic systems. His work has been recognized at top conferences including CVPR, ICCV, ECCV, ICLR, and NeurIPS, and has also led to practical impact through research internships at Google and Apple.

DuoLoRA: Cycle-consistent and Rank-disentangled Content-Style Personalization

We tackle the challenge of jointly personalizing content and style from a few examples. A promising approach is to train separate Low-Rank Adapters (LoRA) and merge them effectively, preserving both content and style. Existing methods, such as ZipLoRA, treat content and style as independent entities, merging them by learning masks in LoRA's output dimensions. However, content and style are intertwined, not independent. To address this, we propose DuoLoRA, a content-style personalization framework featuring three key components: (i) rank-dimension mask learning, (ii) effective merging via layer priors, and (iii) Constyle loss, which leverages cycle-consistency in the merging process. First, we introduce ZipRank, which performs content-style merging within the rank dimension, offering adaptive rank flexibility and significantly reducing the number of learnable parameters.

Additionally, we incorporate SDXL layer priors to apply implicit rank constraints informed by each layer's content-style bias and adaptive merger initialization, enhancing the integration of content and style. To further refine the merging process, we introduce Constyle loss, which leverages the cycle-consistency between content and style. Our experimental results demonstrate that DuoLoRA outperforms state-of-the-art content-style merging methods across multiple benchmarks.

About the Speaker

Aniket Roy is currently a PhD student in the Computer Science at Johns Hopkins University. Prior to that, he did a Master’s from Indian Institute of Technology Kharagpur. During his Master’s program, he demonstrated strong research capabilities, publishing multiple papers in prestigious conferences and journals (including ICIP, CVPR Workshops, TCSVT, and IWDW). He was recognized with the Best Paper Award at IWDW 2016 and the Markose Thomas Memorial Award for the best research thesis at the Master’s level. Aniket continued to pursue research as a PhD student under the guidance of renowned vision researcher Professor Rama Chellappa at Johns Hopkins University. There, he explored the domains of few-shot learning, multimodal learning, diffusion models, LLMs, LoRA merging through publications in leading venues such as NeurIPS, ICCV, TMLR, WACV and CVPR. He also gained valuable industrial experience through internships at esteemed organizations, including Amazon, Qualcomm, MERL, and SRI International. He was also awarded as an Amazon Fellow (2023-24) at JHU, and invited to attend ICCV'25 doctoral consortium.

Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting

CLIP is a foundational model with transferable classification performance in the few-shot setting. Several methods have shown improved performance of CLIP using few-shot examples. However, so far, all these techniques have been benchmarked using standard few-shot datasets. We argue that this mode of evaluation does not provide a true indication of the inductive generalization ability using few-shot examples. As most datasets have been seen by the CLIP model, the resultant setting can be termed as partially transductive. To solve this, we propose a pipeline that uses an unlearning technique to obtain true inductive baselines. In this new inductive setting, the methods show a significant drop in performance (-55% on average among 13 baselines with multiple datasets). We validate the unlearning technique using oracle baselines. An improved few-shot classification technique is proposed that consistently obtains state-of-the-art performance over 13 other recent baseline methods on a comprehensive analysis with 5880 experiments - varying the datasets, differing number of few-shot examples, unlearning setting, and with different seeds. Thus, we identify the issue with the evaluation of CLIP-based few-shot classification, provide a solution using unlearning, propose new benchmarks, and provide an improved method.

About the Speaker

Alexey Kravets is a PhD student in AI at the University of Bath, with over five years of experience working as a Lead Data Scientist at Aviva. My current research primarily focuses on vision and language models, few-shot learning, machine unlearning and mechanistic interpretability. Before my PhD, I've led significant machine learning projects in Aviva – a FTSE 100 insurer in the UK – that included the development of NLP tools for insurance predictions. My passion for AI extends into writing, where I regularly share insights through articles on Medium.

Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)

Tracking and forecasting the rotation of objects is fundamental in computer vision and robotics, yet SO(3) extrapolation remains challenging as (1) sensor observations can be noisy and sparse, (2) motion patterns can be governed by complex dynamics, and (3) application settings can demand long-term forecasting. This work proposes modeling continuous-time rotational object dynamics on SO(3) using Neural Controlled Differential Equations guided by Savitzky-Golay paths. Unlike existing methods that rely on simplified motion assumptions, our method learns a general latent dynamical system of the underlying object trajectory while respecting the geometric structure of rotations. Experimental results on real-world data demonstrate compelling forecasting capabilities compared to existing approaches.

About the Speaker

Lennart Bastian is a PhD candidate at TU Munich's CAMP lab under Prof. Nassir Navab, and an incoming research fellow at Imperial College London. Originally trained in applied mathematics (with early stints in NYC and California's tech scene), he found his calling at the intersection of geometry, machine learning, and clinical applications. His work focuses on making sense of the real world in 3D, teaching computers to understand geometry and what happens in complex surgical environments.

UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields

Neural Radiance Field (NeRF)-based segmentation methods focus on object semantics and rely solely on RGB data, lacking intrinsic material properties. This limitation restricts accurate material perception, which is crucial for robotics, augmented reality, simulation, and other applications. We introduce UnMix-NeRF, a framework that integrates spectral unmixing into NeRF, enabling joint hyperspectral novel view synthesis and unsupervised material segmentation. Our method models spectral reflectance via diffuse and specular components, where a learned dictionary of global endmembers represents pure material signatures, and per-point abundances capture their distribution. For material segmentation, we use spectral signature predictions along learned endmembers, allowing unsupervised material clustering. Additionally, UnMix-NeRF enables scene editing by modifying learned endmember dictionaries for flexible material-based appearance manipulation. Extensive experiments validate our approach, demonstrating superior spectral reconstruction and material segmentation to existing methods.

About the Speaker

Fabian Perez is a computer science student at Universidad Industrial de Santander (UIS) in Colombia. I am currently a master student. I have strong skills in software development and deep learning. My expertise across both these areas allows me to create innovative solutions by bringing them together.

Artificial Intelligence
Computer Vision
Machine Learning
Data Science
Open Source

Members are also interested in