Jan 14 - Best of NeurIPS
2 attendees from 47 groups hosting
Details
Welcome to the Best of NeurIPS series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined the conference. Live streaming from the authors to you.
Jan 14, 2025
9 AM Pacific
Online. Register for the Zoom!
EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding
Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context, but do not explore the comprehensive combination of both. We introduce EgoExOR, the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives. Spanning 94 minutes (84,553 frames at 15 FPS) of two emulated spine procedures, Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery,
EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, exocentric RGB and depth from RGB-D cameras, and ultrasound imagery. Its detailed scene graph annotations, covering 36 entities and 22 relations (568,235 triplets), enable robust modeling of clinical interactions, supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR's multimodal and multi-perspective signals. This new dataset and benchmark set a new foundation for OR perception, offering a rich, multimodal resource for next-generation clinical perception.
About the Speaker
Ege Özsoy is a last year PhD student researching multimodal computer vision and vision–language models for surgical scene understanding, focusing on semantic scene graphs, multimodality, and ego-exocentric modeling in operating rooms.
SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation
Few-shot segmentation requires recognizing novel object categories from only a few annotated examples, demanding both accurate mask generation and strong visual correspondence. While Segment Anything 2 (SAM2) provides powerful prompt-based segmentation and built-in feature matching, its representations are entangled with tracking-specific cues that limit higher-level semantic generalization. We show that SAM2 nonetheless encodes rich latent semantic structure despite its class-agnostic training. To leverage this, we introduce SANSA, a lightweight framework that makes this structure explicit and adapts SAM2 for few-shot segmentation with minimal modifications. SANSA achieves state-of-the-art generalization performance, outperforms generalist in-context methods, supports flexible prompting, and remains significantly faster and smaller than prior approaches.
About the Speaker
Claudia Cuttano is a PhD student in the VANDAL Lab at Politecnico di Torino and is currently conducting a research visit at TU Darmstadt with Prof. Stefan Roth in the Visual Inference Lab. Her work centers on semantic segmentation, particularly on multi-modal scene understanding and leveraging foundation models for pixel-level vision tasks.
Nested Learning: The Illusion of Deep Learning Architectures
We present Nested Learning (NL), a new learning paradigm for continual learning that views machine learning models and their training process as a set of nested and/or parallel optimization problems, each of which with its own context flow, frequency of update, and learning algorithm. Based on NL, we design a new architecture, called Hope, that is capable of continual learning and also modifying itself, if it is needed.
About the Speaker
Ali Behrouz is a Ph.D. student in the Computer Science Department at Cornell University and a research intern at Google Research. His research spans topics from deep learning architectures to continual learning and neuroscience, and appeared at NeurIPS, ICML, KDD, WWW, CHIL, VLDB, ... conferences. His work has been featured with two Best Paper awards, a Best Paper Honorable Mention award, a Best Paper Award candidate, and oral and spotlight presentations.
Are VLM Explanations Faithful? A Counterfactual Testing Approach
VLMs sound convincing—but are their explanations actually true? This talk introduces Explanation-Driven Counterfactual Testing (EDCT), a simple and model-agnostic method that evaluates whether VLM explanations align with the evidence models truly use. By perturbing the very features a model claims to rely on, EDCT exposes mismatches between stated reasoning and real decision pathways. I will show surprising failure cases across state-of-the-art VLMs and highlight how EDCT can guide more trustworthy explanation methods.
About the Speaker
Santosh Vasa is a Machine Learning Engineer at Mercedes-Benz R&D North America, working on multimodal perception and VLM safety for autonomous driving. He co-authored the EDCT framework and focuses on explainability, counterfactual testing, and trustworthy AI.
