Skip to content

Details

Join the Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Register to reserve your spot

Date, Time and Location

Apr 22, 2026
5:30 - 8:30 PM

Impact Hub Munich
Gotzinger Str. 8
München, Germany

Learning Disentangled Motion Representations for Open-World Motion Transfer

Recent progress in image- and text-to-video generation has made it possible to synthesize visually compelling videos, yet these models typically lack an explicit, reusable notion of motion. In this talk, I will present recent work on learning high-level, content-independent motion representations directly from open-world video data, with a focus on our NeurIPS spotlight paper introducing DisMo.

By disentangling motion semantics from appearance and object identity, such representations enable open-world motion transfer across semantically unrelated entities and provide a flexible interface for adapting and fine-tuning modern video generation models. Beyond generation, I will discuss how abstract motion representations support downstream motion understanding tasks and why they offer a promising direction for more controllable, general, and future-proof video models. The talk will conclude with a broader perspective on the opportunities and challenges of motion-centric representations in computer vision and video learning.

About the Speaker

Thomas Ressler-Antal is a PhD student at the Computer Vision & Learning Lab at LMU Munich, advised by Björn Ommer. My research focuses on representation learning from large-scale, open-world video data, with an emphasis on disentangling motion from appearance. I am particularly interested in motion understanding, video generation, and transferable representations that enable controllable and general-purpose video models. My work has been published at NeurIPS as a spotlight paper on learning abstract motion representations from raw video.

Towards Generating Fully Navigable 3D Scenes

3D world generation is a longstanding goal of computer vision with applications in VR/gaming/movies, robotics, and digital twins. Recent progress in generative models, in particular image and video diffusion models, enables automatic generation of photorealistic 3D environments. This talk describes a simple yet effective framework to exploit these models for 3D scene genration. Namely, we'll briefly talk about early approaches (Text2Room, ViewDiff) and dive deep into our recent state-of-the-art approach WorldExplorer.

About the Speaker

Lukas Höllein is a a PhD student at the Visual Computing & Artificial Intelligence Lab at the Technical University of Munich, supervised by Prof. Dr. Matthias Nießner. My research lies at the intersection of computer vision/graphics and machine learning, concerning mostly 3D reconstruction and generation. I'm especially interested in the creation of fully navigable 3D worlds with the help of generative AI.

The Future of 3D Vision Data: From Human Annotation to AI-Generated Data

Accuracy of the dataset is one of the most important, yet often overlooked, aspects of the 3D computer vision field. This talk will start by revisiting my earlier work on 6D pose and depth estimation tasks to highlight how ground-truth errors can cause issues during evaluations, then present practical techniques for accurate data annotation and demonstrate the issues. Finally, we discuss leveraging a diffusion model as a scalable approach to create a large-scale, realistic synthetic dataset that replicates realistic sensor noise.

About the Speaker

Hyunjun Jung acquired a PhD from the chair of Computer Aided Medical Procedure (CAMP) at the Technical University of Munich, supervised by Prof. Dr. Nassir Navab. Hyunjun’s research covered a broad range of topics in 3D computer vision during his PhD, such as 6D pose estimation, depth estimation, robotics, accurate 3D datasets, animatable human avatars, and diffusion models. Hyunjun is currently working as a Post Doctoral Researcher in Prof. Dr. Benjamin Busam’s Photogrammetry and Remote Sensing (PRS) lab and will soon join the LG Graduate School of AI (Seoul, Korea) as an industrial professor position.

Data Foundations for Vision-Language-Action Models

Model architectures get the papers, but data decides whether robots actually work. This talk introduces VLAs from a data-centric perspective: what makes robot datasets fundamentally different from image classification or video understanding, how the field is organizing its data (Open X-Embodiment, LeRobot, RLDS), and what evaluation benchmarks actually measure. We'll examine the unique challenges such as temporal structure, proprioceptive signals, and heterogeneity in embodiment, and discuss why addressing them matters more than the next architectural innovation.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.

Related topics

Events in München, DE
Artificial Intelligence
Computer Vision
Machine Learning
Data Science
Open Source

You may also like