
About us
đź–– This virtual group is for data scientists, machine learning engineers, and open source enthusiasts.
Every month we’ll bring you diverse speakers working at the cutting edge of AI, machine learning, and computer vision.
- Are you interested in speaking at a future Meetup?
- Is your company interested in sponsoring a Meetup?
This Meetup is sponsored by Voxel51, the lead maintainers of the open source FiftyOne computer vision toolset. To learn more, visit the FiftyOne project page on GitHub.
Upcoming events
11

April 22 - Munich AI, ML and Computer Vision Meetup
Impact Hub Munich GmbH, Gotzinger StraĂźe 8, MĂĽnchen, DEJoin the Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.
Date, Time and Location
Apr 22, 2026
5:30 - 8:30 PMImpact Hub Munich
Gotzinger Str. 8
MĂĽnchen, GermanyLearning Disentangled Motion Representations for Open-World Motion Transfer
Recent progress in image- and text-to-video generation has made it possible to synthesize visually compelling videos, yet these models typically lack an explicit, reusable notion of motion. In this talk, I will present recent work on learning high-level, content-independent motion representations directly from open-world video data, with a focus on our NeurIPS spotlight paper introducing DisMo.
By disentangling motion semantics from appearance and object identity, such representations enable open-world motion transfer across semantically unrelated entities and provide a flexible interface for adapting and fine-tuning modern video generation models. Beyond generation, I will discuss how abstract motion representations support downstream motion understanding tasks and why they offer a promising direction for more controllable, general, and future-proof video models. The talk will conclude with a broader perspective on the opportunities and challenges of motion-centric representations in computer vision and video learning.
About the Speaker
Thomas Ressler-Antal is a PhD student at the Computer Vision & Learning Lab at LMU Munich, advised by Björn Ommer. My research focuses on representation learning from large-scale, open-world video data, with an emphasis on disentangling motion from appearance. I am particularly interested in motion understanding, video generation, and transferable representations that enable controllable and general-purpose video models. My work has been published at NeurIPS as a spotlight paper on learning abstract motion representations from raw video.
Towards Generating Fully Navigable 3D Scenes
3D world generation is a longstanding goal of computer vision with applications in VR/gaming/movies, robotics, and digital twins. Recent progress in generative models, in particular image and video diffusion models, enables automatic generation of photorealistic 3D environments. This talk describes a simple yet effective framework to exploit these models for 3D scene genration. Namely, we'll briefly talk about early approaches (Text2Room, ViewDiff) and dive deep into our recent state-of-the-art approach WorldExplorer.
About the Speaker
Lukas Höllein is a a PhD student at the Visual Computing & Artificial Intelligence Lab at the Technical University of Munich, supervised by Prof. Dr. Matthias Nießner. My research lies at the intersection of computer vision/graphics and machine learning, concerning mostly 3D reconstruction and generation. I'm especially interested in the creation of fully navigable 3D worlds with the help of generative AI.
Finding Motion in Commotion: Estimating and Anticipating Motion in Everyday Visual Scenes
Motion is an intrinsic property of video data. How do we harness motion from the abundance of videos to advance vision foundation models? This talk will examine key challenges and emerging opportunities in motion estimation and motion-aware representation learning at scale. Drawing on our latest results from NeurIPS and ICCV, the talk will show how motion-centric learning can enable more versatile and generalisable vision foundation models.
About the Speaker
Nikita Araslanov is a postdoctoral researcher in the Computer Vision Group at TU Munich. His research focuses on semantic and 3D visual inference from video data, with the goal of bridging visual perception and reasoning about complex phenomena. He earned his PhD in Computer Science from TU Darmstadt (2022) and was a visiting researcher at Google (2024–2025).
Small Models, Big Intelligence: How vLLM Semantic Router Uses Sub-2B Language Models for Production-Scale Routing
The vLLM Semantic Router introduces a groundbreaking approach to intelligent LLM request routing through its MoM (Mixture of Models) family, a collection of specialized small language models that make split-second routing decisions for production systems. This system operates between users and models, capturing signals from requests, responses, and context to make intelligent routing decisions, including model selection, safety filtering (jailbreak, PII), semantic caching, and hallucination detection.
In this talk, we'll explore how the router leverages tiny but powerful models like ModernBERT (encoder-based) and Qwen3 (0.6B-1.7B parameter decoder models) to achieve sub-10ms latency classification at over 10,000 queries per second. We'll dive into the technical architecture showing how these small models handle domain classification, jailbreak detection, PII protection, and hallucination detection, proving that for routing intelligence, size isn't everything.
About the Speaker
Peter Bouda is an AI Engineer and tech leader with over 20 years of experience building cutting-edge solutions across AI/ML, NLP, and full-stack development. Currently architecting AI platforms at EY, he previously led the AI Lab at Apiax where he built sophisticated NLP stacks for regulatory compliance in the financial services industry, successfully raised over €1.5 million in EU R&D funding, and deployed production-grade microservices on Azure Kubernetes. The rest is biking: He regulary embarks on long-distance cycling trips and discovers new roads and trails in Portugal and Spain.
Data Foundations for Vision-Language-Action Models
Model architectures get the papers, but data decides whether robots actually work. This talk introduces VLAs from a data-centric perspective: what makes robot datasets fundamentally different from image classification or video understanding, how the field is organizing its data (Open X-Embodiment, LeRobot, RLDS), and what evaluation benchmarks actually measure. We'll examine the unique challenges such as temporal structure, proprioceptive signals, and heterogeneity in embodiment, and discuss why addressing them matters more than the next architectural innovation.
About the Speaker
Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.
204 attendees- Network event

April 23 - Advances in AI at Johns Hopkins University
·OnlineOnline236 attendees from 48 groupsJoin our virtual Meetup to hear talks from researchers at Johns Hopkins University on cutting-edge AI topics.
Date, Time and Location
Apr 23, 2026
9AM PST
Online. Register for the Zoom!Recent Advancements in Image Generation and Understanding
In this talk, I will provide an overview of my research and then take a closer look at three recent works. Image generation has progressed rapidly in the past decade-evolving from Gaussian Mixture Models (GMMs) to Variational Autoencoders (VAEs), GANs, and more recently diffusion models, which have set new standards for quality. I will begin with DiffNat (TMLR’25), which draws inspiration from a simple yet powerful observation: the kurtosis concentration property of natural images. By incorporating a kurtosis concentration loss together with a perceptual guidance strategy, DiffNat can be plugged directly into existing diffusion pipelines, leading to sharper and more faithful generations across tasks such as personalization, super-resolution, and unconditional synthesis.
Continuing the theme of improving quality under constraints, I will then discuss DuoLoRA (ICCV’25), which tackles the challenge of content–style personalization from just a few examples. DuoLoRA introduces adaptive-rank LoRA merging with cycle-consistency, allowing the model to better disentangle style from content. This not only improves personalization quality but also achieves it with 19× fewer trainable parameters, making it far more efficient than conventional merging strategies.
Finally, I will turn to Cap2Aug (WACV’25), which directly addresses data scarcity. This approach uses captions as a bridge for semantic augmentation, applying cross-modal backtranslation (image → text → image) to generate diverse synthetic samples. By aligning real and synthetic distributions, Cap2Aug boosts both few-shot and long-tail classification performance on multiple benchmarks.
About the Speaker
Aniket Roy is currently a Research Scientist at NEC Labs America. He recently earned a PhD from the Computer Science department at Johns Hopkins University under the guidance of Bloomberg Distinguished Professor Prof. Rama Chellappa.
From Representation Analysis to Data Refinement: Understanding Correlations in Deep Models
This talk examines how deep learning models encode information beyond their intended objectives and how such dependencies influence reliability, fairness, and generalization. Representation-level analysis using mutual information–based expressivity estimation is introduced to quantify the extent to which attributes such as demographics or anatomical structural factors are implicitly captured in learned embeddings, even when they are not explicitly used for supervision. These analyses reveal hierarchical patterns of attribute encoding and highlight how correlated factors emerge across layers. Data attribution techniques are then discussed to identify influential training samples that contribute to model errors and reinforce dependencies that reduce robustness. By auditing the training data through influence estimation, harmful instances can be identified and removed to improve model behavior. Together, these components highlight a unified, data-centric perspective for analyzing and refining correlations in deep models.
About the Speaker
Basudha Pal is a recent PhD graduate from the Electrical and Computer Engineering Department at Johns Hopkins University. Her research lies at the intersection of computer vision and representation learning, focusing on understanding and refining correlations in deep neural network representations for biometric and medical imaging using mutual information analysis, data attribution, and generative modeling to improve robustness, fairness, and reliability in high-stakes AI systems.
Scalable & Precise Histopathology: Next-Gen Deep Learning for Digital Histopathology
Whole slide images (WSIs) present a unique computational challenge in digital pathology, with single images reaching gigapixel resolution, equivalent to 500+ photos stitched together. This talk presents two complementary deep learning solutions for scalable and accurate WSI analysis. First, I introduce a Task-Specific Self-Supervised Learning (TS-SSL) framework that uses spatial-channel attention to learn domain-optimized feature representations, outperforming existing foundation models across multiple cancer classification benchmarks. Second, I present CEMIL, a contextual attention-based MIL framework that leverages instructor-learner knowledge distillation to classify cancer subtypes using only a fraction of WSI patches, achieving state-of-the-art accuracy with significantly reduced computational cost. Together, these methods address critical bottlenecks in generalization and efficiency for clinical-grade computational pathology.
About the Speaker
Tawsifur Rahman is a Ph.D. candidate in Biomedical Engineering at Johns Hopkins University, advised by Prof. Rama Chellappa and Dr. Alex Baras, with research focused on weakly supervised and self-supervised deep learning for computational pathology. He has completed two clinical data science internships at Johnson & Johnson MedTech and has published extensively in venues including Nature Modern Pathology, Nature Digital Medicine, MIDL, and IEEE WACV, accumulating over 8,500 citations and recognition in Stanford's Top 2% Scientists ranking.
Towards trustworthy AI under real world data challenges
The current paradigm of training AI models relies on fundamental assumptions that the data we have is clean, properly annotated, and sufficiently diverse across domains. However, this is not always true for the real world. In practice, data is may be physically corrupt, incompletely annotated, and specific to certain domains. As me move towards large scale general purpose models like LLMs and foundation models, it is even more important to address these data challenges so that we can train trustworthy AI models even with noisy real world data. In this presentation, we discuss some methods to tackle these potential issues.
About the Speaker
Ayush Gupta is a Ph.D. student at the AIEM lab, Johns Hopkins University in the department of Computer Science. He is advised by Prof. Rama Chellappa and is working on problems in Computer Vision and Deep Learning. His research has two focus points - general-purpose vision language models, where he works on multimodal LLMs on tasks like VQA, Video Grounding and LLM interpretability; and on fine-grained computer vision problems, where he works on person re-identification and gait recognition.
43 attendees from this group - Network event

April 30 - Best of WACV 2026
·OnlineOnline67 attendees from 48 groupsWelcome to the Best of WACV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you
Date, Time and Location
Apr 30, 2026
9AM - 11AM Pacific
Online. Register for the Zoom!Zero-Shot Coreset Selection via Iterative Subspace Sampling
Deep learning's reliance on massive datasets leads to significant costs in storage, annotation, and training. Although coreset selection aims to mitigate these costs by finding performant data subsets, state-of-the-art methods typically require expensive ground-truth labels and dataset-specific training. To overcome these scalability issues, ZCore introduces a zero-shot approach that functions without labels or prior training on candidate data. Instead, ZCore uses foundation models to generate a zero-shot embedding space for unlabeled data, then quantifies the relative importance of each example based on overall coverage and redundancy within the embedding distribution. On ImageNet, ZCore outperforms previous label-based methods at a 90% prune rate while eliminating the need to annotate over one million images.
About the Speaker
Brent Griffin is a Principal Machine Learning Scientist at Voxel51 specializing in low-cost machine learning on unstructured data. Previously, he was the Perception Lead at Agility Robotics and an assistant research scientist at the University of Michigan conducting research at the intersection of computer vision, control, and robot learning. He is lead author on publications in all of the top IEEE conferences for computer vision, robotics, and control, and his work has been featured in Popular Science, in IEEE Spectrum, and on the Big Ten Network.
ENCORE: A Neural Collapse Perspective on Out of-Distribution Detection in Deep Neural Networks
We present ENCORE, a post-hoc out-of-distribution (OOD) detection method grounded in the geometric properties of neural collapse in deep neural networks. By leveraging the observation that in-distribution features align with class means while OOD features tend to be misaligned or orthogonal, ENCORE modifies inference through cosine-based scoring and adaptive feature scaling to enhance separation between known and unknown inputs. The method approximates neural collapse behavior at test time without requiring retraining, enabling more reliable uncertainty estimation. It is lightweight, memory-efficient, and compatible with a wide range of architectures, including convolutional networks and vision transformers. Extensive experiments on standard benchmarks demonstrate consistent improvements over existing OOD detection approaches in both near- and far-distribution shifts.
About the Speaker
A.Q.M. Sazzad Sayyed is a Ph.D. candidate in Electrical and Computer Engineering at Northeastern University, focusing on robust, secure, and efficient deep learning. His research centers on out-of-distribution detection, uncertainty modeling, and machine learning reliability for safety-critical and edge AI systems.
Synthesizing Compositional Videos from Text Description
Existing pre-trained text-to-video diffusion models can generate high-quality videos, but often struggle with misalignment between the generated content and the input text, particularly while composing scenes with multiple objects. To tackle this issue, we propose a straightforward, training-free approach for compositional video generation from text. We introduce Video-ASTAR for test-time aggregation and segregation of attention with a novel centroid loss to enhance alignment, which enables the generation of multiple objects in the scene, modeling the actions and interactions.
Additionally, we extend our approach to the Multi-Action video generation setting, where only the specified action should vary across a sequence of prompts. To ensure coherent action transitions, we introduce a novel token-swapping and latent interpolation strategy.
About the Speaker
Shanmuganathan Raman is a prominent academic and researcher in the fields of computer vision, deep learning, computational photography, and computer graphics. He is a Professor at the Indian Institute of Technology Gandhinagar (IIT Gandhinagar), where he holds a joint appointment in the Departments of Electrical Engineering and Computer Science and Engineering. He serves as the Head of the Department of Computer Science and Engineering at IIT Gandhinagar.
The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs
Multimodal large language models can answer impressively complex visual questions, but do they truly understand what they see? We present The Perceptual Observatory, a framework for characterizing robustness and grounding in MLLMs beyond standard leaderboard scores. We evaluate models on interpretable tasks such as image matching, grid pointing game, and attribute localization across pixel-level corruptions and diffusion-based stylized illusions. Our analysis reveals that scaling the language model alone does not guarantee better perceptual grounding, uncovering systematic weaknesses in robustness, spatial invariance, fairness, and reasoning-based perception. The Perceptual Observatory offers a more principled way to study multimodal perception and provides actionable insights for building future MLLMs that are reliable and truly grounded in visual evidence.
About the Speaker
Fenil Bardoliya is a Researcher at the Complex Data Reasoning & Analysis Lab (CORAL) at Arizona State University. His research revolves around Multimodal Model Evaluation and Benchmarking, Machine Unlearning, and Structured Reasoning.
10 attendees from this group - Network event

May 6 - Building Composable Computer Vision Workflows in FiftyOne
·OnlineOnline66 attendees from 48 groupsThis workshop explores the FiftyOne plugin framework to build custom computer vision applications. You’ll learn to extend the open source FiftyOne App with Python based panels and server side operators, as well as integrate external tools for labeling, vector search, and model inference into your dataset views.
Date, Time and Location
May 6, 2026
10 AM - 11 AM PST
Online. Register for the Zoom!What You'll Learn
- Build Python plugins. Define plugin manifests and directory structures to register custom functionality within the FiftyOne ecosystem.
- Develop server side operators. Write functions to execute model inference, data cleaning, or metadata updates from the App interface.
- Build interactive panels. Create custom UI dashboards using to visualize model metrics or specialized dataset distributions.
- Manage operator execution contexts. Pass data between the App front end and your backend to build dynamic user workflows.
- Implement delegated execution. Configure background workers to handle long running data processing tasks without blocking the user interface.
- Build labeling integrations. Streamline the flow of data between FiftyOne and annotation platforms through custom triggers and ingestion scripts.
- Extend vector database support. Program custom connectors for external vector stores to enable semantic search across large sample datasets.
- Package and share plugins. Distribute your extensions internally and externally
12 attendees from this group
Past events
102

