
About us
đź–– This virtual group is for data scientists, machine learning engineers, and open source enthusiasts.
Every month we’ll bring you diverse speakers working at the cutting edge of AI, machine learning, and computer vision.
- Are you interested in speaking at a future Meetup?
- Is your company interested in sponsoring a Meetup?
This Meetup is sponsored by Voxel51, the lead maintainers of the open source FiftyOne computer vision toolset. To learn more, visit the FiftyOne project page on GitHub.
Upcoming events
14
- Network event

June 30 - Beyond Annotation Tools: Building a Complete Physical AI Data Engine
·OnlineOnline138 attendees from 48 groupsIn this workshop we’ll demonstrate workflows for image and video annotation, instance segmentation, polylines, QA and review, collaborative labeling operations in FiftyOne, and smart data selection strategies that help teams reduce wasted labeling spend.
Date, Time and Location
Jun 30, 2026
9 AM PST
Online. Register for the Zoom!Annotation is no longer just about drawing boxes. Modern physical AI teams need an end-to-end system for labeling, QA, dataset curation, project management, auto-labeling, and video understanding — all tightly integrated into the workflows where models are actually built and evaluated.
You’ll also get an early look at new agentic labeling workflows powered by “Labeling Agents” — intelligent systems that can learn from text prompts and visual examples to automatically label datasets at scale. We’ll walk through how teams can rapidly create reusable labeling agents, validate outputs, and apply them across large datasets as background tasks.
Whether you’re building computer vision models for robotics, autonomous systems, manufacturing, retail, or multimodal AI applications, this session will show how integrated annotation and data-centric workflows can dramatically accelerate iteration speed while improving dataset quality.
What You’ll Learn
- How smart data selection strategies reduce annotation costs and improve model performance
- Why integrated annotation is becoming a core requirement for modern physical AI platforms
- How to unify data curation, annotation, evaluation, and model iteration inside a single workflow
- How FiftyOne supports annotation workflows for Classification, Object detection, Instance segmentation, Polylines, Video detection and tracking
- How to create, edit, QA, and manage 2D and 3D labels directly in context
- How annotation project management workflows help coordinate labeling teams and reviews
- How SAM2-powered click-to-segment workflows enable fast browser-based segmentation
- How agentic labeling works, including training reusable “Labeling Agents”, prompting with text + visual examples, iterating on outputs before deployment and running large-scale auto-labeling workflows
11 attendees from this group - Network event

July 1 - Getting Started with FiftyOne
·OnlineOnline120 attendees from 48 groupsThis workshop is part of our Getting Started with FiftyOne monthly series — a recurring session designed to help you build a strong foundation in data-centric AI workflows.
Time, Place and Location
July 1, 2026
9 AM PST - 10 AM PST
Online. Register for the Zoom!In this session, you’ll learn how to manage large-scale computer vision datasets using open source FiftyOne. We’ll cover how to curate, visualize, and evaluate your data and models — with a focus on improving data quality over brute-force model iteration.
You’ll walk away with a repeatable framework for building data-centric AI pipelines across research and production.
What you’ll learn:
- Structure unstructured data into queryable schemas (images, video, point clouds)
- Query datasets using the FiftyOne SDK with filters, tags, and confidence thresholds
- Visualize high-dimensional embeddings to identify clusters, gaps, and outliers
- Automate data curation and prioritize high-value samples for labeling
- Debug model performance using evaluation tools (confusion matrices, PR curves)
- Customize FiftyOne with dashboards and interactive panels
Prerequisites:
- Working knowledge of Python
- Familiarity with machine learning and/or computer vision fundamentals
7 attendees from this group - Network event

July 8 - Best of CVPR (Day 1)
·OnlineOnline68 attendees from 48 groupsWelcome to the Best of CVPR series — your virtual front row to groundbreaking research, insights, and innovations from one of computer vision's premier conferences. Live from the authors to you.
Date, Time and Location
Jul 08, 2026
9 AM - 11 AM PT
Online. Register for Zoom!Some Modalities Are More Equal Than Others: Understanding and Improving Multimodal Integration in MLLMs
Multimodal large language models can process vision, audio, and text, but it remains unclear whether they truly integrate these modalities or rely on shortcut cues. In this talk, I will present our recent work, “Some Modalities Are More Equal Than Others,” where we introduce MMA-Bench, a benchmark designed to probe MLLMs under controlled audio–visual conflict, misleading text, and modality-specific queries. Through black-box evaluation and white-box attention analysis, we show that current MLLMs often struggle when modalities disagree, exhibit model-specific modality biases, and can be distracted by irrelevant textual context. We further propose an alignment-aware tuning strategy that trains models to answer based on the queried modality, improving robustness and multimodal grounding. This talk will highlight both the failure modes of current MLLMs and practical directions toward more reliable cross-modal reasoning.
About the Speaker
Tianle Chen is a Ph.D. student in Computer Science at Boston University, advised by Prof. Deepti Ghadiyaram. His research focuses on multimodal large language models, audio–visual reasoning, robustness, and trustworthy multimodal AI. He is interested in understanding how models allocate evidence across modalities and designing methods that improve reliable multimodal reasoning.
LinkedOut: Linking World Knowledge Out of Video LLMs for Next-Generation Video Recommendation
This CVPR 2026 work links structured world knowledge representations out of Video LLMs for next-generation video recommendation, covering how large vision-language models can provide rich semantic priors for video understanding while addressing efficiency and deployment challenges in real recommendation systems.
About the Speaker
Haichao Zhang is a Ph.D. candidate in Computer Engineering at Northeastern University. His research focuses on computer vision, vision-language models, video understanding and generation, and efficient multimodal foundation models. He has research experience at Google CoreML, Meta Reality Labs, LinkedIn Video AI, Amazon AWS AI Labs, and Tencent.
CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation
This paper presents CylinderDepth, a self-supervised surround depth estimation method leveraging cylindrical spatial attention for multi-view consistency across camera rigs.
About the Speaker
Samer Abualhanud is a PhD student and research staff member at Leibniz University Hannover, Germany, supervised by Dr.-Ing. Max Mehltretter and Prof. Christian Heipke. Research focuses on multi-view consistency in 3D reconstruction.
Your ViT is Secretly Also a Video Segmentation Model
Existing online video segmentation models typically combine a per-frame segmentation module with complex, specialized tracking modules. This work shows that a plain Vision Transformer encoder with a lightweight temporal module can match that performance, resulting in VidEoMT — up to 5–10x faster, running at up to 160 FPS with a ViT-L encoder.
About the Speaker
Daan de Geus is an Assistant Professor in the Mobile Perception Systems Lab at TU/e. He received his PhD (cum laude) from TU/e in 2024, and his research focuses on machine learning for visual and multimodal scene understanding.
5 attendees from this group - Network event

July 9 - Best of CVPR (Day 2)
·OnlineOnline100 attendees from 50 groupsThe Best of CVPR is a three-day virtual meetup series featuring researchers presenting their accepted papers from the 2026 Conference on Computer Vision and Pattern Recognition (CVPR).
Date, Time and Location
Jul 09, 2026
9 AM - 11 AM PT
Online. Register for Zoom!Efficient Representation and Coding of Dynamic Light Fields
This talk presents a data-driven approach that integrates aperture and pixel-wise exposure coding with Dynamic Mode Decomposition (DMD) to achieve compact representation of dynamic light fields. By modeling them as mathematical dynamical systems, the framework captures coherent structures across all dimensions and achieves scalable compression, bitrate savings, and high-quality reconstructions.
About the Speaker
Joshitha Ravishanker is a PhD scholar in the Department of Electrical Engineering at IIT Madras, supervised by Dr. Mansi Sharma and Dr. Kaushik Mitra. She is a Prime Minister's Research Fellow and her doctoral research focuses on the efficient representation and compression of light fields for display applications.
PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
Recent video generation models can produce visually striking results, but they often fail to capture the physical dynamics that govern how real-world scenes evolve. In this talk, I will present PHANTOM, a physics-infused video generation model that jointly predicts visual content and latent physical dynamics. PHANTOM uses a physics-aware video representation to guide generation toward videos that are both visually realistic and physically consistent, without requiring explicit simulator-based physical specifications. I will discuss the model design, key results on standard and physics-aware video generation benchmarks, and how this work supports broader progress toward multimodal world models for physical AI and embodied reasoning.
About the Speaker
Ismini Lourentzou is an Assistant Professor at the University of Illinois Urbana-Champaign and Director of the Perception and LANguage Lab. Her research focuses on multimodal machine learning, vision-language models, generative modeling, and embodied AI, with applications in physical reasoning, robotics, healthcare, and trustworthy AI.
LoST: Level of Semantics Tokenization for 3D Shapes
Tokenization is fundamental to generative modeling and especially important for autoregressive 3D generation. However, current 3D shape tokenizers rely on geometric level-of-detail hierarchies that are token-inefficient and poorly aligned with semantic structure. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience so early tokens produce complete, plausible shapes and later tokens refine detailed geometry and semantics.
LoST is trained with Relational Inter-Distance Alignment (RIDA), a semantic alignment loss that matches relationships in 3D shape latent space to those in DINO feature space. Experiments show that LoST achieves state-of-the-art reconstruction and efficient high-quality AR 3D generation while using only 0.1%–10% of the tokens required by prior methods.
About the Speaker
Niladri Dutt is an ELLIS PhD student at University College London (UCL), sponsored by Adobe Research. He is advised by Prof Niloy Mitra (UCL) and Duygu Ceylan (Adobe). His research interests are in representation learning for 3D and multimodal learning.
3D Reconstruction Improves Weakly-Supervised Semantic Segmentation
Semantic segmentation typically requires expensive, dense annotations, making large-scale training a significant bottleneck. We address this by introducing a framework that leverages recent advances in feed-forward 3D reconstruction to improve weakly supervised semantic segmentation on 2D images, using only sparse labels such as points, scribbles, or coarse masks.
Our core insight is that 3D geometric structure recovered directly from casual 2D video sequences provides powerful cross-view consistency constraints that can propagate sparse annotations across entire scenes. A dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, injecting geometric supervision into the learning process while keeping inference purely 2D. Our solution achieves state-of-the-art performance, outperforming existing methods by 2–7% across a range of datasets and annotation types, without requiring additional labels or inference overhead.
About the Speaker
Wolfgang Boettcher is an ELLIS doctoral researcher in the Computer Vision and Machine Learning group at the Max Planck Institute for Informatics. Since his master's degree at ETH Zurich, his research focuses on visual perception, semantic scene understanding, and dynamic 3D reconstruction. He is particularly interested in models that can reason about the physical environment for applications in autonomous systems and robotics.
9 attendees from this group
Past events
238

