
About us
đź–– This virtual group is for data scientists, machine learning engineers, and open source enthusiasts.
Every month we’ll bring you diverse speakers working at the cutting edge of AI, machine learning, and computer vision.
- Are you interested in speaking at a future Meetup?
- Is your company interested in sponsoring a Meetup?
This Meetup is sponsored by Voxel51, the lead maintainers of the open source FiftyOne computer vision toolset. To learn more, visit the FiftyOne project page on GitHub.
Upcoming events
10
- Network event

Feb 5 - AI, ML and Computer Vision Meetup
·OnlineOnline368 attendees from 47 groupsJoin our virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.
Feb 5, 2026
9 - 11 AM Pacific
Online. Register for the Zoom!Unlocking Visual Anomaly Detection: Navigating Challenges and Pioneering with Vision-Language Models
Visual anomaly detection (VAD) is pivotal for ensuring quality in manufacturing, medical imaging, and safety inspections, yet it continues to face challenges such as data scarcity, domain shifts, and the need for precise localization and reasoning. This seminar explores VAD fundamentals, core challenges, and recent advancements leveraging vision-language models and multimodal large language models (MLLMs). We contrast CLIP-based methods for efficient zero/few-shot detection with MLLM-driven reasoning for explainable, threshold-free outcomes. Drawing from recent studies, we highlight emerging trends, benchmarks, and future directions toward building adaptable, real-world VAD systems. This talk is designed for researchers and practitioners interested in AI-driven inspection and next-generation multimodal approaches.
About the Speaker
Hossein Kashiani is a fourth-year Ph.D. student at Clemson University. His research focuses on developing generalizable and trustworthy AI systems, with publications in top venues such as CVPR, WACV, ICIP, IJCB, and TBIOM. His work spans diverse applications, including anomaly detection, media forensics, biometrics, healthcare, and visual perception.
Data-Centric Lessons To Improve Speech-Language Pretraining
Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs.
We focus on three research questions fundamental to speech-language pretraining data:
- How to process raw web-crawled audio content for speech-text pretraining;
- How to construct synthetic pretraining datasets to augment web-crawled data;
- How to interleave (text, audio) segments into training sequences.
We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.
About the Speaker
Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at The University of Cambridge/Google Deepmind. He is also a part of the International Max Planck Research School for Intelligent Systems. He is mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. His research is funded by a Google PhD Fellowship in Machine Intelligence.
A Practical Pipeline for Synthetic Data with Nano Banana Pro + FiftyOne
Most computer-vision failures come from the rare cases, the dark corners, odd combinations, and edge conditions we never capture enough in real datasets. In this session, we walk through a practical end-to-end pipeline for generating targeted synthetic data using Google’s Nano Banana Pro and managing it with FiftyOne. We’ll explore how to translate dataset gaps into generation prompts, create thousands of high-quality synthetic images, automatically enrich them with metadata, and bring everything into FiftyOne for inspection, filtering, and validation. By the end, you’ll understand how to build a repeatable synthetic-first workflow that closes real vision gaps and improves model performance on the scenarios that matter most.
About the Speaker
Adonai Vera - Machine Learning Engineer & DevRel at Voxel51. With over 7 years of experience building computer vision and machine learning models using TensorFlow, Docker, and OpenCV. I started as a software developer, moved into AI, led teams, and served as CTO. Today, I connect code and community to build open, production-ready AI, making technology simple, accessible, and reliable.
Making Computer Vision Models Faster: An Introduction to TensorRT Optimization
Modern computer vision applications demand real-time performance, yet many deep learning models struggle with high latency during deployment. This talk introduces how TensorRT can significantly accelerate inference by applying optimizations such as layer fusion, precision calibration, and efficient memory management. Attendees will learn the core concepts behind TensorRT, how it integrates into existing CV pipelines, and how to measure and benchmark improvements. Through practical examples and performance comparisons, the session will demonstrate how substantial speedups can be achieved with minimal model-accuracy loss. By the end, participants will understand when and how to apply TensorRT to make their CV models production-ready.
About the Speaker
Tushar Gadhiya is a Technical Lead at Infocusp Innovations, specialising in deep learning, computer vision, graph learning, and agentic AI. My experience spans academic research as a PhD holder and industry work, where I have contributed to multiple patents.
9 attendees from this group - Network event

Feb 11 - Visual AI for Video Use Cases
·OnlineOnline204 attendees from 47 groupsJoin our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.
Time and Location
Feb 11, 2026
9 - 11 AM Pacific
Online. Register for the Zoom!VIDEOP2R: Video Understanding from Perception to Reasoning
Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.
In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.
About the Speaker
Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.
Layer-Aware Video Composition via Split-then-Merge
Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.
About the Speaker
Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.
Video-native VLMs and control
We show how image-native vision–language models can be extended to support native video understanding, structured reasoning, tool use, and robotics. Our approach focuses on designing data, modeling, and training recipes to optimize for multimodality input and interaction patterns - treating vision and perception as a first class citizens. We discuss lessons learned from scaling these methods in an open-source model family and their implications for building flexible multimodal systems.
About the Speaker
Akshat Shrivastava is the CTO and co-founder of Perceptron, previously leading AR On-Device at Meta and conducting research at UW.
Video Intelligence Is Going Agentic
Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.
About the Speaker
James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.
7 attendees from this group - Network event

Feb 18 - Feedback-Driven Annotation Pipelines for End-to-End ML Workflows
·OnlineOnline114 attendees from 47 groupsIn this technical workshop, we’ll show how to build a feedback-driven annotation pipeline for perception models using FiftyOne. We’ll explore real model failures and data gaps, and turn them into focused annotation tasks that then route through a repeatable workflow for labeling and QA. The result is an end-to-end pipeline keeping annotators, tools, and models aligned and closing the loop from annotation, curation, back to model training and evaluation.
Time and Location
Feb 18, 2026
10 - 11 AM PST
Online. Register for the Zoom!What you'll learn
- Techniques for labeling the data that matters the most for annotation time and cost savings
- Structure human-in-the-loop workflows for finding and fixing model errors, data gaps, and targeted relabeling instead of bulk labeling
- Combine auto-labeling and human review in a single, feedback-driven pipeline for perception models
- Use label schemas and metadata as “data contracts” to enforce consistency between annotators, models, and tools, especially for multimodal data
- Detect and manage schema drift and tie schema versions to dataset and model versions for reproducibility
- QA and review steps that surface label issues early and tie changes back to model behavior
- An annotation architecture that can accommodate new perception tasks and feedback signals without rebuilding your entire data stack
7 attendees from this group 
Feb 18 - Boston AI, ML and Computer Vision Meetup
Microsoft NERD New England Research & Development Center, One Memorial Drive, Cambridge, MA, USJoin the Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.
Pre-registration is mandatory to clear security.
Date and Location
Feb 18, 2026
5:30 - 8:30 PM EST
Microsoft Research Lab – New England (NERD) at MIT
Deborah Sampson Conference Room
One Memorial Drive, Cambridge, MA 02142SNAP: Towards Segmenting Anything in Any Point Cloud
Segmenting objects in 3D point clouds is a core problem in 3D scene understanding and scalable data annotation. In this talk, I will present SNAP: Segmenting Anything in Any Point Cloud, a unified framework for interactive point cloud segmentation that supports both point-based and text-based prompts across indoor, outdoor, and aerial domains. SNAP is trained jointly on multiple heterogeneous datasets and achieves strong cross-domain generalization through domain-adaptive normalization. The model enables both spatially prompted instance segmentation and text-prompted panoptic and open-vocabulary segmentation directly on point clouds. Extensive experiments demonstrate that SNAP matches or outperforms domain-specific methods on a wide range of zero-shot benchmarks.
About the Speaker
Hanhui Wang is a first-year Ph.D. student at the Visual Intelligence Lab at Northeastern University. His research centers on 3D scene understanding, with recent work on point cloud segmentation and structured representations, and broader interests in generation and reasoning for multimodal 3D/4D perception.
Culturally Adaptive AI
AI can now generate videos, images, speech, and text that are almost indistinguishable from human-created content. As generative AI systems become more sophisticated, we end up questioning our feeds' credibility and whether they're even real. There is a need, now more than ever, to develop models that help humans distinguish between real and AI-generated content. How can we shape the next generation of AI models to be more explainable, safe, and creative? How can we make these models teach humans about different cultures, bridging the gap between human and AI collaboration?
This talk highlights emerging techniques and the future of AI that will improve trust in generative AI systems by integrating insights from multimodality, reasoning, and factuality. Tomorrow's AI won't just process data and generate content; rather, we imagine it will amplify our creativity, extend our compassion, and help us rediscover what makes us fundamentally human.
About the Speaker
Anku Rani is a doctoral researcher at the Massachusetts Institute of Technology, investigating machine learning models for video generation along with projects at the intersection of natural language processing and human-computer interaction. Her research spans multimodality, mathematical reasoning, attribution, and fact verification, with work published in leading AI conferences.
Data Foundations for Vision-Language-Action Models
Model architectures get the papers, but data decides whether robots actually work. This talk introduces VLAs from a data-centric perspective: what makes robot datasets fundamentally different from image classification or video understanding, how the field is organizing its data (Open X-Embodiment, LeRobot, RLDS), and what evaluation benchmarks actually measure. We'll examine the unique challenges such as temporal structure, proprioceptive signals, and heterogeneity in embodiment, and discuss why addressing them matters more than the next architectural innovation.
About the Speaker
Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.
Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models
Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities---such as text, audio, or images---consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page is here.
About the Speaker
Sharut Gupta is a fourth-year Ph.D student at MIT CSAIL, advised by Prof. Phillip Isola and Prof. Stefanie Jegelka. Prior to this, she completed her undergraduate studies in Mathematics and Computing at the Indian Institute of Technology, Delhi (IIT Delhi), during which she worked with Prof. Yoshua Bengio on her thesis. She has also spent time at Meta SuperIntelligence Labs (Meta AI), and Google DeepMind.
Neural Radiance Fields for Image Verification
We propose an image verification method that embeds physical refraction as an authenticity signature. To verify an image, we compare it to a pixel-aligned reconstruction derived from the refraction and flag inconsistencies. Manipulations are detectable because maintaining geometric consistency with the refractive object is difficult without knowing its refractive properties. Unlike prior work that relies on simple analytic refractions and slow per-scene NeRF optimization, we train a compact, scene-agnostic neural refraction field that models complex geometries and enables instant, high-fidelity reconstruction for detection and localization.
About the Speaker
Sage Simhon completed her BSc in Electrical Engineering and Computer Science at MIT and her MEng in Computer Science at MIT. She is interested in the intersection of physics, AI, and computer vision. She currently works on AI for physics simulation at Pasteur Labs.
83 attendees
Past events
212

