
About us
đ This virtual group is for data scientists, machine learning engineers, and open source enthusiasts.
Every month weâll bring you diverse speakers working at the cutting edge of AI, machine learning, and computer vision.
- Are you interested in speaking at a future Meetup?
- Is your company interested in sponsoring a Meetup?
This Meetup is sponsored by Voxel51, the lead maintainers of the open source FiftyOne computer vision toolset. To learn more, visit the FiftyOne project page on GitHub.
Upcoming events
11
- Network event

May 1 - Best of WACV (Day 2)
¡OnlineOnline116 attendees from 48 groupsWelcome to the Best of WACV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this yearâs conference. Live streaming from the authors to you.
Date, Time and Location
May 01, 2026
9 AM - 11 AM Pacific
Online. Register for the Zoom!Beyond Pixels: Type-Aware Contrastive Learning for Global Urban Similarity
Standard visual models often fail to distinguish between superficial appearances and meaningful structural variations in urban environments. We present a type-aware contrastive learning framework that measures city similarity by explicitly modeling infrastructure elements like intersections and bus lanes. Our framework integrates a type-conditioned Vision Transformer that actively fuses visual features with CLIP-derived semantic embeddings via a novel adaptive per-type contrastive loss. This allows the model to dynamically prioritize the most discriminative infrastructure categories while down-weighting less informative visual noise. We demonstrate that this method significantly improves clustering quality and generalizes to unseen cities, providing a scalable, interpretable foundation for urban analysis.
About the Speaker
Idan Kligvasser is a Ph.D. researcher and Machine Learning Engineer specializing in multimodal generative AI and large-scale ML systems. With a background in developing state-of-the-art diffusion models and VLMs, his work has been recognized at top-tier venues including CVPR, NeurIPS, and ICLR.
Perceptually Guided 3DGS Streaming and Rendering for Mixed Reality
Recent advances in 3D Gaussian Splatting (3DGS) enable high-quality rendering but fall short of mixed reality's demanding requirements for high refresh rates, stereo viewing, and limited compute budgets. We propose a perception-guided, continuous level-of-detail framework that exploits human visual system limitations through a lightweight, gaze-contingent model to predict and adaptively modulate rendering quality across the visual field, maximizing perceived quality under compute constraints.
Combined with an edge-cloud collaborative rendering framework for untethered MR devices, our method achieves superior computational efficiency with minimal perceptual quality loss compared to vanilla and foveated baselines, validated through objective metrics and user studies.
About the Speaker
Sai Harsha Mupparaju is an MS Computer Science student at NYU working in the Immersive Computing Lab with Prof. Qi Sun, where he focuses on 3D Gaussian Splatting, neural rendering, and perceptual VR/MR systems. He previously earned his undergraduate degree from BITS Pilani and conducted research at the Indian Institute of Science (IISc). His research has been published at IEEE WACV 2026, ACM Transactions on Graphics, and ACM SIGGRAPH 2024.
SAVIOR: Sample-efficient Adaptation of Vision-Language Models for OCR Representation
OCR pipelines and vision-language models systematically underperform on document patterns critical to financial workflows, such as vertical text, logo-embedded vendor names, degraded scans, and complex multi-column layouts. While underrepresented in public datasets, these patterns constitute a substantial portion of real-world failure cases.
We introduce SAVIOR, a sample-efficient data curation methodology that targets such high-impact failure scenarios to adapt vision-language models for robust financial OCR, and PaIRS, a structure-aware evaluation metric that measures layout fidelity by comparing pairwise spatial relationships between tokens. When fine-tuned with SAVIOR-Train, Qwen2.5-VL-Instruct demonstrates robust financial OCR performance, outperforming both open and closed-source baselines including GPT-4o, Mistral-OCR, PaddleOCR-VL, and DeepSeek-OCR.
About the Speaker
Akshata Bhat is an AI/ML Research Engineer at Hyprbots Inc. Her research interests include multimodal learning, vision-language models, and large-scale document understanding systems.
SynthForm: Towards a DLA-free E2E Form understanding model
We present SynthForm-3k, the first large-scale publicly available dataset of synthetically perturbed forms, comprising 3,417 samples across six domains: taxation, immigration, finance, healthcare, dental, and insurance. Ground-truth Markdown is constructed via an intermediate HTML representation generated by GPT-5 under high-reasoning inference, followed by deterministic HTML-to-Markdown conversion and scan-like perturbations (dust, scan lines, blur, rotation) that simulate real-world faxed and scanned documents.
We further introduce SynthForm-VL, a family of 2B, 4B, and 8B models obtained via full-parameter supervised fine-tuning of Qwen3-VL on this dataset. All three variants outperform their respective baselines, yielding ANLS improvements of +5.8, +9.3, and +10.3, with the fine-tuned 2B model exceeding the performance of the 4Ă larger Qwen3-VL-8B baseline â demonstrating that targeted domain adaptation on perturbation-robust data offers a more favorable costâperformance tradeoff than scale alone for structured form understanding.
About the Speaker
Andre Fu is an ML researcher and founder whose work spans multimodal learning, GPU inference infrastructure, and document understanding, with prior publications at NeurIPS, ICCV, CVPR, and WACV. He has worked in document processing and ML for the last 4 years, specializing in InsurTech, FinTech & HealthTech use cases.
7 attendees from this group - Network event

May 6 - Building Composable Computer Vision Workflows in FiftyOne
¡OnlineOnline82 attendees from 48 groupsThis workshop explores the FiftyOne plugin framework to build custom computer vision applications. Youâll learn to extend the open source FiftyOne App with Python based panels and server side operators, as well as integrate external tools for labeling, vector search, and model inference into your dataset views.
Date, Time and Location
May 6, 2026
10 AM - 11 AM PST
Online. Register for the Zoom!What You'll Learn
- Build Python plugins. Define plugin manifests and directory structures to register custom functionality within the FiftyOne ecosystem.
- Develop server side operators. Write functions to execute model inference, data cleaning, or metadata updates from the App interface.
- Build interactive panels. Create custom UI dashboards using to visualize model metrics or specialized dataset distributions.
- Manage operator execution contexts. Pass data between the App front end and your backend to build dynamic user workflows.
- Implement delegated execution. Configure background workers to handle long running data processing tasks without blocking the user interface.
- Build labeling integrations. Streamline the flow of data between FiftyOne and annotation platforms through custom triggers and ingestion scripts.
- Extend vector database support. Program custom connectors for external vector stores to enable semantic search across large sample datasets.
- Package and share plugins. Distribute your extensions internally and externally
3 attendees from this group - Network event

May 7 - Visual AI in Healthcare
¡OnlineOnline258 attendees from 48 groupsJoin us to hear experts on cutting-edge topics at the intersection of AI, ML, computer vision and healthcare.
Date, Time, and Location
May 07, 2026
9AM PST
Online. Register for the Zoom!Representation Learning Under Weak Supervision in Computational Pathology
Computational pathology has advanced rapidly with deep learning and, more recently, pathology foundation models that provide strong transferable representations from whole-slide images. Yet important gaps remain: pretrained features often retain domain shift relative to downstream clinical datasets, and most existing pipelines do not explicitly model the geometric organization of tissue architecture that underlies disease progression.
In this talk, I will present our work on weak- and semi-supervised representation learning methods designed to address these challenges, including adaptive stain separation for contrastive learning, bag-label-aware contrastive pretraining for multiple-instance learning, and distance-aware spatial modeling that injects tissue geometry into slide-level prediction. These methods reduce dependence on dense annotations while improving the quality, robustness, and clinical relevance of learned representations in histopathology. Across kidney and prostate cancer studies, they produce stronger downstream performance than standard self-supervised, semi-supervised, and MIL baselines, including improved classification on ccRCC datasets and more accurate prediction of metastatic risk from diagnostic prostate biopsies.
About the Speaker
Dr. Tolga Tasdizen is Professor and Associate Chair of Electrical and Computer Engineering and a faculty member of the Scientific Computing and Imaging Institute at the University of Utah, where he works on AI and machine learning for image analysis with applications in biomedical imaging, public health, and materials science. His research spans self- and semi-supervised learning, domain adaptation, and interpretability.
Efficient and Reliable AI for Real-World Healthcare Deployment
Healthcare is one of the highest-impact domains for AI, yet reliable deployment at scale remains difficult. To truly improve patient care and clinical workflows, AI must operate under real clinical constraints, not just in ideal lab settings. In practice, deployment is limited by high compute and memory costs, scarce labeled data, and distribution shifts across sites and time. Many clinically important findings are also rare and long-tailed, which makes generalization especially challenging. My research makes deployability a design objective by developing methods that stay accurate under strict resource and data constraints.
In this talk, I will first discuss high-performance lightweight deep learning architectures built by redesigning core building blocks. I will then present training-time generative supervision strategies that improve data efficiency and generalization to rare and long-tailed cases with no inference overhead. I will conclude with a forward-looking direction toward real-time perception for surgical assistance, where reliable performance under strict constraints is non-negotiable.
About the Speaker
Md Mostafijur Rahman is a Ph.D. candidate at The University of Texas at Austin, advised by Radu Marculescu. His research sits at the intersection of AI, biomedical imaging, and computer vision, with a focus on building efficient, reliable, and scalable AI systems for deployment in healthcare under real-world constraints. His work has been translated to practice through research internships at GE Healthcare, the National Institutes of Health (NIH), and Bosch Research.
VIGIL: Vectors of Intelligent Guidance in Long-Reach Rural Healthcare
VIGIL (Vectors of Intelligent Guidance in Long-Reach Rural Healthcare) is an AI-driven system designed to support generalist clinicians through interactive, multimodal guidance. The system combines perception, language understanding, and tool use to assist with tasks such as ultrasound acquisition and interpretation in real time. In this talk, we focus on the overall system architecture, highlighting how individual componentsâranging from visual models to medical reasoning agentsâinteract to produce coherent guidance. We also discuss key challenges we have encountered, including tool orchestration, latency, and robustness across components. This presentation aims to provide a systems-level perspective on building embodied AI agents for real-world healthcare settings.
About the Speaker
Andrew Krikorian is a Ph.D. student in Robotics at the University of Michigan, where he is a member of the Corso Group (COG). His research focuses on building physically grounded AI agents that combine perception, tool use, and planning to operate effectively in real-world environments, with a particular emphasis on healthcare applications. He is actively involved in the ARPA-H PARADIGM program, developing intelligent systems for rural clinical settings.
Scaling Healthcare AI with Synthetic Data and World Models
The scarcity of labeled, privacy-compliant medical imaging data remains one of the biggest bottlenecks in healthcare AI development. Emerging world models are changing this landscape by generating high-fidelity synthetic data â from radiology scans to surgical scene simulations â that can augment real-world datasets without compromising patient privacy. However, synthetic data is only as valuable as your ability to curate, validate, and evaluate it alongside real clinical data. In this talk, we explore how teams are using FiftyOne to build rigorous quality pipelines around synthetic medical imagery, enabling them to detect distribution gaps, measure model performance across rare pathologies, and ensure that generated samples meaningfully improve downstream diagnostics. We'll walk through practical workflows that combine world model outputs with real-world medical datasets to accelerate Visual AI in healthcare â responsibly and at scale.
About the Speaker
Daniel Gural is an expert in Physical AI and has been working in the field for over 8 years. Working across healthcare he has experience in both operating use case as well as using Visual AI as an aid in psychology applications as well.
18 attendees from this group - Network event

May 11 - Best of 3DV 2026
¡OnlineOnline124 attendees from 48 groupsWelcome to the Best of 3DV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this yearâs conference. Live streaming from the authors to you.
Date, Time and Location
May 11, 2026
9AM Pacific
Online. Register for Zoom!Navigating a 3D Vision Conference with VLMs and Embeddings
Attending the 3D Vision Conference means confronting 177 accepted papers across 3.5 days, far more than any one person can absorb. Skimming titles the night before isn't enough.
This talk shows how to build a systematic, interactive map of an entire conference using modern open-source tools. We load all 177 papers from 3DV 2026 (full PDF page images plus metadata) into a FiftyOne grouped dataset. We then run three annotation passes using Qwen3.5-9B on each cover page: topic classification, author affiliation extraction, and project page detection. Document embeddings from Jina v4 are computed across all 3,019 page images, pooled to paper-level vectors, and fed into FiftyOne Brain for UMAP visualization, similarity search, representativeness scoring, and uniqueness scoring.
The result is an interactive dataset you can query, filter, and explore in the FiftyOne App. Sort by uniqueness to find distinctive work, filter by topic and sort by representativeness to understand each research area, and cross-reference with scheduling metadata to build a personal agenda.
I demonstrate the end-to-end pipeline and discuss design decisions regarding grouped datasets, reasoning model output parsing, and embedding pooling strategies.
About the Speaker
Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. Heâs got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.
Seeing Through Clutter: Structured 3D Scene Reconstruction via Iterative Object Removal
We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter.
We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task-specific training and benefits directly from ongoing advances in foundation models. We demonstrate state-of-the-art robustness on 3D-Front and ADE20K datasets.
About the Speaker
Rio Aguina-Kang is currently a Machine Learning Engineer at Drafted AI, a startup focused on generative architecture. He has previously worked at Adobe Research, Brown Visual Computing, and the Stanford Institute for Human-Centered Artificial Intelligence. He is broadly interested in building systems that let users generate and control visual content through structured representations that reflect their intent.
Physical Realistic 4D Generation
Generating dynamic 3D content that moves and deforms over time is a key frontier in visual computing, with applications in VR/AR, robotics, and digital humans. In this talk, I present our series of works on physically realistic 4D generation: from neural surface deformation with explicit velocity fields (ICLR 2025) to our 4Deform framework for robust shape interpolation (CVPR 2025). Both methods use implicit neural representations with physically constrained velocity fields that enforce volume preservation, spatial smoothness, and geometric consistency. I will also introduce TwoSquared (3DV 2026, oral), which achieves full 4D generation from just two 2D image pairs â demonstrating a practical path toward controllable, physically plausible 4D content creation.
About the Speaker
â Lu Sang is a PhD researcher in Computer Vision at TU Munich (Prof. Daniel Cremers), specializing in 3D/4D reconstruction, neural implicit surfaces, and inverse rendering, with serveral publications at top venues including CVPR, ICLR, and ECCV. She is currently a research intern at Google XR in Zurich. With a strong mathematical foundation and a track record spanning photometric stereo to 4D generation, she brings both theoretical depth and hands-on engineering to cutting-edge visual computing research.
Finding NeMO: A Geometry-Aware Representation of Template Views for Few-Shot Perception
How can we use and perceive objects without training a new model given only a few images? We present NeMO, a novel object representation that allows 6DoF object pose estimation, detection and segmentation given only a hand full of RGB images of an unknown object.
About the Speaker
Sebastian Jung studied physics at the LMU munich. He started his PhD in Computer Science at the german aerospace center (DLR) in 2025 and focuses on object centric few shot perception with a focus on robotic applications. Additionally, he's a student researcher at google, focusing on computer vision algorithms for XR.
5 attendees from this group
Past events
222

