This Meetup is sponsored by Voxel51, the lead maintainers of the open source FiftyOne computer vision toolset. To learn more, visit the FiftyOne project page on GitHub.

Upcoming events (4)

See all

Network event
115 attendees from 39 groups hosting
Wed, Jul 9, 2025, 4:00 PM UTCJuly 9 - Best of CVPR
Link visible for attendees
Join us for a series of virtual events focused on the most interesting and groundbreaking research presented at this year's CVPR conference!

When
July 9, 2025 at 9 AM Pacific

Where
Online. Register for the Zoom!

What Foundation Models really need to be capable of for Autonomous Driving – The Drive4C Benchmark

Foundation models hold the potential to generalize the driving task and support language-based interaction in autonomous driving. However, they continue to struggle with specific reasoning tasks essential for robotic navigation. Current benchmarks typically provide only aggregate performance scores, making it difficult to assess the underlying capabilities these models require. Drive4C addresses this gap by introducing a closed-loop benchmark that evaluates semantic, spatial, temporal, and physical understanding—enabling more targeted improvements to advance foundation models for autonomous driving.

About the Speaker

Tin Stribor Sohn is a PhD Student at Porsche AG and Karlsruhe Institute of Technology in the area of Foundation Models for Scenario Understanding and Decision Making in Autonomous Robotics, Tech Lead at Data Driven Engineering for Autonomous Driving, Prior: Master in CS at University of Tuebingen with focus on Computer Vision and Deep Learning and co-founder of a software company for smart EV charging.

Human Motion Prediction – Enhanced Realism via Nonisotropic Gaussian Diffusion

Predicting future human motion is a key challenge in generative AI and computer vision, as generated motions should be realistic and diverse at the same time. This talk presents a novel approach that leverages top-performing latent generative diffusion models with a novel paradigm. Nonisotropic Gaussian diffusion leads to better performance, fewer parameters, and faster training at no additional computational cost. We will also discuss how such benefits can be obtained in other application domains.

About the Speaker

Cecilia Curreli is a Ph.D. student at the Technical University of Munich, specializing in generative models. A member of the AI Competence Center at MCML, she has conducted research in deep learning, computer vision, and quantum physics through international collaborations with the University of Tokyo and the Chinese Academy of Science.

Efficient Few-Shot Adaptation of Open-Set Detection Models

We propose an efficient few-shot adaptation method for the Grounding-DINO open-set object detection model, designed to improve performance on domain-specific specialized datasets like agriculture, where extensive annotation is costly. The method circumvents the challenges of manual text prompt engineering by removing the standard text encoder and instead introduces randomly initialized, trainable text embeddings. These embeddings are optimized directly from a few labeled images, allowing the model to quickly adapt to new domains and object classes with minimal data. This approach demonstrates superior performance over zero-shot methods and competes favorably with other few-shot techniques, offering a promising solution for rapid model specialization.

About the Speaker

Dr. Sudhir Sornapudi is a Senior Data Scientist- II at Corteva Agriscience. He leads the Advanced Vision Intelligence team, driving computer vision innovations internally from cell-to-space with Biotechnology, Crop Health, and Seed Operations.

OpticalNet: An Optical Imaging Dataset and Benchmark Beyond the Diffraction Limit

Optical imaging capable of resolving nanoscale features would revolutionize scientific research and engineering applications across biomedicine, smart manufacturing, and semiconductor quality control. However, due to the physical phenomenon of diffraction, the optical resolution is limited to approximately half the wavelength of light, which impedes the observation of subwavelength objects such as the native state coronavirus, typically smaller than 200 nm. Fortunately, deep learning methods have shown remarkable potential in uncovering underlying patterns within data, promising to overcome the diffraction limit by revealing the mapping pattern between diffraction images and their corresponding ground truth object images.

However, the absence of suitable datasets has hindered progress in this field —— collecting high-quality optical data of subwavelength objects is highly difficult as these objects are inherently invisible under conventional microscopy, making it impossible to perform standard visual calibration and drift correction. Therefore, we provide the first general optical imaging dataset based on the “building block” concept for challenging the diffraction limit. Drawing an analogy to modular construction principles, we construct a comprehensive optical imaging dataset comprising subwavelength fundamental elements, i.e., small square units that can be assembled into larger and more complex objects. We then frame the task as an image-to-image translation task and evaluate various vision methods. Experimental results validate our “building block” concept, demonstrating that models trained on basic square units can effectively generalize to realistic, more complex unseen objects. Most importantly, by highlighting this underexplored AI-for-science area and its potential, we aspire to advance optical science by fostering collaboration with the vision and machine learning communities.

About the Speakers

Wang Benquan is the final-year PhD candidate at Nanyang Technological University, Singapore. His research interests are AI for Science, scientific deep learning, optical metrology and imaging.

Ruyi is a PhD at University of Texas at Austin, working on generative models and reinforcement learning, and their applications.
4 attendees from this group
Network event
107 attendees from 37 groups hosting
Thu, Jul 10, 2025, 4:00 PM UTCJuly 10 - Best of CVPR
Link visible for attendees
Join us for a series of virtual events focused on the most interesting and groundbreaking research presented at this year's CVPR conference!

When

July 10, 2025 at 9 AM Pacific

Where

Online. Register for the Zoom!

OFER : Occluded Face Expression Reconstruction

Reconstructing 3D face models from a single image is an inherently ill-posed problem, which becomes even more challenging in the presence of occlusions where multiple reconstructions can be equally valid. Despite the ubiquity of the problem, very few methods address its multi-hypothesis nature.

In this paper we introduce OFER, a novel approach for single-image 3D face reconstruction that can generate plausible, diverse, and expressive 3D faces by training two diffusion models to generate a shape and expression coefficients of face parametric model, conditioned on the input image. To maintain consistency across diverse expressions, the challenge is to select the best matching shape. To achieve this, we propose a novel ranking mechanism that sorts the outputs of the shape diffusion network based on predicted shape accuracy scores.

Paper: OFER: Occluded Face Expression Reconstruction

About the Speaker

Pratheba Selvaraju has a PhD from the University of Massachusetts, Amherst. Currently researcher at Max Planck Institute – Perceived systems. Research Interest is in 3D reconstruction and modeling problem, geometry processing and generative modeling. Currently also exploring the space of virtual try-ons combining vision and 3D techniques.

SmartHome-Bench: Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal LMMs

Video anomaly detection is crucial for ensuring safety and security, yet existing benchmarks overlook the unique context of smart home environments. We introduce SmartHome-Bench, a dataset of 1,203 smart home videos annotated according to a novel taxonomy of seven anomaly categories, such as Wildlife, Senior Care, and Baby Monitoring. We evaluate state-of-the-art closed- and open-source multimodal LLMs with various prompting techniques, revealing significant performance gaps. To address these limitations, we propose the Taxonomy-Driven Reflective LLM Chain (TRLC), which boosts detection accuracy by 11.62%.

About the Speaker

Xinyi Zhao is a fourth-year PhD student at the University of Washington, specializing in multimodal large language models and reinforcement learning for smart home applications. This work was conducted during her summer 2024 internship at Wyze Labs, Inc.

Interactive Medical Image Analysis with Concept-based Similarity Reasoning

What if you could tell an AI model exactly “𝘸𝘩𝘦𝘳𝘦 𝘵𝘰 𝘧𝘰𝘤𝘶𝘴” and “𝘸𝘩𝘦𝘳𝘦 𝘵𝘰 𝘪𝘨𝘯𝘰𝘳𝘦” on a medical image? Our work enables radiologists to interactively guide AI models at test time for more transparent and trustworthy decision-making. This paper introduces the novel Concept-based Similarity Reasoning network (CSR), which offers (i) patch-level prototype with intrinsic concept interpretation, and (ii) spatial interactivity. First, the proposed CSR provides localized explanation by grounding prototypes of each concept on image regions. Second, our model introduces novel spatial-level interaction, allowing doctors to engage directly with specific image areas, making it an intuitive and transparent tool for medical imaging.

Paper: Interactive Medical Image Analysis with Concept-based Similarity Reasoning

About the Speaker

Huy Ta is a PhD student at the Australian Institute for Machine Learning, The University of Adelaide, specializing in Explainable and Interactive AI for medical imaging. He brings with him four years of industry experience in medical imaging AI prior to embarking on his doctoral studies.

Multi-view Anomaly Detection: From Static to Probabilistic Modelling

The advent of 3D Gaussian Splatting has revolutionized and re-vitalized the interest in multi-view image data. Applications of these techniques to fields such as anomaly detection have been a logical next step. However, some of the limitations of these models may warrant a return to already applied probabilistic techniques. New approaches, difficulties and possibilities in this field will be explored in this talk.

About the Speaker

Mathis Kruse is a PhD student in the group of Bodo Rosenhahn in Hanover, Germany, where he studies anomaly detection (especially in images). He has a particular interest in multi-view data and its learning-based representations.
5 attendees from this group
Network event
86 attendees from 38 groups hosting
Fri, Jul 11, 2025, 4:00 PM UTCJuly 11 - Best of CVPR Virtual Event
Link visible for attendees
Join us on July 11 at 9 AM Pacific for the third of several virtual events showcasing some of the most thought-provoking papers from this year’s CVPR conference.

Register for the Zoom

OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection

As AI becomes more prevalent in fields like healthcare, ensuring its reliability under unexpected inputs is essential. We present OpenMIBOOD, a benchmarking framework for evaluating out-of-distribution (OOD) detection methods in medical imaging. It includes 14 datasets across three medical domains and categorizes them into in-distribution, near-OOD, and far-OOD groups to assess 24 post-hoc methods. Results show that OOD detection approaches effective in natural images often fail in medical contexts, highlighting the need for domain-specific benchmarks to ensure trustworthy AI in healthcare.

About the Speaker

Max Gutbrod is a PhD student in Computer Science at OTH Regensburg, Germany, with a research focus on medical imaging. He’s working on improving the resilience of AI systems in healthcare, so they can continue performing reliably, even when faced with unfamiliar or unexpected data.

RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings

The choice of representation for geographic location significantly impacts the accuracy of models for a broad range of geospatial tasks, including fine-grained species classification, population density estimation, and biome classification. Recent works learn such representations by contrastively aligning geolocation[lat,lon] with co-located images.

While these methods work exceptionally well, in this paper, we posit that the current training strategies fail to fully capture the important visual features. We provide an information-theoretic perspective on why the resulting embeddings from these methods discard crucial visual information that is important for many downstream tasks. To solve this problem, we propose a novel retrieval-augmented strategy called RANGE. We build our method on the intuition that the visual features of a location can be estimated by combining the visual features from multiple similar-looking locations. We show this retrieval strategy outperforms the existing state-of-the-art models with significant margins in most tasks.

About the Speaker

Aayush Dhakal is a Ph.D. candidate in Computer Science at Washington University in St. Louis (WashU), currently advised by Dr. Nathan Jacobs in the Multimodal Vision Research Lab (MVRL). My work focuses on solving geospatial problems using Deep Learning and Computer Vision. This often involves some combination of computer vision, remote sensing, and self-supervised learning. I love to develop methods that allow seamless interaction of multiple modalities, such as images, text, audio, and geocoordinates.

FLAIR: Fine-Grained Image Understanding through Language-Guided Representations

CLIP excels at global image-text alignment but struggles with fine-grained visual understanding. In this talk, I present FLAIR—Fine-grained Language-informed Image Representations—which leverages long, detailed captions to learn localized image features. By conditioning attention pooling on diverse sub-captions, FLAIR generates text-specific image embeddings that enhance retrieval of fine-grained content. Our model outperforms existing methods on standard and newly proposed fine-grained retrieval benchmarks, and even enables strong zero-shot semantic segmentation—despite being trained on only 30M image-text pairs.

About the Speaker

Rui Xiao is a PhD student in the Explainable Machine Learning group, supervised by Zeynep Akata from Technical University of Munich and Stephan Alaniz from Telecom Paris. His research focuses on learning across modalities and domains, with a particular emphasis on enhancing fine-grained visual capabilities in vision-language models.

DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation

Semi-supervised medical image segmentation often suffers from class imbalance and high uncertainty due to pathology variability. We propose DyCON, a Dynamic Uncertainty-aware Consistency and Contrastive Learning framework that addresses these challenges via two novel losses: UnCL and FeCL. UnCL adaptively weights voxel-wise consistency based on uncertainty, initially focusing on uncertain regions and gradually shifting to confident ones. FeCL improves local feature discrimination under imbalance by applying dual focal mechanisms and adaptive entropy-based weighting to contrastive learning.

About the Speaker

Maregu Assefa is a postdoctoral researcher at Khalifa University in Abu Dhabi, UAE. His current research focuses on advancing semi-supervised and self-supervised multi-modal representation learning for medical image analysis. Previously, his doctoral studies centered on visual representation learning for video understanding tasks, including action recognition and video retrieval.
7 attendees from this group+2
Network event
184 attendees from 39 groups hosting
Thu, Jul 17, 2025, 5:00 PM UTCJuly 17 - AI, ML and Computer Vision Meetup
Link visible for attendees
When and Where

July 17, 2025 | 10:00 – 11:30 AM Pacific

Virtually over Zoom. Sign up!

Using VLMs to Navigate the Sea of Data

At SEA.AI, we aim to make ocean navigation safer by enhancing situational awareness with AI. To develop our technology, we process huge amounts of maritime video from onboard cameras. In this talk, we’ll show how we use Vision-Language Models (VLMs) to streamline our data workflows; from semantic search using embeddings to automatically surfacing rare or high-interest events like whale spouts or drifting containers. The goal: smarter data curation with minimal manual effort.

About the Speaker

Daniel Fortunato, an AI Researcher at SEA.AI, is dedicated to enhancing efficiency through data workflow optimizations. Daniel’s background includes a Master’s degree in Electrical Engineering, providing a robust framework for developing innovative AI solutions. Beyond the lab, he is an enthusiastic amateur padel player and surfer.

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Referring Video Object Segmentation (RVOS) involves segmenting objects in video based on natural language descriptions. SAMWISE builds on Segment Anything 2 (SAM2) to support RVOS in streaming settings, without fine-tuning and without relying on external large Vision-Language Models. We introduce a novel adapter that injects temporal cues and multi-modal reasoning directly into the feature extraction process, enabling both language understanding and motion modeling. We also unveil a phenomenon we denote tracking bias, where SAM2 may persistently follow an object that only loosely matches the query, and propose a learnable module to mitigate it. SAMWISE achieves state-of-the-art performance across multiple benchmarks with less than 5M additional parameters.

About the Speaker

Claudia Cuttano is a PhD student at Politecnico di Torino (VANDAL Lab), currently on a research visit at TU Darmstadt, where she works with Prof. Stefan Roth in the Visual Inference Lab. Her research focuses on semantic segmentation, with particular emphasis on multi-modal understanding and the use of foundation models for pixel-level tasks.

Building Efficient and Reliable Workflows for Object Detection

Training complex AI models at scale requires orchestrating multiple steps into a reproducible workflow and understanding how to optimize resource utilization for efficient pipelines. Modern MLOps practices help streamline these processes, improving the efficiency and reliability of your AI pipelines.

About the Speaker

Sage Elliott is an AI Engineer with a background in computer vision, LLM evaluation, MLOps, IoT, and Robotics. He’s taught thousands of people at live workshops. You can usually find him in Seattle biking around to parks or reading in cafes, catching up on the latest read for AI Book Club.

Your Data Is Lying to You: How Semantic Search Helps You Find the Truth in Visual Datasets

High-performing models start with high-quality data—but finding noisy, mislabeled, or edge-case samples across massive datasets remains a significant bottleneck. In this session, we’ll explore a scalable approach to curating and refining large-scale visual datasets using semantic search powered by transformer-based embeddings. By leveraging similarity search and multimodal representation learning, you’ll learn to surface hidden patterns, detect inconsistencies, and uncover edge cases. We’ll also discuss how these techniques can be integrated into data lakes and large-scale pipelines to streamline model debugging, dataset optimization, and the development of more robust foundation models in computer vision. Join us to discover how semantic search reshapes how we build and refine AI systems.

About the Speaker

Paula Ramos has a PhD in Computer Vision and Machine Learning, with more than 20 years of experience in the technological field. She has been developing novel integrated engineering technologies, mainly in Computer Vision, robotics, and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and Postdoc research, she deployed multiple low-cost, smart edge & IoT computing technologies, such as farmers, that can be operated without expertise in computer vision systems. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry.
6 attendees from this group+1