Part of AI, Machine Learning and Computer Vision Meetup Network - 47 groups

SF AI, Machine Learning and Computer Vision Meetup

4.3•15 ratings

San Francisco, CA, US

About us

🖖 This virtual group is for data scientists, machine learning engineers, and open source enthusiasts.

Every month we’ll bring you diverse speakers working at the cutting edge of AI, machine learning, and computer vision.

Are you interested in speaking at a future Meetup?
Is your company interested in sponsoring a Meetup?

Send me a DM on Linkedin

This Meetup is sponsored by Voxel51, the lead maintainers of the open source FiftyOne computer vision toolset. To learn more, visit the FiftyOne project page on GitHub.

Upcoming events

See all

Network event
Jan 28 - AI, Ml and Computer Vision Meetup
Wed, Jan 28 · 9:00 AM PST
·
Online
Online
274 attendees from 47 groups
Join us for a special edition of the monthly AI, ML and Computer Vision Meetup focused on Physical AI!

Date and Location

Jan 28, 2026
9 - 11 AM Pacific
Online. Register for the Zoom!

Hybrid Cognition for Robotics: LLM-Guided Reinforcement Learning for Physical Decision-Making

Physical systems operate in dynamic, uncertain, and constraint-heavy environments where classical reinforcement learning often struggles. In this talk, I present a hybrid framework where large language models act as a reasoning layer that guides an RL agent through high-level interpretation, constraint awareness, and adaptive strategy shaping. Instead of generating actions, the LLM provides structured contextual guidance that improves robustness, sample efficiency, and policy generalization in physical decision-making tasks. Early experiments demonstrate significant benefits under distribution shifts and safety-critical constraints that break standard RL. This work highlights a path toward more reliable, interpretable, and adaptable AI controllers for next-generation robotics and embodied systems.

About the Speaker

Fatemeh Lotfi is a Ph.D. researcher specializing in reinforcement learning, optimization, and hybrid intelligence for autonomous and physical systems. Her work explores integrating LLM-driven reasoning with RL to create adaptive and safety-aware controllers for dynamic environments. She has contributed to projects involving multi-agent RL, meta-learning, and real-time decision systems across wireless networks, UAVs, and embodied AI.

The World of World Models: How the New Generation of AI Is Reshaping Robotics and Autonomous Vehicles

World Models are emerging as the defining paradigm for the next decade of robotics and autonomous systems. Instead of depending on handcrafted perception stacks or rigid planning pipelines, modern world models learn a unified representation of an environment—geometry, dynamics, semantics, and agent behavior—and use that understanding to predict, plan, and act. This talk will break down why the field is shifting toward these holistic models, what new capabilities they unlock, and how they are already transforming AV and robotics research.

We then connect these advances to the Physical AI Workbench, a practical foundation for teams who want to build, validate, and iterate on world-model-driven pipelines. The Workbench standardizes data quality, reconstruction, and enrichment workflows so that teams can trust their sensor data, generate high-fidelity world representations, and feed consistent inputs into next-generation predictive and generative models. Together, world models and the Physical AI Workbench represent a new, more scalable path forward—one where robots and AVs can learn, simulate, and reason about the world through shared, high-quality physical context.

About the Speaker

Daniel Gural leads technical partnerships at Voxel51, where he’s building the Physical AI Workbench, a platform that connects real-world sensor data with realistic simulation to help engineers better understand, validate, and improve their perception systems.

From Data to Understanding in Physical AI

Data-centric workflows have driven major advances in computer vision, but they break down in physical, real-world robotic systems where data is costly, incomplete, and dominated by long-tail edge cases. In enterprise robotics, scaling labeled datasets alone is insufficient to achieve reliable perception, reasoning, and action under changing physical conditions. This talk examines how physics-informed foundation models incorporate world understanding and physical priors directly into vision and multimodal learning pipelines. By combining data with structure, constraints, and simulation on modern Physical AI stacks, robots can generalize more effectively, reduce data requirements, and operate with greater safety and reliability in deployment.

About the Speaker

Dr. Ashutosh Saxena is the Founder and Chief AI Officer of TorqueAGI. He earned his Ph.D. in Computer Science from Stanford University under Andrew Ng and previously served as a professor at Cornell University, leading the “Wikipedia for Robots” project recognized as an MIT Technology Review Top 10 Breakthrough Technology. His work in 3D vision and embodied AI has been cited over 20,000 times and recognized with honors including MIT TR35 and a Sloan Fellowship.

Data Foundations for Vision-Language-Action Models

Model architectures get the papers, but data decides whether robots actually work. This talk introduces VLAs from a data-centric perspective: what makes robot datasets fundamentally different from image classification or video understanding, how the field is organizing its data (Open X-Embodiment, LeRobot, RLDS), and what evaluation benchmarks actually measure. We'll examine the unique challenges such as temporal structure, proprioceptive signals, and heterogeneity in embodiment, and discuss why addressing them matters more than the next architectural innovation.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in VLMs, Visual Agents, Document AI, and Physical AI.
1 attendee from this group
Network event
Jan 29 - Silicon Valley AI, ML and Computer Vision Meetup
Thu, Jan 29 · 5:30 PM PST
YugaByte, Inc., 771 Vaqueros Ave, Sunnyvale, ca, US
23 attendees from 14 groups
Join our in-person Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Pre-register to reserve your seat

Date, Time and Location

Jan 29, 2026
5:30 - 8:30 PM
Yugabyte Offices
771 Vaqueros Ave, Sunnyvale, CA 94085

The World of World Models: How the New Generation of AI Is Reshaping Robotics and Autonomous Vehicles

World Models are emerging as the defining paradigm for the next decade of robotics and autonomous systems. Instead of depending on handcrafted perception stacks or rigid planning pipelines, modern world models learn a unified representation of an environment—geometry, dynamics, semantics, and agent behavior—and use that understanding to predict, plan, and act. This talk will break down why the field is shifting toward these holistic models, what new capabilities they unlock, and how they are already transforming AV and robotics research.

We then connect these advances to the Physical AI Workbench, a practical foundation for teams who want to build, validate, and iterate on world-model-driven pipelines. The Workbench standardizes data quality, reconstruction, and enrichment workflows so that teams can trust their sensor data, generate high-fidelity world representations, and feed consistent inputs into next-generation predictive and generative models. Together, world models and the Physical AI Workbench represent a new, more scalable path forward—one where robots and AVs can learn, simulate, and reason about the world through shared, high-quality physical context.

About the Speaker

Daniel Gural leads technical partnerships at Voxel51, where he’s building the Physical AI Workbench, a platform that connects real-world sensor data with realistic simulation to help engineers better understand, validate, and improve their perception systems.

Beyond Vector Search: How Distributed PostgreSQL Powers, Resilient, Enterprise-Grade AI Applications

As enterprises move from GenAI prototypes to in-production applications, standalone vector databases often fall short on synchronization, ACID compliance, and resilience. This session demonstrates how PostgreSQL-compatible distributed SQL databases address these challenges while maintaining a familiar developer experience. We’ll cover scaling RAG architectures with pgvector across regions, multi-agent patterns.

Attendees will learn how to achieve ultra-resilience for peak traffic, grey failures, and disasters, along with key design principles such as unified data sources, open standards, and multi-tenant security. Engineers and architects will leave with practical strategies for building globally scalable, enterprise-grade GenAI applications.

About the Speaker

Karthik Ranganathan is Co-CEO and Co-Founder at Yugabyte, the company behind YugabyteDB, the open-source, high-performance distributed SQL database for building global, cloud-native applications.. Karthik was one of the original database engineers at Meta(Facebook), responsible for building distributed databases such as Cassandra and HBase. He is an Apache HBase committer, and also an early contributor to Cassandra, before it was open-sourced by Meta.

Distributed Training at Scale

As deep learning models grow in complexity, particularly with the rise of Large Language Models (LLMs) and generative AI, scalable and cost-effective training has become a critical challenge. This talk introduces Ray Train, an open-source, production-ready library built for seamless distributed deep learning. We will explore its architecture, advanced resource scheduling, and intuitive APIs that simplify integration with popular frameworks such as PyTorch, Lightning, and HuggingFace. Attendees will leave with a clear understanding of how Ray Train accelerates large-scale model training while ensuring reliability and efficiency in production environments.

About the Speaker

Suman Debnath is a Technical Lead (ML) at Anyscale, where he focuses on distributed training, fine-tuning, and inference optimization at scale on the cloud. His work centers around building and optimizing end-to-end machine learning workflows powered by distributed computing framework like Ray, enabling scalable and efficient ML systems.

Self-improving AI-Models via Reasoning in the loop

During this presentation we demostrate efficient uses of reasoning to automate data-flywheels towards continuous model improvement

About the Speaker

Jose Alvarez is Director of Research at NVIDIA, where he leads an applied AV research team within the Spatial Intelligence Lab. His team focuses on scaling deep learning and driving advancements in Autonomous Driving and, more broadly in Physical AI, with work spanning end-to-end models, foundation models, and data flywheels for real-world applications.
Network event
Feb 5 - AI, ML and Computer Vision Meetup
Thu, Feb 5 · 9:00 AM PST
·
Online
Online
260 attendees from 47 groups
Join our virtual Meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Feb 5, 2026
9 - 11 AM Pacific
Online. Register for the Zoom!

Unlocking Visual Anomaly Detection: Navigating Challenges and Pioneering with Vision-Language Models

Visual anomaly detection (VAD) is pivotal for ensuring quality in manufacturing, medical imaging, and safety inspections, yet it continues to face challenges such as data scarcity, domain shifts, and the need for precise localization and reasoning. This seminar explores VAD fundamentals, core challenges, and recent advancements leveraging vision-language models and multimodal large language models (MLLMs). We contrast CLIP-based methods for efficient zero/few-shot detection with MLLM-driven reasoning for explainable, threshold-free outcomes. Drawing from recent studies, we highlight emerging trends, benchmarks, and future directions toward building adaptable, real-world VAD systems. This talk is designed for researchers and practitioners interested in AI-driven inspection and next-generation multimodal approaches.

About the Speaker

Hossein Kashiani is a fourth-year Ph.D. student at Clemson University. His research focuses on developing generalizable and trustworthy AI systems, with publications in top venues such as CVPR, WACV, ICIP, IJCB, and TBIOM. His work spans diverse applications, including anomaly detection, media forensics, biometrics, healthcare, and visual perception.

Data-Centric Lessons To Improve Speech-Language Pretraining

Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs.

We focus on three research questions fundamental to speech-language pretraining data:
- How to process raw web-crawled audio content for speech-text pretraining;
- How to construct synthetic pretraining datasets to augment web-crawled data;
- How to interleave (text, audio) segments into training sequences.
We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.

About the Speaker

Vishaal Udandarao is a third year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at The University of Cambridge/Google Deepmind. He is also a part of the International Max Planck Research School for Intelligent Systems. He is mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. His research is funded by a Google PhD Fellowship in Machine Intelligence.

A Practical Pipeline for Synthetic Data with Nano Banana Pro + FiftyOne

Most computer-vision failures come from the rare cases, the dark corners, odd combinations, and edge conditions we never capture enough in real datasets. In this session, we walk through a practical end-to-end pipeline for generating targeted synthetic data using Google’s Nano Banana Pro and managing it with FiftyOne. We’ll explore how to translate dataset gaps into generation prompts, create thousands of high-quality synthetic images, automatically enrich them with metadata, and bring everything into FiftyOne for inspection, filtering, and validation. By the end, you’ll understand how to build a repeatable synthetic-first workflow that closes real vision gaps and improves model performance on the scenarios that matter most.

About the Speaker

Adonai Vera - Machine Learning Engineer & DevRel at Voxel51. With over 7 years of experience building computer vision and machine learning models using TensorFlow, Docker, and OpenCV. I started as a software developer, moved into AI, led teams, and served as CTO. Today, I connect code and community to build open, production-ready AI, making technology simple, accessible, and reliable.

Making Computer Vision Models Faster: An Introduction to TensorRT Optimization

Modern computer vision applications demand real-time performance, yet many deep learning models struggle with high latency during deployment. This talk introduces how TensorRT can significantly accelerate inference by applying optimizations such as layer fusion, precision calibration, and efficient memory management. Attendees will learn the core concepts behind TensorRT, how it integrates into existing CV pipelines, and how to measure and benchmark improvements. Through practical examples and performance comparisons, the session will demonstrate how substantial speedups can be achieved with minimal model-accuracy loss. By the end, participants will understand when and how to apply TensorRT to make their CV models production-ready.

About the Speaker

Tushar Gadhiya is a Technical Lead at Infocusp Innovations, specialising in deep learning, computer vision, graph learning, and agentic AI. My experience spans academic research as a PhD holder and industry work, where I have contributed to multiple patents.
1 attendee from this group
Network event
Feb 11 - Visual AI for Video Use Cases
Wed, Feb 11 · 9:00 AM PST
·
Online
Online
175 attendees from 47 groups
Join our virtual Meetup to hear talks from experts on cutting-edge topics at the intersection of Visual AI and video use cases.

Time and Location

Feb 11, 2026
9 - 11 AM Pacific
Online. Register for the Zoom!

VIDEOP2R: Video Understanding from Perception to Reasoning

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning.

In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

About the Speaker

Yifan Jiang is a third-year Ph.D. student in the Information Science Institute at the University of Southern California (USC-ISI), advised by Dr. Jay Pujara, focusing on natural language processing, commonsense reasoning and multimodality large language models.

Layer-Aware Video Composition via Split-then-Merge

Split-then-Merge (StM) is a novel generative framework that overcomes data scarcity in video composition by splitting unlabeled videos into separate foreground and background layers for self-supervised learning. By utilizing a transformation-aware training pipeline with multi-layer fusion, the model learns to realistically compose dynamic subjects into diverse scenes without relying on expensive annotated datasets. This presentation will cover the problem of video composition and the details of StM, an approach looking at this problem from a generative AI perspective. We will conclude by demonstrating how StM is working, and outperforming state-of-the-art methods in both quantitative benchmarks and qualitative evaluations.

About the Speaker

Ozgur Kara is a 4th year Computer Science PhD student at the University of Illinois Urbana-Champaign (UIUC), advised by Founder Professor James M. Rehg. His research builds the next generation of video AI by tackling three core challenges: efficiency, controllability, and safety.

Video-native VLMs and control

We show how image-native vision–language models can be extended to support native video understanding, structured reasoning, tool use, and robotics. Our approach focuses on designing data, modeling, and training recipes to optimize for multimodality input and interaction patterns - treating vision and perception as a first class citizens. We discuss lessons learned from scaling these methods in an open-source model family and their implications for building flexible multimodal systems.

About the Speaker

Akshat Shrivastava is the CTO and co-founder of Perceptron, previously leading AR On-Device at Meta and conducting research at UW.

Video Intelligence Is Going Agentic

Video content has become ubiquitous in our digital world, yet the tools for working with video have remained largely unchanged for decades. This talk explores how the convergence of foundation models and agent architectures is fundamentally transforming video interaction and creation. We'll examine how video-native foundation models, multimodal interfaces, and agent transparency are reshaping enterprise media workflows through a deep dive into Jockey, a pioneering video agent system.

About the Speaker

James Le currently leads the developer experience function at TwelveLabs - a startup building foundation models for video understanding. He previously operated in the MLOps space and ran a blog/podcast on the Data & AI infrastructure ecosystem.
1 attendee from this group