Part of AI, Machine Learning and Computer Vision Meetup Network - 51 groups

Santa Cruz AI and Machine Learning Meetup Group

4.3•3 ratings

About us

Our group exists for like minded individuals to explore machine learning and AI technologies, and share knowledge. Anyone with a software development interest or background is welcome to attend. We will have guest speakers, presentations, and network with others.

Upcoming events

See all

Network event
July 21 - Best of ICRA
Tue, Jul 21 · 9:00 AM PDT
·
Online
Online
83 attendees from 51 groups
The Best of ICRA is a three-day virtual meetup series featuring researchers presenting their accepted papers from the 2026 International Conference on Robotics and Automation (ICRA).

Date, Time and Location

Jul 21, 2026
9:00 AM - 11:00 AM PST
Online. Register for the Zoom!

Outdoor Robot Navigation in the Unstructured World: From Traversability to Physical Scene Understanding

Outdoor robot navigation in the unstructured world requires robots to reason about more than obstacles: they must understand where they can move, what terrain is suitable, and how scene context affects navigation decisions. In sidewalks, campuses, trails, and off-road environments, these decisions depend on geometric structure, terrain conditions, semantic cues, and robot-environment interaction.

In this talk, I will present our recent work on scene understanding for outdoor navigation, including a large-scale multimodal dataset for studying outdoor traversability, approaches for trajectory generation and selection, vision-language reasoning for contextual navigation, and Gaussian-based 3D scene modeling. I will also discuss how physical reasoning can extend scene understanding from visual and geometric perception toward terrain properties and interaction cues.

Together, these works explore how robots can better interpret unstructured outdoor environments and use that understanding for navigation decision-making.

About the Speaker

Jing Liang is a postdoctoral researcher at the Stanford Robotics Center, working on robot navigation, perception, and human-centered autonomy in complex real-world environments.

Scene Graphs and the Future of Mapping

In this talk, I will question whether 3D reconstruction is still a necessary part of mapping in the age of feedforward models and present some alternatives. Then, I discuss scene graphs as an alternative map representation and their applications for mobile manipulation.

About the Speaker

Hermann Blum is a Junior Professor at the University of Bonn and the Lamarr Institute. Hermann's research focuses on machine learning for robotic perception and scene understanding, developing models and methods to understand an agent's environment semantically and geometrically.

Toward Zero-Shot 6D Pose Estimation and Tracking of Cluttered Objects on Edge Devices

Robust 6D pose estimation of textured objects under diverse illumination conditions remains a significant challenge, often requiring a trade-off between accurate initial pose estimation and efficient real-time tracking. We present a unified framework explicitly designed for efficient execution on edge devices, which fuses a robust initial estimation module with a fast motion-based tracker.

The key to our approach is a shared, lighting-invariant color-pair feature representation that forms a consistent foundation for both stages. For initial estimation, this representation facilitates robust registration between the live RGB-D view and the object's 3D mesh.

For tracking, the same representation validates temporal correspondences, enabling a lightweight model to reliably regress the object's pose. Experiments on benchmark datasets demonstrate that our integrated approach is both effective and robust, providing competitive pose estimation accuracy while maintaining high-fidelity tracking even through abrupt pose changes.

This is joint work with Xingjian Yang.

About the Speaker

Ashis Banerjee is an Associate Professor of Industrial & Systems Engineering and Mechanical Engineering at the University of Washington, Seattle. Prior to joining UW, he was a Research Scientist at GE Global Research and a Postdoctoral Associate at MIT.

Trustworthy Geometric Perception: Certifiable Optimization and Robust Estimation

Autonomous robots in safety-critical settings require geometric perception that is not merely accurate on average, but provably correct under adversarial conditions. Yet most pipelines rely on local optimization methods that fail silently when poorly initialized.

This talk presents GlobustVP, a certifiably optimal vanishing point estimator that reformulates joint VP localization and line association as a quadratically constrained quadratic program (QCQP) and relaxes it to a tight semidefinite program (SDP), achieving the first globally optimal and outlier-robust solution to this problem. Recognized as a Best Paper Award Candidate at CVPR 2025 (top 0.1%, 14 of 13,008 submissions), GlobustVP demonstrates that certifiable global optimization is both practically feasible and highly competitive.

More broadly, this work is part of a research program toward trustworthy geometric perception: systems that know when they are wrong, and can communicate that to the robots and humans that depend on them.

About the Speaker

Zhenjun Zhao I am a postdoctoral researcher at University of Zaragoza, working with Javier Civera.
1 attendee from this group
Network event
July 22 - Best of ICRA
Wed, Jul 22 · 9:00 AM PDT
·
Online
Online
75 attendees from 51 groups
The Best of ICRA is a three-day virtual meetup series featuring researchers presenting their accepted papers from the 2026 International Conference on Robotics and Automation (ICRA).

Date, Time and Location

Jul 22, 2026
9:00 AM - 11:00 AM PST
Online. Register for the Zoom!

Contrastive learning on 3d point clouds for geometric defect detection

Reliable 3D defect detection in manufacturing is hard: the input is a point cloud — an unordered set that standard neural backbones cannot process directly; high-quality training data is scarce; and real scans are noisy and arrive in arbitrary orientations. We address these challenges in COSARAD, a contrastive learning framework that learns highly discriminative representations of object surface geometry under weak supervision.

When a test object arrives, we extract its features and compare them against a library of defect-free reference shapes for precise, interpretable defect localization — achieving state-of-the-art accuracy on industrial benchmarks such as Real3D-AD. In my talk, I'll cover the design choices behind the system, why contrastive representation learning is the right fit for sparse 3D data, and open problems in scaling inspection to production.

About the Speaker

Alexander Tarvo is a researcher at the University of Washington's MACS Lab, where he works on computer vision with applications in robotics. He holds a PhD in Software Engineering from Brown University and previously held research and engineering roles at Google, Microsoft, and IBM Research. His current research focuses on 3D vision and reinforcement learning for industrial robotics.

A Semantic and Occlusion-Aware Gaussian Mixture Probability Hypothesis Density Filter

Reliable and resilient multi-target tracking is foundational for safe autonomous driving, yet most perception pipelines frequently struggle with sensor noise, heavy clutter, and severe environmental occlusions. To resolve these limitations, this talk presents a novel Semantic-Occlusion Aware (S-OA) Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter.

By combining geometric occlusion reasoning with deep learning-derived environmental semantics, the proposed framework adaptively initializes target tracking in regions where new targets are likely to appear. Evaluations demonstrate that this context-aware tracking system minimizes track initiation latency and preserves high tracking precision even under intense clutter.

Ultimately, this work demonstrates how embedding spatial and semantic structure into filtering yields a significantly more robust and resilient perception stack for autonomous navigation.

About the Speaker

Jovan Menezes is a PhD student at Cornell University, advised by Prof. Mark Campbell. His research focuses on developing scalable and resilient perception algorithms for autonomous driving. By leveraging concepts from probabilistic estimation and deep learning-based computer vision, the goal is to enable autonomous vehicles to perceive and navigate in challenging environments.

An Annotation-to-Detection Framework for Autonomous and Robust Vine Trunk Localization in the Field by Mobile Agricultural Robots

Autonomous robots struggle to detect objects in unstructured fields, requiring in-domain tuning with laborious manual data collection. In this work, we introduce a comprehensive annotation-to-detection framework designed to train a robust multi-modal detector using limited and partially labeled training data.

Our method combines cross-modal annotation transfer, early sensor fusion, and a multi-stage detection architecture to train and enhance multi-modal detection. Validated on vineyard trunk detection and paired with a custom LOAM algorithm, it localised over 70% of trees in one pass with under 0.37 m mean error.

Our system demonstrated that robust detection is achievable even with minimal initial annotations and human intervention.

About the Speaker

Dimitrios Chatziparaschis is a PhD candidate in EE, in University of California, Riverside. His main research lies at the intersection of computer vision, machine learning, and robotics. Main topics include 3D perception, multi-modal sensing, landmark detection, and localization in outdoor and dynamic settings.

vS-Graphs: Tightly Coupling Visual SLAM and 3D Scene Graphs Exploiting Hierarchical Scene Understanding

We introduce vS-Graphs, a novel real-time VSLAM framework that integrates vision-based scene understanding with map reconstruction and comprehensible graph-based representation. The framework infers structural elements (i.e., rooms and floors) from detected building components (i.e., walls and ground surfaces) and incorporates them into optimizable 3D scene graphs.

This solution enhances the reconstructed map's semantic richness, comprehensibility, and localization accuracy.

About the Speaker

Ali Tourani an R&D Specialist and a Senior Software Engineer with 8+ years of experience in practical computer vision and AI system design and deployment. Currently, he holds a Postdoctoral Research Associate position at the University of Luxembourg, where he develops vision-language models and generative AI solutions for real-world robotic applications.
2 attendees from this group
Network event
Aug 6 - Audio and AI Meetup
Thu, Aug 6 · 9:00 AM PDT
·
Online
Online
130 attendees from 51 groups
Join our virtual meetup to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Date, Time and Location

Aug 06, 2026
9:00 AM - 11:00 AM PST
Online. Register for the Zoom!

Do Speech Models Actually Understand Speech? Evaluating Speech LLMs Under Realistic Spoken Instruction Conditions

Speech Large Language Models (SLLMs) are increasingly capable; but are we evaluating them the right way? Most benchmarks rely on text prompts, yet real users interact with these systems through speech, a modality that introduces noise, disfluencies, and stylistic variation that text simply doesn't capture.
In this talk, we present findings from a systematic study across 11 tasks, 12 languages, and five prompt styles, examining how prompt modality, language, and task type shape SLLM performance.

About the Speaker

Maike Züfle is a PhD student at the Karlsruhe Institute of Technology (KIT), working in Prof. Jan Niehues's group on interactive speech systems for more natural human–machine communication. Her research focuses on instruction-following speech models with speech as both input and output, with a recent emphasis on full-duplex systems. Beyond her research, she co-organises the instruction-following and speech translation metrics shared tasks at IWSLT. She is a 2026 Apple Scholar in AI/ML.

AI based Audio Forensics

In this presentation, attendees will discover several modules developed by Gradiant for the detection and analysis of synthetically generated or manipulated audio. The session will be delivered by one of the developers involved in the design and implementation of these technologies, providing first-hand insight into their capabilities and underlying methodology.

The presentation will cover the traceability module, which helps identify the origin of AI-generated content. It will also cover the segment detection tool, designed to locate manipulated regions within an audio recording, as well as the complete audio detection tool, which assesses whether an entire recording has been synthetically generated.

About the Speaker

Daniel Paniagua Ares is a research engineer at Gradiant. Graduated in computer engineering from the FIC and with a master's degree in AI from the VIU.

Curating, Searching, and Evaluating Audio Datasets in FiftyOne

In this talk, we'll start with the ESC-50 environmental-sound dataset to show how FiftyOne represents audio: browsing clips in the tabular view, rendering spectrograms directly in the sample grid with a custom renderer, and turning sounds into searchable vectors with CLAP embeddings. Then we'll demo a similarity-search panel that lets you query an entire audio collection by example clip or a natural-language prompt to quickly find matching sounds.

We'll conclude with a live research problem: Audio Moment Retrieval from the DCASE 2026 Challenge, where the goal is to localize the exact moment in a long recording that matches a text query. We'll frame this as temporal detection, evaluate predictions, and visualize ground-truth vs. predicted moments on an interactive timeline to intuitively expose model failure modes.

Attendees will leave with a concrete blueprint and open code for applying visual data-centric AI practices to their own audio and multimodal datasets.

About the Speaker

John Duncan is a Machine Learning Engineer, Customer Success at Voxel51. His research interests include vision, LiDAR, and audio perception for robots and intelligent systems.
1 attendee from this group
Network event
Sept 24 - AI, ML and Computer Vision Meetup
Thu, Sep 24 · 9:00 AM PDT
·
Online
Online
52 attendees from 51 groups
Join our virtual meetup on September 24 to hear talks from experts on cutting-edge topics across AI, ML, and computer vision.

Date, Time and Location

Sep 24, 2026
9:00 AM - 11:00 AM PST
Online. Register for the Zoom!

How Do Mercedes-Benz AI Principles Drive our Innovation?

At Mercedes-Benz, our AI Principles guide every step of innovation, emphasizing responsible use, safety and reliability, explainability, and the protection of privacy. These principles go beyond statements and actively shape how we design, test, and deploy AI systems in real-world automotive and enterprise settings. In this talk, I will present how these principles inspired our recent research on when reusing LoRA (Low-Rank Adaptation) is effective. By combining theoretical analysis with synthetic data as a proxy for enterprise scenarios, we uncovered the strengths and limitations of modular AI components under constrained data access. Our findings provide practical guidance on when reused LoRAs could deliver high-quality results.

About the Speaker

Mei-Yen Chen is a Senior Data Scientist at Mercedes-Benz Tech Innovation GmbH in Germany with 10 years of industry experience in AI and data solutions. She leads early-stage AI projects across business functions and collaborates with research institutions on machine learning and responsible AI.

Region Tokens as the Visual Primitive: From Recognition to World Modeling

Patch-based tokenization has become the default interface between vision encoders and downstream models, yet patches carry no semantic structure and scale poorly with resolution and temporal extent. This talk presents a research program centered on replacing patch tokens with region-level representations — semantically dense tokens grounded in visual entities rather than arbitrary grid crops.

I will describe RELOCATE, REN, and T-REN, a progression of methods that produce region tokens via pooling, train them with region-level objectives, and extend them to video with temporal coherence. I will then present ongoing work integrating region tokens into VLMs to directly expand visual context capacity, and preliminary results on future region trajectory prediction as a foundation for world modeling.

The broader thesis is that region-level tokens are a more natural unit of visual computation than patches, and their advantage compounds as task complexity, resolution, and temporal horizon increase.

About the Speaker

Savya Khosla is a second-year Ph.D. student at the University of Illinois Urbana-Champaign, advised by Prof. Derek Hoiem and Prof. Alex Schwing.

Leveraging Text-To-Image Diffusion Models for Consistent Set-to-Set Generation

Image collections are humans' primary way of capturing the world, yet advances in generative editing remain largely inapplicable to this modality. We address this gap by introducing Match-and-Fuse - a zero-shot, training-free method for consistent set-to-set generation from image collections that share a common visual element but differ in viewpoint, capture time, and surrounding content.
Our key idea is a unified graph-based framework that combines dense correspondences with an emergent prior in text-to-image diffusion models to generate coherent canvases. We achieve state-of-the-art consistency and visual quality, and unlock new creative capabilities for content generation.

About the Speaker

Kate Feingold is a PhD student in Computer Vision at the Weizmann Institute of Science. Her research sits at the intersection of generative models, 3D/4D perception, and multimodal learning, focusing on problems where vision meets other modalities or paradigms in creative tasks.

Yield Estimation of a Coffee in a dense environment

This presentation provides a detailed workflow related to coffee yield estimation in a dense environment. With photos of pre-harvest coffee plants from a couple of coffee estates, details related to pre-processing, annotation to detect regions of interest (ROI), object detection training and inferencing results with various Yolo models and finally segmentation with SAM2 and Yolo*-seg with training and inference results to determine the count of raw, pre-mature, mature and over-mature coffee berries and finally the yield of the entire estate. All this is based on real world data captured on iPhone and android phones.

About the Speaker

Raghu M. Rao is a consultant working on applications of computer vision AI models. He was previously with AMD and Xilinx. He has a Ph.D. in Wireless Communications from UCLA and is a Senior Member, IEEE. His current interests are in applications of AI for agriculture, health care and wireless communications.