
What we’re about
This is the official group for the Data Science Salon Community in San Francisco/Bay Area: https://datascience.salon/
Our mission is to bring together everyone that is interested in and working in the cloud technologies. Learn, network, present and meet others in the cloud space.
Access on demand content on YouTube: https://www.youtube.com/c/DataScienceSalon
Sponsors
Upcoming events
8
- Network event
•OnlineNov 14 - Workshop: Document Visual AI with FiftyOne
Online141 attendees from 47 groupsThis hands-on workshop introduces you to document visual AI workflows using FiftyOne, the leading open-source toolkit for computer vision datasets.
Date and Location
Nov 14, 2025
9:00-10:30 AM Pacific
Online. Register for the ZoomIn document understanding, a pixel is worth a thousand tokens. While traditional text-extraction pipelines tokenize and process documents sequentially, modern visual AI approaches can understand document structure, layout, and content directly from images—making them more efficient, accurate, and robust to diverse document formats.
In this workshop you'll learn how to:
- Load and organize document datasets in FiftyOne for visual exploration and analysis
- Compute visual embeddings using state-of-the-art document retrieval models to enable semantic search and similarity analysis
- Leverage FiftyOne workflows including similarity search, clustering, and quality assessment to gain insights from your document collections
- Deploy modern vision-language models for OCR and document understanding tasks that go beyond simple text extraction
- Evaluate and compare different OCR models to select the best approach for your specific use case
Whether you're working with invoices, receipts, forms, scientific papers, or mixed document types, this workshop will equip you with practical skills to build robust document AI pipelines that harness the power of visual understanding. Walk away with reproducible notebooks and best practices for tackling real-world document intelligence challenges.
2 attendees from this group - Network event
•OnlineNov 19 - Best of ICCV (Day 1)
Online128 attendees from 44 groupsWelcome to the Best of ICCV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you.
Date, Time and Location
Nov 19, 2025
9 AM Pacific
Online. Register for the Zoom!AnimalClue: Recognizing Animals by their Traces
Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring.
To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species-level labels, bounding boxes or segmentation masks, and fine-grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces.
About the Speaker
Risa Shinoda received her M.S. and Ph.D. in Agricultural Science from Kyoto University in 2022 and 2025. Since April 2025, she has been serving as a Specially Appointed Assistant Professor at the Graduate School of Information Science and Technology, the University of Osaka. She is engaged in research on the application of image recognition to plants and animals, as well as vision-language models.
LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing
Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation.
First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.
About the Speaker
Federico Girella is a third-year Ph.D. student at the University of Verona (Italy), supervised by Prof. Marco Cristani, with expected graduation in May 2026. His research involves joint representations in the Image and Language multi-modal domain, working with deep neural networks such as (Large) Vision and Language Models and Text-to-Image Generative Models. His main body of work focuses on Text-to-Image Retrieval and Generation in the Fashion domain.
ProtoMedX: Explainable Multi-Modal Prototype Learning for Bone Health Assessment
Early detection of osteoporosis and osteopenia is critical, yet most AI models for bone health rely solely on imaging and offer little transparency into their decisions. In this talk, I will present ProtoMedX, the first prototype-based framework that combines lumbar spine DEXA scans with patient clinical records to deliver accurate and inherently explainable predictions.
Unlike black-box deep networks, ProtoMedX classifies patients by comparing them to learned case-based prototypes, mirroring how clinicians reason in practice. Our method not only achieves state-of-the-art accuracy on a real NHS dataset of 4,160 patients but also provides clear, interpretable explanations aligned with the upcoming EU AI Act requirements for high-risk medical AI. Beyond bone health, this work illustrates how prototype learning can make multi-modal AI both powerful and transparent, offering a blueprint for other safety-critical domains.
About the Speaker
Alvaro Lopez is a PhD candidate in Explainable AI at Lancaster University and an AI Research Associate at J.P. Morgan in London. His research focuses on prototype-based learning, multi-modal AI, and AI security. He has led projects on medical AI, fraud detection, and adversarial robustness, with applications ranging from healthcare to financial systems.
CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation
We introduce CLASP (Clustering via Adaptive Spectral Processing), a lightweight framework for unsupervised image segmentation that operates without any labeled data or fine-tuning. CLASP first extracts per-patch features using a self-supervised ViT encoder (DINO); then, it builds an affinity matrix and applies spectral clustering. To avoid manual tuning, we select the segment count automatically with a eigengap-silhouette search, and we sharpen the boundaries with a fully connected DenseCRF. Despite its simplicity and training-free nature, CLASP attains competitive mIoU and pixel-accuracy on COCO-Stuff and ADE20K, matching recent unsupervised baselines. The zero-training design makes CLASP a strong, easily reproducible baseline for large unannotated corpora—especially common in digital advertising and marketing workflows such as brand-safety screening, creative asset curation, and social-media content moderation.
About the Speaker
Max Curie is a Research Scientist at Integral Ad Science, building fast, lightweight solutions for brand safety, multi-media classification, and recommendation systems. As a former nuclear physicist at Princeton University, he brings rigorous analytical thinking and modeling discipline from his physics background to advance ad tech.
1 attendee from this group - Network event
•OnlineNov 20 - Best of ICCV (Day 2)
Online93 attendees from 44 groupsWelcome to the Best of ICCV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you.
Date, Time and Location
Nov 20, 2025
9 AM Pacific
Online. Register for the Zoom!SGBD: Sharpness-Aware Mirror Gradient with BLIP-Based Denoising for Robust Multimodal Product Recommendation
The growing integration of computer vision and machine learning into the retail industry—both online and in physical stores—has driven the adoption of multimodal recommender systems to help users navigate increasingly complex product landscapes. These systems leverage diverse data sources, such as product images, textual descriptions, and user-generated content, to better model user preferences and item characteristics. While the fusion of multimodal data helps address issues like data sparsity and cold-start problems, it also introduces challenges such as information inconsistency, noise, and increased training instability.
In this paper, we analyze these robustness issues through the lens of flat local minima and propose a strategy that incorporates BLIP—a Vision-Language Model with strong denoising capabilities—to mitigate noise in multimodal inputs. Our method, Sharpness-Aware Mirror Gradient with BLIP-Based Denoising (SGBD), is a concise yet effective training strategy that implicitly enhances robustness during optimization. Extensive theoretical and empirical evaluations demonstrate its effectiveness across various multimodal recommendation benchmarks. SGBD offers a scalable solution for improving recommendation performance in real-world retail environments, where noisy, high-dimensional, and fast-evolving product data is the norm, making it a promising paradigm for training robust multi-modal recommender systems in retail industry.
About the Speaker
Kathy Wu holds a Ph.D. in Applied Mathematics and dual M.S. degrees in Computer Science and Quantitative Finance from the University of Southern California (USC), Los Angeles, CA, USA. At USC, she served as a course lecturer, offering ML Foundations and ML for Business Applications in the science school and business school. Her academic research spans high-dimensional statistics, deep learning, and causal inference, etc.
Kathy brings industry experience from Meta, LinkedIn, and Morgan Stanley in the Bay Area and New York City, US, where she focused on AI methodologies and real-world applications. She is currently an Applied Scientist at Amazon, within the Global Store organization, leading projects in E-Commerce Recommendation Systems, Search Engines, Multi-Modal Vision-Language Models (VLMs), and LLM/GenAI in retails.
Her work has been published in top-tier conferences including ICCV, CVPR, ICLR, SIGIR, WACV, etc. At ICCV 2025, she won the Best Paper Award in Retail Vision.
Spatial Mental Modeling from Limited Views
Can VLMs imagine the unobservable space from just a few views, like humans do? Humans form spatial mental models, as internal representations of "unseen space" to reason about layout, perspective, and motion. On our proposed MINDCUBE, we see critical gap systematically on VLMs building robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for ''what-if'' movements). We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps.
The significant improvement comes from ''map-then-reason'' that jointly trains the model to first abstract a cognitive map and then reason upon it. By training models to construct and reason over these internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding reinforcement learning pushed performance even further to 70.7% (+32.9%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of "unobservable space".
We aim to understand why geometric concepts remain challenging for VLMs and outlining promising research directions towards fostering more robust spatial intelligence.
About the Speaker
Manling Li is an Assistant Professor at Northwestern University and Amazon Scholar. She was a postdoc at Stanford University, and obtained the PhD degree in Computer Science at University of Illinois Urbana-Champaign in 2023. She works on the intersection of language, vision, and robotics, recognized by the MIT TR 35 Under 35, ACL Inaugural Dissertation Award Honorable Mention, ACL’24 Outstanding Paper Award, ACL'20 Best Demo Paper Award, and NAACL'21 Best Demo Paper Award, Microsoft Research PhD Fellowship, EE CS Rising Star, etc.
Forecasting and Visualizing Air Pollution via Sky Images and VLM-Guided Generative Models
Air pollution monitoring is traditionally limited by costly sensors and sparse data coverage. Our research introduces a vision-language model framework that predicts air quality directly from real-world sky images and also simulates skies under varying pollution levels to enhance interpretability and robustness. We further develop visualization techniques to make predictions more understandable for policymakers and the public. This talk will present our methodology, key findings, and implications for sustainable urban environments.
About the Speaker
Mohammad Saleh Vahdatpour is a PhD candidate in Computer Science at Georgia State University specializing in deep learning, vision–language models, and sustainable AI systems. His research bridges generative AI, environmental monitoring, and motion perception, focusing on scalable and energy-efficient models that connect scientific innovation with real-world impact.
Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents
We present Sari Sandbox, a high-fidelity, photorealistic 3D retail store simulation for benchmarking embodied agents against human performance in shopping tasks. Addressing a gap in retail-specific sim environments for embodied agent training, Sari Sandbox features over 250 interactive grocery items across three store configurations, controlled via an API. It supports both virtual reality (VR) for human interaction and a vision language model (VLM)-powered embodied agent.
We also introduce SariBench, a dataset of annotated human demonstrations across varied task difficulties. Our sandbox enables embodied agents to navigate, inspect, and manipulate retail items, providing baselines against human performance. We conclude with benchmarks, performance analysis, and recommendations for enhancing realism and scalability.
About the Speakers
Emmanuel G. Maminta is a fourth-year Artificial Intelligence Ph.D. student at the Ubiquitous Computing Laboratory (UCL) in the University of the Philippines Diliman, advised by Prof. Rowel O. Atienza.
Janika Deborah B.Gajo is an undergraduate student studying for a Bachelor of Science in Computer Engineering at the University of the Philippines, Diliman.
1 attendee from this group - Network event
•OnlineNov 21 - Best of ICCV (Day 3)
Online106 attendees from 44 groupsWelcome to the Best of ICCV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you.
Date, Time and Location
Nov 21, 2025
9 AM Pacific
Online. Register for the Zoom!GECO: Geometrically Consistent Embedding with Lightspeed Inference
Recent advances in feature learning have shown that self-supervised vision foundation models can capture semantic correspondences but often lack awareness of underlying 3D geometry. GECO addresses this gap by producing geometrically coherent features that semantically distinguish parts based on geometry (e.g., left/right eyes, front/back legs).
We propose a training framework based on optimal transport, enabling supervision beyond keypoints, even under occlusions and disocclusions. With a lightweight architecture, GECO runs at 30 fps, 98.2% faster than prior methods, while achieving state-of-the-art performance on PFPascal, APK, and CUB, improving PCK by 6.0%, 6.2%, and 4.1%, respectively. Finally, we show that PCK alone is insufficient to capture geometric quality and introduce new metrics and insights for more geometry-aware feature learningAbout the Speaker
Regine Hartwig is a PHD Graduate Student at the Technical University of Munich
Proactive Comorbidity Prediction in HIV: Towards Fair and Trustworthy Care
HIV is a chronic infection that weakens the immune system and exposes patients to a high burden of comorbidities. While antiretroviral therapy has improved life expectancy, comorbidities remain a major challenge, and traditional screening protocols often fail to capture subtle risk patterns early enough. To address this, we develop a novel method trained on lab tests and demographic data from 2,200 patients in SE London. The method integrates feature interaction modeling, attention mechanisms, residual fusion and label-specific attention heads, outperforming TabNet, MLPs and classical machine learning models.
Our experiments show that incorporating demographic information improves predictive performance, though demographic recoverability analyses reveal that age and gender can still be inferred from lab data alone, raising fairness concerns. Finally, robustness checks confirm stable feature importance across cross-validation folds, reinforcing the trustworthiness of our approach.
About the Speaker
Dimitrios Kollias is an Associate Professor in Multimodal AI at Queen Mary University of London, specializing in machine/deep learning, trustworthy AI, computer vision, medical imaging & healthcare, behavior analysis, HMI. I have published 80+ papers (h-index 39; 6100+ citations) in top venues (e.g., CVPR, ICCV, ECCV, AAAI, IJCV, ECAI), invented a patent in behavior analysis (Huawei) and my research is widely adopted by academia and industry. I also serve as AI consultant and advisor to global companies, and have played leading roles in major international AI workshops and competitions.
Toward Trustworthy Embodied Agents: From Individuals to Teams
Modern intelligent embodied agents, such as service robots and autonomous vehicles, interact frequently with humans in dynamic, uncertain environments. They may also collaborate with each other as a team through effective communication to enhance task success, safety, and efficiency. These brings a few significant challenges. First, building reliable agents that safely navigate multi-agent scenarios requires scalable and generalizable prediction of surrounding agents’ behaviors and robust decision making under environmental uncertainty in out-of-distribution (OOD) scenarios. Second, effective cooperation between agents requires efficient communication and information fusion strategies and reliable task planning for complex long-horizon tasks.
In this talk, I will introduce a series of our recent work that addresses these challenges to enable safe and trustworthy embodied agents and their application to autonomous driving and service robots. Specifically, I will first demonstrate principled uncertainty quantification techniques and how they enable generalizable prediction and planning in out-of-distribution scenarios. Then, I will talk about effective approaches to enable efficient multi-agent communication and cooperation in centralized and decentralized settings.
About the Speaker
Dr. Jiachen Li is an Assistant Professor in the Department of Electrical and Computer Engineering (ECE) and a cooperating faculty in the Department of Computer Science and Engineering (CSE) at the University of California, Riverside. He is the Director of the Trustworthy Autonomous Systems Laboratory and is affiliated with the Riverside Artificial Intelligence Research Institute (RAISE), the Center for Robotics and Intelligent Systems (CRIS), and the Center for Environmental Research and Technology (CE-CERT).
DRaM-LHM: A Quaternion Framework for Iterative Camera Pose Estimation
We explore a quaternion adjugate matrix-based representation for rotational motion in the Perspective-n-Point (PnP) problem. Leveraging quadratic quaternion terms within a Determinant Ratio Matrix (DRaM) estimation framework, we extend its application to perspective scenarios, providing a robust and efficient initialization for iterative PnP pose estimation. Notably, by solving the orthographic projection least-squares problem, DRaM provides a reliable initialization that enhances the accuracy and stability of iterative PnP solvers. Experiments on synthetic and real data demonstrate its efficiency, accuracy, and robustness, particularly under high noise conditions. Furthermore, our nonminimal formulation ensures numerical stability, making it effective for real-world applications.
About the Speaker
Chen Lin was a Research Fellow at the Simons Foundation, where she specialized in 3D computer vision and visual(-inertial) SLAM. Her research spans from classical multiview geometry to learning-based pose estimation and scene understanding. Her ICCV 2025 paper introduces a new framework for rotation and pose estimation built on advanced algebraic paradigms.
1 attendee from this group
Past events
113









