Skip to content

May 29 - Best of WACV 2025

Network event
123 attendees from 36 groups hosting
Photo of Jimmy Guerrero
Hosted By
Jimmy G.
May 29 - Best of WACV 2025

Details

This is a virtual event taking place on May 29, 2025 at 9 AM Pacific.

Register for the Zoom

Welcome to the Best of WACV 2025 virtual series that highlights some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) is the premier international computer vision event comprising the main conference and several co-located workshops and tutorials.

DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models

Given a small number of images of a subject, personalized image generation techniques can fine-tune large pre-trained text-to-image diffusion models to generate images of the subject in novel contexts, conditioned on text prompts. In doing so, a trade-off is made between prompt fidelity, subject fidelity and diversity. As the pre-trained model is fine-tuned, earlier checkpoints synthesize images with low subject fidelity but high prompt fidelity and diversity. In contrast, later checkpoints generate images with low prompt fidelity and diversity but high subject fidelity. This inherent trade-off limits the prompt fidelity, subject fidelity and diversity of generated images. In this work, we propose DreamBlend to combine the prompt fidelity from earlier checkpoints and the subject fidelity from later checkpoints during inference. We perform a cross attention guided image synthesis from a later checkpoint, guided by an image generated by an earlier checkpoint, for the same prompt. This enables generation of images with better subject fidelity, prompt fidelity and diversity on challenging prompts, outperforming state-of-the-art fine-tuning methods.

Paper: DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models

About the Speaker

Shwetha Ram is an Applied Scientist at Amazon, where she focuses on advancing multimodal capabilities for Rufus, Amazon’s generative AI-powered conversational shopping assistant. Her work has contributed to a range of innovative initiatives across Amazon, including Lab126, Scout (the autonomous sidewalk delivery robot), and M5 (Amazon’s foundation models). Prior to joining Amazon, Shwetha was part of the Image Technology Incubation team at Dolby Laboratories, where she explored emerging opportunities for Dolby in AR/ VR and immersive media technologies.

Robust Multi-Class Anomaly Detection under Domain Shift

Robust multi-class anomaly detection under domain shift is a fundamental challenge in real-world scenarios, where detectors should distinguish different types of anomalies despite significant distribution shifts. Traditional approaches often struggle to generalize across domains and handle inter-class interference. ROADS addresses these limitations through a prompt-driven framework that combines a hierarchical class-aware prompt mechanism with a domain adapter to jointly encode discriminative, class-specific prompts and learn domain-invariant representations. Extensive evaluations on the MVTec-AD and VISA datasets show that ROADS achieves superior performance in both anomaly detection and localization, particularly in out-of-distribution settings.

Paper: ROADS: Robust Prompt-driven Multi-Class Anomaly Detection under Domain Shift

About the Speaker

Hossein Kashiani is a fourth-year Ph.D. student at Clemson University. His research focuses on developing generalizable and trustworthy AI systems, with publications in top venues such as CVPR (2025), WACV (2025), ICIP, and TBIOM. His work spans diverse applications, including anomaly detection, media forensics, biometrics, healthcare, and visual perception.

What Remains Unsolved in Computer Vision? Rethinking the Boundaries of State-of-the-Art

Despite rapid progress and increasingly powerful models, computer vision still struggles with a range of foundational challenges. This talk revisits the “blind spots” of state-of-the-art vision systems, focusing on problems that remain difficult in real-world applications. I will share insights from recent work on multi-object tracking—specifically cases involving prolonged occlusions, identity switches, and visually indistinguishable subjects such as identical triplets in motion. Through examples from DragonTrack and other mehtods, I’ll explore why these problems persist and what they reveal about the current limits of our models. Ultimately, this talk invites us to look beyond benchmark scores and rethink how we define progress in visual perception.

About the Speaker

Bishoy Galoaa is an incoming PhD student in Electrical and Computer Engineering at Northeastern University, under the supervision of Prof. Sarah Ostadabbas. His research centers on multi-object tracking and scene understanding in complex environments, with a focus on problems that challenge the assumptions of current deep learning models.

LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

Current Large Language Vision Models trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). In this talk, I will introduce a foundation model: LLAVIDAL catered towards understanding ADL and the tricks to train such models.

Paper: LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

About the Speaker

Srijan Das is an Assistant Professor in the Department of Computer Science at the University of North Carolina at Charlotte. At UNC Charlotte, he is working on Video Representation Learning, and Robotic Vision. He is a member of the AI4Health Center and one of the founding members of the Charlotte Machine Learning Lab (CharMLab) at UNC Charlotte.

Photo of San Francisco AI Meetup Group group
San Francisco AI Meetup Group
See more events
FREE