August 7 - Understanding Visual Agents


Details
Join us for a virtual event to hear talks from experts on the current state of visual agents.
When
Aug 7, 2025 at 9 AM Pacific
Where
Virtual. Register for the Zoom.
Foundational capabilities and models for generalist agents for computers
As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately.
We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents.
About the Speaker
Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India.
BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities
The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities.
About the Speaker
Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content.
Visual Agents: What it takes to build an agent that can navigate GUIs like humans
We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field.
About the Speaker
Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Sponsors
August 7 - Understanding Visual Agents