April 9 - Workshop: Build a Visual Agent that can Navigate GUIs like Humans
175 attendees from 48 groups hosting
Hosted by Computer Vision Israel Meetup
Details
This hands-on workshop provides a comprehensive introduction to building and evaluating visual agents for GUI automation using modern tools and techniques.
Date, Time and Location
April 9, 2026 at 9 AM Pacific
Online. Register for the Zoom
Visual agents that can understand and interact with graphical user interfaces represent a transformative frontier in AI automation. These systems combine computer vision, natural language understanding, and spatial reasoning to enable machines to navigate complex interfaces—from web applications to desktop software—just as humans do. However, building robust GUI agents requires careful attention to dataset curation, model evaluation, and iterative improvement workflows.
Participants will learn how to leverage FiftyOne, an open-source toolkit for dataset curation and computer vision workflows, to build production-ready GUI agent systems.
What You'll Learn:
- Dataset Creation & Management: How to structure, annotate, and load GUI interaction datasets using the COCO4GUI standardized format
- Data Exploration & Analysis: Using FiftyOne's interactive interface to visualize datasets, analyze action distributions, and understand annotation patterns
- Multimodal Embeddings: Computing embeddings for screenshots and UI element patches to enable similarity search and retrieval
- Model Inference: Running state-of-the-art models like Microsoft's GUI-Actor to predict interaction points from natural language instructions
- Performance Evaluation: Measuring model accuracy using standard metrics and normalized click distance to assess localization precision
- Failure Analysis: Investigating model failures through attention maps, error pattern analysis, and systematic debugging workflows
- Data-Driven Improvement: Tagging samples based on error types (attention misalignment vs. localization errors) to prioritize fine-tuning efforts
- Synthetic Data Generation: Using FiftyOne plugins to augment training data with synthetic task descriptions and variations
About the Speaker
Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.




