This Meetup is sponsored by Voxel51, the lead maintainers of the open source FiftyOne computer vision toolset. To learn more, visit the FiftyOne project page on GitHub.

Upcoming events (4+)

See all

Network event
408 attendees from 39 groups hosting
Thu, Aug 7, 2025, 4:00 PM UTCAugust 7 - Understanding Visual Agents
Link visible for attendees
Join us for a virtual event to hear talks from experts on the current state of visual agents.

When

Aug 7, 2025 at 9 AM Pacific

Where

Virtual. Register for the Zoom.

Foundational capabilities and models for generalist agents for computers

As we move toward a future where language agents can operate software, browse the web, and automate tasks across digital environments, a pressing challenge emerges: how do we build foundational models that can act as generalist agents for computers? In this talk, we explore the design of such agents—ones that combine vision, language, and action to understand complex interfaces and carry out user-intent accurately.

We present OmniACT as a case study, a benchmark that grounds this vision by pairing natural language prompts with UI screenshots and executable scripts for both desktop and web environments. Through OmniACT, we examine the performance of today’s top language and multimodal models, highlight the limitations in current agent behavior, and discuss research directions needed to close the gap toward truly capable, general-purpose digital agents.

About the Speaker

Raghav Kapoor is a machine learning at Adobe, where he works on the Brand Services team, contributing to cutting-edge projects in brand intelligence. His work blends research with machine learning, reflecting his deep expertise in both areas. Prior to joining Adobe, Raghav earned his Master’s degree from Carnegie Mellon University, where his research focused on multimodal machine learning and web-based agents. He also brings industry experience from his experience as a strategist at Goldman Sachs India.

BEARCUBS: Evaluating Web Agents' Real-World Information-Seeking Abilities

The talk focuses on the challenges of evaluating AI agents in dynamic web settings, the design and implementation of the BEARCUBS benchmark, and insights gained from human and agent performance comparisons. In the talk, we will discuss the significant performance gap between human users and current state-of-the-art agents, highlighting areas for future improvement in AI web navigation and information retrieval capabilities.

About the Speaker

Yixiao Song is a Ph.D. candidate in Computer Science at the University of Massachusetts Amherst. Her research focuses on enhancing the evaluation of natural language processing systems, particularly in assessing factuality and reliability in AI-generated content. Her work encompasses the development of tools and benchmarks such as VeriScore, an automatic metric for evaluating the factuality of long-form text generation, and BEARCUBS, a benchmark for assessing AI agents' ability to identify factual information from web content.

Visual Agents: What it takes to build an agent that can navigate GUIs like humans

We’ll examine conceptual frameworks, potential applications, and future directions of technologies that can “see” and “act” with increasing independence. The discussion will touch on both current limitations and promising horizons in this evolving field.

About the Speaker

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.

Implementing a Practical Vision-Based Android AI Agent

In this talk I will share with you practical details of designing and implementing Android AI agents, using deki.

From theory we will move to practice and the usage of these agents in
industry/production.

For end users - remote usage of Android phones or for automation of standard tasks. Such as:

"Write my friend 'some_name' in WhatsApp that I'll be 15 minutes late"

"Open Twitter in the browser and write a post about 'something'"

"Read my latest notifications and say if there are any important ones"

"Write a linkedin post about 'something'"

And for professionals - to enable agentic testing, a new type of test that only became possible because of the popularization of LLMs and AI agents that use them as a reasoning core.

About the Speaker

Rasul Osmanbayli is a senior Android developer at Kapital Bank, Baku/Azerbaijan. It is the largest private bank in Azerbaijan. He created deki, an Image Description model that was used as a foundation for an Android AI agent that achieved high results on 2 different benchmarks: Android World and Android Control.

He previously worked in Istanbul/Türkiye for various companies as an
Android and Backend developer. He is also a MS at Istanbul Aydin University in Istanbul/Türkiye.
34 attendees from this group+29
Network event
227 attendees from 44 groups hosting
Fri, Aug 15, 2025, 4:00 PM UTCAug 15 - Visual Agent Workshop Part 1: Navigating the GUI Agent Landscape
Link visible for attendees
Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 15, 2025 at 9 AM Pacific

Register for the Zoom

Part 1: Navigating the GUI Agent Landscape

Understanding the Foundation Before Building

The GUI agent field is evolving rapidly, but success requires an understanding of what came before. In this opening session, we'll map the terrain of GUI agent research—from the early days of MiniWoB's simplified environments to today's complex, multimodal systems tackling real-world applications. You'll discover why standard vision models fail catastrophically on GUI tasks, explore the annotation bottlenecks that make GUI datasets so expensive to create, and understand the platform fragmentation that makes "click a button" mean twenty different things across datasets.

We'll dissect the most influential datasets (Mind2Web, AITW, Rico) and models that have shaped the field, examining their strengths, limitations, and the research gaps they reveal. By the end, you'll have a clear picture of where GUI agents excel, where they struggle, and, most importantly, where the opportunities lie for your own contributions.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.
39 attendees from this group+34
Network event
170 attendees from 44 groups hosting
Fri, Aug 22, 2025, 4:00 PM UTCAug 22 - Visual Agent Workshop Part 2: From Pixels to Predictions
Link visible for attendees
Welcome to the three part Visual Agents Workshop virtual series...your hands on opportunity to learn about visual agents - how they work, how to develop them and how to fine-tune them.

Date and Time

Aug 22, 2025 at 9 AM Pacific

Register for the Zoom

Part 2: From Pixels to Predictions - Building Your GUI Dataset

Hands-On Dataset Creation and Curation with FiftyOne

The best GUI models are only as good as their training data, and the best datasets are built by understanding what makes GUI interactions fundamentally different from natural images. In this practical session, you'll build a complete GUI dataset from scratch, learning to capture the precise annotations that GUI agents need.

Using FiftyOne as your data management backbone, you'll import diverse GUI screenshots, explore annotation strategies that go beyond bounding boxes, and implement efficient labeling workflows. We'll tackle the real challenges: handling platform differences, managing annotation quality, and creating datasets that transfer to new domains. You'll also learn advanced techniques like synthetic data generation and automated prelabeling to scale your annotation efforts.

Walk away with a production-ready dataset and the skills to build more—because in GUI agents, data quality determines everything.

By the end, you'll have both a dataset and the methodology to build the next generation of GUI training data.

About the Instructor

Harpreet Sahota is a hacker-in-residence and machine learning engineer with a passion for deep learning and generative AI. He’s got a deep interest in RAG, Agents, and Multimodal AI.
20 attendees from this group+15
Network event
183 attendees from 44 groups hosting
Thu, Aug 28, 2025, 5:00 PM UTCAug 28 - AI, ML and Computer Vision Meetup
Link visible for attendees
Date and Time

Aug 28, 2025 at 10 AM Pacific

Location

Virtual - Register for the Zoom

Exploiting Vulnerabilities In CV Models Through Adversarial Attacks

As AI and computer vision models are leveraged more broadly in society, we should be better prepared for adversarial attacks by bad actors. In this talk, we'll cover some of the common methods for performing adversarial attacks on CV models. Adversarial attacks are deliberate attempts to deceive neural networks into generating incorrect predictions by making subtle alterations to the input data.

About the Speaker

Elisa Chen is a data scientist at Meta on the Ads AI Infra team with 5+ years of experience in the industry.

EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation

Recent 3D deep networks such as SwinUNETR, SwinUNETRv2, and 3D UX-Net have shown promising performance by leveraging self-attention and large-kernel convolutions to capture the volumetric context. However, their substantial computational requirements limit their use in real-time and resource-constrained environments.

In this paper, we propose EffiDec3D, an optimized 3D decoder that employs a channel reduction strategy across all decoder stages and removes the high-resolution layers when their contribution to segmentation quality is minimal. Our optimized EffiDec3D decoder achieves a 96.4% reduction in #Params and a 93.0% reduction in #FLOPs compared to the decoder of original 3D UX-Net. Our extensive experiments on 12 different medical imaging tasks confirm that EffiDec3D not only significantly reduces the computational demands, but also maintains a performance level comparable to original models, thus establishing a new standard for efficient 3D medical image segmentation.

About the Speaker

Md Mostafijur Rahman is a final-year Ph.D. candidate in Electrical and Computer Engineering at The University of Texas at Austin, advised by Dr. Radu Marculescu, where he builds efficient AI methods for biomedical imaging tasks such as segmentation, synthesis, and diagnosis. By uniting efficient architectures with data-efficient training, his work delivers robust and efficient clinically deployable imaging solutions.

What Makes a Good AV Dataset? Lessons from the Front Lines of Sensor Calibration and Projection

Getting autonomous vehicle data ready for real use, whether for training, simulation, or evaluation, isn’t just about collecting LIDAR and camera frames. It’s about making sure every point lands where it should, in the right frame, at the right time.

In this talk, we’ll break down what it actually takes to go from raw logs to a clean, usable AV dataset. We’ll walk through the practical process of validating transformations, aligning coordinate systems, checking intrinsics and extrinsics, and making sure your projected points actually show up on camera images. Along the way, we’ll share a checklist of common failure points and hard-won debugging tips.

Finally, we’ll show how doing this right unlocks downstream tools like Omniverse Nurec and Cosmos—enabling powerful workflows like digital reconstruction, simulation, and large-scale synthetic data generation

About the Speaker

Daniel Gural is a seasoned Machine Learning Engineer at Voxel51 with a strong passion for empowering Data Scientists and ML Engineers to unlock the full potential of their data.
14 attendees from this group+9