Agent Evaluation: Measuring Adaptability and Evaluating Multi-Agent Systems

Name: Agent Evaluation: Measuring Adaptability and Evaluating Multi-Agent Systems
Start: 2026-03-29T18:00:00+03:00
End: 2026-03-29T20:00:00+03:00

Hosted by Shay Palachy A.

Super Organizer

DataHack - Data Science, Machine Learning & Statistics

Details

Our next DataTalks meetup will take place online on Zoom and feature two timely talks on one of the most important questions in AI right now: how we evaluate agents.

As general-purpose agents and multi-agent systems become more capable, the need for robust, theory-grounded, and safety-aware evaluation is becoming impossible to ignore. In this session, we’ll hear from two researchers working at the forefront of AI benchmarking and evaluation, each bringing a different perspective on what it will take to assess agents well.

Time: Sunday, March 29th, 6:00 PM IST / 5:00 PM CET

Location: Zoom only

Language: English

Agenda
---------
🍕 17:50 - 18:00 — Online gathering
🔶 18:00 - 18:45 — Ready For General Agents? Let’s Test It.
🧻 18:45 - 18:55 — Short break
🔷 18:55 - 19:40 — Evaluating Agent-to Agent Communication in Multi-Agent Systems

---

Talk #1: Ready For General Agents? Let’s Test It.

Speaker: Michal, Distinguished Engineer for AI Benchmarking and Evaluation at IBM Research

Abstract: General-purpose agents are emerging, promising seamless deployment across domains. However, we still lack effective ways to measure their adaptability to diverse, unseen environments—a core requirement for true generality. In this talk, I will outline the key challenges in evaluating general agents and present a path toward a unified evaluation framework designed to guide their development. I will introduce Exgentic- exgentic.ai, a new evaluation framework that aims to systematically assess agent generality across tasks, environments, and models.

Bio: Michal is a Distinguished Engineer for AI Benchmarking and Evaluation at IBM Research. Her expertise spans Natural Language Generation and Natural Language Processing, with a focus on efficient and robust evaluation. She has published extensively in leading AI conferences, and has given tutorials on LLM evaluation and benchmarking at COLING 2024, and on Agent Evaluation at IJCAI 2025. Michal has also organized numerous workshops and shared tasks, including the 1st and 2nd Scientific Document Processing (SDP) workshops at EMNLP 2020 and COLING 2021, the User-Aware Conversational Agents workshop at IUI 2019 and 2020, and the GEM2 workshop on evaluation at ACL 2025. She earned her Ph.D. from the University of California, Irvine, in 2009.

---

Talk #2: Evaluating Agent-to-Agent Communication in Multi-Agent Systems

Speaker: Ruchira Dhar, PhD Student in Computer Science, University of Copenhagen

Abstract: As LLM-based multi-agent systems become more common, most evaluations still focus on individual agent capabilities rather than how agents communicate and interact with each other. This talk explores how current benchmarks miss key aspects of agent-to-agent communication, including coordination, bias, and safety in dialogue. It also outlines directions for building theory-grounded and safety-aware evaluation methods for multi-agent AI.

Bio: Ruchira Dhar is a PhD student in Computer Science at the University of Copenhagen. Her research focuses on the evaluation of large language models and NLP systems, with interests spanning AI safety and Responsible AI. Her work has been presented at venues including ICML, NeurIPS, ACL, EMNLP, and AIES. Prior to her PhD, she worked in industry developing NLP and generative AI applications. Her work aims to bridge technical AI research with interdisciplinary insights from cognitive science and the social sciences to support more reliable AI.

---

Note: Both talks will be given in English.

DataHack - Data Science, Machine Learning & Statistics

Agent Evaluation: Measuring Adaptability and Evaluating Multi-Agent Systems

DataHack - Data Science, Machine Learning & Statistics

Details

Related topics

You may also like