SEA: IR Evaluation for LLMs and Agents
Details
SEA (Search Engines Amsterdam) is coming !
This session focuses on IR evaluation for LLMs and agents, featuring two speakers: Arthur Câmara from Zeta Alpha and Mouly Dewan from University of Washington.
Location: Science Park 904, Room C3.161
Date: February 27
Zoom link: https://uva-live.zoom.us/j/65011610507
Details below:
Speaker #1: Arthur Câmara from Zeta Alpha
Title: Evaluating (and automatically improving!) Deep Research agents
Abstract: Given a user's complex information need, a multi-agent Deep Research system iteratively plans, retrieves, and synthesizes evidence across hundreds of documents to produce a high-quality answer. In our case, an orchestrator agent coordinates the process, while parallel worker agents execute tasks. Current Deep Research systems, however, often rely on hand-engineered prompts and static architectures, making improvement brittle, expensive, and time-consuming. Similarly, evaluating the quality of these agents is far from trivial, specially when no golden-answer exists.
At Zeta Alpha, we are looking into improving both of these aspects and in how they may interact. Starting from our LLM-as-a-judge project RAGElo, we are looking into how the LLM-as-a-judge framework, specially using pairwise comparisons, may lead to better agents that can self-play and explore different strategies to produce high-quality Deep Research systems that match or outperform expert-crafted prompts.
Bio: I’m a Senior Research Engineer working on agentic search and evaluation at Zeta Alpha, an R&D-oriented startup based in Amsterdam. Zeta Alpha develops a sovereign AI platform for enterprise costumers allowing large companies to deploy agents that work on their private data, with a high-quality search backend. Before joining Zeta Alpha, I obtained my PhD from TU Delft, working on Search-as-Learning and Neural IR.
Speaker #2: Mouly Dewan from University of Washington
Title: LLM-Driven Usefulness Judgment for Web Search Evaluation
Abstract: Evaluation is fundamental to optimizing search experiences and supporting diverse user intents in Information Retrieval (IR). Traditional search evaluation methods primarily rely on relevance labels, which assess how well retrieved documents match a user's query. However, relevance alone fails to capture a search system’s effectiveness in helping users achieve their search goals, making usefulness a critical evaluation criterion. Recent LLM enabled evaluation works have mostly focused on relevance label generation. In this paper, we explore an alternative approach: LLM-generated usefulness labels which incorporate both implicit and explicit user behavior signals along with relevance to evaluate document usefulness. We introduce Task-aware Rubric-based Usefulness Evaluation (TRUE), a reproducible rubric-driven evaluation framework that leverages iterative sampling and Chain-of-Thought reasoning to model complex search behavior patterns. Our comprehensive study shows that: (i) pre-trained LLMs can generate moderate usefulness labels when provided with rich session-level context; and (ii) LLMs with TRUE framework outperform existing state-of-the-art methods as well as our systematically constructed baseline. We further examine whether LLMs can distinguish between relevance and usefulness, particularly in unique divergence cases between them. Additionally, we conduct an ablation study to identify key features for accurate usefulness label generation, enabling cost-effective evaluation. Overall, this work advances LLM-based evaluation beyond relevance by proposing a reproducible and scalable framework for usefulness judgment, addressing key reproducibility challenges and supporting large-scale LLM-based evaluation.
Bio: I am Mouly, a 3rd Year PhD student in Information School at the University of Washington, Seattle. My research lies at the intersection of Interactive Information Retrieval (IIR) and Conversational AI, with an emphasis on developing and applying novel evaluation methodologies, particularly leveraging LLMs (LLM4Eval) to improve search, recommendation and proactive conversational information seeking systems.
Counter: SEA Talks #299 and #300.
