About us
Information drives the planet. We organize talks around implementations of information retrieval, in search engines, in recommender systems, or in conversational assistants. Our meetups are usually held on the last Friday of the month, at Science Park Amsterdam. Usually, we have two talks in a row, one industrial, the other academic, 25+5 minutes each, no marketing, just algorithms, followed by drinks. We also host ad hoc "single shot" events whenever an interesting visitor stops and shares their work.
Search Engines Amsterdam is supported by the ELLIS unit Amsterdam.
Follow @irlab_amsterdam on Twitter for the latest updates.
Upcoming events
12

SEA: IR Evaluation for LLMs and Agents
Amsterdam Science Park 904, Science Park 904, Amsterdam, NLSEA (Search Engines Amsterdam) is coming 🎉!
This session focuses on IR evaluation for LLMs and agents, featuring two speakers: Arthur Câmara from Zeta Alpha and Mouly Dewan from University of Washington.Location: Science Park 904, Room C3.161
Date: February 27
Zoom link: https://uva-live.zoom.us/j/65011610507Details below:
Speaker #1: Arthur Câmara from Zeta Alpha
Title: T.B.D
Abstract: T.B.D
Bio: T.B.DSpeaker #2: Mouly Dewan from University of Washington
Title: LLM-Driven Usefulness Judgment for Web Search EvaluationAbstract: Evaluation is fundamental to optimizing search experiences and supporting diverse user intents in Information Retrieval (IR). Traditional search evaluation methods primarily rely on relevance labels, which assess how well retrieved documents match a user's query. However, relevance alone fails to capture a search system’s effectiveness in helping users achieve their search goals, making usefulness a critical evaluation criterion. Recent LLM enabled evaluation works have mostly focused on relevance label generation. In this paper, we explore an alternative approach: LLM-generated usefulness labels which incorporate both implicit and explicit user behavior signals along with relevance to evaluate document usefulness. We introduce Task-aware Rubric-based Usefulness Evaluation (TRUE), a reproducible rubric-driven evaluation framework that leverages iterative sampling and Chain-of-Thought reasoning to model complex search behavior patterns. Our comprehensive study shows that: (i) pre-trained LLMs can generate moderate usefulness labels when provided with rich session-level context; and (ii) LLMs with TRUE framework outperform existing state-of-the-art methods as well as our systematically constructed baseline. We further examine whether LLMs can distinguish between relevance and usefulness, particularly in unique divergence cases between them. Additionally, we conduct an ablation study to identify key features for accurate usefulness label generation, enabling cost-effective evaluation. Overall, this work advances LLM-based evaluation beyond relevance by proposing a reproducible and scalable framework for usefulness judgment, addressing key reproducibility challenges and supporting large-scale LLM-based evaluation.
Bio: I am Mouly, a 3rd Year PhD student in Information School at the University of Washington, Seattle. My research lies at the intersection of Interactive Information Retrieval (IIR) and Conversational AI, with an emphasis on developing and applying novel evaluation methodologies, particularly leveraging LLMs (LLM4Eval) to improve search, recommendation and proactive conversational information seeking systems.
Counter: SEA Talks #299 and #300.
6 attendees
Past events
133


