Over ons
Information drives the planet. We organize talks around implementations of information retrieval, in search engines, in recommender systems, or in conversational assistants. Our meetups are usually held on the last Friday of the month, at Science Park Amsterdam. Usually, we have two talks in a row, one industrial, the other academic, 25+5 minutes each, no marketing, just algorithms, followed by drinks. We also host ad hoc "single shot" events whenever an interesting visitor stops and shares their work.
Search Engines Amsterdam is supported by the ELLIS unit Amsterdam.
Follow @irlab_amsterdam on Twitter for the latest updates.
Aankomende evenementen
12

SEA: Search Engines Amsterdam
Amsterdam Science Park 904, Science Park 904, Amsterdam, NLThis session focuses on Sparse and Ranking in Modern Retrieval, featuring two speakers: Simon Lupart from University of Amsterdam and Chuan Meng from University of Edinburgh.
Location: Science Park 904, Room C3.161
Date: Friday, April 24
Time: 16:00-17:00
Zoom link: https://uva-live.zoom.us/j/65011610507Details below:
Speaker #1: Simon Lupart from University of AmsterdamTitle: On the Challenges and Opportunities of Learned Sparse Retrieval for Code
Abstract: Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.
Bio: Simon is a third year PhD student at the University of Amsterdam, working on Information Retrieval and LLMs. He has graduated from a computer science engineering school in France, made an Erasmus exchange in London, during his master, at the imperial college london, and worked in the industry for 2 years at naver labs europe. Now he will present his latest work from an internship at naver labs europe.
Speaker #2: Chuan Meng from University of Edinburgh
Title: Revisiting Text Ranking in Deep Research
Abstract: Deep research has emerged as an important task that aims to address hard queries through extensive open-web exploration. To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it. Despite search's essential role in deep research, black-box web search APIs hinder systematic analysis of search components, leaving the behaviour of established text ranking methods in deep research largely unexplored. In this talk, we revisit key findings and best practices for text ranking methods in the deep research setting. In particular, we examine their effectiveness from three perspectives: (i) retrieval units (documents vs. passages), (ii) pipeline configurations (different retrievers, re-rankers, and re-ranking depths), and (iii) query characteristics (the mismatch between agent-issued queries and the training queries of text rankers). We perform experiments on BrowseComp-Plus, a recent deep research dataset with a fixed document corpus, evaluating a broad spectrum of text ranking methods across diverse setups. We find that agent-issued queries typically follow web-search-style syntax (e.g., quoted exact matches), favouring lexical, learned sparse, and multi-vector retrievers. Passage-level units are more efficient under limited context windows, and avoid the difficulties of document length normalisation in lexical retrieval. Re-ranking consistently improves performance, with deeper re-ranking depths amplifying gains. Translating agent-issued queries into natural-language questions significantly bridges the query mismatch and improves effectiveness.
Related paper: https://arxiv.org/pdf/2602.21456
Bio: Chuan Meng is a postdoc at the University of Edinburgh, working with Dr. Jeff Dalton. He received his PhD from the University of Amsterdam in June 2025, supervised by Prof. Maarten de Rijke and Dr. Mohammad Aliannejadi. He was formerly an Applied Scientist Intern at Amazon. His research focuses on agentic information retrieval (IR) and deep research. He has published 25+ papers in top-tier venues such as SIGIR, ACL, EMNLP, NAACL, CIKM, AAAI, ECIR, and TOIS, with 650+ citations on Google Scholar and an h-index of 16. He has co-organised tutorials and workshops at SIGIR 2025, WSDM 2025, ECIR 2024–2026, SIGIR-AP 2024.
Counter: SEA Talks #303 and #304.
3 deelnemers
Verleden evenementen
135



