Here we are with the sixth edition of the Search Technology Meetup Hamburg. Come and join us for pizza, drinks, really great talks. If you want to participate by giving a short lightning talk (at around 15 minutes) you're welcome to join. Just drop us a line for planning reasons. The topics could be all around search technology (Elasticsearch, Solr, Kibana, Quality, How you use it? Whats your experience? ...).
We thank Codecentric AG (https://www.codecentric.de/) for sponsoring foods & drinks, location and Shopping24 (http://developer.s24.com) for organizing. All three talks will be held in English.
1. BM25 demystified
Lucene will change the default scoring from TF/IDF to BM25 in the next major release. So unless you really enjoy surprises you better learn about it now! TF/IDF was easy enough to understand intuitively but how is it with BM25? What do all these parameters do? And what do people mean when they say it is "probabilistic"? In this talk I will tell the story of how we came from the Probability Ranking Principle to BM25 with a minimum of math and a maximum of explaining. I will also show how BM25 differs from TF/IDF, what it means in practice and give and intuition on what the parameters of this method actually do. You will leave this talk feeling good about Lucene changing the default. And of course you will learn many fancy buzzwords to show off with during the breaks.
Britta Weber was lured away from a career in academic research on image processing by all that’s awesome in Elasticsearch. She joined the company in May 2013 as a software engineer in the company's Berlin office. When not writing code and thinking about machine learning, she enjoys singing renaissance madrigals.
Britta Weber, Software Engineer - elastic.co @a2tirb (https://twitter.com/a2tirb)
2. Building a real-time news search engine
What challenges could a search engine have? Large number of documents? Large query load? Very complex queries? A challenging privileging model? Expected low query latency? High volume of document updates? Updates to documents reflected in milliseconds? Realtime alerting for any search? Absolutely no downtime any time of the day, week or year? What if a search engine had all these challenges? Meet the backend which drives News Search at Bloomberg LP. In this talk, Ramkumar Aiyengar talks about how he and his colleagues successfully pushed Solr over the last three years to unchartered territories, to deliver a real-time search engine critical to the workflow of hundreds of thousands of customers worldwide.
Ramkumar leads the News Search backend team at the Bloomberg R&D office in London. He joined Bloomberg from his university in India and has been with the News R&D team for eight years now. He started working with Apache Solr/Lucene three years back, and is now a committer with the project usually curious about Solr's search distribution, architecture and cloud functionality. He considers himself a Linux evangelist, and is one of those weird geeky creatures who considers Lisp beautiful and believes that Emacs is an operating system.
Ramkumar Aiyengar, TeamLead Search Backend, Bloomberg R&D London, @andyetitmoves (https://twitter.com/andyetitmoves)
3. Serving real time push-notifications for 5million saved searches
Ebay Kleinanzeigen is one of the most visited sites in Germany and still grows at an amazing speed. Currently, we have about 19 million ads and over 18 million unique visitors each month. One of our most popular features are saved searches: When on a search result page, users can register for push notifications in case of new matching ads being posted. Introduced in summer last year, we are now close to 5 million saved searches in our database, with the number steadily growing.
Matching newly posted ads with these saved searches seems like information retrieval upside down: Normally, we store ads and given a user query, return matching results. In this case, we need to store queries and given an ad, identify the queries that matched. ElasticSearch’s Percolator conveniently provides the base functionality out of the box and is at the heart of our implementation. With 5million saved searches and[masked] new ads being posted every day, this is one of the biggest productive applications of the percolation feature according to the ElasticSearch support.
However, identifying queries that match newly incoming ads is only one part of the story. The massive throughput of data and the need for near-realtime notifications require a modular, highly scalable infrastructure which we designed and implemented using Kafka, a distributed state-of-the-art messaging system.
In this talk, we will speak about the challenges that came with the saved searches feature and how we are dealing with them. We will look at necessary general optimisations as well as tweaks specific to our application.
André Charton, Christiana Lemke, Ebay Kleinanzeigen Berlin