Spark, Topic Models and Content Recommendation

This is a past event

66 people went


This Friday, we will have two talks: one academic and one industrial. We will have drinks afterwards. The industry talk will be given by Anne Schuth ( from Blendle ( Rolf Jagerman form UvA-ILPS will give the academic talk.


16:00 - 16:30 Rolf Jagerman (UvA)

16:30 - 17:00 Anne Schuth ( (Blendle (

17:00 - 18:00 Drinks & Snacks

Details of the talks:


Rolf Jagerman--Web-scale Topic Models in Spark: An Asynchronous Parameter Server

Spark recently emerged as a popular tool for performing large-scale data analysis. It has seen a lot of success in both industry and academia. Unfortunately, Spark is bounded to the map-reduce programming paradigm. Inference algorithms for topic models such as LDA are not easily implemented in such a paradigm since they typically rely on a large mutable parameter space that is updated concurrently.We aim to solve this problem by extending Spark with an asynchronous parameter server. Such a parameter server provides a distributed and concurrently accessed parameter space for the model. We implement a distributed LDA inference algorithm, based on LightLDA, in Spark and show that it is significantly faster and more scalable than the default LDA implementations provided by Spark. We are able to train a topic model with 1000 topics on a 27TB dataset in just 80 hours using a computing cluster with 480 cpu cores and 4TB RAM.In this presentation we'll explore the architectural and algorithmic details that enable topic modeling at such a large scale.

Rolf Jagerman received a MSc in computer science from ETH Zürich and is currently a PhD candidate at ILPS, advised by Maarten de Rijke. His main research interests are solving scalability challenges in information retrieval and machine learning. He is currently working on building systems and models for contextual learning to rank.


Anne Schuth--Real-time Content Recommendation

Every morning, at Blendle, we have a huge cold-start problem when over 6.000 new articles from the latest newspapers arrive in our system. These articles are read by virtually no-one yet when we are tasked with sending out personalized newsletters to many of our users. We can thus not rely on collaborative filtering type of recommendations, nor can we use the popularity of the articles as clues for what our user might want to read. We overcome our cold-start problem by a mix of curation by our editorial team and an automated analysis of the content of these articles. We extract named entities, semantic links, authors, the language and plenty of stylometrics.
Much of our setup to analyze content is implemented in Spark, as a (mini) batch process. And the `batch` part is (or better, was) a problem. Our editorial team gets up at around 5am and is done reading and recommending their selection of articles around 8am, which is also the time we would ideally send out the newsletter. Starting our batch process only then would mean a prohibitively long delay. We therefore started switching to a combination of Spark with a streaming infrastructure with Kafka at the core. In this talk I will outline both our batch processing setup and our streaming setup and how these work together.

Anne Schuth is data scientist at Blendle, where you can read all newspapers and magazines and only pay for what you read. Anne recently obtained his PhD from the University of Amsterdam (UvA). His PhD research focused on online learning to rank: optimizing search engine algorithms based on the interactions with users. Anne was previously intern at Microsoft Research in Cambridge and Yandex in Moscow