Skip to content

MOOCS, harvesting & recommendation

Photo of
Hosted By
Dolf T.


Interested in MOOCS, web harvesting and news recommendation? Join our next meetup!


18:00 - 18:30 Pizzas, drinks, networking
18:30 - 19:00 Large-scale data analysis of Massive Open Online Courses (Claudia Hauff)
19:00 - 19:30 The Combine Web Harvester architecture (Brend Wanders)
19:30 - 20:00 Streaming Recommendation (Anne Schuth)
20:00 - 21:00 More drinks & networking

Abstracts and bio's can be found below.

Large-scale data analysis of Massive Open Online Courses (MOOCs) - Claudia Hauff, TU Delft


Since their inception Massive Open Online Courses (MOOCs) have garnered a lot of attention for their potential to drastically change and improve the education landscape. In the last few years, realism has caught up with this ideal - the percentage of MOOC learners that actually engage and pass a MOOC is low, and more concerning, many of those learners that do earn a MOOC certificate are already highly qualified.
To alter these dynamics, we need to establish where in this large-scale online learning process things are going wrong and what we can do to automatically and at-scale engage more learners, in particular those from underrepresented groups.
At TU Delft we offer more than 30 MOOCs on the edX platform and recently hit the 1-milllion-MOOC-enrollments landmark. We employ large-scale data analysis and data visualization techniques to derive actionable insights from the digital traces our MOOC learners leave behind; we also design, develop and deploy novel educational technologies in our MOOCs often grounded in theories and frameworks established within educational psychology and pedagogy.
In this talk, I will present a number of our recent findings in this domain.


Claudia Hauff is an Assistant Professor at the Web Information Systems group, Delft University of Technology. Between 2011 and 2012 she worked as Postdoc in the same group, conducting research in the scope of the ImREAL project. Claudia received her PhD from the University of Twente, where sheI worked in the Human Media Interaction group. The Otto-von-Guericke University of Magdeburg in Germany was her home during her undergraduate years as a student in computer science. In the past, she has worked on a variety of topics in the fields of information retrieval (IR) & data science, including query performance prediction, social search, computational social science, learning to search and IR for specific user groups (e.g. children). She is currently focusing on large-scale learning analytics and how to incorporate search into the learning process at scale.

The Combine Web Harvester architecture - Brend Wanders, University of Twente


This talk presents the Combine Web Harvester architecture. The architecture is developed as part of the SmartCOPI project (funded by the COMMIT research programme), which aims to harvest and consolidate online product information. The Combine is a component-based web harvesting architecture aimed at handling and processing online information by using probabilistic data to represent alternatives. We will explain the high-level goals and design philosophy of the architecture and prototype, and show how a simple harvesting task can be modelled.


Brend Wanders received the M.Sc. degree in computer science from the University of Twente, Enschede, Netherlands, in 2011. He joined the Databases Group at the University of Twente, where he received his doctorate in 2016. He is currently working part-time as a postdoc researcher on the SmartCOPI project. Next to his academic research, he is a part of the specialist consulting firm Bloom & Wanders, together with dr. Niels Bloom.
His research interests include probabilistic datalog, probabilistic databases, data integration and quality management, and unlocking semantic web technologies for a broad public with a focus on semantic wikis.

Streaming Recommendation - Anne Schuth, Blendle


Every morning, at Blendle, we have a huge cold-start problem when over 6.000 new articles from the latest newspapers arrive in our system. These articles are read by virtually no-one yet when we are tasked with sending out personalized newsletters to many of our users. We can thus not rely on collaborative filtering type of recommendations, nor can we use the popularity of the articles as clues for what our user might want to read. We overcome our cold-start problem by a mix of curation by our editorial team and an automated analysis of the content of these articles. We extract named entities, semantic links, authors, the language and plenty of stylometrics. Much of our setup to analyze content is implemented in Spark, as a (mini) batch process. And the `batch` part is (or better, was) a problem. Our editorial team gets up at around 5am and is done reading and recommending their selection of articles around 8am, which is also the time we would ideally send out the newsletter. Starting our batch process only then would mean a prohibitively long delay. We therefore started switching to a combination of Spark with a streaming infrastructure with Kafka at the core. In this talk I will outline both our batch processing setup and our streaming setup and how these work together.


Anne Schuth is data scientist at Blendle, where you can read all newspapers and magazines and only pay for what you read. Anne recently obtained his PhD from the University of Amsterdam (UvA). His PhD research focused on online learning to rank: optimizing search engine algorithms based on the interactions with users. Anne was previously intern at Microsoft Research in Cambridge and Yandex in Moscow.

Werkhorst 36 · Meppel