Search in Historical Newspapers and Sentence Embeddings


As usual, we will have one academic and one industry talk, followed by drinks. The academic talk will be given by Tom Kenter from ILPS-UVA ( and the industry talk will be given by Theo van Veen from KBResearch (

This edition of SEA will be held in SPUI25.


16:00 - 16:30 Tom Kenter

16:30 - 17:00 Theo van Veen

17:00 - 18:00 Drinks & Snacks

Details of the talks:


Tom Kenter---Siamese CBOW: Optimizing Word Embeddings for Sentence Representations

Word embeddings (vector representations of words in a high-dimensional space) have proven to be beneficial in a broad range of tasks in natural language processing, such as machine translation, parsing, semantic search, and tracking the meaning of words and concepts over time. It is not evident, however, how word embeddings should be combined to represent larger pieces of text, like sentences, paragraphs or documents. Surprisingly, simply averaging word embeddings of all words in a text has proven to be a strong baseline. We, therefore, present Siamese Continuous Bag of Words (Siamese CBOW), a neural network that efficiently estimates high-quality sentence embeddings, by directly optimizing word embeddings for the task of being averaged to form sentence representations.

Tom Kenter ( (1975) is a PhD candidate at the University of Amsterdam (ILPS). His main areas of interest are short text semantics and Natural Language Understanding. Tom has extensive experience working outside academia, from his first job as a computational linguist at Q-go, to his most recent internship at Google Research.You can read all about his research and the holy grail he pursues in the New Scientist ( [ ] (in Dutch) and on his homepage ( [ ].


Theo van Veen---Using Wikidata Properties to Improve Search in Dutch Historical Newspapers

In the research environment of the Koninklijke Bibliotheek we are improving access to the collection of Dutch historical newspapers by linking named entities occurring in the newspaper articles to corresponding DBpedia and Wikidata descriptions. Linking of the named entities and disambiguation are continuously improved by applying machine learning techniques. Indexing the Wikidata identifiers for named entities together with the newspaper articles opens up new possibilities for retrieving articles mentioning these resources like searching the newspaper collection using semantic relations from Wikidata. In this talk I describe the steps we have taken so far in setting up a combination of entity linking, machine learning and crowdsourcing and I talk about our plans for improving the quality of the links, using user feedback and extending the semantic search capabilities.

Theo van Veen has been a member of the research and development department of the Koninklijke Bibliotheek, National Library of the Netherlands since 1998. He got his degree in physics at the Technical University Delft in 1979 and started in 1988 in library automation at the University Library in Utrecht. He has been involved in several projects related to the European Library and Europeana, both hosted at the Koninklijke Bibliotheek. His research interest is currently focused on machine learning, text enrichment and service integration.