Skip to content

Topic Models & Multilingual Capabilities: The 10th NLP Dublin Meetup

Photo of Demian
Hosted By
Demian and Sebastian R.
Topic Models & Multilingual Capabilities: The 10th NLP Dublin Meetup

Details

We are excited to announce our 10th NLP Dublin meetup! We are happy to be hosted by Zalando (https://www.zalando.com/) for the third time at their office at Grand Canal Quay. There will be two talks from both academia and industry, one about topic modeling and the other about annotation projections across languages. Pizza and beer will be provided.

Mark Belford, PhD at Insight@UCD

Challenges in Topic Modeling

There are a number of challenges present when applying topic modeling algorithms to text data. One major issue, which is rarely considered, is that of "instability". That is, an algorithm can generate different models when applied to the same data. We show the inherent instability of popular topic modeling approaches, using a number of new measures to assess stability. To address this issue in the context of matrix factorization, we also propose a new ensemble learning strategy. Another major challenge is around the appropriate evaluation of topic modeling results. This is a particular challenge for online topic modeling algorithms, which are designed to analyze streams of text data. When developing a new method for this task, we need to be able to quantify its ability to identify an appropriate number of topics which are also semantically coherent. To support this, we propose a semi-synthetic dataset generator, which can introduce concept drift and concept shift into existing annotated non-temporal datasets, via user-controlled paramaterization. This allows for the creation of multiple different artificial streams of data, where the "correct" number and composition of the topics is available for evaluation purposes.

---

Pascal Pompey, Zalando

Annotation Projections: Towards Multilingual Language Capabilities

Labelled data is the raw material required to build any machine-learning solution. Outside the academic world and its carefully curated standard data-sets, getting reliable labelled data is often the number one priority of applied machine-learning teams. When applied to natural language processing tasks, the labelled data problem is magnified by the number of languages in which text is produced: data labelled in one language is all but useless for developing models in another language.

We present a projection methodology that has come out of a collaboration between the NLP teams of Zalando Research and Zalando Dublin. The projection methodology enables transferring labels from one language to another, hence making it possible to scale NLP solutions across multiple languages.

The contributions are: (1) motivating the problem of label projection across languages, (2) presenting a methodology to achieve that and, (3) reporting a qualitative analysis of the methods along with its strengths and weaknesses.

Photo of Natural Language Processing Dublin group
Natural Language Processing Dublin
See more events
Zalando Ltd.
3 Grand Canal Quay, Dublin, Irland · Dublin