DSPT#69 - Can Somebody make sense of this Noisy Text? (Porto)


“Noisy Text? Am I reading this right?” - you are probably thinking…..Yep.
Text can be one of the noisiest kinds of data, and we will learn about it on the next meetup. Daniel Loureiro, a PhD candidate from Porto University will start by taking us on a journey through neural language models and their benefits in several areas, including the medical domain. Then, Diego Esteves from Farfetch will help us make sense of noisy text data and how to beat SOTA.
Ready to see how loud is your text data?

=== SCHEDULE ===

The preliminary agenda for the meetup is the following:

• 18:30-19:00: Welcome and get together
• 19:00-19:30: Language Modelling Makes Sense by Daniel Loureiro, INESC TEC
• 19:40-19:45: Group photo
• 19:45-20:15: Networking / Coffee Break
• 20:15-20:45: HORUS-NER: A Multimodal Named Entity Recognition Framework for Noisy Text by Diego Esteves, Farfetch/SDA Research
• 20:50: Closing
• 21:00: Dinner is optional but it might be an excellent opportunity for networking (register here: http://bit.ly/dinner_dspt69)

This meetup is sponsored by Talkdesk (https://www.talkdesk.com/) and COCUS (https://www.cocus.pt). Thank you for your support!

See you there!

Talk 1
Title: Language Modelling Makes Sense

Abstract: The latest Neural Language Models (NLMs), based on Transformer architectures, have quickly become the most influential factor in the progress of several Natural Language Processing (NLP) tasks, from Machine Translation to Question Answering. However, since this is a relatively new advancement in the field, there are still many open questions around which properties of NLMs are responsible for these improvements. In this talk, we explore the representational ability of NLMs, and how they enable unprecedented gains for Word Sense Disambiguation. On a related note, we also explore how NLMs can be used for Entity Linking in the Medical Domain, with a solution that effectively combines the representational abilities of NLMs with more traditional approximate string matching methods.

Short bio: Daniel Loureiro is a PhD candidate in Computer Science from University of Porto. He's been working in Natural Language Processing (NLP) for nearly a decade, at both academia and startups, including founding and exiting a startup (PepFeed). His main interests are word-level semantics and common-sense reasoning, and he's published and reviewed at top-tier NLP conferences.

Talk 2
Title: HORUS-NER: A Multimodal Named Entity Recognition Framework for Noisy Text

Abstract: Recent work based on Deep Learning present state-of-the-art (SOTA) performance in the named entity recognition (NER) task. However, such models still have the performance drastically reduced over noisy text (e.g., microblogs), when compared to newswire datasets. Thus, designing and exploring new methods and architectures is highly necessary to overcome current challenges. In this talk, we shift the focus of existing solutions to an entirely different perspective. We investigate the potential of embedding word-level global features extracted from images and news. We performed a comprehensive study in order to validate the hypothesis that images and news queried from the Web boost the task on noisy data, revealing very interesting findings. When our proposed features are used: (1) We beat SOTA in precision using simple CRFs (2) The overall performance of decision trees-based models can be drastically improved. (3) Finally, we show that this approach overcome off-the-shelf Named Entity Recognizers in microblogs.

Short bio:
D. Esteves is a computer scientist with +15 years of combined experience in the industry and academia, currently a Principal Data Scientist at Farfetch and a Research Associate at SDA Research. Before moving to Germany to obtain his Ph.D. (#fact-checking), he worked on national and international IT projects for 10+ years in large companies such as Accenture, B2W Inc., Wilson Sons, and BTG Pactual Investment Bank.