Tim Baldwin on robust NLP and Antonio Jimeno Yepes on biomedical text mining

Melbourne Natural Language Processing (NLP) Meetup
Melbourne Natural Language Processing (NLP) Meetup
Public group

Culture Amp HQ

Level 10, 31 Queen Street · Melbourne

How to find us

After 6pm, the main doors to the building will be locked. You'll need to enter from the side door which you can find by walking along the southern side of the building (up the ramp) and turning right. Then proceed to level 10 via the lifts.

Location image of event venue


For our inaugural meetup, we'll have two talks on applications of NLP, one from the academic side, and one from industry. Talks will run from 6:30 until about 8, with the remainder of time available for mingling and networking. Food and drinks for this meetup will be generously provided by Culture Amp.

The first speaker will be Tim Baldwin, a leading NLP researcher and Professor in the School of Computing and Information Systems at the University of Melbourne and Associate Dean (Research Training) in the
Melbourne School of Engineering. The title of Tim's talk is "Robust, Unbiased Natural Language Processing".

The second speaker will be Antonio Jimeno Yepes, a specialist in biomedical NLP and technical team lead at IBM Research,
who will talk about text mining over biomedical literature. Antonio's talk title is "A hybrid approach for automated mutation annotation of the extended human mutation landscape in scientific literature"

Abstract for Tim Baldwin's talk "Robust, Unbiased Natural Language Processing"

Natural Language Processing systems are notoriously brittle to linguistic
noise and shifts in domain, and also generally biased by the composition of the training data. I will detail a number of approaches for improving the
robustness of NLP systems through a range of techniques, including: (1) data augmentation through the generation of linguistically-motivated training data perturbations, using lexical semantic and syntactic methods; (2) joint learning of a structured model with domain-specific and domain-general components, coupled with adversarial training for domain; and (3) explicitly learning representations that obscure author characteristics at training time. Over tasks including sentiment analysis, language identification, and POS tagging, I will show that the resulting models are more robust out-of-domain, and in the case of the final method, are less demographically biased.

Abstract for Antonio Jimeno Yepes's talk "A hybrid approach for automated mutation annotation of the extended human mutation landscape in scientific literature":

As the cost of DNA sequencing continues to fall, an increasing amount of information on human genetic variation is being produced that could help progress precision medicine. However, information about such mutations is typically first made available in the scientific literature, and is then later manually curated into more standardized genomic databases. This curation process is expensive, time-consuming and many variants do not end up being fully curated, if at all.

Detecting mutations in the literature is the first key step towards automating this process.
However, most of the current methods have focused on identifying mutations that follow existing nomenclatures. We show that there is a large number of mutations that are missed by using this standard approach. Furthermore, we implement the first mutation annotator to cover an extended mutation landscape, and we show that its F1 performance is the same performance as human annotation (F[masked] for manual annotation vs F[masked] for automatic annotation).