NLP @ Google Zurich


Details
Join the 19th NLP Meetup at Google Zurich. Aliaksei Severyn and Eric Muxagata will present their work and they are looking forward to meeting the NLP Zurich community. See you soon!
IMPORTANT: This event is FULL. Only registered members will be able to attend the event.
Agenda:
17:30 Registration
18:00 Welcome
18:05 Aliaksei Severyn: Leveraging Pre-trained Checkpoints for Sequence Generation and Text Edit Tasks
18:35 Eric Muxagata: Understanding categorical semantic compatibility in KG
Leveraging Pre-trained Checkpoints for Sequence Generation and Text Edit Tasks by Aliaksei Severyn
Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. Warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple benchmarks while saving significant amounts of compute time. So far, the focus has been mainly on the Natural Language Understanding tasks. We present an extensive empirical study on the utility of initializing large Transformer-based sequence-to-sequence models with the publicly available pre-trained BERT and GPT-2 checkpoints for sequence generation. We have run over 300 experiments spending thousands of TPU hours to find the recipe that works best and demonstrate that it results in new state-of-the-art results on Machine Translation, Summarization, Sentence Splitting and Sentence Fusion. Having established a strong set of baselines for sequence-to-sequence models we present LaserTagger -- a sequence tagging approach that casts text generation as a text editing task. Target texts are reconstructed from the inputs using three main edit operations: keeping a token, deleting it, and adding a phrase before the token. To predict the edit operations, we propose a novel model, which combines a BERT encoder with an autoregressive Transformer decoder. This approach is evaluated on four tasks: sentence fusion, sentence splitting, abstractive summarization, and grammar correction. LASERTAGGER achieves new state-of-the-art results on three of these tasks, performs comparably to a set of strong seq2seq baselines with a large number of training examples, and outperforms them when the number of examples is limited. Furthermore, we show that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment.
Understanding categorical semantic compatibility in KG by Eric Muxagata
The Knowledge Graph (KG) is Google's main repository of factual knowledge, organized by entities and their relationships. In order to scale this large-scale database, vertical teams all across the company focus on structured data extraction, which is subsequently loaded into the graph at ever expanding rates. This environment requires a system which is able to identify and cluster identical entities across sources (source-to-source identity clustering). This is the main problem Refcon is challenged with solving. A fundamental aspect of identity is categorical similarity. In this talk, we propose a method to learn a semantic embedding space for KG types from the existing correlations in the graph, via means of a Deep Sets Autoencoding architecture (source paper: https://arxiv.org/abs/1703.06114). This space can then directly be used for entity type set comparison and provide a useful signal for the downstream identity classifier task. In sum, the Deep Sets Autoencoder would be a neural network which would be trained to predict all element types from a given variable-length type set, a multi-label self-supervised classification task with an extremely large output space (~20K classes, the total number of unique KG types in the schema).
Sponsors
Google

NLP @ Google Zurich