Paris NLP Season 4 Meetup #3
Details
Seating is on a first come, first served basis whether you have RSVPed or not, so we suggest arriving early. We can host 70 people.
La salle permet d'accueillir 70 personnes. L'inscription est obligatoire mais ne garantit pas que vous pourrez entrer, nous vous recommandons donc d'arriver un peu en avance.
----------
• Thomas Belhalfaoui, Lead Data Scientist @ JobTeaser
Siamese CNN for jobs-candidate matching: learning document embeddings with triplet loss.
Summary:
At JobTeaser, we are the official career center of more than 500 schools and universities throughout Europe, where we can multipost companies job offers.
Our mission: help students and recent graduates find their dream job. Among other tools we develop, we try to recommend job offers of interest to our users.
For this purpose, we build a Siamese Convolutional Neural Network, that takes job offer and student resume texts as inputs, and yields job and resume embeddings in a shared euclidean space. Then, recommendation simply amounts to finding the nearest neighbors.
We train the network with a triplet loss on historical application feedback.
----------
• Djamé Seddah, Associate Professor in CS @ Inria
Sesame street-based naming schemes must fade out, long live CamemBERT et le French fromage!
Summary:
As cliché as it sounds, pretrained language models are now ubiquitous in Natural Language Processing, the most prominent ones being arguably Bert (Delvin et al, 2018). Many works have shown that Bert-based models are able to capture meaningful syntactic information using nothing else than raw data for training (eg. Jawahar et al, 2019) and this ability is probably one of the reasons of its success.
Anyway, until very recently, most available models have either been trained on English data or on the concatenation of data in multiple languages. In this talk, we’ll present the results of a work that investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (a few gigabytes) leads to results that are as good as those obtained using two magnitudes larger datasets. Our best performing model Camembert reaches or improves the state of the art in all four downstream tasks.
Presented by Djamé Seddah, joint work with Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, and Benoît Sagot.
----------