Named Entity Recognition for Tweets - a Hands-on Session

Grenoble Data Science
Grenoble Data Science
Public group

Cowork in Grenoble

16 Boulevard Maréchal Lyautey · Grenoble

How to find us

Co-Work Grenoble (tentatively)

Location image of event venue


In this Meetup we will focus on the development of models for Named Entity Recognition (NER). NER refers to the task of classifying textual segments in a predefined set of categories such as persons, organizations and locations. For instance, given the sentence “Jim bought 300 shares of Acme Corp. in 2006” a NER system should recognize the three entities and return “Jim” tagged as person, “Acme Corp.” tagged as organization and “2006” tagged as time.

NER systems are a critical component of different information extraction architectures like those used for document retrieval, question answering... While current state-of-the-art systems achieve high performance for a narrow set of entities and particularly in grammatically well formed texts, in applications like Twitter where text is short and informal the task becomes challenging.

Motivated by the wide use of NER systems as well as by the wide range of methods used to tackle the problem, we propose an interactive Data Science Meetup session on the problem. The structure of the session is as follows:

First, we will present the problem of NER and we will discuss systems proposed to solve the problem (~20’ min.), Then, we will form groups in order to develop a NER system using machine learning methods. The development will be guided by experienced tutors and the goal is to emphasize on the data pre-processing, feature engineering, and model selection and evaluation processes. By the end of this part, everybody will have a working NER system! (~90 min.)

For the practical session we have chosen Python. Before the meeting, we will distribute instructions for the packages needed and docker files that will make the installation easy. During the meeting, we will also distribute code samples in the form of IPython notebooks, to avoid boilerplate coding.