SEA: Beyond English - NLP & IR for Low Resource Languages


Details
In this edition of SEA we will expand our focus to languages that are not so well studied as for example English. We have two amazing speakers lined up: David Ifeoluwa Adelani from the Saarland University and
Abhishek Pandit from Translators without borders.
*** IMPORTANT: You will be able to view the Zoom link once you 'attend' the meetup on this page. ***
** 17:00 - 17:30 - David Ifeoluwa Adelani, Saarland University **
Development of NLP datasets and models for African Languages
In recent years, deep learning models have been very successful for many natural language processing tasks including machine translation, text generation, information extraction, and dialogue understanding. However, many of these models are only evaluated on English language and other high-resourced languages because of the availability of large unlabelled texts and numerous labelled datasets which are absent in low-resourced African languages. But these high resourced languages are only a few dozens, concentrated in a few regions of the world with a lot of similarities which limit the generalization of these models to low resourced languages.
In this talk, I will discuss some of the challenges of working on low-resourced languages including non-availability of training data and data quality issues, and also the development of high-quality labelled datasets and models for African languagues. First, I will introduce MENYO-20k - the first multi-domain parallel corpus for the low-resource Yoruba-English language pair, and benchmark models that outperform the pre-trained machine translation (MT) models and massively multilingual models like Facebook's M2M-100 and Google multilingual neural MT in English-->Yoruba translation direction. Second, I will present our participatory approach (with Masakhane) of addressing the under-representation of the African languages in NLP research by introducing our on-going work in creating a large, publicly available, high-quality dataset for named entity recognition (NER) in twenty African languages.
** 17:30 - 18:00 - Abhishek Pandit, Translators without Borders **
Building Restricted-Domain Chatbots in Low Resource Languages using Custom Training Data
This presentation will explore TWB's people-centric approach to using the Rasa framework for developing and augmenting training data for restricted-domain chatbots in low-resource languages.
The challenges of low resource languages for NLP applications are widespread, and particularly obstructive in the building of general conversational interfaces. However, innovative approaches may become available when the conversation is constrained to specific domains with a limited vocabulary and expected use of utterances. TWB explores precisely this use case. We develop multilinguals chatbots with partner non-profits who focus on two-way information flows with end users around a restricted subset of operations, primarily information around COVID-19 health and immigration for displaced populations. This constraint limits the number of intents in text classification tasks, requiring less training data than for general conversational AI.
Even with fewer intents, training data must be artificially constructed to match best guesses on how users would phrase questions and utterances. Once actual messages from users arrive and are added to the training data, we find wide disparities in grammar and orthography. This creates a vicious cycle where imprecise classification by the bot leads to negative user experience, which reduces the likelihood of future use and of building up sufficient data for improved accuracy of the bot. We therefore include a discussion of our annotation strategy by our staff, so as to course-correct the bot as rapidly as possible.
Finally, we compare our work against other potential areas of exploration, such as paraphrasing and semantic similarity. We invite your suggestions and are excited for a lively, interactive and enriching discussion.

SEA: Beyond English - NLP & IR for Low Resource Languages