Mixed-script Information Retrieval


Details
Doors and networking at 6:00 PM, talk followed by a Q&A from 7:00-8:30 .
Venue: 44 Tehama St, San Francisco, CA 94105
Classroom 311
Speaker: Parth Gupta, Bio (http://users.dsic.upv.es/~pgupta/), researcher at the Natural Language Engineering Lab (http://users.dsic.upv.es/grupos/nle/?file=kop1.php) at the Technical University of Valencia, Spain
Title: Mixed-script Information Retrieval
Abstract: For many languages that use non-Roman based indigenous scripts (e.g. Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or multi-lingual space with more than one script which is referred as the Mixed-Script space. IR in the mixed-script space is challenging because queries written in either the native or the Roman script need to be matched to the documents written in both the scripts. Moreover, transliterated content features extensive spelling variations. Through analysis of the query logs of Bing search engine, Mixed-Script IR will be introduced, its prevalence will be discussed, and the details of the deep-learning based principled solution to the term modeling challenge where the Mixed-Script terms are modeled jointly through deep-autoencoder will be explained. The talk will close by discussing impact of Mixed-Script IR on popular NLP applications like sentiment analysis, recommendations, machine translation, cross-language text analysis etc in user-generated content.

Mixed-script Information Retrieval