December 17, 2014 · 6:30 PM
Multilingual search requires the developer to address challenges that don’t exist in the monolingual case. In Solr, a robust multilingual search engine requires different analysis chains for each language because each language has its own logic for tokenization, lemmatization, stemming, synonyms, and stop words. To make multilingual search even harder, query strings are typically no longer than a handful of words, making language identification of query strings more difficult, or at worst ambiguous even to a human (“pie” could be an English or Spanish query). We’ll explore the breadth of Solr schema and configuration options available to a multilingual search application developer to balance functionality, performance, and complexity. We’ll dive deep into specific experiments with a practical application.
Speaker Bio: David Troiano
David Troiano is a Principal Software Engineer at Basis Technology who develops the services and applications that consume the core natural language processing products that Basis delivers. Over the past five years, he has worked on content search, discovery, and recommendation systems built on Lucene / Solr, with an eye toward scalability and performance. He also has professional experience with machine learning and predictive analytics tools in the quantitative finance industry. David holds a bachelor’s degree in Computer Science from Harvard College.