Lessons learnt building Domain Specific NLP Pipelines

Details
At Indix (acquired by Avalara), the goal was to build the "Google of Products". It was an ambitious goal that involved crawling the web to gather product information from 5,000+ brand and retailer web sites, classifying the products to a taxonomy of 5,000+ nodes, and extracting relevant attributes of the products to match different products across retailers. This structured data was then exposed via a search API that would help customer use cases that needed product information. The product catalog currently has 3+ billion products. The team also built an e-commerce knowledge graph with 100 million nodes and about a billion edges to solve problems like Query Intent Recognition and Query Understanding for Product Search.
Naturally, a robust NLP pipeline was needed to solve these problems by making sense of the unstructured text data at this scale.
The first part of the talk will cover the evolution of the architecture, building blocks and algorithms of the NLP Pipeline.
The building blocks Rajesh will cover include Language Models, Word Embeddings and Knowledge Graph.
The algorithms he'll cover will be classification, entity extraction, document similarity and query understanding (for e-commerce domain).
Post acquisition by Avalara, the team was tasked to make sense of the unstructured text data in the Tax Compliance domain with limited data.
The second part of the talk will focus on how Rajesh's team is fine-tuning the e-commerce NLP Pipeline and using Transfer Learning techniques from the e-commerce domain to solve problems with language understanding in the tax compliance domain.
Speaker Details:
Rajesh Muppalla is a Senior Director of Engineering at Avalara, leading the teams working on product classification and tax content sourcing automation. Earlier this year, he joined Avalara through the acquisition of his previous company Indix, which he co-founded in 2012. At the time of acquisition, Rajesh was leading the machine learning (ML) and data platform teams. Prior to Indix, he was at Thoughtworks as a tech lead on Go-CD, an open source Continuous Delivery (CD) tool, where he was fortunate to work with some of the pioneers in the area of CD.
Why you should attend this talk?
- The meetup will focus on the lessons learnt building domain specific NLP pipelines using some of the latest techniques, especially the latest deep learning approaches.
- The talk will not only cover the techniques that worked, but also shed some light on the approaches that did not.

Lessons learnt building Domain Specific NLP Pipelines