There is a significant knowledge gap in the machine learning industry between research and bringing applications to production. It is not trivial to go from a Jupyter notebook to serving live traffic for more complex applications. In this talk, we introduce an application designed to answer questions given in plain text. Typical for many research systems, it initially consists of multiple independent Python scripts, tools, and models. We’ll implement a production-ready application on Vespa that can scale to pretty much any desired level.
During the talk, we’ll give a high-level overview of how such a retrieval-based question-answering system works. This includes classic information retrieval (BM25), modern retrieval using approximate nearest neighbor search (ANN), and natural language understanding models based on Transformers such as BERT. We’ll introduce Vespa, the open-source big data serving engine, and show how all this can all be implemented and scaled using various techniques such as distillation and quantization.
After this talk, you will have learned what it takes to build a real-time serving system using the latest AI models in production.
Agenda:
18:00 - 19:00 Background: information retrieval, approximate nearest neighbors, representation learning, transformers.
19:15 - 20:00 Implementation: introduction to Vespa, inference in production vs training, distillation, quantization. Demonstration.