Vespa – open source big data serving engine & Lambda Architecture in Practice

SF Big Analytics
SF Big Analytics
Public group
Location image of event venue


We are excited to have two talks, the first talk speaker travel from Europe to US and we are lucky to have him to give a talk at our meetup

6 pm -- 6:30 pm light dinner + networking
6:35pm -- 7:20 pm Talk 1 + QA
7:25pm -- 8:10 pm Talk 2 + QA
8:10 pm -- 8:30 pm Networking
8:30 pm -- 8:45 pm closing

Talk 1 : Introduction to Vespa – the open source big data serving engine (Yahoo)

Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? This talk introduces Vespa – the open source big data serving engine. Vespa allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting and search at Yahoo where it handles billions of daily queries over billions of documents and was recently open sourced at

Speaker: Jon Bratseth

Jon Bratseth is a distinguished architect in Oath (former Yahoo), and the architect and one of the main contributors to Vespa, the open big data serving engine. Jon has 20 years experience as an architect and programmer on large distributed systems. He has a master in computer science from the Norwegian University of Science and Technology.

talk 2 : Lambda Architecture in Practice (Amplitude)

Over the last few years, Lambda Architecture has emerged as a common paradigm for building distributed data processing systems. We'll be looking at two case studies of custom analytics use cases in order to understand Lambda Architecture in practice and how it impacts the complexity of managing the long-term data.

The first is Sumo Logic, where we built a distributed full-text search and aggregation system on top of Lucene. This was before Lambda Architecture was popularized, and the infrastructure used batch processing only. However, because of the real-time requirements of log management, the batch layer had to commit new data quickly (approximately every minute), which led to fragmentation. In order to combat that, an additional reindexing mechanism was introduced that added significant complexity to the long-term data layer.

The second is Amplitude, where we have built a distributed column store using Lambda Architecture. Our system, Nova, is inspired by Druid, with streaming and batch processing layers that can handle the same query patterns. Some of the key practical problems we solved include avoiding the duplication of query logic across the two layers and ensuring proper handoffs of data, both potential pitfalls of using Lambda Architecture. While the resulting system is more complex operationally, it has made managing the long-term data much simpler and less error-prone.

Speaker Jeffrey Wang

Jeffrey Wang is a Co-founder and Chief Architect at Amplitude Analytics. He works on a variety of product and engineering problems, including building out the columnar data store that powers Amplitude. He studied CS at Stanford and previously worked at Palantir and Sumo Logic.