Skip to content

Oryx 2: Lambda architecture on Spark, Kafka for real-time large scale ML

Photo of Max Kesin
Hosted By
Max K. and Paul D.
Oryx 2: Lambda architecture on Spark, Kafka for real-time large scale ML

Details

Strata month, and we have 2 meetups again!

This one is at Palatir's office, don't show up before 6:30. Please bring ID for security.

Abstract

Building machine learning models is all well and good, but how do they get productionized into a service? It's a long way from a Python script on a laptop, to a fault-tolerant system that learns continuously, serves thousands of queries per second, and scales to terabytes. The confederation of open source technologies we know as Hadoop now offers data scientists the raw materials from which to assemble an answer: the means to build models but also ingest data and serve queries, at scale.

This short talk will introduce Oryx 2, a blueprint for building this type of service on Hadoop technologies. It will survey the problem and the standard technologies and ideas that Oryx 2 combines: Apache Spark, Kafka, HDFS, the lambda architecture, PMML, REST APIs. The talk will touch on a key use case for this architecture -- recommendation engines.

Bio

Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer and is a co-author of O’Reilly Media’s

Advanced Analytics on Spark

He was a committer and VP for Apache Mahout, and co-author of Mahout in Action. Previously, Sean was a senior engineer at Google. He holds an MBA from London Business School and a BA from Harvard University.

Photo of NYC Machine Learning group
NYC Machine Learning
See more events
Palantir
15 Little W 12th Street · New York City, NY