Oryx 2: Lambda architecture on Spark, Kafka for real-time large scale ML


Details
Strata month, and we have 2 meetups again!
This one is at Palatir's office, don't show up before 6:30. Please bring ID for security.
Abstract
Building machine learning models is all well and good, but how do they get productionized into a service? It's a long way from a Python script on a laptop, to a fault-tolerant system that learns continuously, serves thousands of queries per second, and scales to terabytes. The confederation of open source technologies we know as Hadoop now offers data scientists the raw materials from which to assemble an answer: the means to build models but also ingest data and serve queries, at scale.
This short talk will introduce Oryx 2, a blueprint for building this type of service on Hadoop technologies. It will survey the problem and the standard technologies and ideas that Oryx 2 combines: Apache Spark, Kafka, HDFS, the lambda architecture, PMML, REST APIs. The talk will touch on a key use case for this architecture -- recommendation engines.
Bio
Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer and is a co-author of O’Reilly Media’s
Advanced Analytics on Spark
He was a committer and VP for Apache Mahout, and co-author of Mahout in Action. Previously, Sean was a senior engineer at Google. He holds an MBA from London Business School and a BA from Harvard University.

Oryx 2: Lambda architecture on Spark, Kafka for real-time large scale ML