Pre-Summit Event: Factorization Machines in Spark and Spark DataFrames


Details
The Spark Summit Europe will take place October 27 - October 29 in Amsterdam at the Beurs van Berlage. On the evening before the conference, we will host a Spark Meetup.
The food and beverages for this event are kindly offered to us by SAP.
Preliminary agenda:
18.00 - 18.50: Arrive, eat pizza, drink
18.50 - 19.00: Short introduction by your humble organisers and a welcome from our sponsor: SAP
19.00 - 19.45: Talk 1, by Dafne van Kuppevelt, Data Scientist at ING
Factorization Machines on Spark at ING
As we, Data Scientists of ING, face many interesting data challenges each day, we find Apache Spark a very useful tool for processing large amounts of data and applying machine learning algorithms. When our business cases showed the need for a model that wasn't available in Spark yet, namely Factorization Machines (FMs), we implemented these ourselves. Moreover, we used this opportunity to experiment with different algorithms for optimization.
In this talk, we will first explain what Factorization Machines are, and why they are useful for us. We will show details about the implementation in Spark and explain how Parallel Gradient Descent is done. Finally we show the results of experiments, which show that FMs are good predictors and can be quickly trained on a distributed platform. Last but not least, we will give a demo of an iPython Notebook that lets you easily use the code (available on github).
19.45 - 20.00: Break
20.00 - 20.45: Talk 2, by Michael Armbrust, developer of Spark SQL (Databricks)
Spark DataFrames: Simple and Fast Analysis of Structured Data
This talk will provide a technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL’s Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark’s core engine to substantially improve performance and memory use for common types of operations.
20.45 - ??.??: Socialize, drinks, etc.

Sponsors
Pre-Summit Event: Factorization Machines in Spark and Spark DataFrames