Data-intensive Recommenders and Machine Learning applications in Spark & Flink


Dettagli
The third meetup event will focus on how to build data science applications at scale. The event will start with one of the most popular computation framework, Apache Spark (http://spark.apache.org/), followed by a second talk on its recent arising alternative Apache Flink (https://flink.apache.org/).
Thanks to Diego Liberati for offering to host our event at the "Sala conferenze of dipartimento di Elettronica, Informazione e Bioingegneria" of Politecnico di Milano.
How to find us:
You walk in Via Ponzio 34/5 (Politecnico di Milano) right next to the main department entrance there is a secondary door that takes you into a confined area with access to the conference room. Please do not close the street-level door since it cannot be opened from outside without a badge. Closest public transportation: Piola (subway green line) or Lambrate (train station).
Agenda:
18:00 Doors opening
18:30 (20 minutes)
"Introduction to Distributed Computing Engines for Data Processing" by Simone Robutti, Machine Learning Engineer @ Radicalbit (http://radicalbit.io/)
Brief introduction to get a basic familiarity with Map/Reduce, Spark & Flink.
19:00 (40 minutes + QA)
"The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour" by Gianmario Spacagna, Senior Data Scientist @ Pirelli (http://www.pirelli.com)
In the depths of the last cold, wet British winter, the Advanced Data Analytics team from Barclays escaped to a villa on Lanzarote, Canary Islands, for a one week hackathon where they collaboratively developed a recommendation system on top of Apache Spark. The contest consisted on using Bristol customer shopping behaviour data to make personalised recommendations in a sort of Kaggle-like competition where each team's goal was to build an MVP and then repeatedly iterate on it using common interfaces defined by a specifically built framework.
The talk will cover:
• How to rapidly prototype in Spark (via the native Scala API) on your laptop and magically scale to a production cluster without huge re-engineering effort.
• The benefits of doing type-safe ETLs representing data in hybrid, and possibly nested, structures like case classes.
• Enhanced collaboration and fair performance comparison by sharing ad-hoc APIs plugged into a common evaluation framework.
• The co-existence of machine learning models available in MLlib and domain-specific bespoke algorithms implemented from scratch.
• A showcase of different families of recommender models (business-to-business similarity, customer-to-customer similarity, matrix factorisation, random forest and ensembling techniques).
• How Scala (and functional programming) helped our cause.
Gianmario is a Senior Data Scientist at Pirelli Tyre, processing telemetry data for smart manufacturing and connected vehicles applications. His main expertise is on building production-oriented machine learning systems. Co-author of the Professional Manifesto for Data Science (http://www.datasciencemanifesto.org), he loves evangelising his passion for best practices and effective methodologies amongst the community. Prior to Pirelli, he worked in Financial Services (Barclays), Cyber Security (Cisco) and Predictive Marketing (AgilOne).
20:00 (40 minutes + QA)
"Data intensive applications with Apache Flink" by Simone Robutti, Machine Learning Engineer @ Radicalbit (http://radicalbit.io/)
In the last 10 years, the IT industry has seen a complete revolution in the perceived value that computing has on businesses and how engineers think about applications: in several application domains, the need for data has outgrown the capacity of commodity hardware and the need for information has outpaced traditional processing technologies and approaches. In this talk we'll introduce Apache Flink, a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. It is an open source project that builds on top of proven approaches, as well as innovative algorithms. We will go in-depth on how this tool can be used to implement data-intensive applications, in particular regarding present tools and future perspectives to use machine learning algorithms in a distributed context.
Simone Robutti, 27, Machine Learning Engineer at Radicalbit. He achieved a Master’s Degree at Università degli studi di Milano with a thesis on SVM for noisy labeled datasets. From then on his interests shifted towards the engineering side of Machine Learning and Big Data: implementation, deploy, portability and maintainability of ML-intensive systems. Right now his focus in Radicalbit is Flink and its Machine Learning library FlinkML.
Please pay attention that we all must leave the venue before 9pm.
Feel free to invite your friends and colleagues interested in Data Science.
If you are Interested in doing a lightning talk, please contact the organizer privately.
P.S.
We have opened a google form for reserving your seat for the post-event dinner on next 13th July at the restaurant Vietnammonamour (http://www.thefork.it/ristorante/vietnamonamour-via-pestalozza/56981).
Dinner reservation form: https://docs.google.com/forms/d/1Ai2DuLTPcxmUUDyJd_zfu7r6lURYQ1ZDNmlaOIke_oA/viewform?entry.1985478587&entry.689106344=0

Data-intensive Recommenders and Machine Learning applications in Spark & Flink