Spark Streaming and Scaling


Details
Beyond Shuffling & A Structured Streaming Preview: From scaling Spark Jobs to exploring the new Structured Streaming API
Speaker: Holden Karau
Abstract: This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a an introduction to Datasets with Structured Streaming (new in Spark 2.0) and how to do weird things with them.
Bio: Holden Karau is transgender Canadian, an active open source contributor, and co-author of Learning Spark & High Performance Spark. When not in San Francisco working as a software development engineer at IBM's Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, scooters, poutine, and dancing.
Spark Streaming into context
Speaker: David Martinez Rego
Abstract: Streaming is arguably the area of the Big Data technology stack which has seen the most changes in the last 2 years. Being Apache Spark one of the big players, there is not a complete equivalence between the different modern streaming libraries. We will review the design differences/aims between the libraries which can help you make the right choice for your next project.
Bio: David Martinez is as a postdoc researcher at UCL and a consultant on Big Data technologies and backend architecture. He spends his time between writing research on applied Machine Learning and helping different startups and companies to make better use of ML algorithms and Data Science technologies. Previously, he studied the possibility of automated predictive maintenance of power wind mills in his PhD studies.

Sponsors
Spark Streaming and Scaling