28. Scio: Moving Big Data to Google Cloud, a Spotify Story
Hosted by Stockholm Hadoop User Group
Details
Agenda
• 17.45: Drink, socialize
• 18.00: First talk: Test strategies for data processing pipelines
Abstract: A good automated testing strategy is crucial for achieving good product development productivity, and for quickly launching new features with continuous deployment. Although there is high awareness in backend software development for the need of good test structures and disciplining, it is often added as an afterthought in data processing environments, resulting in slow code-test-debug cycles and long delays getting data-driven features out the door. The time it takes from deciding to collect a new type of data, adapting the data collection and data processing pipelines involved, to creating a feature based on the data is often measured in weeks or months.
This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing.
Speaker: Lars Albertsson has worked with data-intensive and scalable applications at Google, Spotify, Schibsted Media Group, natural language processing startup Recorded Future, and with stock exchange systems. He worked in the engineering productivity organisation at Google, and has built continuous deployment solutions for data processing at Spotify and Schibsted. He is now an independent consultant, helping companies build scalable data processing solutions.
• 18.45: Eat, drink, socialize (more)
• 19.00: Second talk: Scio: Moving Big Data to Google Cloud, a Spotify Story
Abstract: We will talk about Spotify’s story of migrating our big data infrastructure to Google Cloud. Over the past year or so we moved away from maintaining our own 2500+ node Hadoop cluster to managed services in the cloud. We replaced two key components in our data processing stack, Hive and Scalding, with BigQuery and Scio and are able to iterate at a much faster speed. We will focus the technical aspect of Scio, a Scala API for Apache Beam and Google Cloud Dataflow and how it changed the way we process data.
Speaker: Neville Li - Neville is a software engineer at Spotify working on data and machine learning infrastructure. In the past few years he has been driving the adoption of Scala and new tools for data processing, including Scalding, Spark, Storm and Parquet. Before that he worked on search quality at Yahoo! and old school distributed systems like MPI.
• 19.45: drink, socialize (even more)
