Building Data Pipelines: From Berlin Rent Prices to Streaming Architecture

Name: Building Data Pipelines: From Berlin Rent Prices to Streaming Architecture
Start: 2018-03-27T19:00:00+02:00
End: 2018-03-27T21:00:00+02:00
Location: The Hub @ Zalando

Hosted by Nana Y.

Tech In Berlin Meetup

Details

Details:

For this Data Meetup, we welcome Jekaterina Kokatjuhha, Ricardo De Cillo, Dmitriy Sorokin, Max Schultze with a series of short talks sharing experiences on building data pipelines, big and small. Topics range from very specific problems to a more broader view of an end-2-end machine learning project built from scratch.

Schedule:

19:00 - 19:20 Doors Open: Drinks + Food
19:20 - 19:30 Welcome by Kshitij Kumar, VP Data Infrastructure, Zalando
19:30 - 19:45 Aggregating, Processing, and Querying 40 Million Events per Hour - Joachim Hofer
19:45 - 20:00 Building a Data Science Project From Scratch - Jekaterina Kokatjuhha
20:00 - 20:30 Break - Drinks, Snacks, and Discussions
20:30 - 20:45 Relational Data Ingestion into Zalando’s Data Lake - Max Schultze
20:45 - 21:00 Fast Distributed Locking - Dmitriy Sorokin
21:00 - 21:15 Data Quality and Json-Schema - Ricardo de Cillo
21:15 - Networking + Drinks
21:45 - Event ends

For more details on topics and speakers, please read below.

Aggregating, Processing and Querying 40 Million Events per Hour
Joachim Hofer

In this talk, I would like to give an overview over our streaming architecture, the reasoning behind some architectural decisions we made (for example, how we deal with the ordering of events, or how we store the current state of the data), and some lessons learned.

Building a Data Science Project From Scratch: Analysis of Berlin Rental Prices
Jekaterina Kokatjuhha

This talk is about how to design a good data science project from scratch based on a real world dataset. As a showcase project we analyze the rental prices for apartments in Berlin.This talk will guide you through all the steps of a short-term data science project: motivation, extraction of data from the web, cleaning and engineering of features using external APIs, storytelling, and building machine learning models. We will dive into the pitfalls and design patterns when scraping data from the web. The importance of the interactive dashboards should not be understated as they help you find useful insights on your own. We will apply the human judgment of the apartment’s address to engineer new features using google API and use correlated features to impute the feature of interest. In the end several machine learning models will be used to explore the idea of bagging and of stacked models.

Relational Data Ingestion into Zalando’s Data Lake
Max Schultze

The Data Lake at Zalando is a centralized repository of all data sources in the company, being accessible to every employee in a secure manner. This comes with the challenge of integrating with the whole data landscape of the company, which includes many different relational databases in both data center and cloud. What are the different ways to make this work and why do we prefer which option for which use case?

Fast Distributed Locking
Dmitriy Sorokin

While operating Nakadi messaging bus we are dealing with dozens of instances, that are processing hundreds of thousands requests per second, and we are facing a lot of challenges related to coordinated actions. In this talk I will cover the way how we were able to implement fast distributed locking for timelines switch.

Data Quality and Json-Schema
Ricardo de Cillo

Json is the most widely adopted format for data transfer between services. Even though other formats, like Avro, have gained lots of attention in recent years, they have not overthrown Json. In this talk we share our experience with ingesting Json data into our data-lake in "The good, the bad and ugly" style.

Tech In Berlin Meetup

Building Data Pipelines: From Berlin Rent Prices to Streaming Architecture

Tech In Berlin Meetup

Details

Related topics

You may also like