"Extreme" Apache Spark:How in 3mo. we created a pipl for processing 2.5b rec/day

Name: "Extreme" Apache Spark:How in 3mo. we created a pipl for processing 2.5b rec/day
Start: 2016-03-15T18:00:00+01:00
End: 2016-03-15T21:00:00+01:00
Location: ITU, IT University of Copenhagen, Auditorium 1

Hosted by Vladimir S.

Big Data Denmark

Details

You can stay up to date with upcoming events by subscribing to BDD newsletter on BigDataDenmark.dk (http://bigdatadenmark.dk/#contact)

"Apache Spark is simply awesome" says our next speaker Josef Habdank. In this talk he will give you a crash course how to design an extremely scalable data processing pipeline on Apache Spark on using tech such as: Spark Streaming, Scala, Kinesis, Snappy, Avro, Parquet, S3, Zeppelin.

It will be a story of 3 crazy developers who in 3 months managed to develop and put to production a Spark data pipeline which can crunch through 2.5 billion airfares a day without breaking a sweat. It was an amazing journey in which they had to do everything themselves: take care of hardware and deploy platform, research technologies, hack out all the code in Spark/Scala, test scalability, do the monitoring tools and deliver the complete business intelligence product to the customer.

Josef says: "Yes it is possible, and it is possible in 3 months. If you come to the talk I will share with you DOs and DONTs of such a process, I will explain which technologies turned out to be right and what was a mistake." You will learn how to use correct message compression and serialization (Avro + Snappy), best practices for in-stream error handling, how build a succsful 50TB+ datawarehouse (Parquet with metadata splitting) and more, with the code samples provided.

About the Speaker:
Josef Habdank is a Lead Data Scientist and Data Platform Architect at Infare Solutions with previous experience from Big Data and Data Science practitioners such as Thomson Reuters, Adform, as well as Department of Defence. He is an expert in Apache Spark and Spark enabled technologies such as Kafka, Kinesis, Cassandra, Tachyon and others. Additionally he is a specialist in real time modelling and non linear forecasting, and has experience with with systems processing tens of billions of data points daily and data warehouses holding hundreds of billions of rows.

Events in Copenhagen, DK

"Extreme" Apache Spark:How in 3mo. we created a pipl for processing 2.5b rec/day

Big Data Denmark

Details

Members are also interested in