Extreme Apache Spark: How to build a pipeline for processing 2.5B rows/day in 3m

DataKRK (formerly Cracow Hadoop User Group)
DataKRK (formerly Cracow Hadoop User Group)
Public group
Location image of event venue


Update: we just confirmed there will be second talk as well (details below)

First talk: Extreme Apache Spark: How in 3 months you can create a pipeline for processing 2.5Bn rows/day

"Apache Spark is simply awesome" says our next speaker Josef Habdank. In this talk he will give you a crash course how to design an extremely scalable data processing pipeline on Apache Spark on using tech such as: Spark Streaming, Scala, Kafka/Kinesis, Snappy, Avro, Parquet, HDFS/S3, Zeppelin.

It will be a story of 3 crazy developers who in 3 months managed to develop and put to production a Spark data pipeline which can crunch through 2.5 billion airfares a day without breaking a sweat. It was an amazing journey in which they had to do everything themselves: take care of hardware and deploy platform, research technologies, hack out all the code in Spark/Scala, test scalability, do the monitoring tools and deliver the complete business intelligence product to the customer.

Josef says: "Yes it is possible, and it is possible in 3 months. If you come to the talk I will share with you DOs and DONTs of such a process, I will explain which technologies turned out to be right and what was a mistake." You will learn how to use correct message compression and serialization (Avro + Snappy), best practices for in-stream error handling, how build a successful 50TB+ Parquet based datawarehouse and more, with the code samples provided.

About the Speaker: Josef Habdank is a Lead Data Scientist and Data Platform Architect at Infare Solutions with previous experience from Big Data and Data Science practitioners such as Thomson Reuters, Adform, as well as Department of Defense. He is an expert in Apache Spark and Spark enabled technologies. He is a frequent speaker on prominent BigData conferences such as Spark Summit or High Load Strategy. Additionally he is a specialist in real time modelling and non linear forecasting, and has experience with with systems processing tens of billions of data points daily and data warehouses holding hundreds of billions of rows.

Second talk: Data Science Bowl 2017 - up & downs of the data scientist

In the United States, lung cancer strikes 225,000 people every year, and accounts for $12 billion in health care costs. In 2016 U.S. Vice President office started the Cancer Moonshot initiative, to make progress in cancer prevention, diagnosis, and treatment. In 2017 the Data Science Bowl competition was established to support the Cancer Moonshot by convening the data science and medical communities to develop lung cancer detection algorithms. This year, the Data Science Bowl award was $1 million in prizes.
IBM (Krakow) data science team took part in that journey and would like to share their experiences."

About the Speakers: Lukasz Cmielowski, PhD, is a Lead Data Scientist in IBM with a track record of developing enterprise-level applications that substantially increases clients' ability to turn data into actionable knowledge. Umit Mert Cakmak is Data Scientist in IBM working on Watson Machine Learning cloud solutions.

P.S. The event will take place in wonderful Pauza in Garden again. Lets show the appreciation to our hosts by making a good use of their great choice of beverages and snacks.