Skip to content

First Meeting in Karlsruhe

Photo of Florian Troßbach
Hosted By
Florian T. and Hendrik
First Meeting in Karlsruhe

Details

Wir freuen uns, euch zur ersten Karlsruher Ausgabe der neuen Big Data Gruppe einzuladen.

Für den ersten Talk begrüßen wir Robin Moffat (@rmoff) von Confluent. Der zweite Talk kommt von Dominik Benz von der Inovex GmbH

Talk1:

Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka

Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again!

Companies new and old are all recognising the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka streaming platform. With Kafka, developers can integrate multiple sources and systems, which enables low latency analytics, event driven architectures and the population of multiple downstream systems. These data pipelines can be built using configuration alone.

In this talk, we'll see how easy it is to stream data from a database such as Oracle into Kafka using the Kafka Connect API. In addition, we'll use KSQL to filter, aggregate and join it to other data, and then stream this from Kafka out into multiple targets such as Elasticsearch and MySQL. All of this can be accomplished without a single line of code!

Why should Java geeks have all the fun?

Talk 2:
Flow is in the Air: Best Practices of Building Analytical Data Pipelines with Apache Airflow

Apache Airflow is an Open-Source python project which facilitates an intuitive programmatic definition of analytical data pipelines. Based on 2+ years of productive experience, we summarize its core concepts, detail on lessons learned and set it in context with the Big Data Analytics Ecosystem.

Creating, orchestrating and running multiple data processing or analysis steps may cover a substantial portion of a Data Engineer and Data Scientist business. A widely adopted notion for this process is a "data pipeline" - which consists mainly of a set of "operators" which perform a particular action on data, with the possibility to specify dependencies among those. Real-Life examples may include:
Importing several files with different formats into a Hadoop platform, perform data cleansing, and training a machine learning model on the result
perform feature extraction on a given dataset, apply an existing deep learning model to it, and write the results in the backend of a microservice
Apache Airflow is an open-source Python project developed by AirBnB which facilitates the programmatic definition of such pipelines. Features which differentiate Airflow from similar projects like Apache Oozie, Luigi or Azkaban include (i) its pluggable architecture with several extension points (ii) the programmatic approach of "workflow is code" and (iii) its tight relationship with the the Python as well as the Big Data Analytics Ecosystem. Based on several years of productive usage, we briefly summarize the core concepts of Airflow, and detail in-depth on lessons learned and best practices from our experience. These include hints for getting efficient quickly with Airflow, approaches to structure workflows, integrating it in an enterprise landscape, writing plugins and extentions, and maintaining it in productive environment. We conclude with a comparison with other analytical workflow engines and summarize why we have chosen Airflow. We will put another special focus on the context of hybrid (realtime + batch) analytical platforms, and how Airflow can complement the Apache Kafka / Confluent ecosystem.

codecentric stellt Location, Getränke und Pizza.

Photo of Karlsruhe Big Data Meetup group
Karlsruhe Big Data Meetup
See more events
codecentric AG
Gartenstr. 69a · 76135 Karlsruhe