Zero-Code Streaming Data Pipeline Using Open Source Technologies


Details
5:30 - Welcome, food, drinks & networking
6:00 - Presentation by Paul Brebner
6:45 - Post talk networking
-
- Access to the building is restricted, so all guests will be let in from 5.30 pm. If you are late (after 5.55) and require access please call Liam on 0490 436 396.
-
Connect with Paul - https://www.linkedin.com/in/paul-brebner-0a547b4/
-
Join our LinkedIn Group - Software Developers in Canberra - https://www.linkedin.com/groups/12497585/
-
If you would like to nominate yourself or a member of your team to speak at a future event please email liam.anderson@instaclustr.com or speak with us on the night.
Speech Details:
With the rapid onset of the global Covid-19 Pandemic in 2020, the USA Centers for Disease Control and Prevention (CDC) quickly implemented a new Covid-19 pipeline to collect testing data from all of the USA’s states and territories. Enabling them to produce multiple consumable results for federal and public agencies. They did this in under 30 days, using Apache Kafka.
We built a similar (but simpler) demonstration streaming pipeline for ingesting, indexing, and visualizing some publicly available tidal data using multiple open source technologies including Apache Kafka, Apache Kafka Connect, Apache Camel Kafka Connector, Open Distro for Elasticsearch and Kibana, Prometheus, and Grafana.
In this talk, we introduce each technology, the pipeline architecture, and walk through the steps, challenges, and solutions to build an initial integration pipeline to ingest USA NOAA (National Oceanic and Atmospheric Administration) Tidal data, and add, map and index the data into Elasticsearch. With the goal to visualize the results with Kibana, where we’ll see the period of the “Lunar” day, and the size and location of tidal ranges.
But what can go wrong? The initial pipeline worked reliably for a few days, but then unexpectedly failed when it encountered erroneous data. To make the pipeline more robust, we investigate Apache Kafka Connect exception handling and evaluate the benefits of using Apache Camel Kafka Connectors and Elasticsearch schema validation.
With a sufficiently robust pipeline in place, it’s time to scale it up. The first step is to select and monitor the most relevant metrics, across multiple technologies. We configured Prometheus to collect the metrics, and Kibana to produce a dashboard. With the monitoring in place, we were able to systematically increase the pipeline throughput by increasing Kafka connector tasks, while watching out for potential bottlenecks. We discovered and fixed, two bottlenecks in the pipeline, proving the value of this approach to pipeline scaling.
We conclude the presentation with lessons learned and some potential future challenges.

Sponsors
Zero-Code Streaming Data Pipeline Using Open Source Technologies