Building Pipeline with Kafka Connect


Details
Agenda
6:30 PM to 6:45 PM - Gathering and Networking
6:45 PM to 7:00 PM - Introduction to Data Riders @Fremont
7:00 PM to 8:30 PM - Building ETL Pipeline with Kafka Connect By Liquan Pei
About Speaker:
Liquan Pei is a Software Engineer at Confluent. He primarily worked on the Kafka Connect HDFS connector and contributed to other components of the Confluent Platform such as Schema Registry, Camus, etc. Previously he worked on MPP database internals, scalable in database machine learning library and integration of in-memory database with Hadoop at Pivotal. He is also an open source contributor to Apache Spark and Apache Kafka. Liquan graduated from UMass Amherst with degrees in Computer Science and Physics.
Building ETL Pipeline with Kafka Connect
Many companies are adopting Apache Kafka to power their data pipelines, including LinkedIn, Netflix, and Airbnb. Kafka's ability to handle high throughput real-time data makes it a perfect fit for solving the data integration problem, acting as the common buffer for all your data and bridging the gap between streaming and batch systems.
However, building a data pipeline around Kafka today can be challenging because it requires combining a wide variety of tools to collect data from disparate data systems. One tool streams updates from your database to Kafka, another imports logs, and yet another exports to HDFS. As a result, building a data pipeline can take significant engineering effort and has relatively high operational overhead because all these different tools require ongoing monitoring and maintenance. Additionally, some of the tools are simply a poor fit for the job: the fragmented nature of the data integration tools ecosystem lead to creative but misguided solutions such as misusing stream processing frameworks for data integration purposes. We describe the design and implementation of Kafka Connect, Kafka's new tool for scalable, fault-tolerant data import and export. First we'll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect's design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity,and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka and its Kafka Connect tool can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka Connect job.

Building Pipeline with Kafka Connect