Designing Scalable Data Pipelines with Apache NiFi


Details
Designing Scalable Data Pipelines with Apache NiFi
An increasing number of companies are embarking on the journey to becoming truly data-driven; this profound change presents its own unique challenges. The velocity and diversity of new datasets is compelling organizations to search for new approaches for reliable data ingestion, whether these system are Hadoop, a data warehouse, or some new-fangled NoSQL database. For years, enterprises have been struggling to create dataflows between diverse systems dealing with both internal and external business data within their infrastructure.
This talk will cover a new project in the Apache ecosystem, NiFi. Throughout the talk you will learn how NiFi has greatly improved the overall efficiency of data ingestion here on the data platform team at Cloudera.
Apache NiFi is a new project that aims to make architecting mission-critical dataflows as simple as designing a flow chart. NiFi’s core concepts are borrowed from a programming paradigm known as flow-based programming, a paradigm that has been around since the 1970s. Although NiFi is a brand new open source project, it has been used in production by the National Security Agency for several years.
One of the primary benefits of using NiFi is the significant boost to data agility, that is, the ability to ingest new datasets within minutes, as opposed to hours or days. With NiFi, security, monitoring, and fault-tolerance are first class citizens, giving you the confidence that production pipelines will continue working, even if failures occur.
In this talk we will cover how NiFi implements the core concepts behind “flow-based programming.” Using a real world example, we will also cover how you can port custom internal code to run within NiFi. Finally, you will learn how even non-programmers can create dataflows using a robust web interface designed from the ground up. You should walk away from this talk with a good understanding of how you can utilize NiFi to automate dataflows between the various systems within an enterprise.
About the Speaker
Ricky Saltzer is a local data engineer for Cloudera's internal data platform team. His team is responsible for architecting scalable ingestion pipelines for a multitude of datasets, such as support data, business data, and internal machine data. Ricky has been using Hadoop for over three years, and is a contributor to multiple open source big data projects. The data platform team he's on makes extensive use of many big data technologies (e.g. HBase, Impala, Kafka,NiFi).

Designing Scalable Data Pipelines with Apache NiFi