Near Real-Time Ingest with StreamSets Data Collector


Details
Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail, inducing heartburn in even the most resilient data scientist.
Over the past couple of years, StreamSets Data Collector (SDC) has emerged as an open source platform for continuous, near real-time, big data ingest. In this session, Pat Patterson, community champion at StreamSets will explain how SDC handles change in upstream data sources, keeping the data flowing into Hive, Kudu, Kafka and many more data targets.
Pat Patterson has been working with Internet technologies since 1997, building software and communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. As a developer evangelist at Salesforce, Pat focused on identity, integration and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community.

Near Real-Time Ingest with StreamSets Data Collector