Skip to content

Near Real-Time Ingest with StreamSets Data Collector

Photo of Scott Crawford
Hosted By
Scott C.
Near Real-Time Ingest with StreamSets Data Collector

Details

Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail, inducing heartburn in even the most resilient data scientist.

Over the past couple of years, StreamSets Data Collector (SDC) has emerged as an open source platform for continuous, near real-time, big data ingest. In this session, Pat Patterson, community champion at StreamSets will explain how SDC handles change in upstream data sources, keeping the data flowing into Hive, Kudu, Kafka and many more data targets.

Pat Patterson has been working with Internet technologies since 1997, building software and communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. As a developer evangelist at Salesforce, Pat focused on identity, integration and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community.

Photo of STL Big Data - Innovation, Data Engineering, Analytics Group group
STL Big Data - Innovation, Data Engineering, Analytics Group
See more events