Integrated Dataflow Processing with Spark and StreamSets

This is a past event

20 people went

Location image of event venue


Join us for a joint meetup with SF HUG to discuss StreamSets Data Collector integration with Spark

You MUST register at the SF HUG page so we have accurate numbers for food etc!


6:00-6:30pm Food, drinks and networking

6:30-7:30pm Tech talk

7:30-8:00pm Networking

Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Metadata in upstream sources such as relational databases and log files can ‘drift’ due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC), is Apache 2.0 licensed open source software that allows data scientists and data engineers to build robust big data ingest pipelines using pre-built and custom processing stages via a browser-based UI.

In this session, Hari will explain how SDC integrates with Apache Spark, and how developers can create their own custom reusable processing elements using Spark’s programming model and existing libraries such as GraphX or MLLib. You'll learn how Spark can run SDC pipelines in a wide variety of environments, from standalone systems such as a developer's laptop, to on-premises and in-cloud clusters, allowing developers, data scientists and data engineers to process data at unprecedented scale.