Skip to content

Streaming with MapR and StreamSets Data Collector

Streaming with MapR and StreamSets Data Collector

Details

At this meeting we'll focus on streaming, with a session on StreamSets Data Collector (https://streamsets.com/products/sdc/)'s new Spark Streaming integrations and then a look at rewriting history using MapR Streams (https://www.mapr.com/products/mapr-streams).

This event is kindly hosted by MapR (https://www.mapr.com/).

6 - 6:30 pm - Food and networking.

6:30 - 7:15 pm - Hari Shreedharan (https://www.linkedin.com/in/harishreedharan), Software Engineer at StreamSets (https://streamsets.com/): "Building Data Pipelines with Spark and StreamSets"

Big data tools such as Spark and Hadoop allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Metadata in upstream sources can ‘drift’ due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an Apache 2.0 licensed open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows.

As well as providing a browser-based graphical design tool to create data flows without coding, SDC allows developers to create their own custom reusable processing elements using Spark’s programming model and existing libraries such as GraphX or MLLib. Data engineers can then build data flows using any mixture of ‘off-the-shelf’ and custom processors.

In this session, we’ll look at how SDC’s “intent-driven” approach keeps the data flowing, and how your Spark code can run with StreamSets standalone, in an on-premise or in-cloud cluster, or on a fully managed cloud solution such as Databricks Cloud.

Hari is a Software Engineer at StreamSets where he works on StreamSets Data Collector. In his previous role at Cloudera, Hari contributed to Apache Flume, Apache Spark and Apache Sqoop.

7:15 - 8 pm - James Casaletto (https://www.linkedin.com/in/jamescasaletto), Principal Solutions Architect at MapR (https://www.mapr.com/): "Rewriting History Using a Streaming System of Record"

The data warehouse is the de facto store for mining data, performing analysis, and running reports. Data is extracted and transformed before it's loaded into the warehouse, but that sequence of operations is not easily reversible, if at all possible.

When using the streaming feature in the MapR Converged Data Platform, MapR Streams, historical data may be replayed (i.e. re-extracted, re-transformed, and re-loaded into a data warehouse) in its original fidelity. Moreover, because MapR is a converged platform, the storage and processing occurs in place -- on the system of record -- with no data movement.

In this discussion, we will show how using MapR Streams as the system of record enables efficiently repairing mis-transformed data in a data warehouse.

James is a Principal Solutions Architect for MapR, where he develops and deploys big data solutions with Apache Hadoop.

Photo of SF Bay Area Data Ingest Meetup group
SF Bay Area Data Ingest Meetup
See more events