Big Data Ingest for Data Scientists


Details
Abstract: It's estimated that most data scientists spend only 20 percent of their time on actual data analysis, with 80 percent spent finding, cleaning, and reorganizing data. StreamSets Data Collector is an Apache 2.0-licensed open source project that lets you create dataflow pipelines to read data from a wide variety of sources, either as a continuous stream, or a batch job, apply transformations to get it into shape, and write it to big data platforms such as Hadoop, S3, Cassandra etc for analysis. Data scientists, data engineers and developers at dozens of companies use Data Collector to ingest vast amounts of data every day.
In this session, Pat Patterson, a technical director at StreamSets, will explain how to create dataflow pipelines using the Data Collector UI. We'll see how a simple pipeline can read and write data with no coding required, while custom login can be implemented with a minimal amount of scripting. As an example, we'll build a pipeline to read JSON-formatted city lot data from the San Francisco public record, calculate the area of each lot from its boundary co-ordinates, and write records simultaneously to Hive and Kafka for analysis and visualization.
Bio: Pat Patterson has been working with Internet technologies since 1997, building software and communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. As a developer evangelist at Salesforce, Pat focused on identity, integration and the Internet of Things. Now a technical director at StreamSets, Pat helps enterprises unlock the value in their data.
P.S.: Food from Mi Patio Mexican Restaurant sponsored by our friends from StreamSets!

Big Data Ingest for Data Scientists