Replicating Drifting Data - Going Beyond the Basics of Big Data Ingest
Details
Super easy access - Right by the Dunwoody Marta Station. Map: https://goo.gl/maps/Lo1Qiu6vRgG2
Doors open at 6:30pm for food and networking, and the presentation starts at 7:00pm.
Topic Update: Cox Automotive's use case will be explored as part of tonight's presentation. Cox Automotive comprises more than 25 companies dealing with different aspects of the car ownership lifecycle, with data as the common language they all share. The challenge for Cox was to create an efficient engine for the timely and trustworthy ingest of data capability for an unknown but large number of data assets from practically any source. Discover how their big data engineering team overcame data drift and are now populating a data lake, allowing analysts easy access to data from their subsidiary companies and producing new data assets unique to the industry.
Abstract: Data drift, the gradual morphing of data structure and semantics, is a fact of life in enterprise IT. New requirements force schema changes, the meaning of database columns changes over time, and infrastructure upgrades add new fields to log files. Left unchecked, drift in data sources can cause applications and dataflows to fail, with costly downtime and, in the worst case, corruption in downstream data stores.
In this session, we'll start by looking at how we can deal with the problem of drift, focusing on the concrete example of replicating a relational database into Hive. We'll then examine some alternative approaches using open source tools such as Sqoop, NiFi and StreamSets Data Collector. Finally, we'll build a simple data pipeline to read the relational schema, create equivalent Hive tables, and then continuously ingest data from the relational database to Hive, altering the Hive schema as columns are added to the source tables.
Speaker: Pat Patterson has been working with Internet technologies since 1997, building software and communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. As a developer evangelist at Salesforce, Pat focused on identity, integration and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community.








