BDAM 02/13: Maintaining full data lineage; Migration & Change Data Capture: CDAP

Are you going?

40 people going

Location image of event venue


Big thanks to for hosting and sponsoring this meetup event!


6:00 - 6:30 - Socialize over food and beverages
6:30 - 7:30 - Tech Talks
7:30 - 8:00 - Networking


#1: Maintaining full data lineage and governance across billions of data partitions by Steven Parkes,

#2: Moving to the Cloud: Data Migration and Change Data Capture (CDC) with CDAP by Tony Hajdari, Google


#1: Maintaining full data lineage and governance across billions of data partitions

As organizations design and build data pipelines, much of the focus tends to be around the end-to-end operations and orchestration that happens during each run. However, as the data and number of pipelines scale, tracking and optimizing each of these dependencies becomes incredibly brittle and manual to maintain. However, this can be alleviated by building context-awareness into pipeline - automatically tracking lineage on the partition level to map and act on dependencies.

In this talk, we’ll discuss how Ascend has architected a cloud service for building these context-aware autonomous pipelines, leveraging open source technologies such as Spark and Kubernetes to support billions of partitions. We’ll also walk through a few use cases where these have been especially impactful - ranging from meeting regulatory requirements to decreasing time to productionize workflows.

#2: Moving to the Cloud: Data Migration and Change Data Capture (CDC) with CDAP

Moving enterprise data to the cloud can be a daunting process. Beyond the initial data offloading from an on-premise Enterprise Data Warehouse (EDW), enterprises require efficient and scalable mechanisms for keeping data in sync. Until recently the open source community had limited options for CDC. CDAP enables Change Data Capture of relational databases for consuming change data events and updating the corresponding cloud instance to continually keep data between an on-premises warehouse and a cloud warehouse in sync. In this talk, we will discuss use-cases for migrating an EDW to the cloud and keeping both on-premises and cloud instances in sync with CDAP pipelines and plugins.


- Steven Parkes is CTO at where he guides architecture development and has implemented many of the core abstractions for Ascend’s semantic scheduler. Prior to Ascend, he built big data infrastructure and applications at both Twitter and Square. He also has experience working with these big data systems from his roles at IBM Research, where he was able to develop against them in their early days.

- Tony Hajdari is a Customer Engineer on the Big Data specialists team at Google where he works on the open source Big Data Application Platform CDAP ( Prior to Google, he worked at Cask Data where he was responsible for field technical services and customer enablement helping customers build the next generation of Big Data applications with less code and greater agility.


Venue:, 541 Cowper St, Palo Alto, CA 94301.

There is a parking garage directly behind the office, which has free parking after 5pm. There is also street parking in front of office that's open after 6pm.