Building data lineage, data orchestration and data mesh
Details
We jointly organize this event with Google focused on how to build data platform: data lineage, data lake/mesh and data storage.
The event will be held in a room called Araujo, in Mountain View office
Agenda
- 06:00 - 06:25 - Reception and networking
- 06:25 - 06:30 - Introductions
- 06:30 - 07:00 - talk 1 (Google)
- 07:00 - 07:30 - talk 2 (ThoughtWorks)
- 07:30 - 08:00 - talk 3 (Alluxio & Google)
- 08:00 - 08:30 - Social time
Talk1: Fine grained root cause and impact analysis with CDAP
Lineage
Lineage is a critical aspect of data governance in large enterprises, and provides traceability for data as it flows through a data system. It can unlock various use cases such as root cause analysis (discover the cause of a bad data event) and impact analysis (gauge the impact of a change before making the change). In this talk, the speaker will demonstrate how CDAP’s granular data lineage capabilities can solve these use cases for enterprises.
Speaker: Yuki Jung - Google
Yuki is a software engineer at Google Cloud, where she is working on the open source Big Data Platform CDAP. Prior to Google, she worked on science content development at Khan Academy.
Talk2: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Speaker : Zhamak Dehghani (ThoughtWorks)
Zhamak is a principal technology consultant at ThoughtWorks with a focus on distributed systems architecture and digital platform strategy at Enterprise. She is a member of ThoughtWorks Technology Advisory Board and contributes to the creation of ThoughtWorks Technology Radar.
Talk3: Accelerating workloads and bursting remote data with Google Dataproc using Alluxio
Google Cloud Dataproc is a popular managed on-demand service to run Spark, Presto and many other compute workloads. Alluxio, an open source data orchestration technology, helps speed up Dataproc workloads by providing a distributed caching layer within the Dataproc Cluster. In addition, Alluxio enables “Zero-copy” bursting allowing users to run compute workloads even on data that’s remote on-prem or another cloud. In this session, Dipti from Alluxio and Roderick from Google Cloud will share an overview of Alluxio and Google Dataproc and the benefits the two together bring. It will include a demo of initializing a Dataproc cluster with Alluxio to run workloads on remote data.
Speakers: Dipti Borkar (Alluxio) & Roderick Yao (Google)
Dipti is VP, Products at Alluxio. She has deep experience in data and database technology across relational and non-relational. Prior to Alluxio, Dipti was VP of Product Marketing at Kinetica and Couchbase. At Couchbase she held several leadership positions there including Head of Global Technical Sales and Head of Product Management. Dipti holds a M.S. in Computer Science from the UC San Diego, and an MBA from UC Berkeley.
Roderick Yao is a Strategic Cloud Engineer at Google. His focus is designing innovative solutions for Google Cloud customers to build and manage data pipelines and data migration to Google. Prior to Google, he was a Senior Solutions Consultant at Cloudera and drove solution architecture helping Fortune 500 companies with their Hadoop Deployment. Roderick has a BS from South China University of Technology and a MS from Bentley College

