BDAM 11/21: Building data lineage; Running Spark with Alluxio; Data Mesh

Big Data Application Meetup
Big Data Application Meetup
Public group
Location image of event venue


After a few months break, we are looking forward to welcoming everyone to the next BDAM meetup on Thursday, November 21st, in Mountain View!

We are grateful for Google for hosting this gathering.

We are hosting this event in collaboration with SF Big Analytics Meetup group ( (

Food and drinks included!

Google Mountain View office - Room Araujo
US-MTV[masked] Charleston Rd
Mountain View, CA 94043

06:00 - 06:30 - Registration and networking
06:30 - 07:00 - Talk 1
07:00 - 07:30 - Talk 2
07:30 - 08:00 - Talk 3
08:00 - 08:30 - Socializing and networking

*Talk 1*
Fine grained root cause and impact analysis with CDAP Lineage

Lineage is a critical aspect of data governance in large enterprises, and provides traceability for data as it flows through a data system. It can unlock various use cases such as root cause analysis (discover the cause of a bad data event) and impact analysis (gauge the impact of a change before making the change). In this talk, the speaker will demonstrate how CDAP’s granular data lineage capabilities can solve these use cases for enterprises.

Speaker: Yuki Jung (Google)
Yuki is a software engineer at Google Cloud, where she is working on the open source Big Data Platform CDAP. Prior to Google, she worked on science content development at Khan Academy.

*Talk 2*
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.

Speaker: Zhamak Dehghani (ThoughtWorks)
Zhamak is a principal technology consultant at ThoughtWorks with a focus on distributed systems architecture and digital platform strategy at Enterprise. She is a member of ThoughtWorks Technology Advisory Board and contributes to the creation of ThoughtWorks Technology Radar.

*Talk 3*
Accelerating workloads and bursting remote data with Google Dataproc using Alluxio

Google Cloud Dataproc is a popular managed on-demand service to run Spark, Presto and many other compute workloads. Alluxio, an open source data orchestration technology, helps speed up Dataproc workloads by providing a distributed caching layer within the Dataproc Cluster. In addition, Alluxio enables “Zero-copy” bursting allowing users to run compute workloads even on data that’s remote on-prem or another cloud. In this session, Dipti from Alluxio and Roderick from Google Cloud will share an overview of Alluxio and Google Dataproc and the benefits the two together bring. It will include a demo of initializing a Dataproc cluster with Alluxio to run workloads on remote data.

Speakers: Dipti Borkar (Alluxio) & Roderick Yao (Google)
Dipti is VP, Products at Alluxio. She has deep experience in data and database technology across relational and non-relational. Prior to Alluxio, Dipti was VP of Product Marketing at Kinetica and Couchbase. Earlier in her career Dipti managed development teams at IBM DB2 where she started her career as a database software engineer. Dipti holds a M.S. in Computer Science from the UC San Diego, and an MBA from the Haas School of Business at UC Berkeley.

Roderick Yao is a Strategic Cloud Engineer at Google. His focus is designing innovative solutions for Google Cloud customers to build and manage data pipelines and data migration to Google.