Skip to content

Using Apache Spark for Mastering Customer Data

Photo of Joe Caserta
Hosted By
Joe C.
Using Apache Spark for Mastering Customer Data

Details

Caserta Concepts and Databricks address the number one operational and analytic goal of nearly every organization today – to have complete view of every customer. Customer Data Integration (CDI) must be implemented to cleanse and match customer identities within and across various data systems. CDI has been a long-standing data engineering challenge, not just one of logic and complexity but also of performance and scalability. The speakers bring together best practice techniques with Apache Spark to achieve complete CDI.

Speakers:

Joe Caserta, President, Caserta Concepts

Kevin Rasmussen, Big Data Engineer, Caserta Concepts

Vida Ha, Lead Solutions Engineer, Databricks

The sessions covers a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Topics include:

· Building an end-to-end CDI pipeline in Apache Spark

· What works, what doesn’t, and how do we use Spark we evolve

· Innovation with Spark including methods for customer matching from statistical patterns, geolocation, and behavior

· Using Pyspark and Python’s rich module ecosystem for data cleansing and standardization matching

· Using GraphX for matching and scalable clustering

· Analyzing large data files with Spark

· Using Spark for ETL on large datasets

· Applying Machine Learning & Data Science to large datasets

· Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally

We’ll also touch on Data Governance, on-boarding new data rapidly, how to balance rapid agility and time to market with critical decision support and customer interaction, and well also share examples of problems that Apache Spark is not optimized for.

Photo of Big Data Warehousing group
Big Data Warehousing
See more events
AWS Pop-Up Loft
350 West Broadway · New York, NY