This is a group for anyone interested in learning how to use and optimize Spark. All skill levels are welcome. We started this group to meet other Spark enthusiasts and are looking forward to all learning together.
Join us for Brian Clapper's introduction to ETL with Apache Spark.
This presentation will be a notebook-based demonstration (with some live coding) of basic ETL in Apache Spark. Code will be presented in a mixture of Python and Scala. We’ll take a few file formats (CSV and JSON, for example) and use Spark to clean them up a bit, then import them into a Parquet-based data lake. (Though, as lakes go, it’ll be a fairly small one.) If time permits, we may dive into Delta Lake, as well. Delta Lake is a newly open sourced Spark add-on that provides ACID guarantees over top of Parquet.