November, almost Northeast monsoon season! Right before rainy season, let's get some Spark-s fly! This time, we'll be hosted by AWS Singapore, so join us on November 6th!
Two talks on the agenda:
- "Working effectively with Apache Spark on AWS" - during this demonstration-driven talk, you'll learn reference architectures for data engineering, data science and machine learning use-cases powered by Apache Spark on AWS. The talk will cover following AWS services: Sagemaker, Glue, Athena, Redshift and RDS, ephemeral EC2 spot, on-demand instances. The demo rely on a regular AWS account in our local preferred region ( ap-southeast-1 ) with an existing VPC that has data sources that Apache Spark will integrate with. As a follow-up to this demo, you might be able to repeat the same steps in your own AWS Accounts.
- "Dynamic Partition Pruning and Apache Spark 3.0 File Sources" - In Spark 3.0 releases, all the built-in file source connectors [including Parquet, ORC, JSON, Avro, CSV, Text] are re-implemented using the new data source API V2. We will give a technical overview of how Spark reads and writes these file formats based on the user-specified data layouts. Also, a mechanism for performing Dynamic Partition Pruning at runtime by reusing the dimension table broadcast results in hash joins and that shows significant improvements for most TPCDS queries will be presented.
Speakers for Talk #1:
• Arseny Chernov, joined Databricks in 2018, and is APJ leader for Partner Solutions Architecture, based out of Singapore.
Speaker for Talk #2:
• Alena Melnikova, Data Engineer at Refinitiv, an inspiring Apache Spark practitioner based in Singapore and working on challenging batch and streaming data pipelines.
Don't forget, join our Slack workspace here: https://dbricks.co/sparkslackapj (that's an invite link) -- we'll use it for follow-up and Q&A-s.