Skip to content

Spark on AWS - Best practices & lessons learned

Spark on AWS - Best practices & lessons learned

Details

A fresh new talk about best practices & lessons learned using Spark on AWS.
Main topics are:

  • Spark & S3
  • AWS Datapipeline
  • Zeppelin: Setup & workarounds
  • Connecting AWS Sagemaker & Spark
  • Metadata management: AWS Glue

We talk about some general considerations to structure your data when storing & reading from S3. How to use AWS Datapipeline to circumvent some of the current limitations due to Hadoop's S3 library and S3's eventual consistency as well as improvements by the recent release of the Hadoop library.
We present our approach to use Zeppelin in a multi-user environment and how we bootstrap and stabilize it.
With AWS Sagemaker an interesting new service focused on Machine Learning started this year. We will show how to connect its Jupyter notebooks to Spark on EMR and discuss differences to Zeppelin.
In the end we will have a quick glance at another new service „AWS Glue" and why you should use it.

Bio:
Lars Haferkamp works as a Data Engineer at comSysto Reply. Since 3 years he works in teams focused on analyzing massive amounts of sensor data with Spark on AWS and building platforms for Data Scientists

Photo of AI Performance Engineering Meetup (Munich) group
AI Performance Engineering Meetup (Munich)
See more events