What we're about
This meetup is focused on Data Science on AWS as well as open source AI/ML technologies.
Upcoming events (4+)
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth
Talk #1: Optimizing large-scale, distributed training jobs using Nvidia GPUs, DeepSpeed ZeRO, and Kubernetes on AWS
by Justin Chiu, Software Engineer @ Amazon Alexa AI
Most modern natural-language-processing applications are built on top of pretrained language models, which encode the probabilities of word sequences for entire languages. These models contain billions - or even trillions - of parameters.Training these models within a reasonable amount of time requires very large computing clusters - often with GPUs. Communication between the GPUs needs to be carefully managed to avoid performance bottlenecks.In this talk, we will discuss techniques to optimize large-scale training jobs on cloud-based hardware using Nvidia GPUs and Kubernetes on AWS. The following steps will be covered:
(1) [Basic infrastructure] Profile NCCL bandwidth to confirm they are getting ~100 Gbps all-reduce bandwidth on p3dn and ~350 Gbps all-reduce bandwidth on p4d. This will confirm that their EKS-EFA setup (https://github.com/aws-samples/aws-efa-eks) is correct, as well as other important EKS/EC2 settings like using cluster placement groups, etc. See info here on how to do that: https://github.com/NVIDIA/nccl-tests
(2) [Training code and DNN framework settings] Once above is done, also confirm the training throughput, as measured in TFLOPS/GPU or Samples/Sec matches expectations. What expectations should be depends a bit on the model size, the input batch size, and the hardware.
Note: If (2) is successful, then you're good. If not, you will want to fix (1) by optimizing the NCCL bandwidth to help isolate your problem.
References:
- https://www.amazon.science/blog/making-deepspeed-zero-run-efficiently-on-more-affordable-hardware
- https://github.com/aws-samples/aws-efa-eks
- https://github.com/NVIDIA/nccl-tests
Talk #2: Modin - Speed up your Pandas workflows by changing a single line of code
by Alejandro Herrera, Solution Architect at Ponder
Modin is a drop-in replacement for pandas. While pandas is single-threaded, Modin lets you instantly speed up your workflows by scaling pandas so it uses all of your cores. Modin works especially well on larger datasets, where pandas becomes painfully slow or runs out of memory.
GitHub: https://github.com/modin-project/modin
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Zoom link: https://us02web.zoom.us/j/82308186562
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
- miKe X.
- Stephen M
- Vikram
- 49 attendees
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth
Talk #1: TBD
Talk #2: TBD
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Zoom link: https://us02web.zoom.us/j/82308186562
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
- Chris F.
- Randy B.
- Greg S
- 13 attendees
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth
Talk #1: TBD
Talk #2: TBD
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Zoom link: https://us02web.zoom.us/j/82308186562
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
- Chris F.
- Randy B.
- Greg S
- 8 attendees
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth
Talk #1: TBD
Talk #2: TBD
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Zoom link: https://us02web.zoom.us/j/82308186562
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
- Stephen M
- Chris F.
- Florian G.
- 18 attendees
Past events (315)
- Antje B.
- Chris F.
- Ananda R.
- 63 attendees