• [Demo+Webinar] Opt. distrib. training with GPUs/K8s & Scaling pandas with Modin

    Online - See Details Below

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth

    Talk #1: Optimizing large-scale, distributed training jobs using Nvidia GPUs, DeepSpeed ZeRO, and Kubernetes on AWS

    by Justin Chiu, Software Engineer @ Amazon Alexa AI

    Most modern natural-language-processing applications are built on top of pretrained language models, which encode the probabilities of word sequences for entire languages. These models contain billions - or even trillions - of parameters.Training these models within a reasonable amount of time requires very large computing clusters - often with GPUs. Communication between the GPUs needs to be carefully managed to avoid performance bottlenecks.In this talk, we will discuss techniques to optimize large-scale training jobs on cloud-based hardware using Nvidia GPUs and Kubernetes on AWS. The following steps will be covered:

    (1) [Basic infrastructure] Profile NCCL bandwidth to confirm they are getting ~100 Gbps all-reduce bandwidth on p3dn and ~350 Gbps all-reduce bandwidth on p4d. This will confirm that their EKS-EFA setup (https://github.com/aws-samples/aws-efa-eks) is correct, as well as other important EKS/EC2 settings like using cluster placement groups, etc. See info here on how to do that: https://github.com/NVIDIA/nccl-tests

    (2) [Training code and DNN framework settings] Once above is done, also confirm the training throughput, as measured in TFLOPS/GPU or Samples/Sec matches expectations. What expectations should be depends a bit on the model size, the input batch size, and the hardware.

    Note: If (2) is successful, then you're good. If not, you will want to fix (1) by optimizing the NCCL bandwidth to help isolate your problem.

    References:

    Talk #2: Modin - Speed up your Pandas workflows by changing a single line of code

    by Alejandro Herrera, Solution Architect at Ponder

    Modin is a drop-in replacement for pandas. While pandas is single-threaded, Modin lets you instantly speed up your workflows by scaling pandas so it uses all of your cores. Modin works especially well on larger datasets, where pandas becomes painfully slow or runs out of memory.

    GitHub: https://github.com/modin-project/modin

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Zoom link: https://us02web.zoom.us/j/82308186562

    Related Links

    O'Reilly Book: https://www.amazon.com/dp/1492079391/
    Website: https://datascienceonaws.com
    Meetup: https://meetup.datascienceonaws.com
    GitHub Repo: https://github.com/data-science-on-aws/
    YouTube: https://youtube.datascienceonaws.com
    Slideshare: https://slideshare.datascienceonaws.com

  • [Webinar] Data Science on AWS Monthly Webinar

    Online - See Details Below

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth

    Talk #1: TBD

    Talk #2: TBD

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Zoom link: https://us02web.zoom.us/j/82308186562

    Related Links

    O'Reilly Book: https://www.amazon.com/dp/1492079391/
    Website: https://datascienceonaws.com
    Meetup: https://meetup.datascienceonaws.com
    GitHub Repo: https://github.com/data-science-on-aws/
    YouTube: https://youtube.datascienceonaws.com
    Slideshare: https://slideshare.datascienceonaws.com

  • [Webinar] Data Science on AWS Monthly Webinar

    Online - See Details Below

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth

    Talk #1: TBD

    Talk #2: TBD

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Zoom link: https://us02web.zoom.us/j/82308186562

    Related Links

    O'Reilly Book: https://www.amazon.com/dp/1492079391/
    Website: https://datascienceonaws.com
    Meetup: https://meetup.datascienceonaws.com
    GitHub Repo: https://github.com/data-science-on-aws/
    YouTube: https://youtube.datascienceonaws.com
    Slideshare: https://slideshare.datascienceonaws.com

  • [Webinar] Data Science on AWS Monthly Webinar

    Online - See Details Below

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth

    Talk #1: TBD

    Talk #2: TBD

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Zoom link: https://us02web.zoom.us/j/82308186562

    Related Links

    O'Reilly Book: https://www.amazon.com/dp/1492079391/
    Website: https://datascienceonaws.com
    Meetup: https://meetup.datascienceonaws.com
    GitHub Repo: https://github.com/data-science-on-aws/
    YouTube: https://youtube.datascienceonaws.com
    Slideshare: https://slideshare.datascienceonaws.com

  • [Webinar] Data Science on AWS Monthly Webinar

    Online - See Details Below

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth

    Talk #1: TBD

    Talk #2: TBD

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Zoom link: https://us02web.zoom.us/j/82308186562

    Related Links

    O'Reilly Book: https://www.amazon.com/dp/1492079391/
    Website: https://datascienceonaws.com
    Meetup: https://meetup.datascienceonaws.com
    GitHub Repo: https://github.com/data-science-on-aws/
    YouTube: https://youtube.datascienceonaws.com
    Slideshare: https://slideshare.datascienceonaws.com

  • [Webinar] Data Science on AWS Monthly Webinar

    Online - See Details Below

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth

    Talk #1: TBD

    Talk #2: TBD

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Zoom link: https://us02web.zoom.us/j/82308186562

    Related Links

    O'Reilly Book: https://www.amazon.com/dp/1492079391/
    Website: https://datascienceonaws.com
    Meetup: https://meetup.datascienceonaws.com
    GitHub Repo: https://github.com/data-science-on-aws/
    YouTube: https://youtube.datascienceonaws.com
    Slideshare: https://slideshare.datascienceonaws.com

  • [Webinar] Data Science on AWS Monthly Webinar

    Online - See Details Below

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth

    Talk #1: TBD

    Talk #2: TBD

    RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154

    Zoom link: https://us02web.zoom.us/j/82308186562

    Related Links

    O'Reilly Book: https://www.amazon.com/dp/1492079391/
    Website: https://datascienceonaws.com
    Meetup: https://meetup.datascienceonaws.com
    GitHub Repo: https://github.com/data-science-on-aws/
    YouTube: https://youtube.datascienceonaws.com
    Slideshare: https://slideshare.datascienceonaws.com