Skip to content

How Spark beat Hadoop@100TB Sort: Optimize Shuffle+Network+Netty+CPU Cache aware

Photo of Chris Fregly
Hosted By
Chris F.
How Spark beat Hadoop@100TB Sort: Optimize Shuffle+Network+Netty+CPU Cache aware

Details

Location Change
Now at Big Commerce @ 685 Market St, 3rd Floor
(2 blocks away from the original location)

Speak Thanks to Big Commerce!!

Agenda

6:30-7pm: Arrive and Mingle

7-7:15pm: Announcements, Quick Recap of Last Meetup

7:15pm-8:30pm: Deep Dive into How Spark Beat Hadoop @ 100TB Daytona GraySort Challenge (http://sortbenchmark.org/).

8:30pm-9pm: Q&A, De-mingle, and Leave

Code-level Deep Dive into the optimizations that allowed Spark to win the Daytona GraySort Challenge.

We'll discuss the following at a code level:

  1. Sort-based Shuffle (less OS resources)

https://issues.apache.org/jira/browse/SPARK-2045

  1. Netty-based Network module (epoll, async, ByteBuffer reuse)

https://issues.apache.org/jira/browse/SPARK-2468

  1. External Shuffle Service (also allows for auto-scaling of Worker nodes)

https://issues.apache.org/jira/browse/SPARK-3796

  1. AlphaSort style cache locality optimizations

http://www.slideshare.net/SparkSummit/deep-dive-into-project-tungsten-josh-rosen (slide 22)

https://issues.apache.org/jira/browse/SPARK-7082

  1. https://issues.apache.org/jira/browse/SPARK-9850 (https://issues.apache.org/jira/browse/SPARK-7082)

Relevant Links:

https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

http://0x0fff.com/spark-architecture-shuffle/

http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf

Photo of AI Performance Engineering Meetup (San Francisco, Global) group
AI Performance Engineering Meetup (San Francisco, Global)
See more events