[MtnView] How Spark beat Hadoop@100TB Sort: Optimize Shuffle+Network+Netty+CPU

Name: [MtnView] How Spark beat Hadoop@100TB Sort: Optimize Shuffle+Network+Netty+CPU
Start: 2015-10-07T18:30:00-07:00
End: 2015-10-07T21:00:00-07:00
Location: Silicon Valley Data Science

Hosted by Chris F.

AI Performance Engineering Meetup (San Francisco, Global)

Details

Location

Silicon Valley Data Science, Mountain View

Thanks, SVDS!!

Agenda

6:30-7pm: Arrive and Mingle

7-7:15pm: Announcements, Quick Recap of Last Meetup

7:15pm-8:30pm: Deep Dive into How Spark Beat Hadoop @ 100TB Daytona GraySort Challenge (http://sortbenchmark.org/).

8:30pm-9pm: Q&A, De-mingle, and Leave

Details

I'll be giving a quick preview of my Oct 12th London Spark Meetup Talk (https://www.meetup.com/Advanced-Apache-Spark-Meetup/events/225815012/) on Project Tungsten. I'm doing this talk in on Nov 12th in SF (https://www.meetup.com/Advanced-Apache-Spark-Meetup/events/223666812/) - as well as down the peninsula shortly after assuming we can find a host down that way. Please email me at chris@fregly.com if you're interesting in hosting!

We'll cover Tungsten's "bare metal" approach to performance optimizations including mechanical sympathy, CPU cache hierarchy awareness, Direct Cache Access (DCA), MESI for multi-processor/multi-core/multi-thread CPU cache synchronization, Linux perf for data CPU cache miss analysis, optimizing matrix multiplication to minimize CPU cache link misses, and a bunch of other low-level sweetness.

This will be a hard-core session with demo's and lots of audience participation, so please come ready with questions and comedy.

Code-level Deep Dive into the optimizations that allowed Spark to win the Daytona GraySort Challenge.

We'll discuss the following at a code level:

Sort-based Shuffle (less OS resources)

https://issues.apache.org/jira/browse/SPARK-2045

Netty-based Network module (epoll, async, ByteBuffer reuse)

https://issues.apache.org/jira/browse/SPARK-2468

External Shuffle Service (also allows for auto-scaling of Worker nodes)

https://issues.apache.org/jira/browse/SPARK-3796

AlphaSort style cache locality optimizations

http://www.slideshare.net/SparkSummit/deep-dive-into-project-tungsten-josh-rosen (slide 22)

https://issues.apache.org/jira/browse/SPARK-7082

https://issues.apache.org/jira/browse/SPARK-9850 (https://issues.apache.org/jira/browse/SPARK-7082)

Relevant Links

https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

http://0x0fff.com/spark-architecture-shuffle/

http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf

AI Performance Engineering Meetup (San Francisco, Global)

[MtnView] How Spark beat Hadoop@100TB Sort: Optimize Shuffle+Network+Netty+CPU

AI Performance Engineering Meetup (San Francisco, Global)

Details

Related topics

You may also like

[MtnView] How Spark beat Hadoop@100TB Sort: Optimize Shuffle+Network+Net­­ty+CPU

AI Performance Engineering Meetup (San Francisco, Global)

Details

Related topics

You may also like

[MtnView] How Spark beat Hadoop@100TB Sort: Optimize Shuffle+Network+Netty+CPU