Skip to content

Challenges that everyone struggles with while productionizing Apache Spark

Photo of Sean Glover
Hosted By
Sean G. and 3 others
Challenges that everyone struggles with while productionizing Apache Spark

Details

Hi Scalators,

Join us at 500px on July 24th for a talk about productionizing Spark by Chetan Khatri! Details below.

Talk: Challenges that everyone struggles with while productionizing Apache Spark workloads

Description:

Spark is a good tool for processing large amounts of data, but there are many pitfalls to avoid in order to build large scale systems in production, This talk will help you to understand kind of challenges you get, when you productionize Spark for TB’s of Data. Talk will guide you through possible practical use cases with best practice solution for Fast Data processing.

Detailed Description:

This talk is intended to present :

  1. Primary data structures (RDD, DataSet, Dataframe)
  2. Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
  3. Parallel read from JDBC: Challenges and best practices.
  4. Bulk Load API vs JDBC write
  5. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
  6. Avoid unnecessary shuffle and use of coalesce, repartition, HashPartitioner with use cases. Impact on cache, Disk I/O, Leaking memory, Internal shuffle, spark executor, spark driver.
  7. What to do when spark default sort doesn’t work!, Alternatives.
  8. Why dropDuplicates() doesn’t result consistency, What is alternative.
  9. Optimize Spark stage generation plan: reduce unnecessary repetitive Actions.
  10. Predicate pushdown with partitioning and bucketing.
  11. Why not to use Scala Concurrent ‘Future’ explicitly with Spark jobs.

Targeted audience: Mid

  1. Who understands basic functional programming with scala or has an understanding of Java.
  2. Who understands concurrent programming or multithreading in Java / Scala.
  3. Who has interest in distributed data processing and has a keen interest in data scaling optimization.
  4. Who has earlier worked in Big Data, Fast Data or has a keen interest.

Speaker Bio:

Chetan Khatri is working as a Lead-Data Science at Accion labs, He is an open source contributor at Apache Spark, Apache HBase, Apache Spark - HBase Connector and many other open source projects. He has been authored curriculum of Artificial Intelligence, Data Science, Distributed computing at KSKV Kachchh University, Government of Gujarat - INDIA. He has delivered many talks at Scala.IO, HBaseConAsia, HKOSCon, FossAsia, PyCon India.

Photo of Scala Toronto group
Scala Toronto
See more events
20 Duncan St
20 Duncan St · Toronto, ON