Experience Report: Large Data Processing in a Managed Language

Details
Abstract
The Hail project was founded to enable genetics research on data sets hundreds of terabytes in size. Three years ago we picked up Scala and Spark as the seemingly obvious foundation on which to build an expressive framework for processing genetics data. The project has grown and generalized becoming an embedded DSL in python, much like pandas or numpy. The long term goal has become a Python DSL for SQL-like and Linear Algebraic methods on matrix-structured data that is backed by a compiler / query planner and distributed run-time.
Unfortunately, on this journey we ran into challenges that seem somewhat fundamentally related to our choice of a managed, immutable-first language. In this talk, I’ll give a general overview of the system and explain why DSLs and compiler-technology are important to its success. I hope the bulk of the talk will cover the challenges we faced getting reasonable performance out of Scala and Spark and some of our solutions to those challenges. I think not all of these challenges are not unique to Scala or the JVM but fundamental to managed languages. I suspect this audience in particular will question the choices we’ve made and I’m eager to foster a discussion of how one can best tackle these issues in managed languages like Haskell & Scala.
Bio
Dan King was an undergraduate at Northeastern University and briefly attempted a PhD in programming languages at Harvard before dropping out. He has found his way to the Broad Institute where he works on a DSL and distributed system for SQL-like and Linear Algebraic operations on large-scale biological data. His days are filled with wishing the functional code he wants to write was a fast as the imperative code he actually writes.
Logistics
The meetup will be at the Thoughtbot location in downtown Boston as usual. Food should be available near the 6:30PM start time, and the talk will begin shortly after that.
Food and beverages will be provided by Thoughtbot.
Streaming
For people who are unable to attend in person, we plan to stream/record the talk. The stream will happen on the Boston Haskell YouTube channel (https://www.youtube.com/channel/UCUCpgCWjaniUkX88wZrK_Ig). Links to the stream will be posted here once it's scheduled.

Experience Report: Large Data Processing in a Managed Language