Toronto Apache Spark #6


Details
Title: "Scala and the JVM as a Big Data Platform - Lessons from Apache Spark"By Dean Wampler
Agenda:
6:30PM to 7:00PM - Opening, Organizational Updates (refreshments provided)
7:00PM to 8:00PM - "Scala and the JVM as a Big Data Platform - Lessons from Apache Spark" by Dean Wampler
Note:
This is an advanced talk for members with more experience with Apache Spark and big data.
In addition to this event, Dean Wampler is going to give another talk ("The Emerging Fast Data Architecture by Dean Wampler") on Feb. 25th at the Scala Toronto (https://www.meetup.com/scalator/events/227584833/) meetup.
Make sure that you attend both events! (https://www.linkedin.com/pulse/big-data-ideas-scala-apache-spark-building-new-mehrdad-pazooki)
Speaker:
Dean Wampler, Ph.D., is the Architect for Big Data Products and Services and a member of the Office of the CTO at Typesafe (http://typesafe.com/), where he focuses on the evolving Fast Data stack for streaming applications based on the Typesafe Reactive Platform (http://typesafe.com/platform), Spark (http://spark.apache.org/), Kafka (http://kafka.apache.org/), Mesos (http://mesos.apache.org/), and other tools.
http://photos2.meetupstatic.com/photos/event/4/b/2/8/600_445339240.jpeg
Dean is a contributor to several open source projects and organizes the Chicago-Area Scala Enthusiasts (http://meetup.com/chicagoscala/) meetup group. He's the author of the Programming Scala, 2nd Edition (http://shop.oreilly.com/product/0636920033073.do) and Functional Programming for Java Developers (http://shop.oreilly.com/product/0636920021667.do), and the co-author of Programming Hive (http://shop.oreilly.com/product/0636920023555.do), all from O'Reilly. He lurks on twitter, @deanwampler (http://twitter.com/deanwampler).
Scala and the JVM as a Big Data Platform - Lessons from Apache Spark
Apache Spark (http://spark.apache.org/) is implemented in Scala and it’s user-facing Scala API is very similar to Scala’s own collections API. The power and concision of this API are bringing many developers to Scala. The core abstractions in Spark have created a flexible, extensible platform for applications like streaming, SQL queries, machine learning, and more.
Scala’s uptake reflect the following advantages over Java:
• A pragmatic balance of object-oriented and functional programming.An interpreter mode, which allows the same sort of exploratory programming that Data Scientists have enjoyed with Python and other languages. Scala-centric “Notebooks” are also now available.
• A rich collections library that enables composition of operations for concise, powerful code.
• Tuples are naturally expressed in Scala and very convenient for working with data.
• Pattern Matching makes data deconstruction fast and intuitive.
• Type inference provides safety, feedback to the developer, yet minimal typing of actual type signatures.
• Scala idioms lend themselves to the construction of small domain specific languages, which are useful for building libraries that are concise and intuitive for domain experts.
Spark, like almost all open-source, Big Data tools, leverages the JVM, which is an excellent, general-purpose platform for scalable computing. However, its management of objects is suboptimal for high-performance data crunching. The way objects are organized in memory and the subsequent impact that has on garbage collection can be improved for the special case of Big Data. Hence, the Spark project has recently started a project called “Tungsten” to build internal optimizations using the following techniques:
• Custom data layouts that use memory very efficiently with cache-awareness.
• Manual memory management, both on-heap and off-heap, to minimize “garbage” and GC pressure.
• Code generation to create optimal implementations of certain, heavily-used expressions from user code.
Using these and other examples from the Spark project, this talk discusses the strengths and weaknesses of Scala and the JVM for Big Data, and how we might improve both to make them better tools for our needs.
Acknowledgement of Sponsors and Active Partners for this event:
Kevin Webber (https://ca.linkedin.com/in/kvnwbbr) from Reactive Toronto (https://www.meetup.com/Reactive-TO/) meetup.
Sean Glover (https://ca.linkedin.com/in/seanaglover), Katrin Shechtman (https://ca.linkedin.com/in/katrinshechtman) from Scala Toronto (https://www.meetup.com/scalator/events/227584833/) meetup.

Toronto Apache Spark #6